Scrapy學習篇（四）之數據存儲

本文轉載自查看原文 2019-02-01 14:44 674 python爬蟲-Scrapy

上一篇中，我們簡單的實現了toscrapy網頁信息的爬取，並存儲到mongo，本篇文章信息看看數據的存儲。這一篇主要是實現信息的存儲，我們以將信息保存到文件和mongo數據庫為例，學習數據的存儲，依然是上一節的例子。

編寫爬蟲

修改items.py文件來定義我們的item

Item 是保存爬取到的數據的容器；其使用方法和python字典類似。雖然你也可以在Scrapy中直接使用dict，但是Item提供了額外保護機制來避免拼寫錯誤導致的未定義字段錯誤。簡單的來說，你所要保存的任何的內容，都需要使用item來定義，比如我們現在抓取的頁面，我們希望保存名言，作者和tags，那么你就要在items.py文件中定義他們，以后你會發現，items.py文件里面你所要填寫的信息是最簡單的了。

import scrapy
class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

這樣就已經定義好了。

編寫spider文件

在項目中的spiders文件夾下面創建一個文件，命名為quotes.py我們將在這個文件里面編寫我們的爬蟲。先上代碼再解釋。

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup    #新增加
from tutorial.items import QuoteItem #新增加

class QuotesSpider(scrapy.Spider):

    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = BeautifulSoup(response.text,'lxml')
        for quote in quotes.find_all(name = 'div',class_='quote'):
            item = QuoteItem() #使用items中定義的數據結構
            for s in quote.find_all(name = 'span',class_='text'):
                item['text'] = s.text
            for s in quote.find_all(name= 'small',class_='author'):
                item['author'] = s.text
            for s in quote.find_all(name='div', class_='tags'):
                item['tags'] = s.text.replace('\n','').strip().replace(' ','')
            yield item

        nexts = quotes.find_all(name='li', class_='next')
        for next in nexts:
            n = next.find(name='a')
            url = 'http://quotes.toscrape.com/' + n['href']
            yield scrapy.Request(url = url,callback = self.parse)

下面主要對新添加或者修改的地方講解

導入QuoteItem自定義類，注意：新建項目中帶有scrapy.cfg文件的那個目錄默認作為項目的根目錄，因此from tutorial.items import QuoteItem
就是從tutorial項目里面的items.py文件里面導入我們自定義的那個類，名稱是QuoteItem,就是上面我們定義的那個QuoteItem ,只有導入了這個類，我們才可以保存我們的字段。
item = QuoteItem() 實例化，不多說。
item['text'] = s.text, item['author'] = s.text , item['tags'] = s.text.replace('\n','').strip().replace(' ',''), item其實就是可以簡單的理解為字典，這個地方就是相當於給字典里面的鍵賦值。
yield item生成器，scrapy會將item傳遞給pipeline進行后續的處理，當然，前提是你打開了settings設置里面的設置項，相關的設置馬上就會說到。
nexts = quotes.find_all(name='li', class_='next') 獲取下一頁，遍歷nexts，如果有下一頁，則 yield Request() ,此Request會作為一個新的Request加入調度隊列，等待調度。

修改pipelines.py文件，實現保存。

class MongoPipeline(object):

    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

修改settings.py文件

之前，我們修改了兩個內容，ROBOTSTXT_OBEY和DEFAULT_REQUEST_HEADERS,這里我們在之前的基礎上，在添加如下內容。

ITEM_PIPELINES = {
    'tutorial.pipelines.textPipeline':300,
    'tutorial.pipelines.MongoPipeline':400
}

MONGO_URL = 'localhost'
MONGO_DB = 'test'

對於新修改的內容簡單的解釋，如果你僅僅想保存到txt文件，就將后者注釋掉，同樣的道理，如果你僅僅想保存到數據庫，就將前者注釋掉,當然，你可以兩者都實現保存，就不用注釋任何一個。對於上面的含義，tutorial.pipelines.textPipeline 其實就是應用tutorial/pipelines模塊里面的textPipeline類，就是我們之前寫的那個，300和400的含義是執行順序，因為我們這里既要保存到文件，也要保存到數據庫，那就定義一個順序，這里的設置就是先執行保存到文件，在執行保存到數據庫，數字是0-1000,你可以自定義。

運行爬蟲

進入到項目文件，執行
scrapy crawl quotes
可以看到mongo數據庫新增了相應的內容。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy學習篇（四）之數據存儲 Scrapy學習篇（五）之Spiders Scrapy學習篇（一）之框架 Scrapy學習篇（八）之settings scrapy 數據存儲mysql Scrapy學習篇（十三）之scrapy-splash Scrapy學習篇（七）之Item Pipeline Scrapy學習篇（三）之創建項目 python爬蟲scrapy命令工具學習之篇三 scrapy爬取的數據異步存儲至MySQL