python網絡爬蟲之使用scrapy自動爬取多個網頁

本文轉載自查看原文 2017-06-25 09:41 15027 python網絡爬蟲

前面介紹的scrapy爬蟲只能爬取單個網頁。如果我們想爬取多個網頁。比如網上的小說該如何如何操作呢。比如下面的這樣的結構。是小說的第一篇。可以點擊返回目錄還是下一頁

對應的網頁代碼：

我們再看進入后面章節的網頁，可以看到增加了上一頁

對應的網頁代碼：

通過對比上面的網頁代碼可以看到. 上一頁，目錄，下一頁的網頁代碼都在<div>下的<a>元素的href里面。不同的是第一章只有2個<a>元素，從二章開始就有3個<a>元素。因此我們可以通過<div>下<a>元素的個數來判決是否含有上一頁和下一頁的頁面。代碼如下

最終得到生成的網頁鏈接。並調用Request重新申請這個網頁的數據

那么在pipelines.py的文件中。我們同樣需要修改下存儲的代碼。如下。可以看到在這里就不是用json. 而是直接打開txt文件進行存儲

class Test1Pipeline(object):

    def __init__(self):

        self.file=''

    def process_item(self, item, spider):

        self.file=open(r'E:\scrapy_project\xiaoshuo.txt','wb')

        self.file.write(item['content'])

        self.file.close()

        return item

完整的代碼如下：在這里需要注意兩次yield的用法。第一次yield后會自動轉到Test1Pipeline中進行數據存儲，存儲完以后再進行下一次網頁的獲取。然后通過Request獲取下一次網頁的內容

class testSpider(Spider):

    name="test1"

    allowd_domains=['http://www.xunsee.com']

start_urls=["http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml"]

def parse(self, response):

init_urls="http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615"

      sel=Selector(response)

      context=''

      content=sel.xpath('//div[@id="content_1"]/text()').extract()

for c in content:

        context=context+c.encode('utf-8')

items=Test1Item()

items['content']=context

count = len(sel.xpath('//div[@id="nav_1"]/a').extract())

if count > 2:

next_link=sel.xpath('//div[@id="nav_1"]/a')[2].xpath('@href').extract()

      else:

next_link=sel.xpath('//div[@id="nav_1"]/a')[1].xpath('@href').extract()

      yield items

for n in next_link:

url=init_urls+'/'+n

        print url

        yield Request(url,callback=self.parse)

對於自動爬取網頁scrapy有個更方便的方法：CrawlSpider

前面介紹到的Spider中只能解析在start_urls中的網頁。雖然在上一章也實現了自動爬取的規則。但略顯負責。在scrapy中可以用CrawlSpider來進行網頁的自動爬取。

爬取的規則原型如下：

classscrapy.contrib.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None,process_links=None, process_request=None)

LinkExtractor.：它的作用是定義了如何從爬取到的的頁面中提取鏈接

Callback指向一個調用函數，每當從LinkExtractor獲取到鏈接時將調用該函數進行處理，該回調函數接受一個response作為第一個參數。注意：在用CrawlSpider的時候禁止用parse作為回調函數。因為CrawlSpider使用parse方法來實現邏輯，因此如果使用parse函數將會導致調用失敗

Follow是一個判斷值，用來指示從response中提取的鏈接是否需要跟進

在scrapy shell中提取www.sina.com.cn為例

LinkExtractor中的allow只針對href屬性：

例如下面的鏈接只針對href屬性做正則表達式提取

結構如下：可以得到各個鏈接。

可以通過restrict_xpaths對各個鏈接加以限制，如下的方法：

實例2：還是以之前的迅讀網為例

提取網頁中的下一節的地址：

網頁地址：

http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml

下一頁的的相對URL地址為2.shtml。

通過如下規則提取出來

>>> item=LinkExtractor(allow=('\d\.shtml')).extract_links(response)

>>> for i in item:

... print i.ur

...

http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/2.shtml

也通過導航頁面直接獲取所有章節的鏈接：

C:\Users\Administrator>scrapy shell http://www.xunread.com/article/8c39f5a0-ca54

-44d7-86cc-148eee4d6615/index.shtml

from scrapy.linkextractors import LinkExtractor

>>> item=LinkExtractor(allow=('\d\.shtml')).extract_links(response)

>>> for i in item:

... print i.url

得到如下全部的鏈接

那么接下來構造在scrapy中的代碼，如下

class testSpider(CrawlSpider):

    name="test1"

    allowd_domains=['http://www.xunsee.com']

    start_urls=["http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml"]

    rules=(Rule(LinkExtractor(allow=('\d\.shtml')),callback='parse_item',follow=True),)

    print rules

    def parse_item(self, response):

        print response.url

        sel=Selector(response)

        context=''

        content=sel.xpath('//div[@id="content_1"]/text()').extract()

        for c in content:

            context=context+c.encode('utf-8')

        items=Test1Item()

        items['content']=context

        yield items

關鍵的是rules=(Rule(LinkExtractor(allow=('\d\.shtml')),callback='parse_item',follow=True),) 這個里面規定了提取網頁的規則。以上面的例子為例。爬取的過程分為如下幾個步驟：

1 從http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml開始，第一調用parse_item，用xpath提取網頁內容，然后用Rule提取網頁規則，在這里提取到2.shtml。

2 進入2.shtml.進入2.shtml后再重復運行第一步的過程。直到Rules中提取不到任何規則

我們也可以做一下優化，設置start_urls為頁面索引頁面

http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml

這樣通過Rule可以一下提取出所有的鏈接。然后對每個鏈接調用parse_item進行網頁信息提取。這樣的效率比從1.shtml要高效很多。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python網絡爬蟲之使用scrapy爬取圖片 scrapy使用爬取多個頁面 Python爬蟲爬取網頁圖片 python網絡爬蟲之使用scrapy自動登錄網站 Python使用Scrapy框架爬取數據存入CSV文件(Python爬蟲實戰4) Python網絡爬蟲_爬取Ajax動態加載和翻頁時url不變的網頁 python網絡爬蟲之解析網頁的正則表達式(爬取4k動漫圖片)[三] 【網絡爬蟲學習】實戰，爬取網頁以及貼吧數據 Python爬蟲學習——使用selenium和phantomjs爬取js動態加載的網頁 Python網絡爬蟲爬取網絡小說信息