(4)分布式下的爬蟲Scrapy應該如何做-規則自動爬取及命令行下傳參

本文轉載自查看原文 2015-09-15 16:48 3465 爬蟲/ python/ Python Web/ 爬蟲框架/ 類爬蟲/ CrawlSpider/ 數據抓取/ scrapy/ 大數據/ Python

本次探討的主題是規則爬取的實現及命令行下的自定義參數的傳遞，規則下的爬蟲在我看來才是真正意義上的爬蟲。

我們選從邏輯上來看，這種爬蟲是如何工作的：

我們給定一個起點的url link ，進入頁面之后提取所有的ur 鏈接，我們定義一個規則，根據規則(用正則表達式來限制)來提取我們想要的連接形式，然后爬取這些頁面，進行一步的處理(數據提取或者其它動作)，然后循環上述操作，直到停止，這個時候有一個潛在的問題，就是重復爬取，在scrapy 的框架下已經着手處理了這些問題，一般來說，對於爬取過濾的問題，通用的處理方式是建立一個地址表，在爬取之前查一下這個地址表，是否已經爬取過，如果是，則直接過濾掉。另一種就是使用現成的通用解決方案，bloom filter

本次討論的是如何使用CrawlSpider 來進行爬取豆瓣標簽下的所有小組的信息：

一，我們新建立一個類，繼承自CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban.items import GroupInfo


class MySpider(CrawlSpider):

關於CrawlSpider的更多說明，請參考：http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

二，為了完成命令行下的參數傳遞，我們需要在類的構造函數里面輸入我們想要的參數

：

在命令行下這樣使用：

scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7

這樣就可以將自定義的參數傳入到里面

這里特別說明最后的一行：super(MySpider, self).__init__()

我們轉到定義，查看CrawlSpider 的定義：

構造函數會調用私有方法編譯rules變量，如果在我們自己定義的Spider里面沒有調用方法，會直接報錯的。

三，編寫規則：

     self.rules = (
            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),
            )

allow 定義想要提取標簽樣式，使用正則匹配，restrict_xpaths 嚴格限制這種標簽的范圍在指定的標簽內，callback ,提取到之后的回調函數。

四，全部代碼參考：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban.items import GroupInfo


class MySpider(CrawlSpider):
    name = 'douban.xp'
    current = ''
    allowed_domains = ['douban.com']
    def __init__(self, target=None):
        if self.current is not '':
            target = self.current
        if target is not None:
            self.current = target
        self.start_urls = [
                'http://www.douban.com/group/explore?tag=%s' % (target)
            ]      
        self.rules = (
            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),
            )
        #call the father base function 
        super(MySpider, self).__init__()       

    def parse_next_page(self, response):
        self.logger.info(msg='begin init the page %s ' % response.url)
        list_item = response.xpath('//a[@class="nbg"]')

        #check the group is not null 
        if list_item is None:
            self.logger.info(msg='cant select anything in selector ')
            return
        for a_item in list_item:
            item = GroupInfo()
            item['group_url'] = ''.join(a_item.xpath('@href').extract())
            item['group_tag'] = self.current
            item['group_name'] = ''.join(a_item.xpath('@title').extract())
            yield item
    

    def parse_start_url(self, response):
        self.logger.info(msg='begin init the start page %s ' % response.url)
        list_item = response.xpath('//a[@class="nbg"]')

        #check the group is not null 
        if list_item is None:
            self.logger.info(msg='cant select anything in selector ')
            return
        for a_item in list_item:
            item = GroupInfo()
            item['group_url'] = ''.join(a_item.xpath('@href').extract())
            item['group_tag'] = self.current
            item['group_name'] = ''.join(a_item.xpath('@title').extract())
            yield item

    def parse_next_page_people(self, response):
        self.logger.info('Hi, this is an the next page! %s', response.url)

五，實際運行：

scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7

實際的數據效果：

本次主要解決兩個問題：

1.如何從命令行下傳遞參考

2.如何編寫CrawlSpider

里面的演示的功能都比較有限，實際的運行中其實是需要進一步編寫其它的規則，比如如何防止被ban，下一篇在簡短的介紹下

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 (3)分布式下的爬蟲Scrapy應該如何做-遞歸爬取方式，數據輸出方式以及數據庫鏈接 (9)分布式下的爬蟲Scrapy應該如何做-關於ajax抓取的處理(一) (8)分布式下的爬蟲Scrapy應該如何做-圖片下載(源碼放送) Python爬蟲之scrapy高級(全站爬取,分布式,增量爬蟲) Scrapy分布式爬蟲打造搜索引擎- (二)伯樂在線爬取所有文章 python爬蟲項目(scrapy-redis分布式爬取房天下租房信息) Scrapy-redis改造scrapy實現分布式多進程爬取【分布式】Zookeeper使用--命令行 jmeter命令行運行-分布式測試使用scrapy實現分布式爬蟲