scrapy基础知识之 RedisCrawlSpider：

本文转载自查看原文 2017-06-09 13:37 3411

这个RedisCrawlSpider类爬虫继承了RedisCrawlSpider，能够支持分布式的抓取。因为采用的是crawlSpider，所以需要遵守Rule规则，以及callback不能写parse()方法。

同样也不再有start_urls了，取而代之的是redis_key，scrapy-redis将key从Redis里pop出来，成为请求的url地址。

from scrapy.spiders import Rule from scrapy.linkextractors import LinkExtractor from scrapy_redis.spiders import RedisCrawlSpider class MyCrawler(RedisCrawlSpider):  name = 'mycrawler_redis' redis_key = 'mycrawler:start_urls' rules = ( # follow all links Rule(LinkExtractor(), callback='parse_page', follow=True), ) # __init__方法必须按规定写，使用时只需要修改super()里的类名参数即可 def __init__(self, *args, **kwargs): # Dynamically define the allowed domains list. domain = kwargs.pop('domain', '') self.allowed_domains = filter(None, domain.split(',')) # 修改这里的类名为当前类名 super(MyCrawler, self).__init__(*args, **kwargs) def parse_page(self, response): return { 'name': response.css('title::text').extract_first(), 'url': response.url, }

注意：

同样的，RedisCrawlSpider类不需要写allowd_domains和start_urls：

scrapy-redis将从在构造方法__init__()里动态定义爬虫爬取域范围，也可以选择直接写allowd_domains。
必须指定redis_key，即启动爬虫的命令，参考格式：redis_key = 'myspider:start_urls'
根据指定的格式，start_urls将在 Master端的 redis-cli 里 lpush 到 Redis数据库里，RedisSpider 将在数据库里获取start_urls。

执行方式：

通过runspider方法执行爬虫的py文件（也可以分次执行多条），爬虫（们）将处于等待准备状态：

scrapy runspider mycrawler_redis.py
在Master端的redis-cli输入push指令，参考格式：

$redis > lpush mycrawler:start_urls http://www.dmoz.org/
爬虫获取url，开始执行。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Scrapy对接Splash基础知识学习 scrapy基础知识之发送POST请求： JTAG基础知识 .NET 基础知识磁盘基础知识（1）磁学基础知识 shellcode基础知识 Cookie基础知识 Android基础知识（一） PCB基础知识（一）

scrapy基础知识之 RedisCrawlSpider：

注意：

执行方式：

`scrapy runspider mycrawler_redis.py`

`$redis > lpush mycrawler:start_urls http://www.dmoz.org/`

免责声明！