同時運行多個scrapy爬蟲的幾種方法（自定義scrapy項目命令）

本文轉載自查看原文 2016-03-19 21:02 6767 scrapy

Reference: http://www.cnblogs.com/rwxwsblog/p/4578764.html

試想一下，前面做的實驗和例子都只有一個spider。然而，現實的開發的爬蟲肯定不止一個。既然這樣，那么就會有如下幾個問題：1、在同一個項目中怎么創建多個爬蟲的呢？2、多個爬蟲的時候是怎么將他們運行起來呢？

說明：本文章是基於前面幾篇文章和實驗的基礎上完成的。如果您錯過了，或者有疑惑的地方可以在此查看：

安裝python爬蟲scrapy踩過的那些坑和編程外的思考

scrapy爬蟲成長日記之創建工程-抽取數據-保存為json格式的數據

scrapy爬蟲成長日記之將抓取內容寫入mysql數據庫

如何讓你的scrapy爬蟲不再被ban

一、創建spider

1、創建多個spider， scrapy genspider spidername domain

scrapy genspider CnblogsHomeSpider cnblogs.com

通過上述命令創建了一個spider name為CnblogsHomeSpider的爬蟲，start_urls為http://www.cnblogs.com/的爬蟲

2、查看項目下有幾個爬蟲scrapy list

[root@bogon cnblogs]# scrapy list CnblogsHomeSpider CnblogsSpider

由此可以知道我的項目下有兩個spider，一個名稱叫CnblogsHomeSpider，另一個叫CnblogsSpider。

更多關於scrapy命令可參考：http://doc.scrapy.org/en/latest/topics/commands.html

二、讓幾個spider同時運行起來

現在我們的項目有兩個spider，那么現在我們怎樣才能讓兩個spider同時運行起來呢？你可能會說寫個shell腳本一個個調用，也可能會說寫個python腳本一個個運行等。然而我在stackoverflow.com上看到。的確也有不上前輩是這么實現。然而官方文檔是這么介紹的。

1、Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider):  # Your spider definition  ... process = CrawlerProcess({  'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

這里主要通過 scrapy.crawler.CrawlerProcess來實現在腳本里運行一個spider。更多的例子可以在此查看：https://github.com/scrapinghub/testspiders

2、Running multiple spiders in the same process

通過 CrawlerProcess

import scrapy
from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider):  # Your first spider definition  ... class MySpider2(scrapy.Spider):  # Your second spider definition  ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished

通過 CrawlerRunner

import scrapy
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider):  # Your first spider definition  ... class MySpider2(scrapy.Spider):  # Your second spider definition  ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished

通過CrawlerRunner和鏈接(chaining) deferred來線性運行

from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider):  # Your first spider definition  ... class MySpider2(scrapy.Spider):  # Your second spider definition  ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl():  yield runner.crawl(MySpider1)  yield runner.crawl(MySpider2)  reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished

這是官方文檔提供的幾種在script里面運行spider的方法。

三、通過自定義scrapy命令的方式來運行

創建項目命令可參考：http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

1、創建commands目錄

mkdir commands

注意：commands和spiders目錄是同級的

2、在commands下面添加一個文件crawlall.py

這里主要通過修改scrapy的crawl命令來完成同時執行spider的效果。crawl的源碼可以在此查看：https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

from scrapy.commands import ScrapyCommand from scrapy.crawler import CrawlerRunner from scrapy.utils.conf import arglist_to_dict class Command(ScrapyCommand):  requires_project = True  def syntax(self):   return '[options]'  def short_desc(self):   return 'Runs all of the spiders'  def add_options(self, parser):   ScrapyCommand.add_options(self, parser)   parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",       help="set spider argument (may be repeated)")   parser.add_option("-o", "--output", metavar="FILE",       help="dump scraped items into FILE (use - for stdout)")   parser.add_option("-t", "--output-format", metavar="FORMAT",       help="format to use for dumping items with -o")  def process_options(self, args, opts):   ScrapyCommand.process_options(self, args, opts)   try:    opts.spargs = arglist_to_dict(opts.spargs)   except ValueError:    raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)  def run(self, args, opts):   #settings = get_project_settings()   spider_loader = self.crawler_process.spider_loader   for spidername in args or spider_loader.list():    print "*********cralall spidername************" + spidername    self.crawler_process.crawl(spidername, **opts.spargs)   self.crawler_process.start()

這里主要是用了self.crawler_process.spider_loader.list()方法獲取項目下所有的spider，然后利用self.crawler_process.crawl運行spider

3、commands命令下添加__init__.py文件

touch __init__.py

注意：這一步一定不能省略。 我就是因為這個問題折騰了一天。囧。。。就怪自己半路出家的吧。

如果省略了會報這樣一個異常

Traceback (most recent call last): File "/usr/local/bin/scrapy", line 9, in <module> load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')() File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute cmds = _get_commands_dict(settings, inproject) File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict cmds.update(_get_commands_from_module(cmds_module, inproject)) File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module for cmd in _iter_command_classes(module): File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes for module in walk_modules(module_name): File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules mod = import_module(path) File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) ImportError: No module named commands

一開始怎么找都找不到原因在哪。耗了我一整天，后來到http://stackoverflow.com/上得到了網友的幫助。再次感謝萬能的互聯網，要是沒有那道牆該是多么的美好呀！扯遠了，繼續回來。

4、settings.py目錄下創建setup.py（這一步去掉也沒影響，不知道官網幫助文檔這么寫有什么具體的意義。）

from setuptools import setup, find_packages setup(name='scrapy-mymodule', entry_points={ 'scrapy.commands': [ 'crawlall=cnblogs.commands:crawlall', ], }, )

這個文件的含義是定義了一個crawlall命令，cnblogs.commands為命令文件目錄，crawlall為命令名。

5. 在settings.py中添加配置：

COMMANDS_MODULE = 'cnblogs.commands'

6. 運行命令scrapy crawlall

最后源碼更新至此： https://github.com/jackgitgz/CnblogsSpider

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 同時運行多個scrapy爬蟲的幾種方法（自定義scrapy項目命令） Scrapy怎樣同時運行多個爬蟲？ Scrapy 運行多個爬蟲 scrapy自定義擴展(extensions)實現實時監控scrapy爬蟲的運行狀態 Learning Scrapy筆記（七）- Scrapy根據Excel文件運行多個爬蟲 Scrapy命令行調用傳入自定義參數 Python 創建項目時配置 Scrapy 自定義模板 scrapy 執行同個項目多個爬蟲（一）scrapy 安裝及新建爬蟲項目並運行 Scrapy同時啟動多個爬蟲