from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
#斷點續爬scrapy crawl spider_name -s JOBDIR=crawls/spider_name
#運行命令scrapy crawlall
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
多個spider同時運行
新建命令文件夾commands,目錄下新建crawlall.py
scrapy crawlall 需在settings里配置 COMMANDS_MODULE = 'project.commands'
執行命令scrapy crawlall
原理:通過加載用戶初始化的 crawler_process.spiders 獲取列表下的所有spider的name,然后遍歷list 分別crawl
斷點續爬
#斷點續爬 scrapy crawl spider_name -s JOBDIR=crawls/spider_name
↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
terminnal 執行此命令
可在crawls目錄下記錄斷點,下次繼續重復執行命令可從斷點續爬。
詳細見開發者文檔
https://doc.scrapy.org/en/latest/topics/jobs.html?highlight=jobdir
