問題:在運行scrapy的過程中,如果想按順序啟動爬蟲怎么做?
背景:爬蟲A爬取動態代理ip,爬蟲B使用A爬取的動態代理ip來偽裝自己,爬取目標,那么A一定要在B之前運行該怎么做?
IDE:pycharm
版本:python3
框架:scrapy
系統:windows10
代碼如下:(請自行修改)
# !/usr/bin/env python3 # -*- coding:utf-8 -*- from scrapy import cmdline from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from torrentSpider.spiders.proxy_ip_spider import ProxyIpSpider from torrentSpider.spiders.douban_spider import DoubanSpider from scrapy.utils.project import get_project_settings ''' 以下是多個爬蟲順序執行的命令 ''' configure_logging() # 加入setting配置文件,否則配置無法生效
# get_project_settings()獲取的是setting.py的配置
runner = CrawlerRunner(get_project_settings()) @defer.inlineCallbacks def crawl(): yield runner.crawl(ProxyIpSpider) yield runner.crawl(DoubanSpider) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished ''' 以下是單個爬蟲執行的命令 ''' # def execute(): # cmdline.execute(['scrapy', 'crawl', 'proxy_ip_spider']) # # # execute()