# 通過腳本同時運行幾個spider
目錄結構:
1.在命令行能通過的情況下創建兩個spider如
TestSpider
Test2Spider
2.在items.py的同級目錄創建run.py文件,有三種方式,任選其一,其代碼如下:
方式一: 通過CrawlerProcess同時運行幾個spider
run_by_CrawlerProcess.py源代碼:
1 # 通過CrawlerProcess同時運行幾個spider 2 from scrapy.crawler import CrawlerProcess 3 # 導入獲取項目配置的模塊 4 from scrapy.utils.project import get_project_settings 5 # 導入蜘蛛模塊(即自己創建的spider) 6 from spiders.test import TestSpider 7 from spiders.test2 import Test2Spider 8 9 # get_project_settings() 必須得有,不然"HTTP status code is not handled or not allowed" 10 process = CrawlerProcess(get_project_settings()) 11 process.crawl(TestSpider) # 注意引入 12 #process.crawl(Test2Spider) # 注意引入 13 process.start()
方式二:通過CrawlerRunner同時運行幾個spider
run_by_CrawlerRunner.py源代碼:
1 # 通過CrawlerRunner同時運行幾個spider 2 from twisted.internet import reactor 3 from scrapy.crawler import CrawlerRunner 4 from scrapy.utils.log import configure_logging 5 # 導入獲取項目配置的模塊 6 from scrapy.utils.project import get_project_settings 7 # 導入蜘蛛模塊(即自己創建的spider) 8 from spiders.test import TestSpider 9 from spiders.test2 import Test2Spider 10 11 configure_logging() 12 # get_project_settings() 必須得有,不然"HTTP status code is not handled or not allowed" 13 runner = CrawlerRunner(get_project_settings()) 14 runner.crawl(TestSpider) 15 #runner.crawl(Test2Spider) 16 d = runner.join() 17 d.addBoth(lambda _: reactor.stop()) 18 reactor.run() # the script will block here until all crawling jobs are finished
方式三:通過CrawlerRunner和鏈接(chaining) deferred來線性運行來同時運行幾個spider
run_by_CrawlerRunner_and_Deferred.py源代碼:
1 # 通過CrawlerRunner和鏈接(chaining) deferred來線性運行來同時運行幾個spider 2 from twisted.internet import reactor, defer 3 from scrapy.crawler import CrawlerRunner 4 from scrapy.utils.log import configure_logging 5 # 導入獲取項目配置的模塊 6 from scrapy.utils.project import get_project_settings 7 # 導入蜘蛛模塊(即自己創建的spider) 8 from spiders.test import TestSpider 9 from spiders.test2 import Test2Spider 10 11 configure_logging() 12 # get_project_settings() 必須得有,不然"HTTP status code is not handled or not allowed" 13 runner = CrawlerRunner(get_project_settings()) 14 15 @defer.inlineCallbacks 16 def crawl(): 17 yield runner.crawl(TestSpider) 18 #yield runner.crawl(Test2Spider) 19 reactor.stop() 20 21 crawl() 22 reactor.run() # the script will block here until the last crawl call is finished
3.修改兩個spider文件引入items,和外部類的如(HeadersHelper.py)的引入模式(以run.py所在目錄為中心)
原導入模式:
from ..items import ScrapydoubanmovieItem from .HeadersHelper import HeadersHelper
注釋:這種導入能夠在命令行scrapy crawl Test正常運行
修改為:
from items import ScrapydoubanmovieItem from .HeadersHelper import HeadersHelper
注釋:修改后這種導入在命令行scrapy crawl Test會報錯,但通過運行run.py文件,能夠同時運行兩個spider
4.按照運行python文件的方式運行run.py,可以得到結果