scarpy 不僅提供了 scrapy crawl spider 命令來啟動爬蟲,還提供了一種利用 API 編寫腳本 來啟動爬蟲的方法。
scrapy 基於 twisted 異步網絡庫構建的,因此需要在 twisted 容器內運行它。
可以通過兩個 API 運行爬蟲:scrapy.crawler.CrawlerProcess 和 scrapy.crawler.CrawlerRunner
scrapy.crawler.CrawlerProcess
這個類內部將會開啟 twisted.reactor、配置log 和 設置 twisted.reactor 自動關閉,該類是所有 scrapy 命令使用的類。
運行單個爬蟲示例
class QiushispiderSpider(scrapy.Spider): name = 'qiushiSpider' # allowed_domains = ['qiushibaike.com'] start_urls = ['https://tianqi.2345.com/'] def start_requests(self): return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] # def parse(self, response): print('proxy simida') if __name__ == '__main__': from scrapy.crawler import CrawlerProcess process = CrawlerProcess() process.crawl(QiushispiderSpider) # 'qiushiSpider' process.start()
process.crawl() 內的參數可以是 爬蟲名'qiushiSpider',也可以是 爬蟲類名QiushispiderSpider
這種方式並沒有使用爬蟲的配置文件settings
2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}
獲取配置
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) process.crawl(QiushispiderSpider) # 'qiushiSpider' process.start()
運行多個爬蟲
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start()
scrapy.crawler.CrawlerRunner
1. 更好的控制爬蟲運行過程
2. 顯式運行 twisted.reactor,顯式關閉 twisted.reactor
3. 需要在 CrawlerRunner.crawl 返回的對象中添加回調函數
運行單個爬蟲示例
class QiushispiderSpider(scrapy.Spider): name = 'qiushiSpider' # allowed_domains = ['qiushibaike.com'] start_urls = ['https://tianqi.2345.com/'] def start_requests(self): return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] # def parse(self, response): print('proxy simida') if __name__ == '__main__': # test CrawlerRunner from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from scrapy.utils.project import get_project_settings configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'}) runner = CrawlerRunner(get_project_settings()) d = runner.crawl(QiushispiderSpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
configure_logging 設定日志輸出格式
addBoth 添加 關閉 twisted.crawl 的回調函數
運行多個爬蟲
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
也可以異步實現
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script
參考資料:
https://blog.csdn.net/weixin_33857230/article/details/89571872