如何用腳本方式啟動scrapy爬蟲


眾所周知,直接通過命令行scrapy crawl yourspidername可以啟動項目中名為yourspidername的爬蟲。在python腳本中可以調用cmdline模塊來啟動命令行:

$ cat yourspider1start.py
from scrapy import cmdline

# 方法 1
cmdline.execute('scrapy crawl yourspidername'.split())

# 方法 2
sys.argv = ['scrapy', 'crawl', 'down_info_spider']
cmdline.execute()

# 方法 3, 創建子進程執行外部程序。方法僅僅返回外部程序的執行結果。0表示執行成功。
os.system('scrapy crawl down_info_spider')

# 方法 4
import subprocess
subprocess.Popen('scrapy crawl down_info_spider') 

其中,在方法3、4中,推薦subprocess

subprocess module intends to replace several other, older modules and functions, such as:

os.system
os.spawn*
os.popen*
popen2.*
commands.*

通過其返回值的poll方法可以判斷子進程是否執行結束

我們也可以直接通過shell腳本每隔2秒啟動所有爬蟲:

$ cat startspiders.sh
#!/usr/bin/env bash
count=0
while [ $count -lt $1 ];
do
  sleep 2 
  nohup python yourspider1start.py >/dev/null 2>&1 &
  nohup python yourspider2start.py >/dev/null 2>&1 &
  let count+=1
done

以上方法本質上都是啟動scrapy命令行。如何通過調用scrapy內部函數,在編程方式下啟動爬蟲呢?

官方文檔給出了兩個scrapy工具:

  1. scrapy.crawler.CrawlerRunner, runs crawlers inside an already setup Twisted reactor
  2. scrapy.crawler.CrawlerProcess, 父類是CrawlerRunner

scrapy框架基於Twisted異步網絡庫,CrawlerRunner和CrawlerProcess幫助我們從Twisted reactor內部啟動scrapy。

直接使用CrawlerRunner可以更精細的控制crawler進程,要手動指定Twisted reactor關閉后的回調函數。指定如果不打算在應用程序中運行更多的Twisted reactor,使用子類CrawlerProcess則更合適。

下面簡單是文檔中給的用法示例:

# encoding: utf-8
__author__ = 'fengshenjie'
from twisted.internet import reactor
from scrapy.utils.project import get_project_settings

def run1_single_spider():
    '''Running spiders outside projects
    只調用spider,不會進入pipeline'''
    from scrapy.crawler import CrawlerProcess
    from scrapy_test1.spiders import myspider1
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(myspider1)
    process.start()  # the script will block here until the crawling is finished

def run2_inside_scrapy():
    '''會啟用pipeline'''
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess(get_project_settings())
    process.crawl('spidername') # scrapy項目中spider的name值
    process.start()

def spider_closing(arg):
    print('spider close')
    reactor.stop()

def run3_crawlerRunner():
    '''如果你的應用程序使用了twisted,建議使用crawlerrunner 而不是crawlerprocess
    Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.
    '''
    from scrapy.crawler import CrawlerRunner
    runner = CrawlerRunner(get_project_settings())

    # 'spidername' is the name of one of the spiders of the project.
    d = runner.crawl('spidername')
    
    # stop reactor when spider closes
    # d.addBoth(lambda _: reactor.stop())
    d.addBoth(spider_closing) # 等價寫法

    reactor.run()  # the script will block here until the crawling is finished

def run4_multiple_spider():
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()

    from scrapy_test1.spiders import myspider1, myspider2
    for s in [myspider1, myspider2]:
        process.crawl(s)
    process.start()

def run5_multiplespider():
    '''using CrawlerRunner'''
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging

    configure_logging()
    runner = CrawlerRunner()
    from scrapy_test1.spiders import myspider1, myspider2
    for s in [myspider1, myspider2]:
        runner.crawl(s)

    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    reactor.run()  # the script will block here until all crawling jobs are finished

def run6_multiplespider():
    '''通過鏈接(chaining) deferred來線性運行spider'''
    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    configure_logging()
    runner = CrawlerRunner()

    @defer.inlineCallbacks
    def crawl():
        from scrapy_test1.spiders import myspider1, myspider2
        for s in [myspider1, myspider2]:
            yield runner.crawl(s)
        reactor.stop()

    crawl()
    reactor.run()  # the script will block here until the last crawl call is finished


if __name__=='__main__':
    # run4_multiple_spider()
    # run5_multiplespider()
    run6_multiplespider()

References

  1. 編程方式下運行 Scrapy spider, 基於scrapy1.0版本


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM