scrapy shell命令的【選項】簡介

本文轉載自查看原文 2018-07-15 13:19 896 Scrapy

在使用scrapy shell測試某網站時，其返回400 Bad Request，那么，更改User-Agent請求頭信息再試。

DEBUG: Crawled (400) <GET https://www.某網站.com> (referer: None)

可是，怎么更改呢？

使用scrapy shell --help命令查看其用法：

Options中沒有找到相應的選項；

Global Options呢？里面的--set/-s命令可以設置/重寫配置。

使用-s選項更改了User-Agent配置，再測試某網站，成功返回頁面（狀態200）：

...>scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" https://www.某網站.com

2018-07-15 12:11:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.某網站.com> (referer: None)

--------翻篇--------

說明，其實，這個-s的用法並非自己通過上面步驟知道的（之前一直關注Options下面的選項，忽略了Global Options，覺得沒用嗎？），而是通過網頁搜索，然后見到下面的文章：

scrapy shell 用法（慢慢更新...）原文作者：木木&侃侃（一位園友，原文鏈接）

更進一步：在Scrapy的源碼中會對相關配置項有更詳細的信息。

打開C:\Python36\Lib\site-packages\scrapy\commands目錄，可以在里面看到各種內置的Scrapy命令的Python文件，其中，shell.py正是scrapy shell命令的源文件。

從源碼可以看到，里面定義了Command類——繼承了scrapy.commands.ScrapyCommand，在Command的add_options函數中，添加了三個選項：

-c：evaluate the code in the shell, print the result and exit（執行一段解析代碼？）

--spider：use this spider

--no-redirect：do not handle HTTP 3xx status codes and print response as-is

沒有發現-s選項，那么，-s選項來自哪兒呢？看看scrapy.commands.ScrapyCommand的源碼（__init__.py文件中）。可以發現，其下的add_options函數中添加了-s選項：

 1 def add_options(self, parser):
 2     """
 3     Populate option parse with options available for this command
 4     """
 5     group = OptionGroup(parser, "Global Options")
 6     group.add_option("--logfile", metavar="FILE",
 7         help="log file. if omitted stderr will be used")
 8     group.add_option("-L", "--loglevel", metavar="LEVEL", default=None,
 9         help="log level (default: %s)" % self.settings['LOG_LEVEL'])
10     group.add_option("--nolog", action="store_true",
11         help="disable logging completely")
12     group.add_option("--profile", metavar="FILE", default=None,
13         help="write python cProfile stats to FILE")
14     group.add_option("--pidfile", metavar="FILE",
15         help="write process ID to FILE")
16     group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE",
17         help="set/override setting (may be repeated)")
18     group.add_option("--pdb", action="store_true", help="enable pdb on failure")
19 
20     parser.add_option_group(group)

好了，源頭找到了。

可是，之前在尋找方法時發現，scrapy crawl、runspider提供了-a選項來設置/重寫配置，可是，已經有了-s選項了，為何還要增加-a選項呢？兩者有什么區別？

從其解釋來看，-a選項僅僅修改spider的參數，而-s可以設置的范圍更廣泛，包括官文Settings中所有配置吧！（未測試）

parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
　　　　　　　　help="set spider argument (may be repeated)")

--------翻篇--------

實踐1：scrapy shell的-c選項

(env0626) D:\ws\env0626\ws>scrapy shell -c "response.xpath('//title/text()')" https://www.baidu.com

輸出：

2018-07-15 13:07:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com> (referer: None)
[<Selector xpath='//title/text()' data='百度一下，你就知道'>]

實踐2：scrapy runspider -a選項和-s選項修改User-Agent請求頭

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class MousiteSpider(scrapy.Spider):
 6     name = 'mousite'
 7     allowed_domains = ['www.zhihu.com']
 8     start_urls = ['https://www.zhihu.com/']
 9 
10     def parse(self, response):
11         yield response.xpath('//title/text()')

測試結果：-a選項無法獲取數據，返回400；-s選項可以，返回200；

-a選項：

(env0626) D:\ws\env0626\ws>scrapy runspider -a USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" mousite.py

DEBUG: Crawled (400) <GET https://www.zhihu.com/> (referer: None)

INFO: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed

-s選項：

(env0626) D:\ws\env0626\ws>scrapy runspider -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" mousite.py

DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: None)

{'title': [<Selector xpath='//title/text()' data='知乎 - 發現更大的世界'>]}

看來，兩者還是有差別的。

注意，上面的試驗都是在Scrapy項目外執行（）。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 linux常用shell命令=命令+選項+參數 shell 中的 set命令 -e -o 選項作用 linux shell命令行選項與參數用法詳解 scrapy架構簡介爬蟲之Scrapy的簡介與優勢 scrapy shell的使用 Scapy——Scrapy shell的使用 mac上的終端bash命令（一） Bourne-Again Shell簡介 Scrapy各項命令說明 rpm命令常用選項