去重規則
在爬蟲應用中,我們可以在request對象中設置參數dont_filter = True 來阻止去重。而scrapy框架中是默認去重的,那內部是如何去重的。
from scrapy.dupefilter import RFPDupeFilter
請求進來以后,會先執行from_settings方法,從settings文件中找一個DUPEFILTER_DEBUG的配置,再執行init初始化方法,生成一個集合 self.fingerprints = set(),然后在執行request_seen方法,所以我們可以自定制去重規則,只要繼承BaseDupeFilter即可

class RFPDupeFilter(BaseDupeFilter): """Request Fingerprint duplicates filter""" def __init__(self, path=None, debug=False): self.file = None self.fingerprints = set() self.logdupes = True self.debug = debug self.logger = logging.getLogger(__name__) if path: self.file = open(os.path.join(path, 'requests.seen'), 'a+') self.file.seek(0) self.fingerprints.update(x.rstrip() for x in self.file) @classmethod def from_settings(cls, settings): debug = settings.getbool('DUPEFILTER_DEBUG') return cls(job_dir(settings), debug) def request_seen(self, request): fp = self.request_fingerprint(request) if fp in self.fingerprints: return True self.fingerprints.add(fp) if self.file: self.file.write(fp + os.linesep) def request_fingerprint(self, request): return request_fingerprint(request) def close(self, reason): if self.file: self.file.close() def log(self, request, spider): if self.debug: msg = "Filtered duplicate request: %(request)s" self.logger.debug(msg, {'request': request}, extra={'spider': spider}) elif self.logdupes: msg = ("Filtered duplicate request: %(request)s" " - no more duplicates will be shown" " (see DUPEFILTER_DEBUG to show all duplicates)") self.logger.debug(msg, {'request': request}, extra={'spider': spider}) self.logdupes = False spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
scrapy默認使用 scrapy.dupefilter.RFPDupeFilter 進行去重,相關配置有:
1
2
3
|
DUPEFILTER_CLASS
=
'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG
=
False
JOBDIR
=
"保存范文記錄的日志路徑,如:/root/"
# 最終路徑為 /root/requests.seen
|
使用redis的集合自定制去重規則:

import redis from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class Myfilter(BaseDupeFilter): def __init__(self,key): self.conn = None self.key = key @classmethod def from_settings(cls, settings): key = settings.get('DUP_REDIS_KEY') return cls(key) def open(self): self.conn = redis.Redis(host='127.0.0.1',port=6379) def request_seen(self, request): fp = request_fingerprint(request) ret = self.conn.sadd(self.key,fp) return ret == 0
備注:利用scrapy的封裝的request_fingerprint 進行對每個request對象進行加密,變成固長,方便存儲。
定制完去重規則后,如何生效,只需更改配置文件即可:
settings.py文件中設置 DUPEFILTER_CLASS
=
'自定制去重規則的類的路徑'
由此可見,去重規則是由兩個因素決定的,一個是request對象中的dont_filter參數,一個是去重類。那這兩個因素又是如何處理的? 這是由調度器中的enqueue_request方法決定的

# scrapy下的core文件中的scheduler.py文件 class Scheduler(object): def enqueue_request(self, request): if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) return False dqok = self._dqpush(request) if dqok: self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider) else: self._mqpush(request) self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider) self.stats.inc_value('scheduler/enqueued', spider=self.spider) return True
調度器
1.使用隊列(廣度優先)
2.使用棧(深度優先)
3.使用優先級的隊列(利用redis的有序集合)
下載中間件
在request對象請求下載的過程中,會穿過一系列的中間件,這一系列的中間件,在請求下載時,會穿過每一個下載中間件的process_request方法,下載完之后返回時,會穿過process_response方法。那這些中間件有什么用處呢?
作用:統一對所有的request對象進行下載前或下載后的處理
我們可以自定制中間件,在請求時,可以添加一些請求頭,在返回時,獲得cookie
自定制下載中間件時,需要在settings.py配置文件中配置才會生效。
# Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'myspider.middlewares.MyspiderDownloaderMiddleware': 543, }
如果想在下載時更換url,可以在process_request中設置,一般不會這么做
class MyspiderDownloaderMiddleware(object): def process_request(self, request, spider): request._set_url(‘更改的url’) return None
我們可以在請求的中間件中添加請求頭,也可以添加cookie,但是,scrapy框架為我們寫好了很多東西,我們只需要用即可,自定制的中間件添加scrapy中沒有的就行。那么scrapy為我們提供了那些下載中間件呢?
比如:我們請求頭中常攜帶的useragent(在useragent.py中做了處理),還有redirect.py 中,處理了重定向的設置,我們在請求時,會出現重定向的情況,scrapy框架為我們做了重定向處理。

class BaseRedirectMiddleware(object): enabled_setting = 'REDIRECT_ENABLED' def __init__(self, settings): if not settings.getbool(self.enabled_setting): raise NotConfigured self.max_redirect_times = settings.getint('REDIRECT_MAX_TIMES') self.priority_adjust = settings.getint('REDIRECT_PRIORITY_ADJUST') @classmethod def from_crawler(cls, crawler): return cls(crawler.settings)
我們可以在settings配置文件中設置最大重定向的次數來阻止重定向 (REDIRECT_MAX_TIMES)
下載中間件中也為我們處理了cookie。
cookie中間件中,實例化時,創建了一個默認的字典defaultdict(特點:創建默認字典時,傳入什么,生成鍵值對時,值就是什么類型,比如 ret =defaultdict(list) s = ret[1] 此時的ret是一個key為1,值為[]的默認字典)。
在請求進來時,在請求requests對象中取一個cookiejar的值:cookiejarkey = request.meta.get("cookiejar"),並把這個值,直接賦值給了實例化時創建的字典, jar = self.jars[cookiejarkey],此時的 self.jars = {cookiejarkey:CookieJar對象},然后從這個CookieJar中取值。請求時攜帶取得值。
下載完成后,響應時,會從響應中取到cookie的值jar.extract_cookies(response, request) ,然后添加到cookiejar中

class CookiesMiddleware(object): """This middleware enables working with sites that need cookies""" def __init__(self, debug=False): self.jars = defaultdict(CookieJar) self.debug = debug @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('COOKIES_ENABLED'): raise NotConfigured return cls(crawler.settings.getbool('COOKIES_DEBUG')) def process_request(self, request, spider): if request.meta.get('dont_merge_cookies', False): return cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] cookies = self._get_request_cookies(jar, request) for cookie in cookies: jar.set_cookie_if_ok(cookie, request) # set Cookie header request.headers.pop('Cookie', None) jar.add_cookie_header(request) self._debug_cookie(request, spider) def process_response(self, request, response, spider): if request.meta.get('dont_merge_cookies', False): return response # extract cookies from Set-Cookie and drop invalid/expired cookies cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] jar.extract_cookies(response, request) self._debug_set_cookie(response, spider) return response
因此,我們可以在發送請求時,在請求攜帶的參數中設置meta參數 meta={"cookiejar":任意值} ,下次請求時,直接在請求中也攜帶同樣的meta即可。如果不想攜帶本次的cookie,也可以重新設置值 meta={"cookiejar":任意值1} ,那后面的請求就可以依據自己的需求,想攜帶誰就攜帶誰。
scrapy為我們提供了很多的內置中間件,但是我們自定制中間件時,需要在配置文件中配置,但是,在配置文件中,我們並看到這些scrapy自帶的中間件,因為這些中間件的配置在scrapy的默認配置文件中。
打開這個默認的配置文件,可以看到scrapy默認的中間件以及優先級的數值

DOWNLOADER_MIDDLEWARES_BASE = { # Engine side 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, # Downloader side }
所以,但我們自定制對應的中間件時,請求時一定要比默認的對應的中間件的數值大,返回響應時一定要比默認的對應的中間件的數值小,否則默認的中間件會覆蓋掉自定制的中間件(執行順序:請求時從小到大,響應時從大到小),從而無法生效。
當然,這些中間件也是有返回值,請求中間件 process_request 返回None表示繼續執行后續的中間件,返回response(怎么返回response?偽造一個,可以自己使用requests模塊訪問一個其他url,返回response,或者from scrapy.http import Response 實例化一個response對象即可)就會跳過后續的請求中間件,直接執行所有的響應中間件(是所有,不同於django的中間件)。也可以返回一個request對象,表示放棄此次請求,並將返回的request對象添加到調度器中。也可以拋出一個異常。
返回響應時, process_response 必須要有返回值,正常情況下返回response,也可以返回一個request對象,也可以拋出異常。
下載中間件中也可以設置代理。
爬蟲中間件
爬蟲應用將item對象或者request對象依次穿過爬蟲中間件的process_spider_output方法傳給引擎進行分發,下載完成后依次穿過爬蟲中間件的process_spider_input方法。
返回值:process_spider_output方法必須返回None或者拋出一個異常
同樣的,我們自定義爬蟲中間件也要在配置文件中配置
# Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { 'myspider.middlewares.MyspiderSpiderMiddleware': 543, }
那么爬蟲中間件有什么作用呢?我們爬取的深度(DEPTH_LIMIT參數)和優先級是如何實現的呢?就是通過內置的爬蟲中間件實現的。
scrapy框架為我們內置的一些爬蟲中件間:
那爬蟲爬取的深度限制和優先級是如何實現的呢? 通過depth.py 這個文件
爬蟲執行到深度中間件時,會先調用from_crawler方法,這個方法會先去settings文件中獲取幾個參數:DEPTH_LIMIT(爬取深度)、DEPTH_PRIORITY(優先級)、DEPTH_STATS_VERBOSE(是否收集最后一層),然后通過process_spider_output 方法中判斷有沒有設置過depth,如果沒有就給當前的request對象設置depth=0參數,然后通過每層自加一 depth = response.meta['depth'] + 1實現層級的控制

class DepthMiddleware(object): def __init__(self, maxdepth, stats=None, verbose_stats=False, prio=1): self.maxdepth = maxdepth self.stats = stats self.verbose_stats = verbose_stats self.prio = prio @classmethod def from_crawler(cls, crawler): settings = crawler.settings maxdepth = settings.getint('DEPTH_LIMIT') verbose = settings.getbool('DEPTH_STATS_VERBOSE') prio = settings.getint('DEPTH_PRIORITY') return cls(maxdepth, crawler.stats, verbose, prio) def process_spider_output(self, response, result, spider): def _filter(request): if isinstance(request, Request): depth = response.meta['depth'] + 1 request.meta['depth'] = depth if self.prio: request.priority -= depth * self.prio if self.maxdepth and depth > self.maxdepth: logger.debug( "Ignoring link (depth > %(maxdepth)d): %(requrl)s ", {'maxdepth': self.maxdepth, 'requrl': request.url}, extra={'spider': spider} ) return False elif self.stats: if self.verbose_stats: self.stats.inc_value('request_depth_count/%s' % depth, spider=spider) self.stats.max_value('request_depth_max', depth, spider=spider) return True # base case (depth=0) if self.stats and 'depth' not in response.meta: response.meta['depth'] = 0 if self.verbose_stats: self.stats.inc_value('request_depth_count/0', spider=spider) return (r for r in result or () if _filter(r))
備注:response.request 表示當前響應是由那個request對象發起的
response.meta 等同於 response.request.meta 可以獲取到當前響應對應的request對象的meta屬性
沒有meta屬性時,會默認攜帶一些參數:比如當前頁面下載的時間。
{'download_timeout': 180.0, 'download_slot': 'dig.chouti.com', 'download_latency': 0.5455923080444336}
同時,request的優先級,通過自身的priority的值自減depth的值得到request.priority -= depth * self.prio
如果 配置值DEPTH_PRIORITY設置為1,則請求的優先級會遞減(0,-1,-2,...)
如果 配置值DEPTH_PRIORITY設置為-1,則請求的優先級會遞增(0,1,2,...)
通過這種方式,通過改變配置的正負值,來實現優先級的控制(是深度優先(從大到小),還是廣度優先(從小到大))
scrapy中DEPTH_LIMIT 和 DEPTH_PRIORITY的默認值
scrapy框架中默認的爬蟲中間件的配置信息

SPIDER_MIDDLEWARES_BASE = { # Engine side 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, # Spider side }
備注:scrapy框架完美的遵循了開放封閉原則(源碼封閉,配置文件開放)
自定制命令
有兩種自定義命令的方式
執行單個爬蟲時,直接寫一個python腳本(.py文件)即可,這是scrapy框架默認支持的
通過腳本執行單個爬蟲腳本

import sys from scrapy.cmdline import execute if __name__ == '__main__': # 方式一 #可以直接寫 腳本的目錄下終端運行 --> python 腳本名 # execute(["scrapy","crawl","chouti","--nolog"]) # #也可以借助sys.argv 在命令行中傳參數會被argv捕獲(argv為一個列表,第一個參數為腳本的路徑,后面是傳的參數) # # 比如 運行命令 : python 腳本 參數(chouti) # print(sys.argv) execute(['scrapy','crawl',sys.argv[1],'--nolog'])
如果我們希望可以同時執行多個爬蟲時,就需要自定制命令
自定制命令的步驟:
- 在spiders同級創建任意目錄,如:commands
- 在其中創建 crawlall.py 文件 (此處文件名就是自定義的命令) 備注:py文件什么名字,自定義命令就是什么名字
from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' # 返回對該命令的描述信息,可以通過scrapy --help 查看 def run(self, args, opts): # 自定制命令默認執行的方法,可以通過判斷等定制內容 # 執行所有的爬蟲 spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
- 在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱'
- 在項目目錄執行命令:scrapy crawlall
自定制擴展
自定義擴展是利用信號在指定位置注冊指定操作
自定義擴展是基於scrapy中的信號的

from scrapy import signals class MyExtension(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): # 可以在配置文件中指定參數 # val = crawler.settings.getint('MMMM') # ext = cls(val) ext = cls() crawler.signals.connect(ext.func_open, signal=signals.spider_opened) # 表示執行到爬蟲開始時(spider_opened),開始執行func_open這個函數 crawler.signals.connect(ext.func_close, signal=signals.spider_closed) # 結束時,執行func_close 函數 return ext def func_open(self, spider): print('open') def func_close(self, spider): print('close')
同樣的我們自定義擴展后也要在配置文件中配置才能生效
# Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None,
'xxx.xxx.xxxx': 500,
}
自定義擴展是在scrapy指定的位置實現的,那scrapy又給我們提供了哪些可擴展的地方?
解釋:engine_stared 和 engine_stopped 是引擎的開始和結束 ,是整個爬蟲爬取任務最開始和結束的地方
spider_opend 和 spider_closed 是爬蟲開始和結束
spider_idle 表示爬蟲空閑 spider_error 表示爬蟲錯誤
request_scheduled 表示調度器開始調度的時候 request_dropped 表示請求舍棄
response_received 表示響應接收到 response_downloaded 表示下載完畢
代理
實現有三種方式:
基於環境變量 (給當前進程中的所有的請求加代理)
借助os模塊中的environ方法,print(os.environ) 得到的是當前進程中的共享的變量,可以通過設置key,val實現。
在爬蟲程序剛開始啟動之前設置環境變量:
啟動腳本中設置
import os
os.environ['http_proxy'] = '代理http:xxx.com'
os.environ['https_proxy'] = '代理https:xxx.com'
或者start_requests方法中:
def start_requests(self):
import os
os.environ['http_proxy'] = 'http:xxx.com'
yield Request(url='xxx')
基於request的meta參數 (給單個請求加代理)
在request參數中設置 meta={'proxy':'代理http:xxx.com'}
基於下載中間件
怎么使用呢?先看源碼中怎么實現的
請求到達HttpProxyMiddleware中間件后,先執行from_crawler方法,從配置文件中查看是否有HTTPPROXY_ENABLED參數
這個參數表示是否開啟代理,然后實例化時,創建了一個空字典 self.proxies = {} ,並循環getproxies,這個getproxies是什么?
class HttpProxyMiddleware(object): def __init__(self, auth_encoding='latin-1'): self.auth_encoding = auth_encoding self.proxies = {} for type, url in getproxies().items(): self.proxies[type] = self._get_proxy(url, type)
getproxies = getproxies_environment 等於一個函數
def getproxies_environment(): proxies = {} for name, value in os.environ.items(): name = name.lower() if value and name[-6:] == '_proxy': proxies[name[:-6]] = value if 'REQUEST_METHOD' in os.environ: proxies.pop('http', None) for name, value in os.environ.items(): if name[-6:] == '_proxy': name = name.lower() if value: proxies[name[:-6]] = value else: proxies.pop(name[:-6], None) return proxies
這個函數中,循環環境變量的值,並從中找一個以_proxy結尾 的key,然后進行字符串的切割,並將處理后的值放入示例化的proxies字典中。比如:我們設置了環境變量 os.environ["http_proxy"] = 'http:xxx.com',那么處理后的proxies字典中的結果為{"http":"http:xxx.com"} 。因此我們可以采用這種方式,實現添加代理,那我們在一開始就要設置好全局變量,在start_requests方法中就要設置,或者在腳本啟動之前也可以。
執行process_request方法時,會先從request的meta參數中找 ‘proxy’ ,如果存在,則使用,不存在,就從self.proxies這個字典中找,這個字典的值來自於全局環境變量。
因此,request中meta參數的優先級高於全局環境變量的。
def process_request(self, request, spider):
# ignore if proxy is already set
if 'proxy' in request.meta:
if request.meta['proxy'] is None:
return
# extract credentials if present
creds, proxy_url = self._get_proxy(request.meta['proxy'], '')
request.meta['proxy'] = proxy_url
if creds and not request.headers.get('Proxy-Authorization'):
request.headers['Proxy-Authorization'] = b'Basic ' + creds
return
elif not self.proxies:
return
if scheme in self.proxies:
self._set_proxy(request, scheme)

class HttpProxyMiddleware(object): def __init__(self, auth_encoding='latin-1'): self.auth_encoding = auth_encoding self.proxies = {} for type, url in getproxies().items(): self.proxies[type] = self._get_proxy(url, type) @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('HTTPPROXY_ENABLED'): raise NotConfigured auth_encoding = crawler.settings.get('HTTPPROXY_AUTH_ENCODING') return cls(auth_encoding) def _basic_auth_header(self, username, password): user_pass = to_bytes( '%s:%s' % (unquote(username), unquote(password)), encoding=self.auth_encoding) return base64.b64encode(user_pass).strip() def _get_proxy(self, url, orig_type): proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user: creds = self._basic_auth_header(user, password) else: creds = None return creds, proxy_url def process_request(self, request, spider): # ignore if proxy is already set if 'proxy' in request.meta: if request.meta['proxy'] is None: return # extract credentials if present creds, proxy_url = self._get_proxy(request.meta['proxy'], '') request.meta['proxy'] = proxy_url if creds and not request.headers.get('Proxy-Authorization'): request.headers['Proxy-Authorization'] = b'Basic ' + creds return elif not self.proxies: return parsed = urlparse_cached(request) scheme = parsed.scheme # 'no_proxy' is only supported by http schemes if scheme in ('http', 'https') and proxy_bypass(parsed.hostname): return if scheme in self.proxies: self._set_proxy(request, scheme) def _set_proxy(self, request, scheme): creds, proxy = self.proxies[scheme] request.meta['proxy'] = proxy if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
小結:request的meta參數和全局環境變量的方式設置代理 適用於下載量比較小的場景,當下載量很大時,由於頻繁的使用一個或幾個就會容易被封。
所以,當請求量很大時,就需要用到第三種方式了,自定制一個下載中間件,每次隨機從所有的代理中取出一個取執行,這樣就會沒有規律性,就不容易被封。

import random import base64 class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = bytes("http://%s" % proxy['ip_port'],encoding='utf8') encoded_user_pass = base64.encodebytes(bytes(proxy['user_pass'],encoding='utf8')) request.headers['Proxy-Authorization'] = bytes('Basic ' + encoded_user_pass,encoding='utf8') else: request.meta['proxy'] = bytes("http://%s" % proxy['ip_port'],encoding='utf8')
設置后要在settings.py 中配置
scrapy中settings.py文件解析

# -*- coding: utf-8 -*- # Scrapy settings for step8_king project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # 1. 爬蟲名稱 BOT_NAME = 'step8_king' # 2. 爬蟲應用路徑 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客戶端 user-agent請求頭 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules # 4. 禁止爬蟲配置 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 並發請求數 # CONCURRENT_REQUESTS = 4 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延遲下載秒數 # DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: # 7. 單域名訪問並發數,並且延遲下次秒數也應用在每個域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 單IP訪問並發數,如果有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,並且延遲下次秒數也應用在每個IP # CONCURRENT_REQUESTS_PER_IP = 3 # Disable cookies (enabled by default) # 8. 是否支持cookie,cookiejar進行操作cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) # 9. Telnet用於查看當前爬蟲的信息,操作爬蟲等... # 使用telnet ip port ,然后通過命令操作 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 10. 默認請求頭 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定義pipeline處理請求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # } # 12. 自定義擴展,基於信號進行調用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # } # 13. 爬蟲允許的最大深度,可以通過meta查看當前深度;0表示無深度 # DEPTH_LIMIT = 3 # 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo # 后進先出,深度優先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先進先出,廣度優先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # 15. 調度器隊列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler # 16. 訪問URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html """ 17. 自動限速算法 from scrapy.contrib.throttle import AutoThrottle 自動限速設置 1. 獲取最小延遲 DOWNLOAD_DELAY 2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY 3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY 4. 當請求下載完成后,獲取其"連接"時間 latency,即:請求連接到接受到響應頭之間的時間 5. 用於計算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ # 開始自動限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下載延遲 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下載延遲 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒並發數 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # 是否顯示 # AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """ 18. 啟用緩存 目的用於將已經發送的請求或相應緩存下來,以便以后使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否啟用緩存策略 # HTTPCACHE_ENABLED = True # 緩存策略:所有請求均緩存,下次在請求直接訪問原來的緩存即可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 緩存超時時間 # HTTPCACHE_EXPIRATION_SECS = 0 # 緩存保存路徑 # HTTPCACHE_DIR = 'httpcache' # 緩存忽略的Http狀態碼 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 緩存存儲的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """ 19. 代理,需要在環境變量中設置 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默認 os.environ { http_proxy:http://root:woshiniba@192.168.11.11:9999/ https_proxy:http://192.168.11.11:9999/ } 方式二:使用自定義下載中間件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, } """ """ 20. Https訪問 Https訪問時有兩種情況: 1. 要爬取網站使用的可信任證書(默認支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取網站使用的自定義證書 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory): def getCertificateOptions(self): from OpenSSL import crypto v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) return CertificateOptions( privateKey=v1, # pKey對象 certificate=v2, # X509對象 verify=False, method=getattr(self, 'method', getattr(self, '_ssl_method', None)) ) 其他: 相關類 scrapy.core.downloader.handlers.http.HttpDownloadHandler scrapy.core.downloader.webclient.ScrapyHTTPClientFactory scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 相關配置 DOWNLOADER_HTTPCLIENTFACTORY DOWNLOADER_CLIENTCONTEXTFACTORY """ """ 21. 爬蟲中間件 class SpiderMiddleware(object): def process_spider_input(self,response, spider): ''' 下載完成,執行,然后交給parse處理 :param response: :param spider: :return: ''' pass def process_spider_output(self,response, result, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) ''' return result def process_spider_exception(self,response, exception, spider): ''' 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給后續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline ''' return None def process_start_requests(self,start_requests, spider): ''' 爬蟲啟動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 ''' return start_requests 內置爬蟲中間件: 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, """ # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { # 'step8_king.middlewares.SpiderMiddleware': 543, } """ 22. 下載中間件 class DownMiddleware1(object): def process_request(self, request, spider): ''' 請求需要被下載時,經過所有下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續后續中間件去下載; Response對象,停止process_request的執行,開始執行process_response Request對象,停止中間件的執行,將Request重新調度器 raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception ''' pass def process_response(self, request, response, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: Response 對象:轉交給其他中間件process_response Request 對象:停止中間件,request會被重新調度下載 raise IgnoreRequest 異常:調用Request.errback ''' print('response1') return response def process_exception(self, request, exception, spider): ''' 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給后續中間件處理異常; Response對象:停止后續process_exception方法 Request對象:停止中間件,request將會被重新調用下載 ''' return None 默認下載中間件 { 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, } """ # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # }