錯誤日志如下:
2021-07-11 02:19:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://xxxx.com/tags/undef>: HTTP status code is not handled or not allowed
問題分析
- 請求的503狀態html內容進行翻譯
503錯誤信息:
Checking your browser before accessing xxxx.com
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds…
- 從翻譯的內容來看是為了瀏覽器驗證等待5s 網上搜了一下說是有個Cloudflare機制為了防止機器人非正常獲取數據搜到 需要搭配使用cfscrape 繞過頁面等待,配置如下:
安裝 pip install cfscrape
class DrdSpider(scrapy.Spider):
def start_requests(self):
cf_requests = []
for url in self.start_urls:
token, agent = cfscrape.get_tokens(url, USER_AGENT)
#token, agent = cfscrape.get_tokens(url)
cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
print "useragent in cfrequest: " , agent
print "token in cfrequest: ", token
return cf_requests
- 但是配置好后運行報錯,信息如下:
Traceback (most recent call last):
File "C:\workspace\new-crm-agent\env\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
request = next(slot.start_requests)
File "C:\workspace\phub\scrapy_obj\mySpider\spiders\drd.py", line 35, in start_requests
token, agent = cfscrape.get_tokens(url)
File "C:\workspace\new-crm-agent\env\lib\site-packages\cfscrape\__init__.py", line 398, in get_tokens
'Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I\'m Under Attack Mode") enabled?'
ValueError: Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I'm Under Attack Mode") enabled?
- 從報錯信息來看意思是該站點沒有采用Cloudflare機制,於是我在報錯前一行代碼打斷點看請求內容。發現狀態碼為200狀態。
那么問題來了,為什么我使用cfscrape訪問正常200,scrapy爬取卻是
503
?
- 我覺得可能是scrayp框架本身問題。 於是使用requests模塊請求獲取看看是否能正常訪問,發現依然是
503
狀態
if __name__ == "__main__":
session = requests.session()
heads = OrderedDict([('Host', None),
('Connection', 'keep-alive'),
('Upgrade-Insecure-Requests', '1'),
('User-Agent',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
('Accept',
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
('Accept-Language', 'en-US,en;q=0.9'),
('Accept-Encoding', 'gzip, deflate')])
session.headers = heads
resp = session.get("https://drd.com/tags/undi")
print(resp)
返回結果:
<Response [503]>
Process finished with exit code 0
- 除了cfscrape。python自帶的requests和scrapy都不能正常訪問, 可能是cfscrape源碼做了特殊設置,查看源碼特殊部分代碼如下:
class CloudflareAdapter(HTTPAdapter):
""" HTTPS adapter that creates a SSL context with custom ciphers """
def get_connection(self, *args, **kwargs):
conn = super(CloudflareAdapter, self).get_connection(*args, **kwargs)
if conn.conn_kw.get("ssl_context"):
conn.conn_kw["ssl_context"].set_ciphers(DEFAULT_CIPHERS)
else:
context = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
conn.conn_kw["ssl_context"] = context
return conn
class CloudflareScraper(Session):
def __init__(self, *args, **kwargs):
self.delay = kwargs.pop("delay", None)
# Use headers with a random User-Agent if no custom headers have been set
headers = OrderedDict(kwargs.pop("headers", DEFAULT_HEADERS))
# Set the User-Agent header if it was not provided
headers.setdefault("User-Agent", DEFAULT_USER_AGENT)
super(CloudflareScraper, self).__init__(*args, **kwargs)
# Define headers to force using an OrderedDict and preserve header order
self.headers = headers
self.org_method = None
self.mount("https://", CloudflareAdapter())
- 問題出在這里
self.mount("https://", CloudflareAdapter())
, 我照着這個請求邏輯用requests發現能正常請求200。 問題可能是https請求前需要ssl認證,並且設置ssl_context。於是我搜了一下set_ciphers
是干什么用的。python官方解釋如下:
SSLContext.set_ciphers(ciphers)
為使用此上下文創建的套接字設置可用密碼。 它應當為 OpenSSL 密碼列表格式 的字符串。 如果沒有可被選擇的密碼(由於編譯時選項或其他配置禁止使用所指定的任何密碼),則將引發 SSLError。
備註 在連接后,SSL 套接字的 SSLSocket.cipher() 方法將給出當前所選擇的密碼。
TLS 1.3 cipher suites cannot be disabled with set_ciphers().
- 應該是該網站443連接需要使用TLS/SSL密碼驗證,需要設置如下:
/settings.py
DOWNLOADER_CLIENT_TLS_CIPHERS = "DEFAULT:!DH"
- 使用requests模塊需要修改http適配器, 代碼如下:
if __name__ == "__main__":
ciphers = "DEFAULT:!DH"
class TestAdapter(HTTPAdapter):
def get_connection(self, *args, **kwargs):
conn = super(TestAdapter, self).get_connection(*args, **kwargs)
if conn.conn_kw.get("ssl_context"):
conn.conn_kw["ssl_context"].set_ciphers(ciphers)
else:
context = create_urllib3_context(ciphers=ciphers)
conn.conn_kw["ssl_context"] = context
return conn
session = requests.session()
heads = OrderedDict([('Host', None),
('Connection', 'keep-alive'),
('Upgrade-Insecure-Requests', '1'),
('User-Agent',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
('Accept',
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
('Accept-Language', 'en-US,en;q=0.9'),
('Accept-Encoding', 'gzip, deflate')])
session.headers = heads
session.mount('https://', TestAdapter())
resp = session.get("https://javdb.com/tags/uncensored")
print(resp)