使用scrapy、requests遇到503狀態碼問題解決

本文轉載自查看原文 2021-07-11 04:13 443

錯誤日志如下：

2021-07-11 02:19:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://xxxx.com/tags/undef>: HTTP status code is not handled or not allowed

問題分析

請求的503狀態html內容進行翻譯

503錯誤信息：

Checking your browser before accessing xxxx.com
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds…

從翻譯的內容來看是為了瀏覽器驗證等待5s 網上搜了一下說是有個Cloudflare機制為了防止機器人非正常獲取數據搜到需要搭配使用cfscrape 繞過頁面等待，配置如下：

安裝 pip install cfscrape

class DrdSpider(scrapy.Spider):
    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url, USER_AGENT)
            #token, agent = cfscrape.get_tokens(url)
            cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
            print "useragent in cfrequest: " , agent
            print "token in cfrequest: ", token
        return cf_requests

但是配置好后運行報錯，信息如下：

Traceback (most recent call last):
  File "C:\workspace\new-crm-agent\env\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\workspace\phub\scrapy_obj\mySpider\spiders\drd.py", line 35, in start_requests
    token, agent = cfscrape.get_tokens(url)
  File "C:\workspace\new-crm-agent\env\lib\site-packages\cfscrape\__init__.py", line 398, in get_tokens
    'Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I\'m Under Attack Mode") enabled?'
ValueError: Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I'm Under Attack Mode") enabled?

從報錯信息來看意思是該站點沒有采用Cloudflare機制，於是我在報錯前一行代碼打斷點看請求內容。發現狀態碼為200狀態。

那么問題來了，為什么我使用cfscrape訪問正常200，scrapy爬取卻是503？

我覺得可能是scrayp框架本身問題。於是使用requests模塊請求獲取看看是否能正常訪問，發現依然是503狀態

if __name__ == "__main__":
    session = requests.session()
    heads = OrderedDict([('Host', None),
             ('Connection', 'keep-alive'),
             ('Upgrade-Insecure-Requests', '1'),
             ('User-Agent',
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
             ('Accept',
              'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
             ('Accept-Language', 'en-US,en;q=0.9'),
             ('Accept-Encoding', 'gzip, deflate')])
    session.headers = heads
    resp = session.get("https://drd.com/tags/undi")
    print(resp)

返回結果：

<Response [503]>
Process finished with exit code 0

除了cfscrape。python自帶的requests和scrapy都不能正常訪問, 可能是cfscrape源碼做了特殊設置，查看源碼特殊部分代碼如下：

class CloudflareAdapter(HTTPAdapter):
    """ HTTPS adapter that creates a SSL context with custom ciphers """

    def get_connection(self, *args, **kwargs):
        conn = super(CloudflareAdapter, self).get_connection(*args, **kwargs)

        if conn.conn_kw.get("ssl_context"):
            conn.conn_kw["ssl_context"].set_ciphers(DEFAULT_CIPHERS)
        else:
            context = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
            conn.conn_kw["ssl_context"] = context

        return conn
        
class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        self.delay = kwargs.pop("delay", None)
        # Use headers with a random User-Agent if no custom headers have been set
        headers = OrderedDict(kwargs.pop("headers", DEFAULT_HEADERS))

        # Set the User-Agent header if it was not provided
        headers.setdefault("User-Agent", DEFAULT_USER_AGENT)

        super(CloudflareScraper, self).__init__(*args, **kwargs)

        # Define headers to force using an OrderedDict and preserve header order
        self.headers = headers
        self.org_method = None

        self.mount("https://", CloudflareAdapter())

問題出在這里self.mount("https://", CloudflareAdapter()), 我照着這個請求邏輯用requests發現能正常請求200。問題可能是https請求前需要ssl認證,並且設置ssl_context。於是我搜了一下set_ciphers是干什么用的。python官方解釋如下：

SSLContext.set_ciphers(ciphers)
為使用此上下文創建的套接字設置可用密碼。 它應當為 OpenSSL 密碼列表格式 的字符串。 如果沒有可被選擇的密碼（由於編譯時選項或其他配置禁止使用所指定的任何密碼），則將引發 SSLError。

備註 在連接后，SSL 套接字的 SSLSocket.cipher() 方法將給出當前所選擇的密碼。
TLS 1.3 cipher suites cannot be disabled with set_ciphers().

應該是該網站443連接需要使用TLS/SSL密碼驗證，需要設置如下：

/settings.py


DOWNLOADER_CLIENT_TLS_CIPHERS = "DEFAULT:!DH"

使用requests模塊需要修改http適配器, 代碼如下：

if __name__ == "__main__":
    ciphers = "DEFAULT:!DH"
    class TestAdapter(HTTPAdapter):
        def get_connection(self, *args, **kwargs):
            conn = super(TestAdapter, self).get_connection(*args, **kwargs)
            if conn.conn_kw.get("ssl_context"):
                conn.conn_kw["ssl_context"].set_ciphers(ciphers)
            else:
                context = create_urllib3_context(ciphers=ciphers)
                conn.conn_kw["ssl_context"] = context
            return conn
    session = requests.session()
    heads = OrderedDict([('Host', None),
             ('Connection', 'keep-alive'),
             ('Upgrade-Insecure-Requests', '1'),
             ('User-Agent',
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
             ('Accept',
              'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
             ('Accept-Language', 'en-US,en;q=0.9'),
             ('Accept-Encoding', 'gzip, deflate')])
    session.headers = heads
    session.mount('https://', TestAdapter())
    resp = session.get("https://javdb.com/tags/uncensored")
    print(resp)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 requests get請求返回碼418問題解決 requests安裝及問題解決 CentOS 使用 sudo 遇到 command not found 問題解決 plsql developer安裝和使用遇到的問題解決在Docker中使用kettle遇到的問題解決安裝QC和遇到的問題解決 windows gcc 遇到的問題解決 HTTP請求狀態碼404相關問題解決 Application pool自動停止問題解決及IIS 7.x 503錯誤解決關於IIS8.5在配置完后出現503問題解決方案