使用scrapy、requests遇到503状态码问题解决

本文转载自查看原文 2021-07-11 04:13 443

错误日志如下：

2021-07-11 02:19:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://xxxx.com/tags/undef>: HTTP status code is not handled or not allowed

问题分析

请求的503状态html内容进行翻译

503错误信息：

Checking your browser before accessing xxxx.com
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds…

从翻译的内容来看是为了浏览器验证等待5s 网上搜了一下说是有个Cloudflare机制为了防止机器人非正常获取数据搜到需要搭配使用cfscrape 绕过页面等待，配置如下：

安装 pip install cfscrape

class DrdSpider(scrapy.Spider):
    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url, USER_AGENT)
            #token, agent = cfscrape.get_tokens(url)
            cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
            print "useragent in cfrequest: " , agent
            print "token in cfrequest: ", token
        return cf_requests

但是配置好后运行报错，信息如下：

Traceback (most recent call last):
  File "C:\workspace\new-crm-agent\env\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\workspace\phub\scrapy_obj\mySpider\spiders\drd.py", line 35, in start_requests
    token, agent = cfscrape.get_tokens(url)
  File "C:\workspace\new-crm-agent\env\lib\site-packages\cfscrape\__init__.py", line 398, in get_tokens
    'Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I\'m Under Attack Mode") enabled?'
ValueError: Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I'm Under Attack Mode") enabled?

从报错信息来看意思是该站点没有采用Cloudflare机制，于是我在报错前一行代码打断点看请求内容。发现状态码为200状态。

那么问题来了，为什么我使用cfscrape访问正常200，scrapy爬取却是503？

我觉得可能是scrayp框架本身问题。于是使用requests模块请求获取看看是否能正常访问，发现依然是503状态

if __name__ == "__main__":
    session = requests.session()
    heads = OrderedDict([('Host', None),
             ('Connection', 'keep-alive'),
             ('Upgrade-Insecure-Requests', '1'),
             ('User-Agent',
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
             ('Accept',
              'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
             ('Accept-Language', 'en-US,en;q=0.9'),
             ('Accept-Encoding', 'gzip, deflate')])
    session.headers = heads
    resp = session.get("https://drd.com/tags/undi")
    print(resp)

返回结果：

<Response [503]>
Process finished with exit code 0

除了cfscrape。python自带的requests和scrapy都不能正常访问, 可能是cfscrape源码做了特殊设置，查看源码特殊部分代码如下：

class CloudflareAdapter(HTTPAdapter):
    """ HTTPS adapter that creates a SSL context with custom ciphers """

    def get_connection(self, *args, **kwargs):
        conn = super(CloudflareAdapter, self).get_connection(*args, **kwargs)

        if conn.conn_kw.get("ssl_context"):
            conn.conn_kw["ssl_context"].set_ciphers(DEFAULT_CIPHERS)
        else:
            context = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
            conn.conn_kw["ssl_context"] = context

        return conn
        
class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        self.delay = kwargs.pop("delay", None)
        # Use headers with a random User-Agent if no custom headers have been set
        headers = OrderedDict(kwargs.pop("headers", DEFAULT_HEADERS))

        # Set the User-Agent header if it was not provided
        headers.setdefault("User-Agent", DEFAULT_USER_AGENT)

        super(CloudflareScraper, self).__init__(*args, **kwargs)

        # Define headers to force using an OrderedDict and preserve header order
        self.headers = headers
        self.org_method = None

        self.mount("https://", CloudflareAdapter())

问题出在这里self.mount("https://", CloudflareAdapter()), 我照着这个请求逻辑用requests发现能正常请求200。问题可能是https请求前需要ssl认证,并且设置ssl_context。于是我搜了一下set_ciphers是干什么用的。python官方解释如下：

SSLContext.set_ciphers(ciphers)
为使用此上下文创建的套接字设置可用密码。 它应当为 OpenSSL 密码列表格式 的字符串。 如果没有可被选择的密码（由于编译时选项或其他配置禁止使用所指定的任何密码），则将引发 SSLError。

備註 在连接后，SSL 套接字的 SSLSocket.cipher() 方法将给出当前所选择的密码。
TLS 1.3 cipher suites cannot be disabled with set_ciphers().

应该是该网站443连接需要使用TLS/SSL密码验证，需要设置如下：

/settings.py


DOWNLOADER_CLIENT_TLS_CIPHERS = "DEFAULT:!DH"

使用requests模块需要修改http适配器, 代码如下：

if __name__ == "__main__":
    ciphers = "DEFAULT:!DH"
    class TestAdapter(HTTPAdapter):
        def get_connection(self, *args, **kwargs):
            conn = super(TestAdapter, self).get_connection(*args, **kwargs)
            if conn.conn_kw.get("ssl_context"):
                conn.conn_kw["ssl_context"].set_ciphers(ciphers)
            else:
                context = create_urllib3_context(ciphers=ciphers)
                conn.conn_kw["ssl_context"] = context
            return conn
    session = requests.session()
    heads = OrderedDict([('Host', None),
             ('Connection', 'keep-alive'),
             ('Upgrade-Insecure-Requests', '1'),
             ('User-Agent',
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
             ('Accept',
              'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
             ('Accept-Language', 'en-US,en;q=0.9'),
             ('Accept-Encoding', 'gzip, deflate')])
    session.headers = heads
    session.mount('https://', TestAdapter())
    resp = session.get("https://javdb.com/tags/uncensored")
    print(resp)

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 requests get请求返回码418问题解决 requests安装及问题解决 CentOS 使用 sudo 遇到 command not found 问题解决 plsql developer安装和使用遇到的问题解决在Docker中使用kettle遇到的问题解决安装QC和遇到的问题解决 windows gcc 遇到的问题解决 HTTP请求状态码404相关问题解决 Application pool自动停止问题解决及IIS 7.x 503错误解决关于IIS8.5在配置完后出现503问题解决方案