爬蟲--Scrapy之Downloader Middleware


下載器中間件(Downloader Middleware)

下載器中間件是介於Scrapy的request/response處理的鈎子框架。 是用於全局修改Scrapy request和response的一個輕量、底層的系統。

激活下載器中間件

要激活下載器中間件組件,將其加入到 DOWNLOADER_MIDDLEWARES 設置中。 該設置是一個字典(dict),鍵為中間件類的路徑,值為其中間件的順序(order)。

這里是一個例子:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

DOWNLOADER_MIDDLEWARES 設置會與Scrapy定義的 DOWNLOADER_MIDDLEWARES_BASE 設置合並(但不是覆蓋), 而后根據順序(order)進行排序,最后得到啟用中間件的有序列表: 第一個中間件是最靠近引擎的,最后一個中間件是最靠近下載器的。

關於如何分配中間件的順序請查看 DOWNLOADER_MIDDLEWARES_BASE 設置,而后根據您想要放置中間件的位置選擇一個值。 由於每個中間件執行不同的動作,您的中間件可能會依賴於之前(或者之后)執行的中間件,因此順序是很重要的。

如果您想禁止內置的(在 DOWNLOADER_MIDDLEWARES_BASE 中設置並默認啟用的)中間件, 您必須在項目的 DOWNLOADER_MIDDLEWARES 設置中定義該中間件,並將其值賦為 None 。 例如,如果您想要關閉user-agent中間件:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

最后,請注意,有些中間件需要通過特定的設置來啟用。

編寫您自己的下載器中間件

編寫下載器中間件十分簡單。每個中間件組件是一個定義了以下一個或多個方法的Python類:

class scrapy.contrib.downloadermiddleware.DownloaderMiddleware

1、process_request(request, spider)

當每個request通過下載中間件時,該方法被調用。

process_request() 必須返回其中之一: 返回 None 、返回一個 Response 對象、返回一個 Request 對象或raise IgnoreRequest 。

如果其返回 None ,Scrapy將繼續處理該request,執行其他的中間件的相應方法,直到合適的下載器處理函數(download handler)被調用, 該request被執行(其response被下載)。

如果其返回 Response 對象,Scrapy將不會調用 任何 其他的 process_request() 或 process_exception() 方法,或相應地下載函數; 其將返回該response。 已安裝的中間件的 process_response() 方法則會在每個response返回時被調用。

如果其返回 Request 對象,Scrapy則停止調用 process_request方法並重新調度返回的request。當新返回的request被執行后, 相應地中間件鏈將會根據下載的response被調用。

如果其raise一個 IgnoreRequest 異常,則安裝的下載中間件的 process_exception() 方法會被調用。如果沒有任何一個方法處理該異常, 則request的errback(Request.errback)方法會被調用。如果沒有代碼處理拋出的異常, 則該異常被忽略且不記錄(不同於其他異常那樣)。

參數:
  • request (Request 對象) – 處理的request
  • spider (Spider 對象) – 該request對應的spider

 

操作演示:

輸入指令:

scrapy startproject httpbintest

cd httpbintest

scrapy genspider httpbin httpbin.org

之后進入工程修改httpbin.py的內容,修改為:

 

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/']

    def parse(self, response):
        print(response.text)

輸入scrapy crawl httpbin

打印后的結果為:

2018-10-11 12:04:53 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 12:04:53 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 12:04:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'ROBOTSTXT_OBEY': True, '
SPIDER_MODULES': ['httpbintest.spiders']}
2018-10-11 12:04:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 12:04:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 12:04:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 12:04:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 12:04:55 [scrapy.core.engine] INFO: Spider opened
2018-10-11 12:04:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 12:04:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 12:04:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2018-10-11 12:04:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <title>httpbin.org</title>
    <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700"
        rel="stylesheet">
    <link rel="stylesheet" type="text/css" href="/flasgger_static/swagger-ui.css">
    <link rel="icon" type="image/png" href="/static/favicon.ico" sizes="64x64 32x32 16x16" />
    <style>
        html {
            box-sizing: border-box;
            overflow: -moz-scrollbars-vertical;
            overflow-y: scroll;
        }

        *,
        *:before,
        *:after {
            box-sizing: inherit;
        }

        body {
            margin: 0;
            background: #fafafa;
        }
    </style>
</head>

<body>
    <a href="https://github.com/requests/httpbin" class="github-corner" aria-label="View source on Github">
        <svg width="80" height="80" viewBox="0 0 250 250" style="fill:#151513; color:#fff; position: absolute; top: 0; border: 0; right: 0;"
            aria-hidden="true">
            <path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path>
            <path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 1
25.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2"
                fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path>
            <path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,5
3.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.
6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156
.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z"
                fill="currentColor" class="octo-body"></path>
        </svg>
    </a>
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" style="position:absolute;width:0;height:0">
        <defs>
            <symbol viewBox="0 0 20 20" id="unlocked">
                <path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V6h2v-.801C8 3.754 8.797 3 10 3c1.203 0 2 .754 2 2.199V8H4c-.553 0-1 .646-1
1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.64
6 16.352 8 15.8 8z"></path>
            </symbol>

            <symbol viewBox="0 0 20 20" id="locked">
                <path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V8H4c-.553 0-1 .646-1 1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18
.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.646 16.352 8 15.8 8zM12 8H8V5.199C8 3.754 8.797 3 10 3c1
.203 0 2 .754 2 2.199V8z"
                />
            </symbol>

            <symbol viewBox="0 0 20 20" id="close">
                <path d="M14.348 14.849c-.469.469-1.229.469-1.697 0L10 11.819l-2.651 3.029c-.469.469-1.229.469-1.697 0-.469-.469-.469-1.229 0-1.697l2.758-3.15-2
.759-3.152c-.469-.469-.469-1.228 0-1.697.469-.469 1.228-.469 1.697 0L10 8.183l2.651-3.031c.469-.469 1.228-.469 1.697 0 .469.469.469 1.229 0 1.697l-2.758 3.152 2
.758 3.15c.469.469.469 1.229 0 1.698z"
                />
            </symbol>

            <symbol viewBox="0 0 20 20" id="large-arrow">
                <path d="M13.25 10L6.109 2.58c-.268-.27-.268-.707 0-.979.268-.27.701-.27.969 0l7.83 7.908c.268.271.268.709 0 .979l-7.83 7.908c-.268.271-.701.27-
.969 0-.268-.269-.268-.707 0-.979L13.25 10z"
                />
            </symbol>

            <symbol viewBox="0 0 20 20" id="large-arrow-down">
                <path d="M17.418 6.109c.272-.268.709-.268.979 0s.271.701 0 .969l-7.908 7.83c-.27.268-.707.268-.979 0l-7.908-7.83c-.27-.268-.27-.701 0-.969.271-.
268.709-.268.979 0L10 13.25l7.418-7.141z"
                />
            </symbol>


            <symbol viewBox="0 0 24 24" id="jump-to">
                <path d="M19 7v4H5.83l3.58-3.59L8 6l-6 6 6 6 1.41-1.41L5.83 13H21V7z" />
            </symbol>

            <symbol viewBox="0 0 24 24" id="expand">
                <path d="M10 18h4v-2h-4v2zM3 6v2h18V6H3zm3 7h12v-2H6v2z" />
            </symbol>

        </defs>
    </svg>


    <div id="swagger-ui">
        <div data-reactroot="" class="swagger-ui">
            <div>
                <div class="information-container wrapper">
                    <section class="block col-12">
                        <div class="info">
                            <hgroup class="main">
                                <h2 class="title">httpbin.org
                                    <small>
                                        <pre class="version">0.9.2</pre>
                                    </small>
                                </h2>
                                <pre class="base-url">[ Base URL: httpbin.org/ ]</pre>
                            </hgroup>
                            <div class="description">
                                <div class="markdown">
                                    <p>A simple HTTP Request &amp; Response Service.
                                        <br>
                                        <br>
                                        <b>Run locally: </b>
                                        <code>$ docker run -p 80:80 kennethreitz/httpbin</code>
                                    </p>
                                </div>
                            </div>
                            <div>
                                <div>
                                    <a href="https://kennethreitz.org" target="_blank">the developer - Website</a>
                                </div>
                                <a href="mailto:me@kennethreitz.org">Send email to the developer</a>
                            </div>
                        </div>
                        <!-- ADDS THE LOADER SPINNER -->
                        <div class="loading-container">
                            <div class="loading"></div>
                        </div>

                    </section>
                </div>
            </div>
        </div>
    </div>


    <div class='swagger-ui'>
        <div class="wrapper">
            <section class="clear">
                <span style="float: right;">
                    [Powered by
                    <a target="_blank" href="https://github.com/rochacbruno/flasgger">Flasgger</a>]
                    <br>
                </span>
            </section>
        </div>
    </div>



    <script src="/flasgger_static/swagger-ui-bundle.js"> </script>
    <script src="/flasgger_static/swagger-ui-standalone-preset.js"> </script>
    <script src='/flasgger_static/%20lib/jquery.min.js' type='text/javascript'></script>
    <script>

        window.onload = function () {


            fetch("/spec.json")
                .then(function (response) {
                    response.json()
                        .then(function (json) {
                            var current_protocol = window.location.protocol.slice(0, -1);
                            if (json.schemes[0] != current_protocol) {
                                // Switches scheme to the current in use
                                var other_protocol = json.schemes[0];
                                json.schemes[0] = current_protocol;
                                json.schemes[1] = other_protocol;

                            }
                            json.host = window.location.host;  // sets the current host

                            const ui = SwaggerUIBundle({
                                spec: json,
                                validatorUrl: null,
                                dom_id: '#swagger-ui',
                                deepLinking: true,
                                jsonEditor: true,
                                docExpansion: "none",
                                apisSorter: "alpha",
                                //operationsSorter: "alpha",
                                presets: [
                                    SwaggerUIBundle.presets.apis,
                                    // yay ES6 modules ↘
                                    Array.isArray(SwaggerUIStandalonePreset) ? SwaggerUIStandalonePreset : SwaggerUIStandalonePreset.default
                                ],
                                plugins: [
                                    SwaggerUIBundle.plugins.DownloadUrl
                                ],

            // layout: "StandaloneLayout"  // uncomment to enable the green top header
        })

        window.ui = ui

        // uncomment to rename the top brand if layout is enabled
        // $(".topbar-wrapper .link span").replaceWith("<span>httpbin</span>");
        })
    })
}
    </script>

<script type="text/javascript">
  var _gauges = _gauges || [];
  (function() {
    var t   = document.createElement('script');
    t.type  = 'text/javascript';
    t.async = true;
    t.id    = 'gauges-tracker';
    t.setAttribute('data-site-id', '58cb2e71c88d9043ac01d000');
    t.setAttribute('data-track-path', 'https://track.gaug.es/track.gif');
    t.src = 'https://d36ee2fcip1434.cloudfront.net/track.js';
    var s = document.getElementsByTagName('script')[0];
    s.parentNode.insertBefore(t, s);
  })();
</script>  <div class='swagger-ui'>
    <div class="wrapper">
        <section class="block col-12 block-desktop col-12-desktop">
            <div>

                <h2>Other Utilities</h2>

                <ul>
                    <li>
                        <a href="/forms/post">HTML form</a> that posts to /post /forms/post</li>
                </ul>

                <br />
                <br />
            </div>
        </section>
    </div>
</div>
</body>

</html>
2018-10-11 12:04:58 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 12:04:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 430,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 10556,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 4, 4, 58, 747159),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 4, 4, 55, 342964)}
2018-10-11 12:04:58 [scrapy.core.engine] INFO: Spider closed (finished)
View Code

查看打印后的結果,我們發現有一個"origin": "221.208.253.90",的這行代碼,我們通過上網查詢發現這個本地的IP地址為:黑龍江省哈爾濱市 聯通 ,那么現在想做的是做一個代理的中間件,實現一個IP的偽裝

現在我們需要在middleware.py這個文件編寫代碼了,其代碼如下:

class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

    def process_request(self,request, spider):
        self.logger.debug('Using Proxy')
        request.meta['proxy'] = 'http://211.101.136.86:49784' # 設置request的代理,如果請求的時候,request會自動加上這個代理 
return None

根據上文需要,所以在settings.py的文件里寫入代碼。其代碼如下:

DOWNLOADER_MIDDLEWARES = {
   'httpbintest.middlewares.ProxyMiddleware': 543,
}

然后敲入指令scrapy crawl httpbin

其打印結果為:

2018-10-11 14:16:58 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 14:16:58 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 14:16:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'ROBOTSTXT_OBEY': True, '
SPIDER_MODULES': ['httpbintest.spiders']}
2018-10-11 14:16:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 14:16:59 [scrapy.core.engine] INFO: Spider opened
2018-10-11 14:16:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 14:16:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 14:16:59 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:16:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2018-10-11 14:16:59 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:17:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
{
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "en",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"
  },
  "origin": "211.101.136.86",
  "url": "http://httpbin.org/get"
}

2018-10-11 14:17:00 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 14:17:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 433,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 794,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 6, 17, 0, 472256),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 6, 16, 59, 96177)}
2018-10-11 14:17:00 [scrapy.core.engine] INFO: Spider closed (finished)
View Code

2、process_response(request, response, spider)

process_request() 必須返回以下之一: 返回一個 Response 對象、 返回一個 Request 對象或raise一個 IgnoreRequest 異常。

如果其返回一個 Response (可以與傳入的response相同,也可以是全新的對象), 該response會被在鏈中的其他中間件的 process_response() 方法處理。

如果其返回一個 Request 對象,則中間件鏈停止, 返回的request會被重新調度下載。處理類似於 process_request() 返回request所做的那樣。

如果其拋出一個 IgnoreRequest 異常,則調用request的errback(Request.errback)。 如果沒有代碼處理拋出的異常,則該異常被忽略且不記錄(不同於其他異常那樣)。

參數:
  • request (Request 對象) – response所對應的request
  • response (Response 對象) – 被處理的response
  • spider (Spider 對象) – response所對應的spider

現在我們在middlewares.py里寫入代碼,如下:

class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

    def process_request(self,request, spider):
        self.logger.debug('Using Proxy')
        request.meta['proxy'] = 'http://211.101.136.86:49784' # 設置request的代理,如果請求的時候,request會自動加上這個代理

    def process_response(self,request, response, spider):
        response.status = 201
        return response

在httpbin.py里寫入:

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        print(response.text)
        print(response.status)

輸入scrapy crawl httpbin,看下打印的結果為:

2018-10-11 14:16:58 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 14:16:58 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 14:16:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'ROBOTSTXT_OBEY': True, '
SPIDER_MODULES': ['httpbintest.spiders']}
2018-10-11 14:16:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 14:16:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 14:16:59 [scrapy.core.engine] INFO: Spider opened
2018-10-11 14:16:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 14:16:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 14:16:59 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:16:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2018-10-11 14:16:59 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:17:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
{
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "en",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"
  },
  "origin": "211.101.136.86",
  "url": "http://httpbin.org/get"
}

2018-10-11 14:17:00 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 14:17:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 433,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 794,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 6, 17, 0, 472256),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 6, 16, 59, 96177)}
2018-10-11 14:17:00 [scrapy.core.engine] INFO: Spider closed (finished)

C:\Users\Administrator\Desktop\爬蟲程序\Scrapy\httpbintest>scrapy crawl httpbin
2018-10-11 14:50:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 14:50:40 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 14:50:40 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'ROBOTSTXT_OBEY': True, '
SPIDER_MODULES': ['httpbintest.spiders']}
2018-10-11 14:50:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 14:50:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 14:50:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 14:50:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 14:50:40 [scrapy.core.engine] INFO: Spider opened
2018-10-11 14:50:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 14:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 14:50:40 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:50:41 [scrapy.core.engine] DEBUG: Crawled (201) <GET http://httpbin.org/robots.txt> (referer: None)
2018-10-11 14:50:41 [httpbintest.middlewares] DEBUG: Using Proxy
2018-10-11 14:50:41 [scrapy.core.engine] DEBUG: Crawled (201) <GET http://httpbin.org/get> (referer: None)
{
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "en",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"
  },
  "origin": "211.101.136.86",
  "url": "http://httpbin.org/get"
}

201
2018-10-11 14:50:41 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 14:50:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 433,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 794,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 6, 50, 41, 966878),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 6, 50, 40, 436791)}
2018-10-11 14:50:41 [scrapy.core.engine] INFO: Spider closed (finished)
View Code

我們可以看到這一行代碼:

輸出了 201狀態碼,那么這里實際上就是說downloader得到了request之后,然后記過middewire改寫把狀態碼變成了201,然后再傳給response,那么我們得到的狀態碼就變成了201

3、process_exception(request, exception, spider)

當下載處理器(download handler)或 process_request() (下載中間件)拋出異常(包括 IgnoreRequest 異常)時, Scrapy調用 process_exception() 。

process_exception() 應該返回以下之一: 返回 None 、 一個 Response 對象、或者一個 Request 對象。

如果其返回 None ,Scrapy將會繼續處理該異常,接着調用已安裝的其他中間件的process_exception() 方法,直到所有中間件都被調用完畢,則調用默認的異常處理。

如果其返回一個 Response 對象,則已安裝的中間件鏈的 process_response() 方法被調用。Scrapy將不會調用任何其他中間件的 process_exception() 方法。

如果其返回一個 Request 對象, 則返回的request將會被重新調用下載。這將停止中間件的process_exception() 方法執行,就如返回一個response的那樣。

參數:
  • request (是 Request 對象) – 產生異常的request
  • exception (Exception 對象) – 拋出的異常
  • spider (Spider 對象) – request對應的spider

我們敲入代碼,用一個代理對google進行爬取,先敲入 scrapy genspider google www.google.com ,在spiders里會有個google.py文件,為了防止不必要的麻煩,所以在settings.py里將ROBOTSTXT_OBEY 設置為False

現在我們在google.py里敲入代碼:

# -*- coding: utf-8 -*-
import scrapy


class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def make_requests_from_url(self, url):
        return scrapy.Request(url = url,meta={'download_timeout':10},callback=self.parse) # 設置超時時間


    def parse(self, response):
        print(response.text)

並在middlewares.py里添加代碼為:

class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

   
    def process_exception(self,request, exception, spider):
       
        return request

輸入scrapy crawl google,查看輸出的內容為:

2018-10-11 16:21:07 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 16:21:07 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 16:21:07 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'SPIDER_MODULES': ['httpb
intest.spiders']}
2018-10-11 16:21:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 16:21:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 16:21:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 16:21:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 16:21:08 [scrapy.core.engine] INFO: Spider opened
2018-10-11 16:21:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 16:21:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 16:21:08 [py.warnings] WARNING: G:\Anaconda3-5.0.1\install\lib\site-packages\scrapy\spiders\__init__.py:76: UserWarning: Spider.make_requests_from_ur
l method is deprecated; it won't be called in future Scrapy releases. Please override Spider.start_requests method instead (see httpbintest.spiders.google.Googl
eSpider).
  cls.__module__, cls.__name__

2018-10-11 16:21:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.google.com/> (failed 1 times): Connection was refused by other side: 10
061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.google.com/> (failed 2 times): Connection was refused by other side: 10
061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:11 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:12 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:14 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:15 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:16 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:17 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:18 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:19 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:20 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:22 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:23 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:25 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:26 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:27 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:28 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:29 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:30 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:31 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:31 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2018-10-11 16:21:31 [scrapy.core.engine] INFO: Closing spider (shutdown)
2018-10-11 16:21:32 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com/> (failed 3 times): Connection was refused by other
side: 10061: 由於目標計算機積極拒絕,無法連接。.
2018-10-11 16:21:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 24,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 24,
 'downloader/request_bytes': 5112,
 'downloader/request_count': 24,
 'downloader/request_method_count/GET': 24,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 10, 11, 8, 21, 32, 343622),
 'log_count/DEBUG': 25,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'retry/count': 2,
 'retry/max_reached': 22,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
 'scheduler/dequeued': 24,
 'scheduler/dequeued/memory': 24,
 'scheduler/enqueued': 25,
 'scheduler/enqueued/memory': 25,
 'start_time': datetime.datetime(2018, 10, 11, 8, 21, 8, 197241)}
2018-10-11 16:21:32 [scrapy.core.engine] INFO: Spider closed (shutdown)
View Code

為了防止它持續輸出錯誤結果,所以要禁止它重復的提示錯誤信息,所以在settings.py里添加下面代碼:

DOWNLOADER_MIDDLEWARES = {
   'httpbintest.middlewares.ProxyMiddleware': 543,
   'scrapy.downloadermiddlewares.retry.RetryMiddleware':None,
}

重新運行代碼,會看到,解決了重復的問題:

2018-10-11 16:40:39 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 16:40:39 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 16:40:39 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'SPIDER_MODULES': ['httpb
intest.spiders']}
2018-10-11 16:40:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 16:40:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 16:40:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 16:40:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 16:40:39 [scrapy.core.engine] INFO: Spider opened
2018-10-11 16:40:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 16:40:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 16:40:39 [py.warnings] WARNING: G:\Anaconda3-5.0.1\install\lib\site-packages\scrapy\spiders\__init__.py:76: UserWarning: Spider.make_requests_from_ur
l method is deprecated; it won't be called in future Scrapy releases. Please override Spider.start_requests method instead (see httpbintest.spiders.google.Googl
eSpider).
  cls.__module__, cls.__name__

2018-10-11 16:40:40 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.google.com/> - no more duplicates will be shown (see DUPEFILTER_DEBU
G to show all duplicates)
2018-10-11 16:40:40 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 16:40:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
 'downloader/request_bytes': 213,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 8, 40, 40, 872314),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 8, 40, 39, 842255)}
2018-10-11 16:40:40 [scrapy.core.engine] INFO: Spider closed (finished)
View Code

那么,接下來就是改process_exception,第一次請求五個是不能請求的,所以這個錯誤會被process_exception捕捉掉,然后敲寫self.logger.debug('Get Exception')代表已經出錯了

class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

    def process_exception(self,request, exception, spider):
        self.logger.debug('Get Exception')
        return request

重新運行下就會看到

2018-10-11 16:45:36 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 16:45:36 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 16:45:36 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'SPIDER_MODULES': ['httpb
intest.spiders']}
2018-10-11 16:45:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 16:45:36 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 16:45:36 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 16:45:36 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 16:45:36 [scrapy.core.engine] INFO: Spider opened
2018-10-11 16:45:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 16:45:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 16:45:36 [py.warnings] WARNING: G:\Anaconda3-5.0.1\install\lib\site-packages\scrapy\spiders\__init__.py:76: UserWarning: Spider.make_requests_from_ur
l method is deprecated; it won't be called in future Scrapy releases. Please override Spider.start_requests method instead (see httpbintest.spiders.google.Googl
eSpider).
  cls.__module__, cls.__name__

2018-10-11 16:45:37 [httpbintest.middlewares] DEBUG: Get Exception
2018-10-11 16:45:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.google.com/> - no more duplicates will be shown (see DUPEFILTER_DEBU
G to show all duplicates)
2018-10-11 16:45:37 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-11 16:45:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
 'downloader/request_bytes': 213,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 11, 8, 45, 37, 546283),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 11, 8, 45, 36, 500223)}
2018-10-11 16:45:37 [scrapy.core.engine] INFO: Spider closed (finished)
View Code

仔細看,看到

現在我們在middlewares.py里重新設置代理,看能否請求成功

class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

    def process_exception(self,request, exception, spider):
        self.logger.debug('Get Exception')
        request.meta['proxy'] = 'http://211.101.136.86:49784'  # 設置request的代理,如果請求的時候,request會自動加上這個代理
        return request

設置完成之后,return request就會將這個request重新加在圖下的請求隊列里,就會重新的執行request請求,然后在回調下面的方法里面,然后輸出網頁的源代碼了

設置好之后,我們就運行一下代碼,其輸出結果如下:

2018-10-11 17:06:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: httpbintest)
2018-10-11 17:06:05 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 |A
naconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.0.3, Platf
orm Windows-7-6.1.7601-SP1
2018-10-11 17:06:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'httpbintest', 'NEWSPIDER_MODULE': 'httpbintest.spiders', 'SPIDER_MODULES': ['httpb
intest.spiders']}
2018-10-11 17:06:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-11 17:06:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'httpbintest.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-11 17:06:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-11 17:06:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-11 17:06:06 [scrapy.core.engine] INFO: Spider opened
2018-10-11 17:06:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-11 17:06:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-11 17:06:06 [py.warnings] WARNING: G:\Anaconda3-5.0.1\install\lib\site-packages\scrapy\spiders\__init__.py:76: UserWarning: Spider.make_requests_from_ur
l method is deprecated; it won't be called in future Scrapy releases. Please override Spider.start_requests method instead (see httpbintest.spiders.google.Googl
eSpider).
  cls.__module__, cls.__name__

2018-10-11 17:06:07 [httpbintest.middlewares] DEBUG: Get Exception
2018-10-11 17:06:17 [httpbintest.middlewares] DEBUG: Get Exception
2018-10-11 17:06:21 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2018-10-11 17:06:21 [scrapy.core.engine] INFO: Closing spider (shutdown)
2018-10-11 17:06:27 [httpbintest.middlewares] DEBUG: Get Exception
2018-10-11 17:06:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 639,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 10, 11, 9, 6, 27, 317766),
 'log_count/DEBUG': 4,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2018, 10, 11, 9, 6, 6, 268562)}
2018-10-11 17:06:27 [scrapy.core.engine] INFO: Spider closed (shutdown)
View Code

我們在修改下代碼:

class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def make_requests_from_url(self, url):
        self.logger.debug('Try First Time')
        return scrapy.Request(url = url,meta={'download_timeout':10},callback=self.parse,dont_filter=True) # 設置超時時間,dont_filter=True 防止多次請求,進行過濾

    def parse(self, response):
        print(response.text)
class ProxyMiddleware(object): # 代理中間件

    logger = logging.getLogger(__name__) # 利用這個可以方便進行scrapy 的log輸出

    def process_exception(self,request, exception, spider):
        self.logger.debug('Get Exception')
        self.logger.debug('Try Second Time')

        request.meta['proxy'] = 'http://60.208.32.201:80'  # 設置request的代理,如果請求的時候,request會自動加上這個代理
        return request

打印后的結果為:

 

OK,Byby

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM