scrapy 爬蟲中間件-offsite和refer中間件


環境使用anaconda 創建的pyithon3.6環境 

mac下 

source activate python36

mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$     cd /www
(python36) mac@macdeMacBook-Pro:/www$     scrapy startproject testMiddlewile
New Scrapy project 'testMiddlewile', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /www/testMiddlewile

You can start your first spider with:
    cd testMiddlewile
    scrapy genspider example example.com
(python36) mac@macdeMacBook-Pro:/www$     cd testMiddlewile/
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$        scrapy genspider -t crawl yeves yeves.cn
Created spider 'yeves' using template 'crawl' in module:
  testMiddlewile.spiders.yeves
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     

  

啟動爬蟲 

scrapy crawl yeves

 

(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     scrapy crawl yeves
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: testMiddlewile)
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Darwin-17.7.0-x86_64-i386-64bit
2019-11-10 09:10:27 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'testMiddlewile', 'NEWSPIDER_MODULE': 'testMiddlewile.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testMiddlewile.spiders']}
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: 29995a24067c48f8
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-10 09:10:27 [scrapy.core.engine] INFO: Spider opened
2019-11-10 09:10:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-10 09:10:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/robots.txt> from <GET http://yeves.cn/robots.txt>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 21 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 22 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
2019-11-10 09:10:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/> from <GET http://yeves.cn/>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at l

  

從上面打印信息可以看到 scrapy默認啟動了五個爬蟲中間件

 

2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']

 

通過在pycharm 查看源碼 先引入

from scrapy.spidermiddlewares.offsite import  OffsiteMiddleware
from scrapy.spidermiddlewares.referer import RefererMiddleware


from scrapy.spidermiddlewares.httperror import  HttpErrorMiddleware
from scrapy.spidermiddlewares.urllength import  UrlLengthMiddleware
from scrapy.spidermiddlewares.depth import  DepthMiddleware

  

offsite中間件

通過按住option進入offsite中間件源碼

"""
Offsite Spider Middleware

See documentation in docs/topics/spider-middleware.rst
"""
import re
import logging
import warnings

from scrapy import signals
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached

logger = logging.getLogger(__name__)


class OffsiteMiddleware(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.stats)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def process_spider_output(self, response, result, spider):
        for x in result:
            if isinstance(x, Request):
                if x.dont_filter or self.should_follow(x, spider):
                    yield x
                else:
                    domain = urlparse_cached(x).hostname
                    if domain and domain not in self.domains_seen:
                        self.domains_seen.add(domain)
                        logger.debug(
                            "Filtered offsite request to %(domain)r: %(request)s",
                            {'domain': domain, 'request': x}, extra={'spider': spider})
                        self.stats.inc_value('offsite/domains', spider=spider)
                    self.stats.inc_value('offsite/filtered', spider=spider)
            else:
                yield x

    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        host = urlparse_cached(request).hostname or ''
        return bool(regex.search(host))

    def get_host_regex(self, spider):
        """Override this method to implement a different offsite policy"""
        allowed_domains = getattr(spider, 'allowed_domains', None)
        if not allowed_domains:
            return re.compile('')  # allow all by default
        url_pattern = re.compile("^https?://.*$")
        for domain in allowed_domains:
            if url_pattern.match(domain):
                message = ("allowed_domains accepts only domains, not URLs. "
                           "Ignoring URL entry %s in allowed_domains." % domain)
                warnings.warn(message, URLWarning)
        domains = [re.escape(d) for d in allowed_domains if d is not None]
        regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
        return re.compile(regex)

    def spider_opened(self, spider):
        self.host_regex = self.get_host_regex(spider)
        self.domains_seen = set()


class URLWarning(Warning):
    pass

 

__init__ 類初始化

from_crawler   scrapy 中間件管理所調用的 調用后得到對象

process_spider_output 處理輸出

should_follow  是否要繼續跟蹤

get_host_regex  正則

spider_opend 為了兼容以前的一個函數

 

函數調用流程  from_crawler-》__init__》spider_opend-》get_host_regex

offsite中間件 就是判斷當前要請求的url是否符合爬蟲里面定義的運行的域名 防止跳到其他域名去了 

allowed_domains = ['yeves.cn']

 

refer中間件 主要是因為有些圖片訪問需要提供refer訪問來源才能訪問,比如阿里雲后台oss配置的防止盜鏈 

通過把上次的請求url作為本次url的refer

源碼如下

class RefererMiddleware(object):

    def __init__(self, settings=None):
        self.default_policy = DefaultReferrerPolicy
        if settings is not None:
            self.default_policy = _load_policy_class(
                settings.get('REFERRER_POLICY'))

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('REFERER_ENABLED'):
            raise NotConfigured
        mw = cls(crawler.settings)

        # Note: this hook is a bit of a hack to intercept redirections
        crawler.signals.connect(mw.request_scheduled, signal=signals.request_scheduled)

        return mw

    def policy(self, resp_or_url, request):
        """
        Determine Referrer-Policy to use from a parent Response (or URL),
        and a Request to be sent.

        - if a valid policy is set in Request meta, it is used.
        - if the policy is set in meta but is wrong (e.g. a typo error),
          the policy from settings is used
        - if the policy is not set in Request meta,
          but there is a Referrer-policy header in the parent response,
          it is used if valid
        - otherwise, the policy from settings is used.
        """
        policy_name = request.meta.get('referrer_policy')
        if policy_name is None:
            if isinstance(resp_or_url, Response):
                policy_header = resp_or_url.headers.get('Referrer-Policy')
                if policy_header is not None:
                    policy_name = to_native_str(policy_header.decode('latin1'))
        if policy_name is None:
            return self.default_policy()

        cls = _load_policy_class(policy_name, warning_only=True)
        return cls() if cls else self.default_policy()

    def process_spider_output(self, response, result, spider):
        def _set_referer(r):
            if isinstance(r, Request):
                referrer = self.policy(response, r).referrer(response.url, r.url)
                if referrer is not None:
                    r.headers.setdefault('Referer', referrer)
            return r
        return (_set_referer(r) for r in result or ())

    def request_scheduled(self, request, spider):
        # check redirected request to patch "Referer" header if necessary
        redirected_urls = request.meta.get('redirect_urls', [])
        if redirected_urls:
            request_referrer = request.headers.get('Referer')
            # we don't patch the referrer value if there is none
            if request_referrer is not None:
                # the request's referrer header value acts as a surrogate
                # for the parent response URL
                #
                # Note: if the 3xx response contained a Referrer-Policy header,
                #       the information is not available using this hook
                parent_url = safe_url_string(request_referrer)
                policy_referrer = self.policy(parent_url, request).referrer(
                    parent_url, request.url)
                if policy_referrer != request_referrer:
                    if policy_referrer is None:
                        request.headers.pop('Referer')
                    else:
                        request.headers['Referer'] = policy_referrer

  

爬蟲中間件里面的幾個函數 offsite中間件只用到了output

process_spider_input 3

process_spider_output 2

process_start_requests 1

process_spider_exception

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM