Scrapy深度和優先級

本文轉載自查看原文 2019-10-26 16:29 324 爬蟲

一、深度　　　　

配置文件 settings.py

DEPTH_LIMIT = 5

二、優先級

配置文件

DEPTH_PRIORITY=1

優先級為正數時，隨着深度越大，優先級越低

源碼中，優先級

request.priority -= depth * self.prio

三、源碼分析

1、深度

class QuoteSpider(scrapy.Spider):
    name = 'quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        # response.request.meta = None
        # from scrapy.http import Response
        # response.request.meta ->response.meta
        from scrapy.spidermiddlewares.depth import DepthMiddleware
        print(response.request.url, response.meta.get('depth'))
        next_url = response.xpath('//div[@class="col-md-8"]/nav/ul[@class="pager"]/li[@class="next"]/a/@href').extract_first()
        # print(next_url)
        # 拼接url
        _next = response.urljoin(next_url)
        # print(_next)
        # callback 回調函數
        yield scrapy.Request(url=_next, callback=self.parse)

前提：scrapy yield request對象 -> 中間件 ->調度器...

yield Request對象沒有設置meta的值，meta默認為None

parse方法中的respone.request相當於request對象->response.request.meta=None

from scrapy.http import Response ->response.meta 等價於 response.request.meta --->response.meta=None

DepthMiddleware中間件->如果'depth'不在response.meta,那么response.meta['depth'] = 0

# result是存放request對象的列表，通過_filter進行過濾
# 返回 True，存放到調度器
# 返回 False, 丟棄
return (r for r in result or () if _filter(r))

超出最大深度，返回False

# 在配置文件，設置最大深度
maxdepth = settings.getint('DEPTH_LIMIT')

2、優先級

待續...

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python - scrapy 爬蟲框架 ( 起始url的實現，深度和優先級，下載中間件 ) Scrapy之start_urls、爬蟲中間件之深度，優先級以及源碼流程 python的and和or優先級 js 優先級線程的優先級線程的優先級 &&與||的優先級比較 UCOSIII優先級 setPriority()優先級 css優先級