scrapy不过滤重复url

本文转载自查看原文 2020-05-25 09:54 561

今天在爬取一个朝鲜网站：http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2时，发现它会重定向多次，又回到原url，如果scrapy过滤重复url，则无法爬取。

所以，查资料发现：可以重复爬取，而且设置比较简单。

资料如下：

https://blog.csdn.net/huyoo/article/details/75570668

实际代码如下：

def parse(self, response):
    meta = response.meta
    ===================================================================================
    meta["website"] = "http://www.rodong.rep.kp/ko/"
    meta['area'] = 'xj_rodong_rep_kp'

    start_url_list = [
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=3",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=5",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=6",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=7",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=1&iSubMenuID=1",
        "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2"
    ]
    for url in start_url_list:
        yield Request(url, meta=meta, callback=self.parse_list, dont_filter=True)

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 scrapy过滤重复数据和增量爬取 URL过滤 scrapy多url爬取第三百二十六节，web爬虫，scrapy模块,解决重复ur——自动递归url php 过滤重复的数组 5.scrapy过滤器 Shiro配置URL过滤 scrapy处理需要跟进的url scrapy::Max retries exceeded with url Oracle Distinct（过滤重复）用法