scrapy不過濾重復url

本文轉載自查看原文 2020-05-25 09:54 561

今天在爬取一個朝鮮網站：http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2時，發現它會重定向多次，又回到原url，如果scrapy過濾重復url，則無法爬取。

所以，查資料發現：可以重復爬取，而且設置比較簡單。

資料如下：

https://blog.csdn.net/huyoo/article/details/75570668

實際代碼如下：

def parse(self, response):
    meta = response.meta
    ===================================================================================
    meta["website"] = "http://www.rodong.rep.kp/ko/"
    meta['area'] = 'xj_rodong_rep_kp'

    start_url_list = [
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=3",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=5",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=6",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=7",
        # "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=1&iSubMenuID=1",
        "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2"
    ]
    for url in start_url_list:
        yield Request(url, meta=meta, callback=self.parse_list, dont_filter=True)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy過濾重復數據和增量爬取 URL過濾 scrapy多url爬取第三百二十六節，web爬蟲，scrapy模塊,解決重復ur——自動遞歸url php 過濾重復的數組 5.scrapy過濾器 Shiro配置URL過濾 scrapy處理需要跟進的url scrapy::Max retries exceeded with url Oracle Distinct（過濾重復）用法