warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_doma

本文转载自查看原文 2018-11-24 16:13 1132 Spider

多页面循环爬取数据抛出如下异常

warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_doma

代码没有报错,只是输出了第一层的Web的爬取结果。但是第二层没有执行爬取。

问题分析
从日志来进行分析,没有发现错误信息;第一层代码爬取正确,但是第二层web爬取,没有被执行,代码的编写应该没有问题的。 
那问题是什么呢?会不会代码没有被执行呢?通过添加日志,但是对应的代码并没有执行,日志也被正常输出。是不是被过滤或者拦截了,从而代码没有被执行? 
经过代码审查之后,发现allowed_domains设置的问题,由于起设置不正确,导致其余的链接被直接过滤了。 
关于allowed_domains需要是一组域名,而非一组urls。

问题的解决
需要将之前的domain name修改一下:

allowed_domains = [‘http://www.heao.gov.cn/‘]

将起修改为:

allowed_domains = [‘heao.gov.cn’]

重新执行爬虫,发现多个层次是可以被正确爬取的。

总结
关于scrapy是一整套的解决方案,其中很多的设置和配置需要通过不同的实例来反复理解和应用的,才能如鱼得水,庖丁解牛般快速定位问题。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 scrapy的allowed_domains设置含义关于UserWarning: Corrupt EXIF data. Expecting to read 4 bytes but only got 0. warnings.warn(str(msg))这种问题的解决办法 NotSupportedError Only secure origins are allowed warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '报错 selenium3.8以上的版本warnings.warn('use options instead of chrome_options', Deprecati RuntimeWarning: DateTimeField User.date_joined received a naive datetime (2020-08-01 00:00:00) while time zone support is active. warnings.warn("DateTimeField %s.%s received a naive datetime "问题 Learning Meta Face Recognition in Unseen Domains 使用PhantomJS报warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '解决方法 UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) 人脸识别(Unseen Domains) - 1 - Learning Meta Face Recognition in Unseen Domains - 1 - 论文学习