scrapy框架下爬蟲實現詳情頁抓取


 

以爬取陽光陽光熱線問政平台網站為例,進行詳情頁的爬取。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from yanguang.items import YanguangItem
 4 
 5 class SunSpider(scrapy.Spider):
 6     name = 'sun'
 7     allowed_domains = ['sun0769.com']
 8     start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']
 9 
10     def parse(self, response):
11         tr_list=response.xpath("//div[@class='greyframe']/table[2]/tr/td/table/tr")
12         for tr in tr_list:
13             item=YanguangItem()
14             item['title']=tr.xpath("./td[2]/a[@class='news14']/@title").extract_first()
15             item["href"]=tr.xpath("./td[2]/a[@class='news14']/@href").extract_first()
16             item["publish_date"]=tr.xpath("./td[last()]/text()").extract_first()
17 
18             yield scrapy.Request(
19                 item["href"],
20                 callback=self.parse_detail,
21                 meta={"item":item},
22             )
23         #翻頁
24         next_url=response.xpath(".//a[text()='>']/@href").extract_first()
25         if next_url is not None:
26             yield scrapy.Request(
27                 next_url,
28                 callback=self.parse()
29             )
30 
31 
32     def parse_detail(self,response): #處理詳情頁
33         item=response.meta["item"]
34         item["content"]=response.xpath("//div[@class='c1 text14_2']//text()").extract()
35         item["content_img"] = response.xpath("//div[@class='c1 text14_2']//img/@src").extract()
36         item["content_img"] =["http://wz.sun0769.com"+i for i in item["content_img"]]
37         yield item

下面為pipelines.py文件中對爬取的數據處理操作。

 1 import re
 2 class YanguangPipeline(object):
 3     def process_item(self, item, spider):
 4         item["content"]=self.process_content(item["content"])
 5         print(item)
 6         return item
 7 
 8     def process_content(self,content):#文本內容的處理
 9         content=[re.sub(r"\xa0|\s","",i)for i in content]
10         content=[i for i in content if len(i)>0]#去除列表中的空字符串
11         return content

在settings.py文件中修改USER_AGENT的內容是對方服務器無法一眼看出我們的請求是爬蟲。

默認settings.py文件中的USER_AGENT為:

1 # Crawl responsibly by identifying yourself (and your website) on the user-agent
2 USER_AGENT = 'tencent (+http://www.yourdomain.com)'

將settings.py文件中的USER_AGENT修改為:

1 # Crawl responsibly by identifying yourself (and your website) on the user-agent
2 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2024 CODEPRJ.COM