一.scrapy分頁處理
1.分頁處理
如上篇博客,初步使用了scrapy框架了,但是只能爬取一頁,或者手動的把要爬取的網址手動添加到start_url中,太麻煩
接下來介紹該如何去處理分頁,手動發起分頁請求
爬蟲文件.py
# -*- coding: utf-8 -*-
import scrapy
from qiubaiPage.items import QiubaiproItem
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
url='https://www.qiushibaike.com/text/page/%d/'
page_num=1
# 2.基於管道的持久化存儲(基於管道的持久化存儲必須寫下管道文件當中)
def parse(self,response):
div_list=response.xpath('//div[@id="content-left"]/div')
for div in div_list:
try :
author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
except Exception as e:
print(e)
continue
content = div.xpath('./a[1]/div/span//text()').extract()
content = ''.join(content)
# 實例話一個item對象(容器)
item = QiubaiproItem()
item['author'] = author
item['content'] = content
# 返回給pipline去持久化存儲
yield item
if self.page_num<10: #發起請求的條件
self.page_num+=1
url=(self.url%self.page_num)
#手動發起請求,調用parse再去解析
yield scrapy.Request(url=url,callback=self.parse)
items.py
import scrapy
class QiubaiproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
author=scrapy.Field()
content=scrapy.Field()
pipline.py
class QiubaipagePipeline(object):
f = None
# 開啟爬蟲時執行程序執行一次,重寫父類的方法,可以開啟數據庫等,要記得參數有一個spider不要忘記了
def open_spider(self, spider):
self.f = open('./qiushibaike.txt', 'w', encoding='utf-8')
# 提取處理數據(保存數據)
def process_item(self, item, spider):
self.f.write(item['author'] + ':' + item['content'] + '\n')
return item
# .關閉爬蟲時執行也是只執行一次,重寫父類方法,可以關閉數據庫等,重寫父類要要有參數spider,不要忘記了
def colse_spider(self, spider):
self.f.close()
注意:要基於管道存儲要記得去settings.py把注釋放開
2.post請求
- 問題:在之前代碼中,我們從來沒有手動的對start_urls列表中存儲的起始url進行過請求的發送,但是起始url的確是進行了請求的發送,那這是如何實現的呢?
- 解答:其實是因為爬蟲文件中的爬蟲類繼承到了Spider父類中的start_requests(self)這個方法,該方法就可以對start_urls列表中的url發起請求:
def start_requests(self): for u in self.start_urls: yield scrapy.Request(url=u,callback=self.parse)
【注意】該方法默認的實現,是對起始的url發起get請求,如果想發起post請求,則需要子類重寫該方法
def start_requests(self): #請求的url post_url = 'http://fanyi.baidu.com/sug' # post請求參數 formdata = { 'kw': 'wolf', } # 發送post請求 yield scrapy.FormRequest(url=post_url, formdata=formdata, callback=self.parse)
3.cookies處理
對於cookies的處理就是不用處理,直接去settings.py把cookies的相關配置放開就行
4.請求傳參之中間件代理池使用
一.下載中間件(Downloader Middlewares) 位於scrapy引擎和下載器之間的一層組件。
- 作用:
(1)引擎將請求傳遞給下載器過程中, 下載中間件可以對請求進行一系列處理。比如設置請求的 User-Agent,設置代理等
(2)在下載器完成將Response傳遞給引擎中,下載中間件可以對響應進行一系列處理。比如進行gzip解壓等。
我們主要使用下載中間件處理請求,一般會對請求設置隨機的User-Agent ,設置隨機的代理。目的在於防止爬取網站的反爬蟲策略。
二.UA池:User-Agent池
- 作用:盡可能多的將scrapy工程中的請求偽裝成不同類型的瀏覽器身份。
- 操作流程:
1.在下載中間件中攔截請求
2.將攔截到的請求的請求頭信息中的UA進行篡改偽裝
3.在配置文件中開啟下載中間件
請求傳參的使用:首先要你要明白整個scrapy模塊的使用流程:在下載器和引擎之間有個下載中間件,他可以攔截到所有的請求對象和所有的響應對象,包括異常的請求和異常的響應.
這就我們提供了便利-------->使用袋里池--------->把請求對象蘭攔截下來,給他換一個ip地址,再把請求對象向網絡發布出去!
還有一個要注意的是要去settings.py文件中把中間件相關的配置放開
middleware.py
#下載中間件class QiubaipageDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
#攔截請求
def process_request(self, request, spider):
request.meta['proxy'] = 'https://60.251.156.116:8080'
print('this is process_request!!!')
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
#攔截響應
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
#攔截異常
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
request.meta['proxy'] = 'https://60.251.156.116:8080' #可以把多個代理封裝成列表對象,請求時隨機抽出一個來形成一個代理池
print('this is process_exception!!!')
5.請求傳參之遞歸請求網頁數據
在某些情況下,我們爬取的數據不在同一個頁面中,例如,我們爬取一個電影網站,電影的名稱,評分在一級頁面,而要爬取的其他電影詳情在其二級子頁面中。
這時我們就需要用到請求傳參。
爬蟲文件.py # -*- coding: utf-8 -*- import scrapy from bossPro.items import BossproItem class BossSpider(scrapy.Spider): name = 'boss' # allowed_domains = ['www.xxx.com'] start_urls = [ 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101280600&industry=&position='] def parse(self, response): li_list = response.xpath('//div[@class="job-list"]/ul/li') for li in li_list: job_title = li.xpath('.//div[@class="job-title"]/text()').extract_first() company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first() #那子網頁url detail_url = 'https://www.zhipin.com' + li.xpath('.//div[@class="info-primary"]/h3/a/@href').extract_first() # detail_url = 'https://www.zhipin.com' + li.xpath('.//div[@class="info-primary"]/h3/a/@href').extract_first() # 實例化一個item對象 item = BossproItem() item["job_title"] = job_title item['company'] = company # 把item傳給下一個解析函數,請求傳參 yield scrapy.Request(url=detail_url, callback=self.detail_parse, meta={'item': item}) #二級網頁解析 #要通過以及解析把item傳過來我才能把數據裝到容器里面 def detail_parse(self, response): item = response.meta["item"] job_detail= response.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()').extract() job_detail=''.join(job_detail) item['job_detail']=job_detail #記得要返回,要不然pipline拿不到東西 yield item
settings.py
#robts協議
#pipiline
#ua
都要設置好
items.py
import scrapy
class BossproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
job_title=scrapy.Field()
company= scrapy.Field()
job_detail = scrapy.Field()