一.scrapy分頁處理

　　1.分頁處理

　　如上篇博客,初步使用了scrapy框架了,但是只能爬取一頁,或者手動的把要爬取的網址手動添加到start_url中,太麻煩
接下來介紹該如何去處理分頁,手動發起分頁請求

爬蟲文件.py

# -*- coding: utf-8 -*-
import scrapy
from qiubaiPage.items import QiubaiproItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    url='https://www.qiushibaike.com/text/page/%d/'
    page_num=1

    # 2.基於管道的持久化存儲(基於管道的持久化存儲必須寫下管道文件當中)
    def parse(self,response):
        div_list=response.xpath('//div[@id="content-left"]/div')
        for div in  div_list:
            try :
                author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()

            except Exception as e:
                print(e)
                continue
            content = div.xpath('./a[1]/div/span//text()').extract()
            content = ''.join(content)

            # 實例話一個item對象(容器)
            item = QiubaiproItem()
            item['author'] = author
            item['content'] = content
            # 返回給pipline去持久化存儲
            yield item

        if self.page_num<10:  #發起請求的條件
            self.page_num+=1
            url=(self.url%self.page_num)
            #手動發起請求,調用parse再去解析
            yield scrapy.Request(url=url,callback=self.parse)


items.py

import scrapy
class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author=scrapy.Field()
    content=scrapy.Field()


pipline.py

class QiubaipagePipeline(object):
    f = None

    # 開啟爬蟲時執行程序執行一次,重寫父類的方法,可以開啟數據庫等,要記得參數有一個spider不要忘記了
    def open_spider(self, spider):
        self.f = open('./qiushibaike.txt', 'w', encoding='utf-8')

    # 提取處理數據(保存數據)
    def process_item(self, item, spider):
        self.f.write(item['author'] + ':' + item['content'] + '\n')
        return item

    # .關閉爬蟲時執行也是只執行一次,重寫父類方法,可以關閉數據庫等,重寫父類要要有參數spider,不要忘記了
    def colse_spider(self, spider):
        self.f.close()

注意:要基於管道存儲要記得去settings.py把注釋放開

　　2.post請求

- 問題：在之前代碼中，我們從來沒有手動的對start_urls列表中存儲的起始url進行過請求的發送，但是起始url的確是進行了請求的發送，那這是如何實現的呢？

- 解答：其實是因為爬蟲文件中的爬蟲類繼承到了Spider父類中的start_requests（self）這個方法，該方法就可以對start_urls列表中的url發起請求：

  def start_requests(self):
        for u in self.start_urls:
           yield scrapy.Request(url=u,callback=self.parse)

　　【注意】該方法默認的實現，是對起始的url發起get請求，如果想發起post請求，則需要子類重寫該方法

def start_requests(self):
        #請求的url
        post_url = 'http://fanyi.baidu.com/sug'
        # post請求參數
        formdata = {
            'kw': 'wolf',
        }
        # 發送post請求
        yield scrapy.FormRequest(url=post_url, formdata=formdata, callback=self.parse)

　　3.cookies處理

　　對於cookies的處理就是不用處理,直接去settings.py把cookies的相關配置放開就行

　　4.請求傳參之中間件代理池使用

一.下載中間件（Downloader Middlewares）位於scrapy引擎和下載器之間的一層組件。

- 作用：

（1）引擎將請求傳遞給下載器過程中，下載中間件可以對請求進行一系列處理。比如設置請求的 User-Agent，設置代理等

（2）在下載器完成將Response傳遞給引擎中，下載中間件可以對響應進行一系列處理。比如進行gzip解壓等。

我們主要使用下載中間件處理請求，一般會對請求設置隨機的User-Agent ，設置隨機的代理。目的在於防止爬取網站的反爬蟲策略。

二.UA池：User-Agent池

- 作用：盡可能多的將scrapy工程中的請求偽裝成不同類型的瀏覽器身份。

- 操作流程：

1.在下載中間件中攔截請求

2.將攔截到的請求的請求頭信息中的UA進行篡改偽裝

3.在配置文件中開啟下載中間件

　
　請求傳參的使用:首先要你要明白整個scrapy模塊的使用流程:在下載器和引擎之間有個下載中間件,他可以攔截到所有的請求對象和所有的響應對象,包括異常的請求和異常的響應.
這就我們提供了便利-------->使用袋里池--------->把請求對象蘭攔截下來,給他換一個ip地址,再把請求對象向網絡發布出去!
　　還有一個要注意的是要去settings.py文件中把中間件相關的配置放開

middleware.py　　
#下載中間件class QiubaipageDownloaderMiddleware(object):    # Not all methods need to be defined. If a method is not defined,

    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

 @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
　　#攔截請求
 def process_request(self, request, spider):

　　　　　　request.meta['proxy'] = 'https://60.251.156.116:8080'
　　　　　　print('this is process_request!!!')

        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
　　#攔截響應
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
　　#攔截異常
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain

　　　　　　request.meta['proxy'] = 'https://60.251.156.116:8080' #可以把多個代理封裝成列表對象,請求時隨機抽出一個來形成一個代理池
　　　　　　print('this is process_exception!!!')

　　5.請求傳參之遞歸請求網頁數據

　　在某些情況下，我們爬取的數據不在同一個頁面中，例如，我們爬取一個電影網站，電影的名稱，評分在一級頁面，而要爬取的其他電影詳情在其二級子頁面中。

這時我們就需要用到請求傳參。

爬蟲文件.py


# -*- coding: utf-8 -*-
import scrapy
from bossPro.items import BossproItem


class BossSpider(scrapy.Spider):
    name = 'boss'
    # allowed_domains = ['www.xxx.com']
    start_urls = [
        'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101280600&industry=&position=']

    def parse(self, response):
        li_list = response.xpath('//div[@class="job-list"]/ul/li')
        for li in li_list:
            job_title = li.xpath('.//div[@class="job-title"]/text()').extract_first()
            company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first()
            #那子網頁url
            detail_url = 'https://www.zhipin.com' + li.xpath('.//div[@class="info-primary"]/h3/a/@href').extract_first()
            # detail_url = 'https://www.zhipin.com' + li.xpath('.//div[@class="info-primary"]/h3/a/@href').extract_first()
            # 實例化一個item對象
            item = BossproItem()
            item["job_title"] = job_title
            item['company'] = company
            # 把item傳給下一個解析函數,請求傳參
            yield scrapy.Request(url=detail_url, callback=self.detail_parse, meta={'item': item})
    #二級網頁解析
    #要通過以及解析把item傳過來我才能把數據裝到容器里面
    def detail_parse(self, response):
        item = response.meta["item"]
        job_detail= response.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()').extract()
        job_detail=''.join(job_detail)
        item['job_detail']=job_detail
        #記得要返回,要不然pipline拿不到東西
        yield item


settings.py

#robts協議
#pipiline
#ua
都要設置好

items.py

import scrapy
class BossproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job_title=scrapy.Field()
    company= scrapy.Field()
    job_detail = scrapy.Field()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy實現post請求與請求傳參爬蟲scrapy組件請求傳參,post請求,中間件 flask對於post和get請求的處理 scrapy發送post請求關於Scrapy中post請求 scrapy的post簡單請求 postman之post請求傳參接口測試http的get和post請求，鍵值對和json傳參的處理，cookie的保留 Express處理GET/POST請求(POST請求包含文件) 爬蟲：scrapy之【請求傳參(item) + 發送post、get請求 + 日志等級 + 中間件 + selenium】