python---Scrapy模塊的使用（一）

本文轉載自查看原文 2018-06-27 15:21 2244 Python

Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。其可以應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。

Scrapy 使用了 Twisted異步網絡庫來處理網絡通訊。整體架構大致如下

各個組件：

Scrapy引擎：

是框架核心，用來處理調度整個系統的數據流處理

Scheduler調度器：

用來接收引擎發送過來的請求，壓入隊列中，並在引擎再次請求時返回，就是在我們所要爬取的url全部放入一個優先隊列中，由它來決定下一個處理的url是什么，同時他會自動將重復的url去除

注意：我們在創建一個項目時，在spider中會存在一個start_urls = ['http://dig.chouti.com/']，他將是我們的初始url，會在項目啟動后被引擎放入調度器中開始處理

Downloader下載器：

用於下載網頁內容，並將網頁內容返回給蜘蛛（下載器是基於twisted實現）

Item解析器：

設置數據存儲模板，用於結構化數據。為下一步的持久化數據做處理，類似於Django中的models，設置存儲字段等數據，解析數據

Pipeline項目管道

負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析后，將被發送到項目管道，並經過幾個特定的次序處理數據。

Downloader Middlewares下載器中間件

位於Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。

Spider Middlewares爬蟲中間件

介於Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應輸入和請求輸出。

Scheduler Middewares調度中間件

介於Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程

1.引擎從調度器中取出一個連接URL，用於接下來的抓取

2.引擎將URL封裝為一個請求Request傳給下載器

3.下載器將資源下載，封裝為應答包Response

4.爬蟲解析Response

5.解析出實體Item，將實體通過管道解析持久化操作

6.若是解析出URL，將其放入調度器中等待抓取

注意：第一步之前，是先去爬蟲start_url中獲取初始網址，進行操作

項目創建的基本命令

1.創建項目

scrapy startproject 項目名

scrapy startproject scrapyPro

2.進入項目，創建爬蟲

cd 項目名
scrapy genspider 項目列表名 初始url（后面可以修改）

cd scrapyPro
scrapy genspider chouti chouti.com

3.展示爬蟲應用列表

scrapy list

chouti

4.運行爬蟲應用

scrapy crawl 爬蟲應用名稱

scrapy crawl chouti --nolog　　#--nolog不打印日志

項目結構

project_name/
   scrapy.cfg　　#項目的主配置文件
   project_name/
       __init__.py
       items.py　　#設置數據存儲模板，用於結構化數據：類似於Django中models  pipelines.py　　#數據處理行為：如數據的持久化 settings.py　　#配置文件：遞歸層數，並發數等 spiders/　　#爬蟲目錄，我們可以創建多個爬蟲在此
           __init__.py
           爬蟲1.py
           爬蟲2.py
           爬蟲3.py

簡單實例應用（獲取抽屜的新聞標題和URL，並將其保存到文件，實現持久化操作）

chouti爬蟲的編寫

import scrapy,hashlib
from scrapy.selector import Selector,HtmlXPathSelector
from scrapy.http import Request
import sys,io
from ..items import ChoutiItem


sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')　　#設置原因：在windows下，我們需要在cmd命令行下啟動項目，而cmd默認是gbk編碼，而py3是utf-8為了保持編碼一致，避免亂碼，我們需要修改 class ChoutiSpider(scrapy.Spider):
    name = 'chouti' allowed_domains = ['chouti.com']　　#允許采集的域名  start_urls = ['http://dig.chouti.com/']　　#初始url

    def parse(self, response):　　#response.url/text/body/meta含有訪問深度
        item_objs = Selector(response=response).xpath("//div[@class='item']//a[@class='show-content color-chag']")  #注意這里過濾class不能只選用一個，必須將全部的class加上
        for item in item_objs:
            title = item.xpath("text()").extract_first().strip()
            href = item.xpath("@href").extract_first()
            #想要將數據持久化，必須在Item中設置數據字段 item_obj = ChoutiItem(title=title,href=href) # 將item對象傳遞給pipeline yield item_obj

items.py文件的編寫：設置數據存儲模板，之后才能夠實現持久化處理

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class ChoutiItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() href = scrapy.Field()

pipeline.py文件編寫：實現數據持久化操作

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ScrapyproPipeline(object):
    def process_item(self, item, spider):　　#寫入文件中去
        tpl = "%s\n%s\n\n"%(item['href'],item['title']) with open("news.josn",'a',encoding="utf-8") as fp:　　#注意文件編碼 fp.write(tpl)

注意：我們要使使用Item解析實體和pipeline持久化關聯，需要修改setting配置文件

ITEM_PIPELINES = {
   'scrapyPro.pipelines.ScrapyproPipeline': 300,　　#后面300代表優先級
}

補充：Selector的操作

Selector(response=response).xpath('//a')    #解析當前響應response下的所有子孫a標簽
Selector(response=response).xpath('//a[2]')    #獲取所有的a標簽下的第二個標簽
Selector(response=response).xpath('//a[@id="i1"]')　　#獲取屬性id為il的標簽　　@后面跟屬性
Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')　　#多個標簽的連續篩選
Selector(response=response).xpath('//a[starts-with(@href, "link")]')　　#屬性以指定的字符串開頭
Selector(response=response).xpath('//a[contains(@href, "link")]')　　#屬性中包含有指定字符串
Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')　　#正則匹配數據
Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()')　　#獲取匹配標簽下的文本數據
Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href')　　#獲取匹配標簽的屬性值

上面獲取的都是解析對象，要想獲取具體數據（字符串），我們需要使用

Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()　　#提取所有的，是個列表
Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()　　#提取第一個

注意：'//'代表子孫標簽,'/'代表子標簽，另外'./'代表當前標簽下尋找

簡單實例應用：獲取抽屜頁碼

import scrapy,hashlib
from scrapy.selector import Selector,HtmlXPathSelector
from scrapy.http import Request

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/']
    visited_urls = set()　　#用於存放我們獲取的url的md5值，而且是集合去重

    def md5(self,url):　　#將url轉md5
        ha_obj = hashlib.md5()
        ha_obj.update(bytes(url,encoding="utf-8"))
        key = ha_obj.hexdigest()
        return key

    def parse(self, response):
        page_objs = Selector(response=response).xpath("//div[@id='dig_lcpage']//a[@class='ct_pagepa']")　　#解析實體 for page in page_objs:
            href = page.xpath("@href").extract_first()
            ha_href = self.md5(href)
            if ha_href in self.visited_urls:
                pass
            else:
                self.visited_urls.add(ha_href)　　#將獲取的url添加到集合
                new_url = "https://dig.chouti.com%s"%href
　　　　　　 　　yield Request(url=new_url,callback=self.parse)　　#請求新的url，獲取下面的url　　對於阻塞操作使用yield切換，實現異步

注意：默認是獲取所有的頁面100多頁，我們可以在setting文件設置解析深度

DEPTH_LIMIT = 1　　#深度為1，表示在當前頁面向下再次解析一次

這是第二次，只會解析到14頁

簡單實例應用：獲取校花網的圖片和姓名，按照姓名進行持久化文件歸類

xiaohua.py

# -*- coding: utf-8 -*-
import scrapy
import hashlib
from scrapy.http.request import Request
from scrapy.selector import Selector,HtmlXPathSelector
import scrapy.http.response.html
from ..items import XiaohuaItem

class XiaohuaSpider(scrapy.Spider):
    name = 'xiaohua'
    allowed_domains = ['xiaohua.com']
    start_urls = ['http://www.xiaohuar.com/']
    visited_url = set()
    visited_url_img = {}
    visited_url_title = {}

    def md5(self,url):
        hash_obj = hashlib.md5()
        hash_obj.update(bytes(url,encoding="utf-8"))
        return hash_obj.hexdigest()

    def parse(self, response):
        #獲取首頁中所有的人物的下一級url，過濾掉校草
        xh_a = Selector(response=response).xpath("//ul[@class='twoline']/li")
        for xh in xh_a:
            a_url = xh.xpath("./a[@class='xhpic']/@href").extract_first()
            a_title = xh.xpath("./a/span/text()").extract_first()
            if a_title.find("校草") != -1:
                continue
            if not a_url.startswith("http"):
                a_url = "http://www.xiaohuar.com%s" % a_url
            ha_url = self.md5(a_url)
            if ha_url in self.visited_url:
                pass
            else:
                self.visited_url.add(ha_url)
                self.visited_url_title[ha_url]=a_title

                yield scrapy.Request(url=a_url,callback=self.parse,dont_filter=True)

        #下面是所有人物下一級url中去查找所有照片，注意：部分小照片和大照片的區別在於前面多了small，大照片只取前32位即可
        #/d/file/20171202/small062adbed4692d28b77a45e269d8f19031512203361.jpg   小照片
        #/d/file/20171202/     062adbed4692d28b77a45e269d8f1903.jpg             大照片
       
        xh_img = Selector(response=response).xpath("//div[@class='post_entry']")
        xh_img_a = xh_img.xpath("./ul//img/@src")

        if len(xh_img) == 0:
            xh_img = Selector(response=response).xpath("//div[@class='photo-Middle']/div")
            if len(xh_img) != 0:
                xh_img = xh_img[1]
            else:
                xh_img = Selector(response=response).xpath("//div[@class='photo-m']/div")[1]
            xh_img_a = xh_img.xpath("./table//img/@src")
        
        # 上面找到標簽，下面開始對標簽進行循環，獲取所有照片url

        for xh_img_item in xh_img_a:
            xh_img_url = xh_img_item.extract()

            if xh_img_url.find("small"):
                tmp_list = xh_img_url.rsplit("/",1)
                tmp_name_list = tmp_list[1].replace("small","").split(".")
                xh_img_url = "/".join([tmp_list[0],".".join([tmp_name_list[0][:32],tmp_name_list[1]])])
            if not xh_img_url.startswith("http"):
                xh_img_url = "http://www.xiaohuar.com%s" % xh_img_url
            #將所有照片加入字典  url:名字
            self.visited_url_img[xh_img_url] = self.visited_url_title[self.md5(response.url)]
        
        #若是收集完成，那么兩者的長度是一致的，開始進行持久化
        if len(set(self.visited_url_img.values())) == len(self.visited_url) and len(self.visited_url) != 0:
            for xh_url in self.visited_url_img.items():
                item_obj = XiaohuaItem(title=xh_url[1],img_url=xh_url[0])
                yield item_obj

items.py

import scrapy


class XiaohuaItem(scrapy.Item):
    title = scrapy.Field()
    img_url = scrapy.Field()

pipelines.py

import requests,os


class XiaohuaPipeline(object):
    def process_item(self, item, spider):
        '''
        title
        img_url
        '''
        file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),'upload',item['title'])
        if not os.path.isdir(file_path):
            os.makedirs(file_path)

        response = requests.get(item['img_url'],stream=False)
        with open(os.path.join(file_path,item['img_url'].rsplit("/",1)[1]),"wb") as fp:
            fp.write(response.content)

        return item

settings.py

ITEM_PIPELINES = {
   'scrapyPro.pipelines.ScrapyproPipeline': 300,
   'scrapyPro.pipelines.XiaohuaPipeline': 100,
}

補充：request的回調函數不執行

原因：可能是我在設置allowed_domains允許域名中所設置的域名不是我們所爬取的網站域名

    allowed_domains = ['xiaohua.com']　　#域名不對，是xiaohuar.com
    start_urls = ['http://www.xiaohuar.com/']

設置allowed_domains的原因：

因為在網站中可能有外聯，我們只是需要去訪問該網站，而不是他的外聯網站，所以設置allowed_domains是必須的，可以過濾掉外聯的網站，要是希望獲取外聯網站，我們在該列表中添加上即可。

上面出錯原因：

我們所爬取的網站時xiaohuar.com,而設置的域名是xiaohua.com所以爬取失敗，在request中請求被拒絕

解決方法：（兩者）

1.修改allowed_domains（推薦）

allowed_domains = ['xiaohuar.com']

2.在request中設置dont_filter（不對url進行過濾）

yield scrapy.Request(url=a_url,callback=self.parse,dont_filter=True)

我們可以在自定義類BaseDupeFilter（用於過濾url）中設置去重，數據收集等操作，使用dont_filter可能會有所影響

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python---Scrapy實現使用Splash進行網頁信息爬取 python之scrapy模塊scrapy-redis使用 python之scrapy模塊pipelines python之scrapy模塊logging日志 python scrapy簡單使用 python爬蟲之scrapy的pipeline的使用 Scrapy安裝使用時模塊導入出錯 scrapy框架初識（Spider模塊,CrawlSpider模塊的使用） python爬蟲scrapy之rules的基本使用 python---twisted的使用，使用其模擬Scrapy