scrapy快速入門

本文轉載自查看原文 2019-05-15 11:58 533 scrapy

1. 什么是scrapy？

　　其官網是這樣簡述的，“A Fast & Powerful Scraping &Crawling Framework ”, 並且其底層以twisted作為網絡架構( Python實現的基於事件驅動的網絡引擎框架)，所以爬取效率及性能出色。

　　定義·：Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。其可以應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網絡抓取 )所設計的，也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化測試

2. scrapy模塊執行與通信流程

　　2.1 流程圖：

　　2.2 各大組件的作用：

　　　　引擎 (Scrapy Engine)：用來處理整個系統的數據流，觸發事物（框架核心）

　　　　調度器 (Scheduler)：用來接收引擎發送的請求，壓入請求隊列中，並在引擎再次請求的時候返回，可以想象成一個url（待爬取網頁的url）的優先隊列，由他決定下一個要抓取的網址是什么，同時去除重復的網址

　　　　下載器 (Downloader)：顧名思義，用於下載網頁代碼，並將結果返回給spider

　　　　爬蟲 (Spiders): 蜘蛛呢，就是執行者，用於解析網頁中的信息，即實體（Item）

　　　　項目管道(Pipeline): 負責處理爬蟲從網頁中抽取的實體，主要功能是持久化實體，驗證實體的有效性，清除不需要的信息。當爬蟲被解析后，將被發送至項目管道，並經過幾個特定次序來處理數據

　　　　下載中間件 (Downloader Middlewares): 位於Engine及Spider之間的框架，主要工作是處理scrapy engine與Downloader之間的請求及響應

　　　　爬蟲中間件 (Spider Middlewares): 位於Engine及Spider之間的框架，主要工作是處理scrapy engine與spider之間的響應輸入及請求輸出

　　　　調度中間件 (Schedule Middlewares): 介於scrapy engine及scheduler之間的中間件，處理從scrapy engine發送到scheduler的請求響應

　　2.3 是不是感覺一頭霧水，那我們直接上過程就很清楚啦

3. scrapy的基本使用

　　3.1 安裝

# 這里我們用conda創建一個名為rawling_py36的虛擬環境
conda create --name crawling_py36 python=3.6

# 進入環境
activate crawling_py36

# 安裝scrapy
conda install scrapy

# 退出當前環境(windows)
deactivate

　　說明：也可以使用pip install scrapy命令進行安裝，但是在windows平台下， scrapy依賴pypiwin32模塊，在執行pip install scrapy命令前，請先執行pip install pypiwin32

　　3.2 開始（這里以爬取糗事百科段子為例）

　　　　a. 新建項目

# 新建一個文件夾 並進入文件夾
mkdir spider_projects
cd ./spider_projects

# 進入虛擬環境
activate crawling_py36

# 新建爬蟲項目,
scrapy startproject qsbk

#提示成功

　　　　b.新建爬蟲

　　　　其域名為qiushibaike.com，我們這里根據域名新建一個spider

# 一定要在項目文件夾下
cd qsbk

scrapy genspider qsbk_spider qiushibaike.com

　　　　成功之后，其文件樹結構為：

　　　　c. 分析網頁結構（https://www.qiushibaike.com/text/）

　　　　　那么紅圈里的div就是當前頁的所有段子信息，我們使用xpath進行解析，那么每一個段子的xpath語法為 //div[@id='content-left']/div[contains(@class,'article')]

　　　　　提取下一頁的地址的xpath語法為 //ul[@class='pagination']/li[last()]/a/@href

　　　　d. 開始

　　　　爬蟲配置 (settings.py):

 1 # Obey robots.txt rules
 2 ROBOTSTXT_OBEY = False
 3 
 4 # Override the default request headers:
 5 DEFAULT_REQUEST_HEADERS = {
 6   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 7   'Accept-Language': 'en',
 8   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
 9 }
10 
11 
12 # Configure item pipelines
13 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
14 ITEM_PIPELINES = {
15    'qsbk.pipelines.QsbkPipeline': 300,
16 }

　　　　解析數據（qsbk_spider.py）：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from ..items import QsbkItem
 4 
 5 
 6 class QsbkSpiderSpider(scrapy.Spider):
 7     name = 'qsbk_spider'
 8     allowed_domains = ['qiushibaike.com']
 9     start_urls = ['https://www.qiushibaike.com/text/page/1/']
10     base_domain = "https://www.qiushibaike.com"
11 
12     def parse(self, response):
13         # 獲取所有段子信息
14         texts = response.xpath("//div[@id='content-left']/div[contains(@class, 'article')]")
15         # 提取單個段子信息
16         for text in texts:
17             author_info = text.xpath("./div[contains(@class,'author')]")
18             head_img = 'http:' + author_info.xpath(".//img/@src").get().strip()
19             author_name = author_info.xpath(".//h2/text()").get().strip()
20             content_list = text.xpath(".//div[@class='content']/span/text()").getall()
21             content = ''.join(content_list).strip()
22 
23             article_item = QsbkItem(author_name=author_name, head_img=head_img, content=content)
24 
25             yield article_item
26 
27         # 下一頁的地址
28         next_page_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
29 
30         if next_page_url:
31             yield scrapy.Request(url=self.base_domain+next_page_url, callback=self.parse)
32         else:
33             return

　　　　結構化實體（items.py）：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author_name = scrapy.Field()
    head_img = scrapy.Field()
    content = scrapy.Field()

　　　　持久化數據（pipelines.py）:

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 # import json
 9 # class QsbkPipeline(object):
10 #
11 #     def open_spider(self, spider):
12 #         print('start crawling....')
13 #         self.file = open('qsbk.json','w' , encoding='utf-8')
14 #
15 #     def process_item(self, item, spider):
16 #         item_json = json.dumps(dict(item), ensure_ascii=False)
17 #         self.file.write(item_json+'\n')
18 #         return item
19 #
20 #     def close_spider(self, spider):
21 #         self.file.close()
22 #         print('stop crawling.....')
23 
24 # 一次性寫入
25 # from scrapy.exporters import JsonItemExporter, JsonLinesItemExporter
26 # class QsbkPipeline(object):
27 #
28 #     def open_spider(self, spider):
29 #         print('start crawling....')
30 #         self.file = open('qsbk.json', 'wb')
31 #         self.exporter = JsonItemExporter(self.file, ensure_ascii=False)
32 #         self.exporter.start_exporting()
33 #
34 #     def process_item(self, item, spider):
35 #         self.exporter.export_item(item)
36 #
37 #     def close_spider(self, spider):
38 #         self.exporter.finish_exporting()
39 #         self.file.close()
40 #         print('stop crawling.....')
41 
42 # 分行寫入，（在數據量很大的時候推薦使用）
43 from scrapy.exporters import JsonLinesItemExporter
44 class QsbkPipeline(object):
45 
46     def open_spider(self, spider):
47         print('start crawling....')
48         self.file = open('qsbk.json', 'wb')
49         self.exporter = JsonLinesItemExporter(self.file, ensure_ascii=False)
50 
51     def process_item(self, item, spider):
52         self.exporter.export_item(item)
53         return item
54 
55     def close_spider(self, spider):
56         self.file.close()
57         print('stop crawling.....')

　　說明：為了避免每次爬取使用命令行，可以在項目根目錄下新建 run.py 代替命令行執行，而scrapy也同樣提供了執行cmd命令這一模塊

　　啟動爬蟲（run.py）

1 from scrapy import cmdline
2 
3 cmdline.execute("scrapy crawl qsbk_spider".split())

　　e.爬取結束（qsbk.json）　　

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python Scrapy 爬蟲框架快速入門網頁爬蟲--scrapy入門 Scrapy 入門教程 Scrapy入門教程 Scrapy入門教程 Scrapy簡單入門及實例講解 python之scrapy入門教程 Python爬蟲Scrapy(二)_入門案例 Scrapy簡單入門及實例講解 python爬蟲框架之scrapy的快速上手