Python基礎之Scrapy簡介

本文轉載自查看原文 2021-08-14 22:31 110 爬蟲/ Scrapy/ Python

Scrapy作為爬蟲的進階內容，可以實現多線程爬取目標內容，簡化代碼邏輯，提高開發效率，深受爬蟲開發者的喜愛，本文主要以爬取某股票網站為例，簡述如何通過Scrapy實現爬蟲，僅供學習分享使用，如有不足之處，還請指正。

什么是Scrapy?

Scrapy是用python實現的一個為了爬取網站數據，提取結構性數據而編寫的應用框架。使用Twisted高效異步網絡框架來處理網絡通信。Scrapy架構：

關於Scrapy架構各項說明，如下所示：

ScrapyEngine：引擎。負責控制數據流在系統中所有組件中流動，並在相應動作發生時觸發事件。此組件相當於爬蟲的“大腦”，是整個爬蟲的調度中心。
Schedule：調度器。接收從引擎發過來的requests，並將他們入隊。初始爬取url和后續在頁面里爬到的待爬取url放入調度器中，等待被爬取。調度器會自動去掉重復的url。
Downloader：下載器。負責獲取頁面數據，並提供給引擎，而后提供給spider。

Spider：爬蟲。用戶編些用於分析response並提取item和額外跟進的url。將額外跟進的url提交給ScrapyEngine，加入到Schedule中。將每個spider負責處理一個特定(或一些)網站。

ItemPipeline：負責處理被spider提取出來的item。當頁面被爬蟲解析所需的數據存入Item后，將被發送到Pipeline，並經過設置好次序
DownloaderMiddlewares：下載中間件。是在引擎和下載器之間的特定鈎子(specific hook)，處理它們之間的請求(request)和響應(response)。提供了一個簡單的機制，通過插入自定義代碼來擴展Scrapy功能。通過設置DownloaderMiddlewares來實現爬蟲自動更換user-agent,IP等。

SpiderMiddlewares：Spider中間件。是在引擎和Spider之間的特定鈎子(specific hook)，處理spider的輸入(response)和輸出(items或requests)。提供了同樣簡單機制，通過插入自定義代碼來擴展Scrapy功能。

Scrapy數據流：

ScrapyEngine打開一個網站，找到處理該網站的Spider，並向該Spider請求第一個(批)要爬取的url(s)；
ScrapyEngine向調度器請求第一個要爬取的url，並加入到Schedule作為請求以備調度；
ScrapyEngine向調度器請求下一個要爬取的url；
Schedule返回下一個要爬取的url給ScrapyEngine，ScrapyEngine通過DownloaderMiddlewares將url轉發給Downloader；
頁面下載完畢，Downloader生成一個頁面的Response，通過DownloaderMiddlewares發送給ScrapyEngine；
ScrapyEngine從Downloader中接收到Response，通過SpiderMiddlewares發送給Spider處理；
Spider處理Response並返回提取到的Item以及新的Request給ScrapyEngine；
ScrapyEngine將Spider返回的Item交給ItemPipeline，將Spider返回的Request交給Schedule進行從第二步開始的重復操作，直到調度器中沒有待處理的Request，ScrapyEngine關閉。

Scrapy安裝

在命令行模式下，通過pip install scrapy命令進行安裝Scrapy，如下所示：

當出現以下提示信息時，表示安裝成功

Scrapy創建項目

在命令行模式下，切換到項目存放目錄，通過scrapy startproject stockstar 創建爬蟲項目，如下所示：

根據提示，通過提供的模板，創建爬蟲【命令格式：scrapy genspider 爬蟲名稱域名】，如下所示：

注意：爬蟲名稱，不能跟項目名稱一致，否則會報錯，如下所示：

通過Pycharm打開新創建的scrapy項目，如下所示：

爬取目標

本例主要爬取某證券網站行情中心股票ID與名稱信息，如下所示：

Scrapy爬蟲開發

通過命令行創建項目后，基本Scrapy爬蟲框架已經形成，剩下的就是業務代碼填充。

item項定義

定義需要爬取的字段信息，如下所示：

1 class StockstarItem(scrapy.Item):
2     """
3     定義需要爬取的字段名稱
4     """
5     # define the fields for your item here like:
6     # name = scrapy.Field()
7     stock_type = scrapy.Field()  # 股票類型
8     stock_id = scrapy.Field()  # 股票ID
9     stock_name = scrapy.Field()  # 股票名稱

定制爬蟲邏輯

Scrapy的爬蟲結構是固定的，定義一個類，繼承自scrapy.Spider，類中定義屬性【爬蟲名稱，域名，起始url】，重寫父類方法【parse】，根據需要爬取的頁面邏輯不同，在parse中定制不同的爬蟲代碼，如下所示：

 1 class StockSpider(scrapy.Spider):
 2     name = 'stock'
 3     allowed_domains = ['quote.stockstar.com']  # 域名
 4     start_urls = ['http://quote.stockstar.com/stock/stock_index.htm']  # 啟動的url
 5 
 6     def parse(self, response):
 7         """
 8         解析函數
 9         :param response:
10         :return:
11         """
12         item = StockstarItem()
13         styles = ['滬A', '滬B', '深A', '深B']
14         index = 0
15         for style in styles:
16             print('********************本次抓取' + style[index] + '股票********************')
17             ids = response.xpath(
18                 '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
19                 '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall()
20             names = response.xpath(
21                 '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
22                 '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall()
23             # print('ids = '+str(ids))
24             # print('names = ' + str(names))
25             for i in range(len(ids)):
26                 item['stock_type'] = style
27                 item['stock_id'] = str(ids[i])
28                 item['stock_name'] = str(names[i])
29                 yield item

數據處理

在Pipeline中，對抓取的數據進行處理，本例為簡便，在控制進行輸出，如下所示：

1 class StockstarPipeline:
2     def process_item(self, item, spider):
3         print('股票類型>>>>'+item['stock_type']+'股票代碼>>>>'+item['stock_id']+'股票名稱>>>>'+item['stock_name'])
4         return item

注意：在對item進行賦值時，只能通過item['key']=value的方式進行賦值，不可以通過item.key=value的方式賦值。

Scrapy配置

通過settings.py文件進行配置，包括請求頭，管道，robots協議等內容，如下所示：

 1 # Scrapy settings for stockstar project
 2 #
 3 # For simplicity, this file contains only settings considered important or
 4 # commonly used. You can find more settings consulting the documentation:
 5 #
 6 #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 
10 BOT_NAME = 'stockstar'
11 
12 SPIDER_MODULES = ['stockstar.spiders']
13 NEWSPIDER_MODULE = 'stockstar.spiders'
14 
15 
16 # Crawl responsibly by identifying yourself (and your website) on the user-agent
17 #USER_AGENT = 'stockstar (+http://www.yourdomain.com)'
18 
19 # Obey robots.txt rules 是否遵守robots協議
20 ROBOTSTXT_OBEY = False
21 
22 # Configure maximum concurrent requests performed by Scrapy (default: 16)
23 #CONCURRENT_REQUESTS = 32
24 
25 # Configure a delay for requests for the same website (default: 0)
26 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
27 # See also autothrottle settings and docs
28 #DOWNLOAD_DELAY = 3
29 # The download delay setting will honor only one of:
30 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
31 #CONCURRENT_REQUESTS_PER_IP = 16
32 
33 # Disable cookies (enabled by default)
34 #COOKIES_ENABLED = False
35 
36 # Disable Telnet Console (enabled by default)
37 #TELNETCONSOLE_ENABLED = False
38 
39 # Override the default request headers:
40 DEFAULT_REQUEST_HEADERS = {
41   # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
42   'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #,
43   # 'Accept-Language': 'en,zh-CN,zh;q=0.9'
44 }
45 
46 # Enable or disable spider middlewares
47 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
48 #SPIDER_MIDDLEWARES = {
49 #    'stockstar.middlewares.StockstarSpiderMiddleware': 543,
50 #}
51 
52 # Enable or disable downloader middlewares
53 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
54 #DOWNLOADER_MIDDLEWARES = {
55 #    'stockstar.middlewares.StockstarDownloaderMiddleware': 543,
56 #}
57 
58 # Enable or disable extensions
59 # See https://docs.scrapy.org/en/latest/topics/extensions.html
60 #EXTENSIONS = {
61 #    'scrapy.extensions.telnet.TelnetConsole': None,
62 #}
63 
64 # Configure item pipelines
65 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
66 ITEM_PIPELINES = {
67    'stockstar.pipelines.StockstarPipeline': 300,
68 }
69 
70 # Enable and configure the AutoThrottle extension (disabled by default)
71 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
72 #AUTOTHROTTLE_ENABLED = True
73 # The initial download delay
74 #AUTOTHROTTLE_START_DELAY = 5
75 # The maximum download delay to be set in case of high latencies
76 #AUTOTHROTTLE_MAX_DELAY = 60
77 # The average number of requests Scrapy should be sending in parallel to
78 # each remote server
79 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
80 # Enable showing throttling stats for every response received:
81 #AUTOTHROTTLE_DEBUG = False
82 
83 # Enable and configure HTTP caching (disabled by default)
84 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
85 #HTTPCACHE_ENABLED = True
86 #HTTPCACHE_EXPIRATION_SECS = 0
87 #HTTPCACHE_DIR = 'httpcache'
88 #HTTPCACHE_IGNORE_HTTP_CODES = []
89 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

View Code

Scrapy運行

因scrapy是各個獨立的頁面，只能通過終端命令行的方式運行，格式為：scrapy crawl 爬蟲名稱，如下所示：

1 scrapy crawl stock

如下圖所示：

備注

本例內容相對簡單，僅為說明Scrapy的常見用法，爬取的內容都是第一次請求能夠獲取到源碼的內容，即所見即所得。

遺留兩個小問題：

對於爬取的內容需要翻頁才能完成，即多次請求，如何處理？
對於爬取的內容是異步傳輸，頁面請求只是獲取一個框架，內容是異步填充，即常見的ajax方式，如何處理？

以上兩個問題，待后續遇到時，再進一步分析。一首陶淵明的歸田園居，與君共享。

歸園田居(其一)

【作者】陶淵明【朝代】魏晉

少無適俗韻，性本愛丘山。誤落塵網中，一去三十年。

羈鳥戀舊林，池魚思故淵。開荒南野際，守拙歸園田。

方宅十余畝，草屋八九間。榆柳蔭后檐，桃李羅堂前。

曖曖遠人村，依依墟里煙。狗吠深巷中，雞鳴桑樹顛。

戶庭無塵雜，虛室有余閑。久在樊籠里，復得返自然。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 10.scrapy框架簡介和基礎應用 (十) scrapy框架簡介和基礎應用 Python基礎 — Python簡介 python scrapy 抓取腳本之家文章(scrapy 入門使用簡介) python網絡爬蟲（2）——scrapy框架的基礎使用 Scrapy Middleware用法簡介 Learning Scrapy筆記（三）- Scrapy基礎 python2.7入門---簡介&基礎語法 python爬蟲scrapy之scrapy終端(Scrapy shell) scrapy基礎之數據爬取