Python實戰異步爬蟲(協程)+分布式爬蟲(多進程)

本文轉載自查看原文 2019-11-09 12:12 661

轉自：https://blog.csdn.net/SL_World/article/details/86633611

在講解之前，我們先來通過一幅圖看清多進程和協程的爬蟲之間的原理及其區別。(圖片來源於網絡)

這里，異步爬蟲不同於多進程爬蟲，它使用單線程(即僅創建一個事件循環，然后把所有任務添加到事件循環中)就能並發處理多任務。在輪詢到某個任務后，當遇到耗時操作(如請求URL)時，掛起該任務並進行下一個任務，當之前被掛起的任務更新了狀態(如獲得了網頁響應)，則被喚醒，程序繼續從上次掛起的地方運行下去。極大的減少了中間不必要的等待時間。

對於協程(Asyncio庫)的原理及實現請見：《Python異步IO之協程(詳解)》
對於多進程的知識講解及實現請見：《廖雪峰-Python多進程》

在有了Asyncio異步IO庫實現協程后，我們還需要實現異步網頁請求。因此，aiohttp庫應運而生。

使用aiohttp庫實現異步網頁請求
　　在我們寫普通的爬蟲程序時，經常會用到requests庫用以請求網頁並獲得服務器響應。而在協程中，由於requests庫提供的相關方法不是可等待對象(awaitable),使得無法放在await后面，因此無法使用requests庫在協程程序中實現請求。在此，官方專門提供了一個aiohttp庫，用來實現異步網頁請求等功能，簡直就是異步版的requests庫。

【基礎實現】：在官方文檔中，推薦使用ClientSession()函數來調用網頁請求等相關方法。然后，我們在協程中使用ClientSession()的get()或request()方法來請求網頁。(其中async with是異步上下文管理器，其封裝了異步實現等功能)

import aiohttp

async with aiohttp.ClientSession() as session:
    async with session.get('http://httpbin.org/get') as resp:
        print(resp.status)
        print(await resp.text())

ClientSession()除了有請求網頁的方法，官方API還提供了其他HTTP常見方法。

session.request(method='GET', url='http://httpbin.org/request')
session.post('http://httpbin.org/post', data=b'data')
session.put('http://httpbin.org/put', data=b'data')
session.delete('http://httpbin.org/delete')
session.head('http://httpbin.org/get')
session.options('http://httpbin.org/get')
session.patch('http://httpbin.org/patch', data=b'data')

【案例】：爬取2018年AAAI頂會中10篇論文的標題。

一、測試普通爬蟲程序

import time
from lxml import etree
import requests
urls = [
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16488',
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16583',
    # 省略后面8個url...
]
'''
提交請求獲取AAAI網頁,並解析HTML獲取title
'''
def get_title(url,cnt):
    response = requests.get(url)  # 提交請求,獲取響應內容
    html = response.content       # 獲取網頁內容(content返回的是bytes型數據,text()獲取的是Unicode型數據)
    title = etree.HTML(html).xpath('//*[@id="title"]/text()') # 由xpath解析HTML
    print('第%d個title:%s' % (cnt,''.join(title)))
    
if __name__ == '__main__':
    start1 = time.time()
    i = 0
    for url in urls:
        i = i + 1
        start = time.time()
        get_title(url,i)
        print('第%d個title爬取耗時:%.5f秒' % (i,float(time.time() - start)))
    print('爬取總耗時:%.5f秒' % float(time.time()-start1))

執行結果如下：

第1個title:Norm Conflict Resolution in Stochastic Domains
第1個title爬取耗時:1.41810秒
第2個title:Algorithms for Trip-Vehicle Assignment in Ride-Sharing
第2個title爬取耗時:1.31734秒
第3個title:Tensorized Projection for High-Dimensional Binary Embedding
第3個title爬取耗時:1.31826秒
第4個title:Synthesis of Programs from Multimodal Datasets
第4個title爬取耗時:1.28625秒
第5個title:Video Summarization via Semantic Attended Networks
第5個title爬取耗時:1.33226秒
第6個title:TIMERS: Error-Bounded SVD Restart on Dynamic Networks
第6個title爬取耗時:1.52718秒
第7個title:Memory Management With Explicit Time in Resource-Bounded Agents
第7個title爬取耗時:1.35522秒
第8個title:Mitigating Overexposure in Viral Marketing
第8個title爬取耗時:1.35722秒
第9個title:Neural Link Prediction over Aligned Networks
第9個title爬取耗時:1.51317秒
第10個title:Dual Deep Neural Networks Cross-Modal Hashing
第10個title爬取耗時:1.30624秒
爬取總耗時:13.73324秒

可見，平均每請求完一個URL並解析該HTML耗時1.4秒左右。本次程序運行總耗時13.7秒。

二、測試基於協程的異步爬蟲程序

　　下面，是使用了協程的異步爬蟲程序。etree模塊用於解析HTML，aiohttp是一個利用asyncio的庫，它的API看起來很像請求的API，可以暫時看成協程版的requests。

import time
from lxml import etree
import aiohttp
import asyncio
urls = [
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16488',
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16583',
    # 省略后面8個url...
]
titles = []
sem = asyncio.Semaphore(10) # 信號量，控制協程數，防止爬的過快
'''
提交請求獲取AAAI網頁,並解析HTML獲取title
'''
async def get_title(url):
    with(await sem):
        # async with是異步上下文管理器
        async with aiohttp.ClientSession() as session:  # 獲取session
            async with session.request('GET', url) as resp:  # 提出請求
                # html_unicode = await resp.text() 
                # html = bytes(bytearray(html_unicode, encoding='utf-8'))
                html = await resp.read() # 可直接獲取bytes 
                title = etree.HTML(html).xpath('//*[@id="title"]/text()')
                print(''.join(title))
'''
調用方
'''
def main():
    loop = asyncio.get_event_loop()           # 獲取事件循環
    tasks = [get_title(url) for url in urls]  # 把所有任務放到一個列表中
    loop.run_until_complete(asyncio.wait(tasks)) # 激活協程
    loop.close()  # 關閉事件循環

if __name__ == '__main__':
    start = time.time()
    main()  # 調用方
    print('總耗時：%.5f秒' % float(time.time()-start))

執行結果如下：

Memory Management With Explicit Time in Resource-Bounded Agents
Norm Conflict Resolution in Stochastic Domains
Video Summarization via Semantic Attended Networks
Tensorized Projection for High-Dimensional Binary Embedding
Algorithms for Trip-Vehicle Assignment in Ride-Sharing
Dual Deep Neural Networks Cross-Modal Hashing
Neural Link Prediction over Aligned Networks
Mitigating Overexposure in Viral Marketing
TIMERS: Error-Bounded SVD Restart on Dynamic Networks
Synthesis of Programs from Multimodal Datasets
總耗時：2.43371秒

可見，本次我們使用協程爬取10個URL只耗費了2.4秒，效率是普通同步程序的8~12倍。

【解釋】：

request獲取的text()返回的是網頁的Unicode型數據，content和read()返回的是bytes型數據。而etree.HTML(html)接收的參數需是bytes類型，所以①可以通過resp.read()直接獲取bytes；②若使用text()則需要通過先把Unicode類型數據轉換成比特數組對象，再轉換成比特對象，即bytes(bytearray(html_unicode, encoding='utf-8'))。
發起請求除了可以用上述session.request('GET', url)也可以用session.get(url)，功能相同。
如果同時做太多的請求，鏈接有可能會斷掉。所以需要使用sem = asyncio.Semaphore(10) ，Semaphore是限制同時工作的協同程序數量的同步工具。
async with是異步上下文管理器，不解的請看Python中的async with用法。

三、測試基於多進程的分布式爬蟲程序
下面，我們測試多進程爬蟲程序，由於我的電腦CPU是4核，所以這里進程池我就設的4。

import multiprocessing
from multiprocessing import Pool
import time
import requests
from lxml import etree
urls = [
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16488',
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16583',
    # 省略后面8個url...
]
'''
提交請求獲取AAAI網頁,並解析HTML獲取title
'''
def get_title(url,cnt):
    response = requests.get(url)  # 提交請求
    html = response.content       # 獲取網頁內容
    title = etree.HTML(html).xpath('//*[@id="title"]/text()') # 由xpath解析HTML
    print('第%d個title:%s' % (cnt,''.join(title)))
'''
調用方
'''
def main():
    print('當前環境CPU核數是：%d核' % multiprocessing.cpu_count())
    p = Pool(4)  # 進程池
    i = 0
    for url in urls:
        i += 1
        p.apply_async(get_title, args=(url, i))
    p.close()
    p.join()   # 運行完所有子進程才能順序運行后續程序
    
if __name__ == '__main__':
    start = time.time()
    main()  # 調用方
    print('總耗時：%.5f秒' % float(time.time()-start))

執行結果：

當前環境CPU核數是：4核
第2個title:Algorithms for Trip-Vehicle Assignment in Ride-Sharing
第1個title:Norm Conflict Resolution in Stochastic Domains
第4個title:Synthesis of Programs from Multimodal Datasets
第3個title:Tensorized Projection for High-Dimensional Binary Embedding
第5個title:Video Summarization via Semantic Attended Networks
第6個title:TIMERS: Error-Bounded SVD Restart on Dynamic Networks
第7個title:Memory Management With Explicit Time in Resource-Bounded Agents
第8個title:Mitigating Overexposure in Viral Marketing
第9個title:Neural Link Prediction over Aligned Networks
第10個title:Dual Deep Neural Networks Cross-Modal Hashing
總耗時：5.01228秒
可見，多進程分布式爬蟲也比普通同步程序要快很多，本次運行時間5秒。但比協程略慢。

【時間對比】：
對於上例中10個URL的爬取時間，下面整理成了表格。

CPU核數\實現方式	普通同步爬蟲	多進程爬蟲	異步爬蟲
4核	13.7秒	5.0秒	2.4秒

其中增加多進程中進程池Pool(n)的n可加速爬蟲，下圖顯示了消耗的時間(單位.秒)和Pool()參數的關系。

如果你以為到這里就結束了，那你就要錯過最精彩的東西了：)

四、測試-異步結合多進程-爬蟲程序
由於解析HTML也需要消耗一定的時間，而aiohttp和asyncio均未提供相關解析方法。所以可以在請求網頁的時使用異步程序，在解析HTML使用多進程，兩者配合使用，效率更高哦～！
【請求網頁】：使用協程。
【解析HTML】：使用多進程。

from multiprocessing import Pool
import time
from lxml import etree
import aiohttp
import asyncio
urls = [
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16488',
    'https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16583',
    # 省略后面8個url...
]
htmls = []
titles = []
sem = asyncio.Semaphore(10) # 信號量，控制協程數，防止爬的過快
'''
提交請求獲取AAAI網頁html
'''
async def get_html(url):
    with(await sem):
        # async with是異步上下文管理器
        async with aiohttp.ClientSession() as session:  # 獲取session
            async with session.request('GET', url) as resp:  # 提出請求
                html = await resp.read() # 直接獲取到bytes
                htmls.append(html)
                print('異步獲取%s下的html.' % url)

'''
協程調用方，請求網頁
'''
def main_get_html():
    loop = asyncio.get_event_loop()           # 獲取事件循環
    tasks = [get_html(url) for url in urls]  # 把所有任務放到一個列表中
    loop.run_until_complete(asyncio.wait(tasks)) # 激活協程
    loop.close()  # 關閉事件循環
'''
使用多進程解析html
'''
def multi_parse_html(html,cnt):
    title = etree.HTML(html).xpath('//*[@id="title"]/text()')
    titles.append(''.join(title))
    print('第%d個html完成解析－title:%s' % (cnt,''.join(title)))
'''
多進程調用總函數，解析html
'''
def main_parse_html():
    p = Pool(4)
    i = 0
    for html in htmls:
        i += 1
        p.apply_async(multi_parse_html,args=(html,i))
    p.close()
    p.join()


if __name__ == '__main__':
    start = time.time()
    main_get_html()   # 調用方
    main_parse_html() # 解析html
    print('總耗時：%.5f秒' % float(time.time()-start))

執行結果如下：

異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16380下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16674下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16583下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16911下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/17343下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16449下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16488下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16659下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16581下的html.
異步獲取https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16112下的html.
第3個html完成解析－title:Algorithms for Trip-Vehicle Assignment in Ride-Sharing
第1個html完成解析－title:Tensorized Projection for High-Dimensional Binary Embedding
第2個html完成解析－title:TIMERS: Error-Bounded SVD Restart on Dynamic Networks
第4個html完成解析－title:Synthesis of Programs from Multimodal Datasets
第6個html完成解析－title:Dual Deep Neural Networks Cross-Modal Hashing
第7個html完成解析－title:Norm Conflict Resolution in Stochastic Domains
第8個html完成解析－title:Neural Link Prediction over Aligned Networks
第5個html完成解析－title:Mitigating Overexposure in Viral Marketing
第9個html完成解析－title:Video Summarization via Semantic Attended Networks
第10個html完成解析－title:Memory Management With Explicit Time in Resource-Bounded Agents

【參考文獻】：
[1] aiohttp官方API文檔
[2] 加速爬蟲: 異步加載 Asyncio
[3] python:利用asyncio進行快速抓取
[4] 使用 aiohttp 和 asyncio 進行異步請求
[5] requests的content與text導致lxml的解析問題

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 分布式計算--(分布式+多進程+多線程+多協程) python 多進程/多線程/協程同步異步 python3 多進程、隊列、進程池、協程 python3多進程進程池協程並發 Python實現基於協程的異步爬蟲 Python爬蟲進階 | 異步協程 Python 多進程爬蟲實例 Python爬蟲，多進程 + 日志記錄 python多線程、多進程、協程的使用 python 多進程+協程實現並發