python對異步編程有原生的支持,即asyncio標准庫,使用異步IO模型可以節約大量的IO等待時間,非常適合於爬蟲任務。
1.基本用法
import time
import asyncio
import aiohttp # 用異步方式獲取網頁內容
urls = ['https://www.baidu.com'] * 400
async def get_html(url, sem):
async with(sem):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html = await resp.text()
def main():
loop = asyncio.get_event_loop() # 獲取事件循環
sem = asyncio.Semaphore(10) # 控制並發的數量
tasks = [get_html(url, sem) for url in urls] # 把所有任務放到一個列表中
loop.run_until_complete(asyncio.wait(tasks)) # 激活協程
loop.close() # 關閉事件循環
if __name__ == '__main__':
start = time.time()
main()
print(time.time()-start) # 5.03s
2.多進程+協程
如果想進一步加快爬取速度,考慮到python多線程的全局鎖限制,可以采用多進程+協程的方案:
import time
import asyncio
import aiohttp # 用異步方式獲取網頁內容
from multiprocessing import Pool
all_urls = ['https://www.baidu.com'] * 400
async def get_html(url, sem):
async with(sem):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html = await resp.text()
def main(urls):
loop = asyncio.get_event_loop() # 獲取事件循環
sem = asyncio.Semaphore(10) # 控制並發的數量
tasks = [get_html(url, sem) for url in urls] # 把所有任務放到一個列表中
loop.run_until_complete(asyncio.wait(tasks)) # 激活協程
loop.close() # 關閉事件循環
if __name__ == '__main__':
start = time.time()
p = Pool(4)
for i in range(4):
p.apply_async(main, args=(all_urls[i*100:(i+1)*100],))
p.close()
p.join()
print(time.time()-start) # 2.87s
可以看出來多進程已經加快了爬取速度,具體加速效果跟機器CPU配置相關。