aiohttp是python3的一個異步模塊,分為服務器端和客戶端。廖雪峰的python3教程中,講的是服務器端的使用方法。均益這里主要講的是客戶端的方法,用來寫爬蟲。使用異步協程的方式寫爬蟲,能提高程序的運行效率。
1、安裝
pip install aiohttp
2、單一請求方法
import aiohttp
import asyncio async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(url): async with aiohttp.ClientSession() as session: html = await fetch(session, url) print(html) url = 'http://junyiseo.com' loop = asyncio.get_event_loop() loop.run_until_complete(main(url))
3、多url請求方法
import aiohttp
import asyncio async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(url): async with aiohttp.ClientSession() as session: html = await fetch(session, url) print(html) loop = asyncio.get_event_loop() # 生成多個請求方法 url = "http://junyiseo.com" tasks = [main(url), main(url)] loop.run_until_complete(asyncio.wait(tasks)) loop.close()
4、其他的請求方式
上面的代碼中,我們創建了一個 ClientSession 對象命名為session,然后通過session的get方法得到一個 ClientResponse 對象,命名為resp,get方法中傳入了一個必須的參數url,就是要獲得源碼的http url。至此便通過協程完成了一個異步IO的get請求。
aiohttp也支持其他的請求方式
session.post('http://httpbin.org/post', data=b'data') session.put('http://httpbin.org/put', data=b'data') session.delete('http://httpbin.org/delete') session.head('http://httpbin.org/get') session.options('http://httpbin.org/get') session.patch('http://httpbin.org/patch', data=b'data')
5、請求方法中攜帶參數
GET方法帶參數
params = {'key1': 'value1', 'key2': 'value2'} async with session.get('http://httpbin.org/get', params=params) as resp: expect = 'http://httpbin.org/get?key2=value2&key1=value1' assert str(resp.url) == expect
POST方法帶參數
payload = {'key1': 'value1', 'key2': 'value2'} async with session.post('http://httpbin.org/post', data=payload) as resp: print(await resp.text())
6、獲取響應內容
resp.status 是http狀態碼,
resp.text() 是網頁內容
async with session.get('https://api.github.com/events') as resp: print(resp.status) print(await resp.text())
gzip和deflate轉換編碼已經為你自動解碼。
7、JSON請求處理
async with aiohttp.ClientSession() as session: async with session.post(url, json={'test': 'object'})
返回json數據的處理
async with session.get('https://api.github.com/events') as resp: print(await resp.json())
8、以字節流的方式讀取文件,可以用來下載
async with session.get('https://api.github.com/events') as resp: await resp.content.read(10) #讀取前10個字節
下載保存文件
with open(filename, 'wb') as fd: while True: chunk = await resp.content.read(chunk_size) if not chunk: break fd.write(chunk)
9、上傳文件
url = 'http://httpbin.org/post' files = {'file': open('report.xls', 'rb')} await session.post(url, data=files)
可以設置好文件名和content-type:
url = 'http://httpbin.org/post' data = FormData() data.add_field('file', open('report.xls', 'rb'), filename='report.xls', content_type='application/vnd.ms-excel') await session.post(url, data=data)
10、超時處理
默認的IO操作都有5分鍾的響應時間 我們可以通過 timeout 進行重寫,如果 timeout=None 或者 timeout=0 將不進行超時檢查,也就是不限時長。
async with session.get('https://github.com', timeout=60) as r: ...
11、自定義請求頭
url = 'http://example.com/image' payload = b'GIF89a\x01\x00\x01\x00\x00\xff\x00,\x00\x00' b'\x00\x00\x01\x00\x01\x00\x00\x02\x00;' headers = {'content-type': 'image/gif'} await session.post(url, data=payload, headers=headers)
設置session的請求頭
headers={"Authorization": "Basic bG9naW46cGFzcw=="} async with aiohttp.ClientSession(headers=headers) as session: async with session.get("http://httpbin.org/headers") as r: json_body = await r.json() assert json_body['headers']['Authorization'] == \ 'Basic bG9naW46cGFzcw=='
12、自定義cookie
url = 'http://httpbin.org/cookies' cookies = {'cookies_are': 'working'} async with ClientSession(cookies=cookies) as session: async with session.get(url) as resp: assert await resp.json() == { "cookies": {"cookies_are": "working"}}
在多個請求中共享cookie
async with aiohttp.ClientSession() as session: await session.get( 'http://httpbin.org/cookies/set?my_cookie=my_value') filtered = session.cookie_jar.filter_cookies( 'http://httpbin.org') assert filtered['my_cookie'].value == 'my_value' async with session.get('http://httpbin.org/cookies') as r: json_body = await r.json() assert json_body['cookies']['my_cookie'] == 'my_value'
13、限制同時請求數量
limit默認是100,limit=0的時候是無限制
conn = aiohttp.TCPConnector(limit=30)
14、SSL加密請求
有的請求需要驗證加密證書,可以設置ssl=False,取消驗證
r = await session.get('https://example.com', ssl=False)
加入證書
sslcontext = ssl.create_default_context( cafile='/path/to/ca-bundle.crt') r = await session.get('https://example.com', ssl=sslcontext)
15、代理請求
async with aiohttp.ClientSession() as session: async with session.get("http://python.org", proxy="http://proxy.com") as resp: print(resp.status)
https://www.mzihen.com/solution-to-shadowsocks-error-port-already-in-use/
代理認證
async with aiohttp.ClientSession() as session: proxy_auth = aiohttp.BasicAuth('user', 'pass') async with session.get("http://python.org", proxy="http://proxy.com", proxy_auth=proxy_auth) as resp: print(resp.status)
或者通過URL認證
session.get("http://python.org", proxy="http://user:pass@some.proxy.com")
16、優雅的關閉程序
沒有ssl的情況,加入這個語句關閉await asyncio.sleep(0)
async def read_website(): async with aiohttp.ClientSession() as session: async with session.get('http://example.org/') as resp: await resp.read() loop = asyncio.get_event_loop() loop.run_until_complete(read_website()) # Zero-sleep to allow underlying connections to close loop.run_until_complete(asyncio.sleep(0)) loop.close()
如果是ssl請求,在關閉前需要等待一會
loop.run_until_complete(asyncio.sleep(0.250)) loop.close()
17、小結
本文從官方翻譯而來,有問題可以留言
官方文檔
http://aiohttp.readthedocs.io/en/stable/