urllib庫作為基本庫,requests庫也是在urllib庫基礎上發展的
但是urllib在使用上不如requests便利,比如上篇文章在寫urllib庫的時候,比如代理設置,處理cookie時,沒有寫,因為感覺比較繁瑣,另外在發送post請求的時候,也是比較繁瑣。
一言而代之,requests庫是python實現的簡單易用的HTTP庫
在以后做爬蟲的時候,建議用requests,不用urllib
用法講解:
#!/usr/bin/env python # -*- coding:utf-8 -*- import requests response = requests.get('http://www.baidu.com') print(type(response)) print(response.status_code) print(type(response.text)) print(response.text) print(response.cookies)
輸出結果為:
<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html>省略了 </html>
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
由上面的小程序中可以看出,response.text直接就拿到str類型的數據,而在urllib中的urllib.request.read()得到的是bytes類型的數據,還需要decode,比較繁瑣,同樣的response.cookies直接就將cookies拿到了,而不像在urllib中那樣繁瑣
各種請求方式
import requests requests.post('http://www.baidu.com') requests.put('http://www.baidu.com') requests.delete('http://www.baidu.com') requests.head('http://www.baidu.com') requests.options('http://www.baidu.com')
基本get請求:
利用http://httpbin.org/get進行get請求測試:
import requests
response = requests.get('http://httpbin.org/get')
print(response.text)
輸出結果:
{"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"113.71.243.133","url":"http://httpbin.org/get"}
帶參數的get請求:
import requests data = {'name':'geme','age':'22'} response = requests.get('http://httpbin.org/get',params=data) print(response.text)
輸出結果為:
{"args":{"age":"22","name":"geme"},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"113.71.243.133","url":"http://httpbin.org/get?age=22&name=geme"}
解析json
response.text返回的其實是json形式的字符串,可以通過response.json()直接進行解析,解析結果與json模塊loads方法得到的結果是完全一樣的
import requests import json response = requests.get('http://httpbin.org/get') print(type(response.text)) print(response.json()) print(json.loads(response.text)) print(type(response.json()))
輸出結果為:
<class 'str'>
{'headers': {'Connection': 'close', 'User-Agent': 'python-requests/2.18.4', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org'}, 'url': 'http://httpbin.org/get', 'args': {}, 'origin': '113.71.243.133'}
{'headers': {'Connection': 'close', 'User-Agent': 'python-requests/2.18.4', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org'}, 'url': 'http://httpbin.org/get', 'args': {}, 'origin': '113.71.243.133'}
<class 'dict'>
該方法在分析ajax請求的時候比較常用
獲取二進制數據
二進制數據是在下載圖片或者視頻的時候常用的一個方法
import requests response = requests.get('https://github.com/favicon.ico') print(type(response.text),type(response.content)) print(response.text) print(response.content)
#用response.content可以獲取二進制內容
文件的保存在爬蟲原理一文中講到,就不再贅述
添加headers
import requests response = requests.get('https://www.zhihu.com/explore') print(response.text)
輸出結果為:
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>
在請求這個url的時候,沒有加headers,報了一個400的狀態碼,下面加上headers試一下:
import requests headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'} response = requests.get('https://www.zhihu.com/explore',headers = headers) print(response.text)
加上這個headers之后,可以正常運行請求了
基本post請求
import requests data = {'name':'haha','age':'12'} response = requests.post('http://httpbin.org/post',data = data) print(response.text)
輸出結果為:
{"args":{},"data":"","files":{},"form":{"age":"12","name":"haha"},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Content-Length":"16","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"json":null,"origin":"113.71.243.133","url":"http://httpbin.org/post"}
不像在urllib中那樣,需要轉碼等操作,方便了很多,還能在post中繼續添加headers
響應
response一些常用的屬性:
response.status_code
response.headers
response.cookies
response.url
response.history
狀態碼的判斷:
不同的狀態碼對應
import requests response = requests.get('http://httpbin.org/get.html')
exit() if not response.status_code ==requests.codes.not_found else print('404 not found')
#或者這句替換為
exit() if not response.status_code ==200 else print('404 not found')
#因為不同的狀態對應着不同的數字
輸出結果為:
404 not found
requests的一些高級操作
文件上傳;
import requests url = 'http://httpbin.org/post' file = {'file':open('tt.jpeg','rb')} response = requests.post(url,files = file) print(response.text)
獲取cookie
import requests response = requests.get('https://www.baidu.com') print(response.cookies) print(type(response.cookies))
Cookie 的返回對象為 RequestsCookieJar
,它的行為和字典類似,將其key,value打印出來
import requests response = requests.get('https://www.baidu.com') print(response.cookies) # print(type(response.cookies)) for key,value in response.cookies.items(): print(key + '=' + value)
輸出結果為:
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
會話維持:
import requests requests.get('http://httpbin.org/cookies/set/number/12345678')#行1 response = requests.get('http://httpbin.org/cookies')#行2 print(response.text)
輸出結果:{"cookies":{}}
在http://httpbin.org/cookies中有一個set可以設置cookie
通過get拿到這個cookie
但是輸出結果為空,原因是行1 和 行2 的兩次get在兩個瀏覽器中進行
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/12345678') response = s.get('http://httpbin.org/cookies') print(response.text)
輸出結果為:{"cookies":{"number":"12345678"}}
證書驗證:
import requests response = requests.get('https://www.12306.cn') print(response.status_code) 輸出結果為: ...... requests.exceptions.SSLError: HTTPSConnectionPool(host='www.12306.cn', port=443): Max retries exceeded with url: / (Caused by SSLError(CertificateError("hostname 'www.12306.cn' doesn't match either of 'webssl.chinanetcenter.com', 'i.l.inmobicdn.net', '*.fn-mart.com', 'www.1zhe.com', 'appcdn.liwusj.com', 'static.liwusj.com', 'download.liwusj.com', 'ptlogin.liwusj.com', 'app.liwusj.com', '*.pinganfang.com', '*.anhouse.com', 'dl.jphbpk.gxpan.cn', 'dl.givingtales.gxpan.cn', 'dl.toyblast.gxpan.cn', 'dl.sds.gxpan.cn', 'yxhhd.5054399.com', 'download.ctrip.com', 'mh.tiancity.com', 'yxhhd2.5054399.com', 'app.4399.cn', 'i.4399.cn', 'm.4399.cn', 'a.4399.cn', 'newsimg.5054399.com', 'cdn.hxjyios.iwan4399.com', 'ios.hxjy.iwan4399.com', 'gjzx.gjzq.com.cn', 'f.3000test.com', 'tj.img4399.com', 'vedio.5054399.com', '*.zhe800.com', '*.qiyipic.com', '*.vxinyou.com', '*.gdjh.vxinyou.com', '*.3000.com', 'pay.game2.cn', 'static1.j.cn', 'static2.j.cn', 'static3.j.cn', 'static4.j.cn', 'video1.j.cn', 'video2.j.cn', 'video3.j.cn', 'online.j.cn', 'playback.live.j.cn', 'audio1.guang.j.cn', 'audio2.guang.j.cn', 'audio3.guang.j.cn', 'img1.guang.j.cn', 'img2.guang.j.cn', 'img3.guang.j.cn', 'img4.guang.j.cn', 'img5.guang.j.cn', 'img6.guang.j.cn', '*.4399youpai.com', 'v.3304399.net', 'w.tancdn.com', '*.3000api.com', 'static11.j.cn', '*.kuyinyun.com', '*.kuyin123.com', '*.diyring.cc', '3000test.com', '*.3000test.com', 'hdimg.5054399.com', 'www.3387.com', 'bbs.4399.cn', '*.cankaoxiaoxi.com', '*.service.kugou.com', 'test.macauslot.com', 'testm.macauslot.com', 'testtran.macauslot.com', 'xiuxiu.huodong.meitu.com', '*.meitu.com', '*.meitudata.com', '*.wheetalk.com', '*.shanliaoapp.com', 'xiuxiu.web.meitu.com', 'api.account.meitu.com', 'open.web.meitu.com', 'id.api.meitu.com', 'api.makeup.meitu.com', 'im.live.meipai.com', '*.meipai.com', 'm.macauslot.com', 'www.macauslot.com', 'web.macauslot.com', 'translation.macauslot.com', 'img1.homekoocdn.com', 'cdn.homekoocdn.com', 'cdn1.homekoocdn.com', 'cdn2.homekoocdn.com', 'cdn3.homekoocdn.com', 'cdn4.homekoocdn.com', 'img.homekoocdn.com', 'img2.homekoocdn.com', 'img3.homekoocdn.com', 'img4.homekoocdn.com', '*.macauslot.com', '*.samsungapps.com', 'auto.tancdn.com', '*.winbo.top', 'static.bst.meitu.com', 'api.xiuxiu.meitu.com', 'api.photo.meituyun.com', 'h5.selfiecity.meitu.com', 'api.selfiecity.meitu.com', 'h5.beautymaster.meiyan.com', 'api.beautymaster.meiyan.com', 'www.yawenb.com', 'm.yawenb.com', 'www.biqugg.com', 'www.dawenxue.net', 'cpg.meitubase.com', 'www.qushuba.com', 'www.ranwena.com', 'www.u8xsw.com', '*.4399sy.com', 'ms.qaqact.cn', 'ms.awqsaged.cn', 'fanxing2.kugou.com', 'fanxing.kugou.com', 'sso.56.com', 'upload.qf.56.com', 'sso.qianfan.tv', 'cdn.danmu.56.com', 'www-ppd.hermes.cn', 'www-uat.hermes.cn', 'www-ts2.hermes.cn', 'www-tst.hermes.cn', '*.syyx.com', 'img.wgeqr.cn', 'img.wgewa.cn', 'img.09mk.cn', 'img.85nh.cn', '*.zhuoquapp.com', 'img.dtmpekda8.cn', 'img.etmpekda6.cn'",),))
import requests response = requests.get('https://www.12306.cn',verify = False) print(response.status_code) 輸出結果為:
D:\python-3.5.4.amd64\python.exe E:/PythonProject/Test1/爬蟲/requests模塊.py
D:\python-3.5.4.amd64\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
200
.。。。
代理設置
代理類型為http或者https時
import requests proxies = { 'http':'http://127.0.0.1:9743', 'https':'https://127.0.0.1:9743' }
#聲明一個字典,指定代理,請求時把代理傳過去就可以了,比urllib方便了很多 #如果代理有密碼,在聲明的時候,比如:
#proxies = { # 'http':'http://user:password@127.0.0.1:9743', #}
#按照如上修改就可以了
response = requests.get('https://www.taobao.com',proxies = proxies)
print(response.status_code)
代理類型為socks時:pip3 install request[socks]
proxies = { 'http':'socks5://127.0.0.1:9743', 'https':'socks5://127.0.0.1:9743'
}
response = requests.get('https://www.taobao.com',proxies = proxies)
print(response.status_code)
超時設置:
response = requests.get('http://www.baidu.com',timeout = 1)
認證設置
有的密碼需要登陸認證,這時可以利用auth這個參數:
import requests from requests.auth import HTTPBasicAuth r = requests.get('http://120.27.34.24:900',auth = HTTPBasicAuth('user','password'))
#r = requests.get('http://120.27.34.24:900',auth=('user','password'))這樣寫也可以
print(r.status_code)
異常處理:
import requests from requests.exceptions import ReadTimeout,HTTPError,RequestException try: response = requests.get('http://www.baidu.com',timeout = 1) print(response.status_code) except ReadTimeout: print('timeout') except HTTPError: print('httrerror') except RequestException: print('error')
例程只引入了三個異常,官方文檔里還有別的異常,見上圖,也可以引入,跟例程中的三個異常操作相似