更為強大的庫requests是為了更加方便地實現爬蟲操作,有了它 , Cookies 、登錄驗證、代理設置等操作都不是 .
一、安裝requests模塊(cmd窗口執行)
pip3 install requests
二、requests的基本方法
import requests response=requests.get("https://www.baidu.com/") print(type(response)) #<class 'requests.models.Response'> response類型 print(response.status_code) #200 獲取狀態碼 print(response.text) #獲取網頁源碼 print(response.content) #獲取網頁源碼 print(response.cookies) #獲取網頁cookies ,Req u estsCookieJar print(response.headers) #獲取請求頭
三、推薦一個測試網址:http://httpbin.org測試請求網站,可以隨便搗鼓(其他請求方式)
import requests r=requests.post("http://httpbin.org/post") print(r.text) #打印post請求的頭部信息 r=requests.put("http://httpbin.org/post") r=requests.delete("http://httpbin.org/post") r=requests.options("http://httpbin.org/post")
這里分別用 post ()、 put ()、 delete ()等方法實現了 POST 、 PUT 、 DELETE 等請求 。
四、get 請求
查看get請求包含的請求信息
import requests r=requests.get("http://httpbin.org/get") print(r.text) #打印get請求信息
結果顯示: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get" }
結果顯示說明:一個請求信息應該包含了請求頭、ip地址、URL等信息。
(1)請求添加額外信息
方法一:?key=value&key2=value2... (?:表示起始,&:表示和)
r= requests.get("http://httpbin.org/get?name=germey&age=22")

import requests r= requests.get("http://httpbin.org/get?name=germey&age=22") print(r.text)

{ "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get?name=germey&age=22" }
通過運行結果可以判斷,請求的鏈接自動被構造成了:http://httpbin.org/get?name=germey&age=22
方法二:利用get 里面參數params,可以將請求信息編譯加載到url中(推薦使用)
import requests data={ "name":"germey", "age":22 } r=requests.get("http://httpbin.org/get",params=data) print(r.text)

結果顯示: { "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get?name=germey&age=22" }
結果都構造了:http://httpbin.org/get?name=germey&age=22,方法二比較實用
(2)從網頁請求到的請求信息都是json格式字符串,轉換成字典dict,使用 .json();
如果不是Json格式,則報錯:JSON。decodeJSONDecodeError異常
import requests r=requests.get("http://httpbin.org/get") print(type(r.text)) #查看請求頭數據類型 print(r.text) #打印請求信息 #r.json() 將json字符串轉換為字典 print(type(r.json()))#轉換為dict,打印數據類型

<class 'str'> { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get" } <class 'dict'>
結果顯示:請求信息是<str>;r.json()后的數據是<dict>
五、get方法請求抓取網頁實例
(1)成功獲取知乎的網頁信息

# 請求知乎 import requests #構建請求要求信息 data={ "type":"content", "q":"趙麗穎" } url="https://www.zhihu.com/search" #構建請求的ip和服務器信息 headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36" , "origin": "119.123.196.143", } response=requests.get(url,params=data,headers=headers) print(response.text)
這里我們加入了 headers 信息,其中包含了 User- Agent 字段信息, 也就是瀏覽器標識信息 。 如果
不加這個 ,知乎會禁止抓取,data構造了一個請求搜索信息.
(2)github站點圖標下載

import requests r=requests.get(" https://github.com/favicon.ico") print(r.text) print(r.content) with open("github.ico","wb") as f: f.write(r.content)
(3)請求頭信息headers
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "https://httpbin.org/get" }
(4)抓取github圖標
r.text 得到的數據是字符串類型
r.content 得到的數據是bytes類型數據
import requests r = requests.get("https://github.com/favicon.ico") print('text',r.text)#獲取到字符串 print('content',r.content) #獲取的是二進制

import requests r=requests.get(" https://github.com/favicon.ico") print(r.text) print(r.content) with open("gg.ico","wb") as f: f.write(r.content)
六、post請求
帶data信息請求
import requests data ={ 'name' :'pig', 'age':18 } r = requests.post('http://httpbin.org/post', data=data) print(r.text)

{ "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "pig" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "15", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "json": null, "origin": "119.123.198.80", "url": "http://httpbin.org/post" }
七、請求狀態碼
1、100狀態碼:信息狀態碼
2、200狀態碼:成功狀態碼
3、300狀態嗎:重定向狀態碼
4、400狀態碼:客戶端錯誤狀態碼
5、500狀態碼:服務器錯誤狀態碼

import requests import re def get(url,session): headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.100 Safari/537.36", "Host": "book.douban.com" } response=session.get(url,headers=headers) return response.content.decode("utf-8") def parse(path,data,session): res=session.get(data,timeout=2) with open(path,"wb") as f: f.write(res.content) if __name__ == '__main__': obj=re.compile('<img src="(?P<picture>.*?)"') session = requests.session() html=get("https://book.douban.com/",session) pic_url_list=obj.findall(html,re.S) n=1 for pic_url in pic_url_list: print(pic_url) try: path=r"book_picture/"+f"{n}"+".jpg" parse(path,pic_url,session) n+=1 except Exception as e: print(e)

#coding=utf-8 import requests import os import re def getHtml(url): headers={ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Cookie": "BDqhfp=%E5%90%8D%E8%83%9C%E5%8F%A4%E8%BF%B9%26%26-10-1undefined%26%260%26%261; BAIDUID=3E9EFDBD86FBFCF2DA4C32A06482351C:FG=1; BIDUPSID=E013F7397D0EF0F46FFAF97FC8A7F349; PSTM=1543903169; pgv_pvi=2584268800; delPer=0; PSINO=7; BDRCVFR[5VG_cZ6c41T]=9xWipS8B-FspA7EnHc1QhPEUf; ZD_ENTRY=baidu; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; indexPageSugList=%5B%22%E5%90%8D%E8%83%9C%E5%8F%A4%E8%BF%B9%22%5D; cleanHistoryStatus=0; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BCLID=7951205421696906576; BDSFRCVID=_rLOJeC629zHWkc9ByJPtBGK9T-me2jTH6bH27NMSSL8H8ptFoAvEG0PjM8g0KubkDpOogKK3gOTH4PF_2uxOjjg8UtVJeC6EG0P3J; H_BDCLCKID_SF=tJI8_DLytD_3JRcFDDTb-tCqMfTM2n58a5IX3buQX-od8pcNLTDKeJIn3UR2qxj8WHRJ5q3cfx-VDJK4MlO1j4DnDGJwWlJgBnb7VPJ23J7Nfl5jDh38XjksD-Rt5tnR-gOy0hvctb3cShPm0MjrDRLbXU6BK5vPbNcZ0l8K3l02VKO_e4bK-Tr3jaDetU5; H_PS_PSSID=; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; userFrom=blog.csdn.net", "Host": "image.baidu.com", "Upgrade-Insecure-Requests":"1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36" } page = requests.get(url,headers=headers) html = page.content.decode() return html if __name__ == '__main__': msg=input("請輸入搜索內容<如蒼老師>:") os.mkdir(msg) html = getHtml(f"https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&word={msg}&hs=0&oriquery=%E5%90%8D%E8%83%9C%E5%8F%A4%E8%BF%B9&ofr=%E5%90%8D%E8%83%9C%E5%8F%A4%E8%BF%B9&z=0&ic=0&face=0&width=0&height=0&latest=0&s=0&hd=0©right=0&selected_tags=%E7%AE%80%E7%AC%94%E7%94%BB") obj=re.compile('"thumbURL":"(?P<pic>.*?)"',re.S) pic_list=obj.findall(html) n=1 for pic_url in pic_list: print(pic_url) try: data=requests.get(pic_url,timeout=2) path=f"{msg}/"+f"{n}"+".jpg" with open(path,"wb") as f: print("-----------正在下載圖片----------") f.write(data.content) n+=1 print("----------下載完成----------") except Exception as e: print(e)