爬取頁面數據,我們需要訪問頁面,發送http請求,以下內容就是Python發送請求的幾種簡單方式:
會使用到的庫 urllib requests
1.urlopen
import urllib.request
import urllib.parse
import urllib.error
import socket
data = bytes(urllib.parse.urlencode({"hello": "world"}),encoding='utf8')
try:
response = urllib.request.urlopen('http://httpbin.org/post',data=data,timeout=10)
print(response.status)
print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print("TIMEOUT")
2.requests
用到requests中的get post delete put 方法訪問請求 這種比一簡單一些
每個方法有相應的參數列表,比如 get params參數 proxies:設置代理 auth: 認證 timeout :超時時間 等
import requests
ico = requests.get("https://github.com/favicon.ico")
with open("favicon.ico", "wb") as file:
file.write(ico.content)
3.Request Session
from requests import Session, Request
url = "https://home.cnblogs.com/u/qiutian-guniang/"
s = Session()
req = Request('GET', url=url, headers=header)
pred = s.prepare_request(req)
r = s.send(pred)
print(r.text)
某些網頁會禁止抓取數據 我們可以 通過設置User-Agent來設置 使用cookies來保持登錄的訪問狀態例如:以下的cookie內容可以通過在F12控制台獲取 復制粘貼 放入headers中
cookies = "_gat=1"
headers = {
"Cookie": cookies,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; '
'x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/68.0.3440.106 Safari/537.36'
}