`urllib`和`requests`的學習

urllib
requests
參考資料

urllib

urllib是python的基本庫之一，內置四大模塊，即request，error，parse，robotparser，常用的request，error，一個用於發送HTTP請求，一個用於處理請求的錯誤。parse用於對URL的處理，拆分，全並等。

基本用法

# 導入 urllib 中的 request
from urllib import request
# 使用 urlopen，打開一個網址
urllib_response = request.urlopen("http://www.baidu.com")
# read() 方法用於讀取響應， decode()  處理，默認 read() 是字節流
print(urllib_response.read().decode("utf-8"))
print(type(urllib_response.read()))  # <class 'bytes'>

問題1:如果在發送請求時斷網會怎樣
我是在斷網的情況下實時的，urllib會提示urllib.error.URLError，即urlopen error [Errno -3] Temporary failure in name resolution，無法解析域名。

from urllib import request, error
# 解決方法是導入 error 模塊
# 使用 try except 方法捕獲該錯誤，避免錯誤影響程序后續運行，也可以將該錯誤放入日志
try:
    urllib_response = request.urlopen("http://www.baidu.com")
    print(urllib_response.read().decode('utf-8'))
except error.URLError as e:
    print(e)

問題2:如何添加請求頭信息

有添加請求頭信息的需求，就當然有為什么要添加請求頭信息的問題了。
要搞明這個問題，就要先了解原生請求頭是什么，於是我寫一個簡單的web server，以下簡稱server

import socket

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server:
    server.bind(('127.0.0.1', 5000))
    server.listen(5)
    while True:
        conn, ip = server.accept()
        req = b''
        while True:
            temp = conn.recv(1024)
            req += temp
            if len(temp) < 1024:
                break
        print(req.decode('utf-8'))
        conn.sendall(b'HTTP/1.1 200 OK\r\nContent-Type:text/html\r\n\r\n')
        conn.close()

此時再使用urlopen直接打開127.0.0.1:5000，server顯示如下信息

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:5000
User-Agent: Python-urllib/3.7
Connection: close

通過User-Agent可以看出，當前是python-urllib/3.7。即非合法的客戶端，如瀏覽器，目標網站可以通過識別該字段來判斷此次訪問是否合法，以決定是否禁止其訪問。當然，設置了合法的請求頭也並非這個爬蟲可以安然無恙，目標網站仍然可以通過其他算法也判斷這個訪問是否合法訪問。不過，我們能做的就是把能做的做了

`urllib`設置請求頭

from urllib import request
"""
設置 headers 信息，其為字典格式
使用 request.Request 構造請求
使用 request.urlopen 打開請求
"""
headers = {
    'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
                   ' AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/68.0.3440.106 Safari/537.36')
}

req = request.Request(url='http://127.0.0.1:5000', headers=headers)
response = request.urlopen(req)
print(response.read().decode("utf-8"))

server顯示如下，User-Agent已被成功修改

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:5000
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Connection: close

requests

requests是python的第三方庫，通過pip install requests即可安裝

基本用法

import requests

response = requests.get("http://www.baidu.com")
# 請求百度會出現亂碼，因為百度默認解析編碼為 utf-8
# requests 的 response 默認為 ISO-8859-1
# 通過 response.encoding 直接設置此次響應解析編碼
response.encoding = 'utf-8'
# 通過 response.text 即可查看響應的文本 body
print(response.text)

如果要設置請求頭，requests可以直接添加headers參數以設置請求頭，而不用像urllib的reqeust需要構造一個請求，再進行請求。因為請求頭是在其內部已經構建了。

import requests

headers = {
    'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
                   ' AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/68.0.3440.106 Safari/537.36')
}
response = requests.get("http://127.0.0.1:5000", headers=headers)
print(response.text)

server的訪問請求頭信息的前后對比。

GET / HTTP/1.1
Host: 127.0.0.1:5000
User-Agent: python-requests/2.20.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive


GET / HTTP/1.1
Host: 127.0.0.1:5000
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

參考資料

崔慶才的《Python3網絡爬蟲開發實戰》
urllib官方文檔
requests官方文檔

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 urllib2和requests的區別 requests與urllib.request python關於urllib庫與requests 對urllib、requests、scrapy的總結理解urllib、urllib2及requests區別及運用深入理解urllib、urllib2及requests Python模塊之requests,urllib和re requests庫和urllib包對比 requests庫和urllib包對比 python3 urllib和requests模塊

淺談urllib和requests

urllib和requests的學習

urllib

基本用法

urllib設置請求頭

requests

基本用法

參考資料

免責聲明！

`urllib`和`requests`的學習

`urllib`設置請求頭