爬蟲之urllib包以及request模塊和parse模塊

本文轉載自查看原文 2018-05-09 22:43 1601 爬蟲

urllib簡介

簡介

Python3中將python2.7的urllib和urllib2兩個包合並成了一個urllib庫

Python3中,urllib庫包含有四個模塊:

urllib.request 主要用來打開或者讀取url
urllib.error 主要用來存放返回的錯誤信息
urllib.parse 主要用來解析url
urllib.robotparser 主要用來解析robots.txt文件

模塊安裝與導入

urllib是python自帶的一個包，無需安裝，導入方法如下：

from urllib import request
...

urllib.request

urllib.request這個模塊用得比較多, 尤其是urlopen函數，會返回一個二進制的對象，對這個對象進行read（）操作可以得到一個包含網頁的二進制字符串，然后用decode()解碼成一段html代碼:

語法結構:

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None, capath=None, cadefault=False, context=None)

參數:其中url既可以是一個URL字符串,又可以是一個Requst對象,一般使用后者添加其他參數。

當request的方式是post時,使用參數data，用於填寫傳遞的表單信息，將data填好表單信息，准備傳入urlopen 前，還需要利用urllib.parse里的urlencode()函數轉換格式，寫成data = urllib.parse.urlencode(data).encode(‘’),然后將data傳入函數。

而urllib.request的Request函數，也可以用於打開url字符串，同時可以傳入更多的參數，例如：headers，Request函數可以返回一個request對象作為urlopen函數的url參數使用。

語法結構:

urllib.request. Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

參數:其中url是一個URL字符串。

data用法與urlopen一致。

headers參數是一個字典，服務器對於用戶發出的request，會通過其中的headers信息來判斷用戶發信息，我們可以通過自己編寫headers傳入urllib.request. Request中用於偽裝自己的身份。Header中User-agent參數是判斷用戶身份。另外通過設置代理可以改變用戶提交時的IP地址。

urllib.error

待續......

urllib.parse

待續......

urllib.robotparser

待續......

爬蟲小示例

代碼

from urllib import request

# 定義一個url
url = 'https://www.baidu.com/'

# 用request.urlopen()方法打開指定的url
response = request.urlopen(url)

# 返回的是一個HTTPResponse對象
print(type(response))    # <class 'http.client.HTTPResponse'>
print(response)          # <http.client.HTTPResponse object at 0x00000196C95CB550>

# 調用返回的response對象的read()方法，可以讀取url返回的html內容，不過是bytes類型的
html = response.read()
print(type(html))        # <class 'bytes'>

# 對bytes類型的html進行解碼
html = html.decode()
print(html)

分析

根據以上代碼，我們得知，urllib包下面的request模塊的urlopen方法可以獲取一個HttpResponse對象，通過調用對象的read()方法可以獲取二進制格式的url的html內容，對結果進行解碼即可

urlopen返回的HttpResonse對象

我們從上面的小示例可以看出，urlopen打開一個url后會返回一個HttpResponse對象，這個對象有以下幾個常用的方法：

read()

次方法用來讀取url的html內容，格式為二進制

geturl()

用來獲取urlopen的url參數，也就是所打開的url

如，在上面示例中調用此方法：

print(response.geturl())
 
# https://www.baidu.com/

info()

返回response對象的meta信息

print(response.info())

'''
Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Wed, 09 May 2018 13:59:22 GMT
Last-Modified: Tue, 08 May 2018 03:45:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=E163F6688178D6656D765FF58DBA2D01; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1525874362; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
'''

getcode()

返回http狀態碼。200 404 403等

print(response.getcode())

# 200

給urlopen傳遞參數

get方法傳參

利用url參數給服務器傳遞信息

參數為dict類型，需要用parse對字典參數進行編碼

格式為：

response = request.urlopen(url,data)

示例如下：

from urllib import request
from urllib import parse

url = 'https://www.baidu.com/s'

# 讓用戶輸入關鍵詞
keyword = input('請輸入您要搜索的內容：')

# 定義一個字典，將用戶輸入的關鍵字封裝到里面
data = {'kw':keyword}

# 對data進行編譯
data = parse.urlencode(data).encode()

response = request.urlopen(url,data)

print(response.read().decode())

上面的小示例是直接用給urlopen的data傳參數的形式來傳遞數據，也可以將參數數據封裝給一個Request對象，然后將對象再傳遞給urlopen。

這種方法可以傳遞更多的信息，如header等，可以更好的隱藏我們身份，偽裝成瀏覽器訪問，如下：

from urllib import request
from urllib import parse

keyword = input('請輸入您要搜索的內容：')
url = 'https://www.baidu.com/s'
data = {'kw':keyword}
header = {'Content-Length':len(data)}

req = request.Request(url,data=parse.urlencode(data).encode(),headers=header)

response = request.urlopen(req)

print(response.read().decode())

post方法傳參

連接百度翻譯接口的小示例

from urllib import request
from urllib import parse

url = 'http://fanyi.baidu.com/sug'

keyword = input('請輸入您要翻譯的詞語')

data = {'kw':keyword}
data = parse.urlencode(data).encode()

header = {'Content-Length':len(data)}

res = request.Request(url,data=data,headers=header)

response = request.urlopen(res)

res = response.read().decode()
print(res)

上面的小示例返回的是一個json類型的字典，輸入'girl'之后翻譯結果如下：

{"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18\uff0c\u672a\u5a5a\u5973\u5b50; \u5973\u804c\u5458\uff0c\u5973\u6f14\u5458; \uff08\u7537\u4eba\u7684\uff09\u5973\u670b\u53cb;"},{"k":"girls","v":"n. \u5973\u5b69; \u5973\u513f( girl\u7684\u540d\u8bcd\u590d\u6570 ); \u5973\u5de5; \uff08\u7537\u4eba\u7684\uff09\u5973\u670b\u53cb;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb; \u5973\u6027\u670b\u53cb;"},{"k":"girl friend","v":"n. \u5973\u670b\u53cb\uff0c\uff08\u7537\u4eba\u7684\uff09\u60c5\u4eba; \u5bf9\u8c61;"},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d\u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]}

我們只需要用json給它轉成字典格式，然后循環展示即可查看到結果，如下：

from urllib import request
from urllib import parse
import json

url = 'http://fanyi.baidu.com/sug'

keyword = input('請輸入您要翻譯的詞語')

data = {'kw':keyword}
data = parse.urlencode(data).encode()

header = {'Content-Length':len(data)}

res = request.Request(url,data=data,headers=header)

response = request.urlopen(res)

res = response.read().decode()

fanyi_res = json.loads(res)['data']


for item in fanyi_res:
    print(item['k'],item['v'])

結果如下：

請輸入您要翻譯的詞語girl
girl n. 女孩; 姑娘，未婚女子; 女職員，女演員; （男人的）女朋友;
girls n. 女孩; 女兒( girl的名詞復數 ); 女工; （男人的）女朋友;
girlfriend n. 女朋友; 女性朋友;
girl friend n. 女朋友，（男人的）情人; 對象;
Girls' Generation  少女時代（韓國SM娛樂有限公司於2007年推出的九名女子少女組合）;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python-爬蟲03：urllib.request模塊的使用 python爬蟲-urllib模塊爬蟲-urllib3模塊的使用 python3爬蟲初探（一）之urllib.request Python-urllib庫parse模塊解析鏈接常用方法 python學習筆記（17）urllib.parse模塊使用練手爬蟲用urllib模塊獲取 python爬蟲request模塊詳解 urllib.parse python3中urllib庫的request模塊詳解