python3之模塊urllib

本文轉載自查看原文 2020-03-30 11:16 1277 Python3/ python爬蟲

閱讀目錄

1、urllib.request.urlopen()
2、urllib.request.Requset()
3、urllib.request的高級類
4、異常處理
5、解析鏈接
6、分析Robots協議

urllib是python內置的HTTP請求庫，無需安裝即可使用，它包含了4個模塊：

request：它是最基本的http請求模塊，用來模擬發送請求

error：異常處理模塊，如果出現錯誤可以捕獲這些異常

parse：一個工具模塊，提供了許多URL處理方法，如：拆分、解析、合並等

robotparser：主要用來識別網站的robots.txt文件，然后判斷哪些網站可以爬

1、urllib.request.urlopen()

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

請求對象，返回一個HTTPResponse類型的對象，包含的方法和屬性：

方法：read()、readinto()、getheader(name)、getheaders()、fileno()

屬性：msg、version、status、reason、bebuglevel、closed

import urllib.request

response=urllib.request.urlopen('https://www.python.org')  #請求站點獲得一個HTTPResponse對象
#print(response.read().decode('utf-8'))   #返回網頁內容
#print(response.getheader('server')) #返回響應頭中的server值
#print(response.getheaders()) #以列表元祖對的形式返回響應頭信息
#print(response.fileno()) #返回文件描述符
#print(response.version)  #返回版本信息
#print(response.status)  #返回狀態碼200，404代表網頁未找到
#print(response.debuglevel) #返回調試等級
#print(response.closed)  #返回對象是否關閉布爾值
#print(response.geturl()) #返回檢索的URL
#print(response.info()) #返回網頁的頭信息
#print(response.getcode()) #返回響應的HTTP狀態碼
#print(response.msg)  #訪問成功則返回ok
#print(response.reason) #返回狀態信息

urlopen()方法可傳遞參數：

url：網站地址，str類型，也可以是一個request對象

data：data參數是可選的，內容為字節流編碼格式的即bytes類型，如果傳遞data參數，urlopen將使用Post方式請求

from urllib.request import urlopen
import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8') 
#data需要字節類型的參數，使用bytes()函數轉換為字節，使用urllib.parse模塊里的urlencode()方法來講參數字典轉換為字符串並指定編碼
response = urlopen('http://httpbin.org/post',data=data)
print(response.read())

#output
b'{
"args":{},
"data":"",
"files":{},
"form":{"word":"hello"},  #form字段表明模擬以表單的方法提交數據，post方式傳輸數據
"headers":{"Accept-Encoding":"identity",
    "Connection":"close",
    "Content-Length":"10",
    "Content-Type":"application/x-www-form-urlencoded",
    "Host":"httpbin.org",
    "User-Agent":"Python-urllib/3.5"},
"json":null,
"origin":"114.245.157.49",
"url":"http://httpbin.org/post"}\n'

timeout參數：用於設置超時時間，單位為秒，如果請求超出了設置時間還未得到響應則拋出異常，支持HTTP,HTTPS,FTP請求

import urllib.request
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)  #設置超時時間為0.1秒,將拋出異常
print(response.read())

#output
urllib.error.URLError: <urlopen error timed out>

#可以使用異常處理來捕獲異常
import urllib.request
import urllib.error
import socket
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout): #判斷對象是否為類的實例
        print(e.reason) #返回錯誤信息
#output
timed out

其他參數：context參數，她必須是ssl.SSLContext類型，用來指定SSL設置，此外,cafile和capath這兩個參數分別指定CA證書和它的路徑，會在https鏈接時用到。

2、urllib.request.Requset()

urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

參數：

url：請求的URL，必須傳遞的參數，其他都是可選參數

data：上傳的數據，必須傳bytes字節流類型的數據，如果它是字典，可以先用urllib.parse模塊里的urlencode()編碼

headers：它是一個字典，傳遞的是請求頭數據，可以通過它構造請求頭，也可以通過調用請求實例的方法add_header()來添加

例如：修改User_Agent頭的值來偽裝瀏覽器，比如火狐瀏覽器可以這樣設置：

{'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}

origin_req_host：指請求方的host名稱或者IP地址

unverifiable：表示這個請求是否是無法驗證的，默認為False，如我們請求一張圖片如果沒有權限獲取圖片那它的值就是true

method：是一個字符串，用來指示請求使用的方法，如：GET,POST,PUT等

#!/usr/bin/env python
#coding:utf8
from urllib import request,parse

url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)',
    'Host':'httpbin.org'
}  #定義頭信息

dict={'name':'germey'}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
#req.add_header('User-Agent','Mozilla/5.0 (compatible; MSIE 8.4; Windows NT') #也可以request的方法來添加
response = request.urlopen(req) 
print(response.read())

3、urllib.request的高級類

在urllib.request模塊里的BaseHandler類，他是所有其他Handler的父類，他是一個處理器，比如用它來處理登錄驗證，處理cookies，代理設置，重定向等

它提供了直接使用和派生類使用的方法：

add_parent(director)：添加director作為父類

close()：關閉它的父類

parent()：打開使用不同的協議或處理錯誤

defautl_open(req)：捕獲所有的URL及子類，在協議打開之前調用

Handler的子類包括：

HTTPDefaultErrorHandler：用來處理http響應錯誤，錯誤會拋出HTTPError類的異常

HTTPRedirectHandler：用於處理重定向

HTTPCookieProcessor：用於處理cookies

ProxyHandler：用於設置代理，默認代理為空

HTTPPasswordMgr：永遠管理密碼，它維護用戶名和密碼表

HTTPBasicAuthHandler：用戶管理認證，如果一個鏈接打開時需要認證，可以使用它來實現驗證功能

OpenerDirector類是用來處理URL的高級類，它分三個階段來打開URL：

在每個階段中調用這些方法的順序是通過對處理程序實例進行排序來確定的；每個使用此類方法的程序都會調用protocol_request()方法來預處理請求，然后調用protocol_open()來處理請求，最后調用protocol_response()方法來處理響應。

之前的urlopen()方法就是urllib提供的一個Opener，通過Handler處理器來構建Opener實現Cookies處理,代理設置，密碼設置等

Opener的方法包括：

add_handler(handler)：添加處理程序到鏈接中

open(url,data=None[,timeout])：打開給定的URL與urlopen()方法相同

error(proto,*args)：處理給定協議的錯誤

更多Request內容...

密碼驗證：

#!/usr/bin/env python
#coding:utf8
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username='username'
passowrd='password'
url='http://localhost'
p=HTTPPasswordMgrWithDefaultRealm() #構造密碼管理實例
p.add_password(None,url,username,passowrd) #添加用戶名和密碼到實例中
auth_handler=HTTPBasicAuthHandler(p) #傳遞密碼管理實例構建一個驗證實例
opener=build_opener(auth_handler)  #構建一個Opener
try:
    result=opener.open(url)  #打開鏈接，完成驗證，返回的結果是驗證后的頁面內容
    html=result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代理設置：

#!/usr/bin/env python
#coding:utf8
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler=ProxyHandler({
    'http':'http://127.0.0.1:8888',
    'https':'http://127.0.0.1:9999'
})
opener=build_opener(proxy_handler) #構造一個Opener
try:
    response=opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

Cookies：

獲取網站的Cookies

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar() #實例化cookiejar對象
handler=urllib.request.HTTPCookieProcessor(cookie) #構建一個handler
opener=urllib.request.build_opener(handler) #構建Opener
response=opener.open('http://www.baidu.com') #請求
print(cookie)
for item in cookie:
    print(item.name+"="+item.value)

Mozilla型瀏覽器的cookies格式，保存到文件：

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
fielname='cookies.txt'
cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #創建保存cookie的實例，保存瀏覽器類型的Mozilla的cookie格式
#cookie=http.cookiejar.CookieJar() #實例化cookiejar對象
handler=urllib.request.HTTPCookieProcessor(cookie) #構建一個handler
opener=urllib.request.build_opener(handler) #構建Opener
response=opener.open('http://www.baidu.com') #請求
cookie.save(ignore_discard=True,ignore_expires=True)

也可以保存為libwww-perl(LWP)格式的Cookies文件

cookie=http.cookiejar.LWPCookieJar(filename=fielname)

從文件中讀取cookies：

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
#fielname='cookiesLWP.txt'
#cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #創建保存cookie的實例，保存瀏覽器類型的Mozilla的cookie格式
#cookie=http.cookiejar.LWPCookieJar(filename=fielname) #LWP格式的cookies
#cookie=http.cookiejar.CookieJar() #實例化cookiejar對象
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookiesLWP.txt',ignore_discard=True,ignore_expires=True)

handler=urllib.request.HTTPCookieProcessor(cookie) #構建一個handler
opener=urllib.request.build_opener(handler) #構建Opener
response=opener.open('http://www.baidu.com') #請求
print(response.read().decode('utf-8'))

4、異常處理

urllib的error模塊定義了由request模塊產生的異常，如果出現問題，request模塊便會拋出error模塊中定義的異常。

1）URLError

URLError類來自urllib庫的error模塊，它繼承自OSError類，是error異常模塊的基類，由request模塊產生的異常都可以通過捕獲這個類來處理

它只有一個屬性reason，即返回錯誤的原因

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('https://hehe,com/index')
except error.URLError as e:
    print(e.reason)  #如果網頁不存在不會拋出異常，而是返回捕獲的異常錯誤的原因(Not Found)

reason如超時則返回一個對象

#!/usr/bin/env python
#coding:utf8

import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('https://www.baidu.com',timeout=0.001)
except urllib.error.URLError as e:
    print(e.reason)
    if isinstance(e.reason,socket.timeout):
        print('time out')

2）HTTPError

它是URLError的子類，專門用來處理HTTP請求錯誤，比如認證請求失敗，它有3個屬性：

code：返回HTTP的狀態碼，如404頁面不存在，500服務器錯誤等

reason：同父類，返回錯誤的原因

headers：返回請求頭

更多error內容...

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:  #先捕獲子類異常
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:  #再捕獲父類異常
    print(e.reason)
else:
    print('request successfully')

5、解析鏈接

urllib庫提供了parse模塊，它定義了處理URL的標准接口，如實現URL各部分的抽取，合並以及鏈接轉換，它支持如下協議的URL處理：file,ftp,gopher,hdl,http,https,imap,mailto,mms,news,nntp,prospero,rsync,rtsp,rtspu,sftp,sip,sips,snews,svn,snv+ssh,telnet,wais

urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

通過urlparse的API可以看到，它還可以傳遞3個參數

urlstring：待解析的URL，字符串

scheme：它是默認的協議，如http或者https，URL如果不帶http協議，可以通過scheme來指定，如果URL中制定了http協議則URL中生效

allow_fragments：是否忽略fragment即錨點，如果設置為False，fragment部分會被忽略，反之不忽略

更多parse模塊內容...

1）urlparse()

該方法可以實現URL的識別和分段，分別是scheme(協議),netloc(域名),path(路徑),params(參數),query(查詢條件),fragment(錨點)

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result,sep='\n')  #返回的是一個元祖
print(result.scheme,result[0])  #可以通過屬性或者索引來獲取值
print(result.netloc,result[1])
print(result.path,result[2])
print(result.params,result[3])
print(result.query,result[4])
print(result.fragment,result[5])

#output
#返回結果是一個parseresult類型的對象，它包含6個部分，
#分別是scheme(協議),netloc(域名),path(路徑),params(參數),query(查詢條件),fragment(錨點)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
http http
www.baidu.com www.baidu.com
/index.html /index.html
user user
id=5 id=5
comment comment

指定scheme協議，allow_fragments忽略錨點信息：

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https',allow_fragments=False)
print(result) 

#output
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5#comment', fragment='')

2）urlunparse()

與urlparse()相反，通過列表或者元祖的形式接受一個可迭代的對象，實現URL構造

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data)) #構造一個完整的URL

#output
http://www.baidu.com/index.html;user?a=6#comment

3)urlsplit()

與urlparse()方法類似，它會返回5個部分，把params合並到path中

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit
result=urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

#output
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

4)urlunsplit()

與urlunparse()類似，它也是將鏈接的各部分組合完整的鏈接的方法，傳入的參數也是可迭代的對象，如列表元祖等，唯一的區別是長度必須是5個，它省略了params

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit,urlunsplit
data=['http','www.baidu.com','index.html','a=5','comment']
result=urlunsplit(data)
print(result)

#output
http://www.baidu.com/index.html?a=5#comment

5)urljoin()

通過將基本URL（base）與另一個URL(url)組合起來構建完整URL，它會使用基本URL組件，協議(schemm)、域名(netloc)、路徑(path)、來提供給URL中缺失的部分進行補充，最后返回結果

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','index.html'))
print(urljoin('http://www.baidu.com','http://cdblogs.com/index.html'))
print(urljoin('http://www.baidu.com/home.html','https://cnblog.com/index.html'))
print(urljoin('http://www.baidu.com?id=3','https://cnblog.com/index.html?id=6'))
print(urljoin('http://www.baidu.com','?id=2#comment'))
print(urljoin('www.baidu.com','https://cnblog.com/index.html?id=6'))

#output
http://www.baidu.com/index.html
http://cdblogs.com/index.html
https://cnblog.com/index.html
https://cnblog.com/index.html?id=6
http://www.baidu.com?id=2#comment
https://cnblog.com/index.html?id=6

base_url提供了三項內容scheme,netloc,path，如果這3項在新的鏈接中不存在就給予補充，如果新的鏈接存在就使用新的鏈接部分，而base_url中的params,query和fragment是不起作用的。通過urljoin()方法可以實現鏈接的解析、拼接和生成

6)urlencode()

urlencode()在構造GET請求參數時很有用，它可以將字典轉化為GET請求參數

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #將字典轉化為get參數
print(url)

#output
http://www.baidu.com?password=123&username=zs

7)parse_qs()

parse_qs()與urlencode()正好相反，它是用來反序列化的，如將GET參數轉換回字典格式

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,parse_qs,urlsplit
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #將字典轉化為get參數

query=urlsplit(url).query  #獲去URL的query參數條件
print(parse_qs(query))  #根據獲取的GET參數轉換為字典格式

#output
{'username': ['zs'], 'password': ['123']}

8)parse_qsl()它將參數轉換為元祖組成的列表

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,urlsplit,parse_qsl

params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #將字典轉化為get參數

query=urlsplit(url).query  #獲去URL的query參數條件
print(parse_qsl(query)) #將轉換成列表形式的元祖對

#output
[('username', 'zs'), ('password', '123')]

9)quote()：該方法可以將內容轉換為URL編碼的格式，如參數中帶有中文時，有時會導致亂碼的問題，此時用這個方法將中文字符轉化為URL編碼

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)

#output
https://www.baidu.com/s?key=%E4%B8%AD%E6%96%87

10)unquote()：與quote()相反，他用來進行URL解碼

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote,urlsplit,unquote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)
unq=urlsplit(url).query.split('=')[1] #獲取參數值

print(unquote(unq))  #解碼參數

6、分析Robots協議

利用urllib的robotparser模塊，我們可以實現網站Robots協議的分析

1）Robots協議

Robots協議也稱為爬蟲協議、機器人協議，它的全名叫做網絡爬蟲排除標准(Robots Exclusion Protocol)，用來告訴爬蟲和搜索引擎哪些網頁可以抓取，哪些不可以抓取，它通常是一個robots.txt的文本文件，一般放在網站的根目錄下。

當搜索爬蟲訪問一個站點時，它首先會檢查這個站點根目錄下是否存在robots.txt文件，如果存在，搜索爬蟲會根據其中定義的爬去范圍來爬取，如果沒有找到，搜索爬蟲會訪問所有可直接訪問的頁面

我們來看下robots.txt的樣例：

User-agent: *
Disallow: /
Allow: /public/

它實現了對所有搜索爬蟲只允許爬取public目錄的功能，將上述內容保存為robots.txt文件放在網站根目錄下，和網站的入口文件（index.html）放在一起

User-agent描述了搜索爬蟲的名稱，將其設置為*則代表協議對任何爬蟲有效，如設置為Baiduspider則代表規則對百度爬蟲有效，如果有多條則對多個爬蟲受到限制，但至少需要指定一條

一些常見的搜索爬蟲名稱：

BaiduSpider　　百度爬蟲 www.baidu.com

Googlebot　　Google爬蟲 www.google.com

360Spider　　360爬蟲 www.so.com

YodaoBot　　有道爬蟲 www.youdao.com

ia_archiver　　Alexa爬蟲 www.alexa.cn

Scooter　　altavista爬蟲 www.altavista.com

Disallow指定了不允許抓取的目錄，如上例中設置的/則代表不允許抓取所有的頁面

Allow一般和Disallow一起使用，用來排除單獨的某些限制，如上例中設置為/public/則表示所有頁面不允許抓取，但可以抓取public目錄

設置示例：

#禁止所有爬蟲
User-agent: *
Disallow: /

#允許所有爬蟲訪問任何目錄,另外把文件留空也可以
User-agent: *
Disallow:

#禁止所有爬蟲訪問某那些目錄
User-agent: *
Disallow: /home/
Disallow: /tmp/

#只允許某一個爬蟲訪問
User-agent: BaiduSpider
Disallow:
User-agent: *
Disallow: /

2）robotparser

rebotparser模塊用來解析robots.txt，該模塊提供了一個類RobotFileParser，它可以根據某網站的robots.txt文件來判斷一個抓取爬蟲時都有權限來抓取這個網頁

urllib.robotparser.RobotFileParser(url='')

robotparser類常用的方法：

set_url()：用來設置robots.txt文件的連接，如果在創建RobotFileParser對象是傳入了連接，就不需要在使用這個方法設置了

read()：讀取reobts.txt文件並進行分析，它不會返回任何內容，但執行那個了讀取和分析操作

parse()：用來解析robots.txt文件，傳入的參數是robots.txt某些行的內容，並安裝語法規則來分析內容

can_fetch()：該方法傳入兩個參數，第一個是User-agent，第二個是要抓取的URL，返回的內容是該搜索引擎是否可以抓取這個url,結果為True或False

mtime()：返回上次抓取和分析robots.txt的時間

modified()：將當前時間設置為上次抓取和分析robots.txt的時間

#!/usr/bin/env python
#coding:utf8
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()  #創建對象
rp.set_url('https://www.cnblogs.com/robots.txt') #設置robots.txt連接，也可以在創建對象時指定
rp.read()  #讀取和解析文件
print(rp.can_fetch('*','https://i.cnblogs.com/EditPosts.aspx?postid=9170312&update=1')) #堅持鏈接是否可以被抓取

轉載：https://www.cnblogs.com/zhangxinqi/p/9170312.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3之模塊urllib Python3之urllib模塊 python3 urllib和requests模塊 python3爬蟲學習（一）urllib模塊的使用 Python3中Urllib庫是什么？urllib模塊基本使用 Python3學習筆記（urllib模塊的使用） Python3中urllib模塊的改變 python3中urllib庫的request模塊詳解 python3中urllib庫的request模塊詳解 Python的urllib和urllib2模塊