數據獲取—爬蟲-2（Urllib包解析）

本文轉載自查看原文 2019-08-18 19:51 361 爬蟲/ urllib/ 全流程

Urllib庫

它是python內置的HTTP請求庫，使用它發送Request。它主要包含以下幾個基本模塊：

urllib.request：請求庫，模擬打開網頁的過程。
urllib.error:異常處理模塊，捕集，處理返回的錯誤值。
urllib.parse：解析模塊，提供了很多解析方法。
urllib.roboparse：robots.txt文件解析，判斷文件的可爬性。

Request

雖然urllib庫是python的內置庫，但是仍然需要導入。導入后可以直接使用urllib.request.urlopen()函數直接向服務器發送Request。Request中含有data數據時是POST請求，否則為GET請求。詳細代碼如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)#urlopen函數形式，主要使用前三個參數   

#GET請求
import urllib.request #導入相應的庫

response = urllib.request.urlopen('http://www.baidu.com') #發送Request
print(response.read().decode('utf-8')) '''打印相關請求，關於網頁的編碼格式如果常見的仍然無法編譯，查看網頁源代碼，在head的第一行charset屬性中可能會有相應信息。'''

# POST請求
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
#POST請求比GET多了一個data文件
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

#設置延遲時間
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)#反應時間0.1s
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):#判斷錯誤類型
        print('TIME OUT')

urlopen()能夠發送Request，但是無法直接進行更多的設置，如設置請求頭等。這時候可以先聲明一個Request對象，然后傳入相應的信息，最后將Request對象傳入給urlopen().

from urllib import request, parse #導入相應的包

url = 'http://httpbin.org/post' #網址
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
} #設置請求頭

dict = {
    'name': 'Germey'
}#設置DataFrom信息

data = bytes(parse.urlencode(dict), encoding='utf8')#將DataFrom信息編譯成二進制流
req = request.Request(url=url, data=data, headers=headers, method='POST')#構建Request類
#如果req中缺少header時，urllib提供了add_header方法
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)#傳入urlopen
print(response.read().decode('utf-8'))#打印

Response

對於服務器發送的相應體，我們可以獲取其類型、狀態碼和響應頭。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))#獲取相應類型
print(response.status)# 獲取狀態碼
print(response.getheaders())#獲取響應頭
print(response.getheader('Server'))#獲取相應頭的中的參數
print(response.read().decode('utf-8'))#打印相應體

Handler

除了正常的Request內容之外，urllib提供很多附加功能，通常使用handler實現。

proxy

設置代理需要首先床架ProxyHandler，再將其構建為一個opener，使用open()方法打開。上文中urlopen()內部同樣是構建一個opener，然后使用open()打開網頁。

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

Cookie用來維持網頁登陸狀態，用於爬取需要登陸的網站。常見的Cookie設置格式如下：

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar() #首先創建一個CookieJar類
handler = urllib.request.HTTPCookieProcessor(cookie)#借助handler處理Cookie
opener = urllib.request.build_opener(handler)#構建opener
response = opener.open('http://www.baidu.com')#打開網頁
for item in cookie:
    print(item.name+"="+item.value)#打印出Cookie的值
	
import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)#火狐瀏覽器格式存儲cookie
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)#保存

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()#是用另一種格式存儲
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)#加載Cookie
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

異常處理

python中只定義了兩種錯誤類，URLError和Base#Ear融入,廠用try--except，捕集判斷錯誤類型。

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)


from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')
	
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

URL解析

這就像一個工具包，里面有好多功能。

urlparse:將URL分割，並生成一個ParseResult類，里面保存URL各部分信息。

#獲取URl信息
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)  
#設置URL信息，如有URL已經存在相應信息，那么該設置不會起作用
from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
#可以通過指定不存在相應信息方式更改切分結果
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)#設置不存在allow_fragments
print(result)

urlunparse：拼接一個URL

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

urljoin：合並一個URL，每個URL可以分為6個部分。該函數以后面的URl為基准使用前方URL中元素作為補充，得到性的URL。

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

urlenconde：將字典形參數轉化為GET請求參數，得到能夠直接使用的URL。

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)#拼接URL
print(url)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲系列之解決動態數據獲取(一) 數據獲取 Python爬蟲QQ音樂數據采取，公開數據獲取案例之一 Redis 數據獲取命令 MySQL與Sqlserver數據獲取 pandas的數據獲取及保存 python爬蟲筆記（1-1）requests模塊：請求數據獲取響應內容爬蟲之urllib包以及request模塊和parse模塊量化歷史數據獲取 PHP_GET數據獲取