python 3.x 爬蟲基礎---Urllib詳解

本文轉載自查看原文 2018-02-23 14:24 12974 python 爬蟲

python 3.x 爬蟲基礎

python 3.x 爬蟲基礎---http headers詳解

python 3.x 爬蟲基礎---Urllib詳解

python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4）

python 3.x 爬蟲基礎---正則表達式

前言

　　爬蟲也了解了一段時間了希望在半個月的時間內結束它的學習，開啟python的新大陸，今天大致總結一下爬蟲基礎相關的類庫---Urllib。

Urllib

官方文檔地址：https://docs.python.org/3/library/urllib.html

urllib提供了一系列用於操作URL的功能。

Python3中將python2.7的urllib和urllib2兩個包合並成了一個urllib庫，其主要包括一下模塊：

urllib.request 請求模塊

urllib.error 異常處理模塊

urllib.parse url解析模塊

urllib.robotparser robots.txt解析模塊

urllib.request

urllib.request.urlopen

通過案例可以看出urlopen，會返回一個二進制的對象，對這個對象進行read（）操作可以得到一個包含網頁的二進制字符串，然后用decode()解碼成一段html代碼。

urlopen參數如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

常用參數：

　　url:訪問的地址，一般不只是地址。

　　data:此參數為可選字段，特別要注意的是，如果選擇，請求變為post傳遞方式,其中傳遞的參數需要轉為bytes，如果是我們只需要通過 urllib.parse.urlencode 轉換即可：

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding=  'utf8')
response = urllib.request.urlopen('http://xxxxx', data=data)
print(response.read().decode('utf-8'))

　　timeout:設置網站的訪問超時時間

其他參數：

　　context 參數：它必須是 ssl.SSLContext 類型，用來指定 SSL 設置。

　　cafile 和 capath 兩個參數：是指定CA證書和它的路徑，這個在請求 HTTPS 鏈接時會有用。

　　cadefault 參數：現在已經棄用了，默認為 False

urlopen返回對象提供方法：

　　read() , readline() ,readlines() , fileno() , close() ：對HTTPResponse類型數據進行操作。

　　info()：返回HTTPMessage對象，表示遠程服務器返回的頭信息。

　　getcode()：返回Http狀態碼。

　　geturl()：返回請求的url。

import urllib.request
response = urllib.request.urlopen('http://python.org/')
print("查看 response 的返回類型：",type(response))
print("查看反應地址信息: ",response)
print("查看頭部信息1(http header)：\n",response.info())
print("查看頭部信息2(http header)：\n",response.getheaders())
print("輸出頭部屬性信息：",response.getheader("Server"))
print("查看響應狀態信息1(http status)：\n",response.status)
print("查看響應狀態信息2(http status)：\n",response.getcode())
print("查看響應 url 地址：\n",response.geturl())
page = response.read()
print("輸出網頁源碼:",page.decode('utf-8'))

View Code

urllib.request.Request

import urllib.request
headers = {'Host': 'www.xicidaili.com',
           'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
           'Accept': r'application/json, text/javascript, */*; q=0.01',
           'Referer': r'http://www.xicidaili.com/', }
req = urllib.request.Request(r'http://www.xicidaili.com/nn/', headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
print(html)

通過代碼我們可以看出urlopen不再是傳遞url了,而是一個 request。這樣一來我們不帶把請求獨立成一個對象，而且能更加靈活方便的配置訪問參數，這是爬蟲http必不可少的一步。

Request參數如下：

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

常用參數：　　

　　url:訪問的地址。

　　data:此參數為可選字段，其中傳遞的參數需要轉為bytes，如果是字典我們只需要通過 urllib.parse.urlencode 轉換即可：

　 headers:http相應headers傳遞的信息，構造方法：headers 參數傳遞，通過調用 Request 對象的 add_header() 方法來添加請求頭。python 3.x 爬蟲基礎---http headers詳解，可參考此文章。

其他參數：

　　origin_req_host ：指的是請求方的 host 名稱或者 IP 地址。

　　unverifiable ：用來表明這個請求是否是無法驗證的，默認是 False 。意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。如果沒有權限，這時 unverifiable 的值就是 True 。

　　method ：用來指示請求使用的方法，比如 GET ， POST ， PUT 等

urllib.request.ProxyHandler（ip代理）

以上做些簡單的demo是沒有問題的，但是如果想讓你的爬蟲更加強大，那么 rulllib.request.ProxyHandler 設置代理你一定要知道，網站它會檢測某一段時間某個IP 的訪問次數，如果訪問次數過多，它會禁止你的訪問,所以這個時候需要通過設置代理來爬取數據

ef Proxy_read(proxy_list, user_agent_list, i):
    proxy_ip = proxy_list[i]
    print('當前代理ip：%s'%proxy_ip)
    user_agent = random.choice(user_agent_list)
    print('當前代理user_agent：%s'%user_agent)
    sleep_time = random.randint(1,3)
    print('等待時間：%s s' %sleep_time)
    time.sleep(sleep_time)
    print('開始獲取')
    headers = {'User-Agent': user_agent,'Accept': r'application/json, text/javascript, */*; q=0.01',
                'Referer': r'https://www.cnblogs.com'
                }
    proxy_support = request.ProxyHandler({'http':proxy_ip})
    opener = request.build_opener(proxy_support)
    request.install_opener(opener)
    req = request.Request(r'https://www.cnblogs.com/kmonkeywyl/p/8409715.html',headers=headers)
    try:
        html = request.urlopen(req).read().decode('utf-8')
    except Exception as e:
        print('******打開失敗！******')
    else:
        global count
    count +=1
    print('OK!總計成功%s次！'%count)

以上代碼是前段時間寫的刷新頁面的但是沒有達到想要的效果，不過里面有用到 request.ProxyHandler({'http':proxy_ip}) 。其中 urllib.request.build_opener() 方法來利用這個處理器構建一個 Opener ，那么這個 Opener 在發送請求的時候就具備了認證功能了。 request.install_opener(opener) 方法打開鏈接，就可以完成認證了。

urllib.request.HTTPCookieProcessor（cookie操作）

網站中通過cookie進行判斷權限是很常見的。那么我們可以通過 urllib.request.HTTPCookieProcessor(cookie) 來操作cookie。使用Cookie和使用代理IP一樣，也需要創建一個自己的opener。在HTTP包中，提供了cookiejar模塊，用於提供對Cookie的支持。 http.cookiejar功能強大，我們可以利用本模塊的CookieJar類的對象來捕獲cookie並在后續連接請求時重新發送，比如可以實現模擬登錄功能。該模塊主要的對象有CookieJar、FileCookieJar、MozillaCookieJar、LWPCookieJar。

獲取cookie( CookieJar)

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
   print(item.name+"="+item.value)

View Code

保存cookie(MozillaCookieJar)

filename = 'cookie.txt'  
cookie = http.cookiejar.MozillaCookieJar(filename)  
handler = urllib.request.HTTPCookieProcessor(cookie)  
opener = urllib.request.build_opener(handler)  
response = opener.open('http://www.baidu.com')  
cookie.save(ignore_discard=True, ignore_expires=True)

View Code

使用cookie

import http.cookiejar, urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

View Code

其中FileCookieJar、MozillaCookieJar、LWPCookieJar約為保存cookie信息，只是保存格式不同。我們在進行操作cookie時使用對應的格式即可。

urllib.error

　　用 try-except來捕捉異常,主要的錯誤方式就兩種 URLError（錯誤信息）和HTTPError(錯誤編碼).

try:
    data=urllib.request.urlopen(url)
    print(data.read().decode('utf-8'))
except urllib.error.HTTPError as e:
    print(e.code)
except urllib.error.URLError as e:
    print(e.reason)

urllib.parse

urllib.parse.urlparse

將對應的URL解析成六部分，並以元組的數據格式返回來。

import urllib.parse
o = urllib.parse.urlparse('http://www.cnblogs.com/kmonkeywyl/')
print(o)

參數

result = urlparse('url',scheme='https')解析協議可以去掉http://

result = urlparse('url',scheme='http')

result = urlparse('url',allow_fragments=False) url帶有查詢參數

result = urlparse('url',allow_fragments=False) url不帶有查詢參數

urllib.parse.urlunparse

拼接url

from urllib.parse 
data = ['http','www.baidu.com','index.html','user','a=1','comment']
print(urllib.parse.urlunparse(data))

urllib.parse.urljoin

用來拼接url的方法或者叫組合方法,url必須為一致站點,否則后面參數會覆蓋前面的host

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.badiu.com','https://www.baidu.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','http://www.baidu.com/FAQ.html'))
print(urljoin('www.baidu.com#comment','?category=2'))

這個在這個就不過多的介紹了，有興趣的可以去看資料。

作者：王延領

出處：http://wyl1924.cnblogs.com

本文版權歸作者和博客園共有，歡迎轉載，但未經作者同意必須保留此段聲明，且在文章頁面明顯位置給出原文鏈接。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 3.x 爬蟲基礎---http headers詳解 python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4） python 3.x 爬蟲基礎---正則表達式 python3.5爬蟲基礎urllib實例 Python爬蟲基礎（一）urllib2庫的基本使用 python爬蟲必學標准模塊——urllib和urllib3詳解 Python 3.x中導入urllib出現AttributeError: module 'urllib' has no attribute 'urlopen' Python爬蟲之urllib.parse詳解 Python 3.X 要使用urllib.request 來抓取網絡資源。轉 python 3.x報錯：No module named 'cookielib'或No module named 'urllib2'