python爬蟲---urllib庫的基本用法

本文轉載自查看原文 2017-11-22 12:37 2088 Python爬蟲

urllib是python自帶的請求庫，各種功能相比較之下也是比較完備的，urllib庫包含了一下四個模塊：

urllib.request 請求模塊

urllib.error 異常處理模塊

urllib.parse url解析模塊

urllib.robotparse robots.txt解析模塊

下面是一些urllib庫的使用方法。

使用urllib.request

import urllib.request response = urllib.request.urlopen('http://www.bnaidu.com')
print(response.read().decode('utf-8'))

使用read()方法打印網頁的HTML，read出來的是字節流,需要decode一下

import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.status) #打印狀態碼信息 其方法和response.getcode() 一樣 都是打印當前response的狀態碼
print(response.getheaders()) #打印出響應的頭部信息，內容有服務器類型，時間、文本內容、連接狀態等等
print(response.getheader('Server'))  #這種拿到響應頭的方式需要加上參數，指定你想要獲取的頭部中那一條數據
print(response.geturl())  #獲取響應的url 
print(response.read())#使用read()方法得到響應體內容，這時是一個字節流bytes，看到明文還需要decode為charset格式

為一個請求添加請求頭，偽裝為瀏覽器

1.在請求時就加上請求頭參數

import urllib.request import urllib.parse url = 'http://httpbin.org/post' header = {} header['User-Agent'] = 'Mozilla/5.0 ' \ '(Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 ' \ '(KHTML, like Gecko) Version/5.1 Safari/534.50'

req = urllib.request.Request(url=url, headers=header)
res = urllib.request.urlopen(req)

Request是一個請求類，在構造時將headers以參數形式加入到請求中

2.使用動態追加headers的方法

若要使用動態追加的方法，必須實例化Request這個類

import urllib.request import urllib.parse url = 'http://httpbin.org/post' req = urllib.request.Request(url=url) req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0')
res = urllib.request.urlopen(req)

使用代理：

ProxyHandler是urllib.request下的一個類，借助這個類可以構造代理請求

參數為一個dict形式的，key對應着類型，IP，端口

import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'112.35.29.53:8088', 'https':'165.227.169.12:80' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.baidu.com')
print(response.read())

urllib.parse的用法

import urllib.request import urllib.parse url = 'http://httpbin.org/post' header = {} header['User-Agent'] = 'Mozilla/5.0 ' \ '(Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 ' \ '(KHTML, like Gecko) Version/5.1 Safari/534.50' data = {} data['name'] = 'us' data = urllib.parse.urlencode(data).encode('utf-8') req = urllib.request.Request(url=url, data=data, headers=header, method='POST') response = urllib.request.urlopen(req) print(response.read().decode('utf-8')) print(type(data))

urllib這個庫很坑，建議直接棄用，上個月我用urllib寫好的代碼，現在運行起來各種問題

所以使用requests庫吧，超簡潔的語法方法。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲之urllib庫（一） Python之urllib庫的用法 Python 爬蟲 urllib、urllib2、urllib3用法及區別 python3爬蟲之Urllib庫（一） Python爬蟲入門三之Urllib庫的基本使用 python 爬蟲基本庫使用urllib之urlopen(一) 【Python爬蟲】requests與urllib庫的區別 python爬蟲(四)_urllib2庫的基本使用 python爬蟲從入門到放棄（三）之 Urllib庫的基本使用 Python爬蟲基礎（一）urllib2庫的基本使用