爬蟲-Python爬蟲常用庫

本文轉載自查看原文 2018-05-03 23:01 27946

一、常用庫

1、requests 做請求的時候用到。

requests.get("url")

2、selenium 自動化會用到。

3、lxml

4、beautifulsoup

5、pyquery 網頁解析庫說是比beautiful 好用，語法和jquery非常像。

6、pymysql 存儲庫。操作mysql數據的。

7、pymongo 操作MongoDB 數據庫。

8、redis 非關系型數據庫。

9、jupyter 在線記事本。

二、什么是Urllib

Python內置的Http請求庫

urllib.request 請求模塊　　模擬瀏覽器

urllib.error 異常處理模塊

urllib.parse url解析模塊　　工具模塊，如：拆分、合並

urllib.robotparser robots.txt 解析模塊　　

2和3的區別

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

用法：

urlOpen 發送請求給服務器。

urllib.request.urlopen(url,data=None[參數],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

例子：

例子1：

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

　　例子2：

　　import urllib.request

　　import urllib.parse

　　data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

　　response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

　　print(response.read())

　　注：加data就是post發送，不加就是以get發送。

　　例子3：

　　超時測試

　　import urllib.request

　　response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

　　print(response.read())

　　-----正常

　　import socket

　　import urllib.reqeust

　　import urllib.error

　　try:

　　　　response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

　　except urllib.error.URLError as e:

　　　　if isinstance(e.reason,socket.timeout):

　　　　　　print('TIME OUT')

　　這是就是輸出 TIME OUT

響應

響應類型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

輸出:print(type(response))

　　　狀態碼、響應頭

　　　import urllib.request

　　　response = urllib.request.urlopen('http://www.python.org')

　　　print(response.status) // 正確返回200

　　　print(response.getheaders()) //返回請求頭

　　 print(response.getheader('Server'))　　

三、Request 可以添加headers

　　import urllib.request

　　request=urllib.request.Request('https://python.org')

　　response=urllib.request.urlopen(request)

　　print(response.read().decode('utf-8'))

　　例子：

　　from urllib import request,parse

　　url='http://httpbin.org/post'

　　headers={

　　　　User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36

　　　　Host:httpbin.org

　　}

　　dict={

　　　　'name':'Germey'

　　}

　　data=bytes(parse.urlencode(dict),encoding='utf8')

　　req= request.Request(url=url,data=data,headers=headers,method='POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

四、代理

　　import urllib.request

　　proxy_handler =urllib.request.ProxyHandler({

　　　　'http':'http://127.0.0.1:9743',

　　　　'https':'http://127.0.0.1:9743',

　　})

　　opener =urllib.request.build_opener(proxy_handler)

　　response= opener.open('http://httpbin.org/get')

　　print(response.read())

五、Cookie

　　import http.cookiejar,urllib.request

　　cookie = http.cookiejar.Cookiejar()

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(handler)

　　response = opener.open('http://www.baidu.com')

　　for item in cookie:

　　　　print(item.name+"="+item.value)

　　第一種保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'　　

　　cookie =http.cookiejar.MozillaCookieJar(filename)

　　handler= urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response= opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　第二種保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'

　　cookie =http.cookiejar.LWPCookieJar(filename)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　讀取cookie

　　import http.cookiejar,urllib.request

　　cookie=http.cookiejar.LWPCookieJar()

　　cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

六、異常處理

　　例子1：

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.URLError as e:

　　　　print(e.reason)　　//url異常捕獲

　　例子2:

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.HTTPError as e:

　　　　print(e.reason,e.code,e.headers,sep='\n')　　//url異常捕獲

　　except error.URLError as e:

　　　　print(e.reason)　　

　　else:

　　　　print('Request Successfully')

7、URL解析

　　urlparse //url 拆分

　　urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

　　例子：

　　from urllib.parse import urlparse //url 拆分

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

　　print(type(result),result)

　　結果：

　　例子2：

　　from urllib.parse import urlparse //沒有http

　　result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

　 print(result)

　　例子3：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

　　print(result)

　　例子4：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

　　print(result)

　　例子5：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

　　print(result)

七、拼接

　　urlunparse

　　例子：

　　from urllib.parse import urlunparse

　　data=['http','www.baidu.com','index.html','user','a=6','comment']

　　print(urlunparse(data))

　　urljoin

　　from urllib.parse import urljoin

　　print(urljoin('http://www.baidu.com','FAQ.html'))

　　后面覆蓋前面的

　　urlencode

　　from urllib.parse import urlencode

　　params={

　　　　'name':'gemey',

　　　　'age':22

　　}

　　base_url='http//www.baidu.com?'

　　url = base_url+urlencode(params)

　　print(url)

　　http://www.baidu.com?name=gemey&age=22

example:

urllib是Python自帶的標准庫，無需安裝，直接可以用。
提供了如下功能：

網頁請求
響應獲取
代理和cookie設置
異常處理
URL解析

爬蟲所需要的功能，基本上在urllib中都能找到，學習這個標准庫，可以更加深入的理解后面更加便利的requests庫。

urllib庫

urlopen 語法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:訪問的網址 #data:額外的數據，如header，form data

用法

# request:GET import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) # request: POST # http測試：http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超時設置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

響應

# 響應類型 import urllib.open response = urllib.request.urlopen('https:///www.python.org') print(type(response)) # 狀態碼， 響應頭 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server'))

Request

聲明一個request對象，該對象可以包括header等信息，然后用urlopen打開。

# 簡單例子 import urllib.request request = urllib.request.Requests('https://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) # 增加header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 構造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者隨后增加header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))

Handler：處理更加復雜的頁面

官方說明
代理

import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'http://127.0.0.1:9743' 'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read())

Cookie:客戶端用於記錄用戶身份,維持登錄信息

import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value) # 保存cooki為文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存類型有很多種 ## 類型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 類型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相應的方法讀取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com")

異常處理

捕獲異常，保證程序穩定運行

# 訪問不存在的頁面 from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.URLError as e: print(e.reason) # 先捕獲子類錯誤 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判斷原因 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

URL解析

主要是一個工具模塊，可用於為爬蟲提供URL。

urlparse:拆分URL

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True) # scheme: 協議類型 # 是否忽略’#‘部分

舉個例子

from urllib import urlparse result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580") result ##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')

urlunparse:拼接URL，為urlparse的反向操作

from urllib.parse import urlunparse data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data))

urljoin:拼接兩個URL

urljoin

urlencode:字典對象轉換成GET請求對象

from urllib.parse import urlencode params = { 'name':'germey', 'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url)

最后還有一個robotparse，解析網站允許爬取的部分。

作者：hoptop
鏈接：https://www.jianshu.com/p/cfbdacbeac6e
來源：簡書
著作權歸作者所有。商業轉載請聯系作者獲得授權，非商業轉載請注明出處。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬蟲常用的庫 Python爬蟲學習==>第五章：爬蟲常用庫的安裝 python爬蟲入門request 常用庫介紹 Python3 常用爬蟲庫的安裝 Python爬蟲常用庫的安裝及其環境配置 python爬蟲常用第三方庫 Python常用的爬蟲框架及操作庫 Python 網絡爬蟲的常用庫匯總 python爬蟲常用模塊 python爬蟲之urllib庫（一）