一、常用庫
1、requests 做請求的時候用到。
requests.get("url")
2、selenium 自動化會用到。
3、lxml
4、beautifulsoup
5、pyquery 網頁解析庫 說是比beautiful 好用,語法和jquery非常像。
6、pymysql 存儲庫。操作mysql數據的。
7、pymongo 操作MongoDB 數據庫。
8、redis 非關系型數據庫。
9、jupyter 在線記事本。
二、什么是Urllib
Python內置的Http請求庫
urllib.request 請求模塊 模擬瀏覽器
urllib.error 異常處理模塊
urllib.parse url解析模塊 工具模塊,如:拆分、合並
urllib.robotparser robots.txt 解析模塊
2和3的區別
Python2
import urllib2
response = urllib2.urlopen('http://www.baidu.com');
Python3
import urllib.request
response =urllib.request.urlopen('http://www.baidu.com');
用法:
urlOpen 發送請求給服務器。
urllib.request.urlopen(url,data=None[參數],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)
例子:
例子1:
import urllib.requests
response=urllib.reqeust.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
例子2:
import urllib.request
import urllib.parse
data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)
print(response.read())
注:加data就是post發送,不加就是以get發送。
例子3:
超時測試
import urllib.request
response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())
-----正常
import socket
import urllib.reqeust
import urllib.error
try:
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):
print('TIME OUT')
這是就是輸出 TIME OUT
響應
響應類型
import urllib.request
response=urllib.request.urlopen('https://www.python.org')
print(type(response))
輸出:print(type(response))
狀態碼、響應頭
import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.status) // 正確返回200
print(response.getheaders()) //返回請求頭
print(response.getheader('Server'))
三、Request 可以添加headers
import urllib.request
request=urllib.request.Request('https://python.org')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
例子:
from urllib import request,parse
url='http://httpbin.org/post'
headers={
}
dict={
'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf8')
req= request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
四、代理
import urllib.request
proxy_handler =urllib.request.ProxyHandler({
'http':'http://127.0.0.1:9743',
'https':'http://127.0.0.1:9743',
})
opener =urllib.request.build_opener(proxy_handler)
response= opener.open('http://httpbin.org/get')
print(response.read())
五、Cookie
import http.cookiejar,urllib.request
cookie = http.cookiejar.Cookiejar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
第一種保存cookie方式
import http.cookiejar,urllib.request
filename = 'cookie.txt'
cookie =http.cookiejar.MozillaCookieJar(filename)
handler= urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response= opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
第二種保存cookie方式
import http.cookiejar,urllib.request
filename = 'cookie.txt'
cookie =http.cookiejar.LWPCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
讀取cookie
import http.cookiejar,urllib.request
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
六、異常處理
例子1:
from urllib import reqeust,error
try:
response =request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason) //url異常捕獲
例子2:
from urllib import reqeust,error
try:
response =request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,sep='\n') //url異常捕獲
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
7、URL解析
urlparse //url 拆分
urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)
例子:
from urllib.parse import urlparse //url 拆分
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)
結果:
例子2:
from urllib.parse import urlparse //沒有http
result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
例子3:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
例子4:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)
print(result)
例子5:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
七、拼接
urlunparse
例子:
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
urljoin
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
后面覆蓋前面的
urlencode
from urllib.parse import urlencode
params={
'name':'gemey',
'age':22
}
base_url='http//www.baidu.com?'
url = base_url+urlencode(params)
print(url)
http://www.baidu.com?name=gemey&age=22