慕課網-Python網絡爬蟲與信息提取(嵩天)
第一周:網絡爬蟲之規則
單元1:requests庫入門
>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> r.status_code
200
>>> r.encoding ='uft-8'
>>> r
<Response [200]>
>>> r.text
Requests庫的7個主要方法
方法 | 說明 |
---|---|
requests.request() | 構造一個請求,支撐以下各方法的基礎方法 |
requests.get() | 獲取HTML網頁的主要方法 |
requests.head() | 獲取HTML網頁頭信息的方法 |
requests.post() | 向HTML網頁提交POST請求的方法 |
requests.put() | 向HTML網頁提交PUT請求的方法 |
requests.patch() | 向HTML網易提交局部修改請求 |
requests.delete | 向HTML網易提交刪除請求 |
requests.get方法
Response對象的屬性
屬性 | 說明 |
---|---|
r.status_code | 請求返回狀態,200 連接成功;404 連接失敗 |
r.text | 響應內容的字符串形式,即URL對應的頁面內容 |
r.encoding | 從header中猜測的響應內容編碼方式 |
r.apparent_encoding | 從內容中分析出中的響應內容編碼方式(備選編碼方式) |
r.content | 響應內容的二進制樣式 |
示例
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9a产å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a> 京ICPè¯\x81030173å\x8f· <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>>
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = 'utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視頻</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登錄</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登錄</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必讀</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a> 京ICP證030173號 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>>
爬取網頁的通用代碼框架
Requests庫的異常
異常 | 說明 |
---|---|
requests.ConnectionError | 網絡連接錯誤,如DNS查詢失敗、拒絕連接等 |
requests.HTTPError | HTTP錯誤異常 |
requests.URLRequired | URL缺失異常 |
requests.TooManyRedirects | 超過最大重定向次數,產生重定向異常 |
requests.ConnectTimeout | 連接遠程服務器超時異常 |
requests.Timeout | 請求URL超時,產生超時異常 |
通用代碼
import requests
def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status() #如果返回不是200,返回異常
r.encoding = r.apparent_encoding
return r.text
except:
return "產生異常"
if __name__ == "__main__":
url = 'http://www.baidu.com'
print(getHTMLText(url))
HTTP協議及requests庫主要方法
HTTP 超文本傳輸協議
head方法
>>> import requests
>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Date': 'Mon, 13 Apr 2020 05:30:55 GMT', 'Content-Type': 'application/json', 'Content-Length': '305', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
>>> r.text
''
psot方法
>>> payload = {'key1':'value1','key2':'value2'}
>>> r = requests.post('http://httpbin.org/post',data=payload)
>>> print.(r.text)
SyntaxError: invalid syntax
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5e93fa0c-ca170ac8131ca8be595a1a54"
},
"json": null,
"origin": "61.50.105.30",
"url": "http://httpbin.org/post"
}
r = requests.post('http://httpbin.org/post',data = 'ABC')
>>> print(r.text)
{
"args": {},
"data": "ABC",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "3",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.23.0",
"X-Amzn-Trace-Id": "Root=1-5e93fab9-139eb74dbc89fb6237f88e2f"
},
"json": null,
"origin": "61.50.105.30",
"url": "http://httpbin.org/post"
}
Requests庫主要解析方法
params 修改URL
import requests
kv = {'key1':'value1','key2':'value2'}
r = requests.request('get','http://python123.io/ws',params=kv)
print(r.url)
data
json
headers
cookies
auth
files
timeout
proxies
allow_redirects
stream
verify
cert
單元小結
requests.get()
requests.head()
重點掌握
爬取網頁的通用代碼框架 掌握
單元2:網絡爬蟲的 “盜亦有道”
網絡爬蟲引發的問題
Requests 小規模數據爬取
Scrapy 中規模數據爬取。
服務器
性能騷擾
法律風險
隱私泄露
來源審查 :判斷User-Agent 進行限制
發布公告:Rebots協議
網絡爬蟲排除標准
在網站根目錄下 rebots.txt
無此文件,允許所有爬取。
Rebots協議
京東robots協議
User-agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
百度robots協議
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: *
Disallow: /
Robots協議的遵守方式
約束性:網絡爬蟲可以不遵守,但存在法律風險。
類人類行為,可以不參考rebots協議。
單元小結
單元3:Requests庫網絡爬蟲實戰(5個案例)
實例1:京東商品頁面的爬取
import requests
url = 'https://item.jd.com/100005787046.html'
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print('爬取失敗')
代碼結果與課程不一致,運行結果鏈接 需要登錄京東。
"<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http://item.jd.com/100005787046.html'</script>"
實例2:亞馬遜商品頁面的爬取
import requests
url = 'https://www.amazon.cn/dp/B0785D5L1H/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E6%9E%81%E7%AE%80&qid=1586760393&sr=8-1'
try:
kv = {'user-agent':'Mozilla/5.0'} # 偽裝瀏覽器
r = requests.get(url,headers = kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print('爬取失敗')
實例3:百度/360搜索關鍵詞提交
>>> kv = {'wd':'python'}
>>> r = requests.get('http://www.baidu.com/s',params = kv)
>>> r.status_code
200
>>> r.request.url
'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3Dpython&logid=7947542543176541686&signature=3fed0b3a823c62d34101863c3639b327×tamp=1586761583'
>>> len(r.text)
1519
import requests
keyword = 'Python'
try:
kv = {'wd':keyword}
r = requests.get("http://www.baidu.com/s",params = kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print('爬取失敗')
import requests
keyword = 'Python'
try:
kv = {'q':keyword}
r = requests.get("http://www.so.com/s",params = kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print('爬取失敗')
實例4:網絡圖片的爬取和存儲
>>> path = "D:/abc.jpg"
>>> url = "http://image.ngchina.com.cn/2015/0323/20150323111422966.jpg"
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open(path,'wb') as f:
f.write(r.content)
625427
>>> f.close()
import requests
import os
url ='http://image.ngchina.com.cn/2015/0323/20150323111422966.jpg'
root = 'D://pic//'
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print('文件保存成功')
else:
print('文件已存在')
except:
print('爬取失敗')
實例5:IP地址歸屬地的自動查詢
import requests
url ='https://m.ip138.com/iplookup.asp?ip='
try:
r = requests.get(url+'202.204.80.112')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:])
except:
print('爬取失敗')
執行結果 爬取失敗
r = requests.get(url+'202.204.80.112') 執行報錯
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import requests
>>> url ='https://m.ip138.com/iplookup.asp?ip='
>>> r = requests.get(url+ '202.204.80.112')
Traceback (most recent call last):
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
raise err
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] 由於連接方在一段時間后沒有正確答復或連接的主機沒有反應,連接嘗試失敗。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "D:\develop_study\python\Python38-32\lib\http\client.py", line 1230, in request
self._send_request(method, url, body, headers, encode_chunked)
File "D:\develop_study\python\Python38-32\lib\http\client.py", line 1276, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "D:\develop_study\python\Python38-32\lib\http\client.py", line 1225, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "D:\develop_study\python\Python38-32\lib\http\client.py", line 1004, in _send_output
self.send(msg)
File "D:\develop_study\python\Python38-32\lib\http\client.py", line 944, in send
self.connect()
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\connection.py", line 184, in connect
conn = self._new_conn()
File "D:\develop_study\python\Python38-32\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x03E697C0>: Failed to establish a new connection: [WinError 10060] 由於連接方在一段時間后沒有正確答復或連接的主機沒有反應,連接嘗試失敗。
單元小結
requests的強大功能