- 提出問題:如何簡單抓取一個網頁的源碼
- 解決方法:利用urllib庫,抓取一個網頁的源代碼
------------------------------------------------------------------------------------
- 代碼示例
#python3.4 import urllib.request response = urllib.request.urlopen("http://zzk.cnblogs.com/b") print(response.read())
- 運行結果
b'\n<!DOCTYPE html>\n<html>\n<head>\n <meta charset="utf-8"/>\n <title>\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title> \n <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>\n <meta content="\xe6\x8a\x80\xe6\x9c\xaf\xe6\x90\x9c\xe7\xb4\xa2,IT\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe6\x90\x9c\xe7\xb4\xa2,\xe4\xbb\xa3\xe7\xa0\x81\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e" name="keywords" />\n <meta content="\xe9\x9d\xa2\xe5\x90\x91\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe7\x9a\x84\xe4\xb8\x93\xe4\xb8\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x82\xe9\x81\x87\xe5\x88\xb0\xe6\x8a\x80\xe6\x9c\xaf\xe9\x97\xae\xe9\xa2\x98\xe6\x80\x8e\xe4\xb9\x88\xe5\x8a\x9e\xef\xbc\x8c\xe5\x88\xb0\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b..." name="description" />\n <link type="text/css" href="/Content/Style.css" rel="stylesheet" />\n <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>\n <script src="/Scripts/Common.js" type="text/javascript"></script>\n <script src="/Scripts/Home.js" type="text/javascript"></script>\n</head>\n<body>\n <div class="top">\n \n <div class="top_tabs">\n <a href="http://www.cnblogs.com">\xc2\xab \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe9\xa6\x96\xe9\xa1\xb5 </a>\n </div>\n <div id="span_userinfo" class="top_links">\n </div>\n </div>\n <div style="clear: both">\n </div>\n <center>\n <div id="main">\n <div class="logo_index">\n <a href="http://zzk.cnblogs.com">\n <img alt="\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8blogo" src="/images/logo.gif" /></a>\n </div>\n <div class="index_sozone">\n <div class="index_tab">\n <a href="/n" onclick="return channelSwitch('n');">\xe6\x96\xb0\xe9\x97\xbb</a>\n<a class="tab_selected" href="/b" onclick="return channelSwitch('b');">\xe5\x8d\x9a\xe5\xae\xa2</a> <a href="/k" onclick="return channelSwitch('k');">\xe7\x9f\xa5\xe8\xaf\x86\xe5\xba\x93</a>\n <a href="/q" onclick="return channelSwitch('q');">\xe5\x8d\x9a\xe9\x97\xae</a>\n </div>\n <div class="search_block">\n <div class="index_btn">\n <input type="button" class="btn_so_index" onclick="Search();" value=" \xe6\x89\xbe\xe4\xb8\x80\xe4\xb8\x8b " />\n <span class="help_link"><a target="_blank" href="/help">\xe5\xb8\xae\xe5\x8a\xa9</a></span>\n </div>\n <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />\n </div>\n </div>\n </div>\n <div class="footer">\n ©2004-2016 <a href="http://www.cnblogs.com">\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</a>\n </div>\n </center>\n</body>\n</html>\n'
- 附上python2.7的實現代碼:
#python2.7 import urllib2 response = urllib2.urlopen("http://zzk.cnblogs.com/b") print response.read()
- 可見,python3.4和python2.7的代碼存在差異性。
----------@_@? 問題出現!----------------------------------------------------------------------
- 發現問題:查看上面的運行結果,會發現中文並沒有正常顯示。
- 解決問題:處理中文編碼問題
--------------------------------------------------------------------------------------------------
- 處理源碼中的中文問題!!!
- 修改代碼,如下:
#python3.4 import urllib.request response = urllib.request.urlopen("http://zzk.cnblogs.com/b") print(response.read().decode('UTF-8'))
- 運行,結果顯示:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py <!DOCTYPE html> <html> <head> <meta charset="utf-8"/> <title>找找看 - 博客園</title> <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/> <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" /> <meta content="面向程序員的專業搜索引擎。遇到技術問題怎么辦,到博客園找找看..." name="description" /> <link type="text/css" href="/Content/Style.css" rel="stylesheet" /> <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script> <script src="/Scripts/Common.js" type="text/javascript"></script> <script src="/Scripts/Home.js" type="text/javascript"></script> </head> <body> <div class="top"> <div class="top_tabs"> <a href="http://www.cnblogs.com">« 博客園首頁 </a> </div> <div id="span_userinfo" class="top_links"> </div> </div> <div style="clear: both"> </div> <center> <div id="main"> <div class="logo_index"> <a href="http://zzk.cnblogs.com"> <img alt="找找看logo" src="/images/logo.gif" /></a> </div> <div class="index_sozone"> <div class="index_tab"> <a href="/n" onclick="return channelSwitch('n');">新聞</a> <a class="tab_selected" href="/b" onclick="return channelSwitch('b');">博客</a> <a href="/k" onclick="return channelSwitch('k');">知識庫</a> <a href="/q" onclick="return channelSwitch('q');">博問</a> </div> <div class="search_block"> <div class="index_btn"> <input type="button" class="btn_so_index" onclick="Search();" value=" 找一下 " /> <span class="help_link"><a target="_blank" href="/help">幫助</a></span> </div> <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" /> </div> </div> </div> <div class="footer"> ©2004-2016 <a href="http://www.cnblogs.com">博客園</a> </div> </center> </body> </html> Process finished with exit code 0
- 結果顯示:處理完編碼后,網頁源碼中中文可以正常顯示了
-----------@_@! 探討一個新的中文編碼問題 ----------------------------------------------------------
問題:“如果url中出現中文,那么應該如果解決呢?”
例如:url = "http://zzk.cnblogs.com/s?w=python爬蟲&t=b"
-----------------------------------------------------------------------------------------------------
- 接下來,我們來解決url中出現中文的問題!!!
(1)測試1:保留原來的格式,直接訪問,不做任何處理
- 代碼示例:
#python3.4 import urllib.request url="http://zzk.cnblogs.com/s?w=python爬蟲&t=b" resp = urllib.request.urlopen(url) print(resp.read().decode('UTF-8'))
- 運行結果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py Traceback (most recent call last): File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in <module> response = urllib.request.urlopen(url) File "C:\Python34\lib\urllib\request.py", line 161, in urlopen return opener.open(url, data, timeout) File "C:\Python34\lib\urllib\request.py", line 463, in open response = self._open(req, data) File "C:\Python34\lib\urllib\request.py", line 481, in _open '_open', req) File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain result = func(*args) File "C:\Python34\lib\urllib\request.py", line 1210, in http_open return self.do_open(http.client.HTTPConnection, req) File "C:\Python34\lib\urllib\request.py", line 1182, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "C:\Python34\lib\http\client.py", line 1088, in request self._send_request(method, url, body, headers) File "C:\Python34\lib\http\client.py", line 1116, in _send_request self.putrequest(method, url, **skips) File "C:\Python34\lib\http\client.py", line 973, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128) Process finished with exit code 1
果然不行!!!
(2)測試2:中文單獨處理
- 代碼示例:
import urllib.request import urllib.parse url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬蟲")+"&t=b" resp = urllib.request.urlopen(url) print(resp.read().decode('utf-8'))
- 運行結果:

- 結果顯示:對url中的中文進行單獨處理,url對應內容可以正常抓取了
------@_@! 又有一個新的問題-----------------------------------------------------------
- 問題:如果把url的中英文一起進行處理呢?還能成功抓取嗎?
----------------------------------------------------------------------------------------
(3)於是,測試3出現了!測試3:url中,中英文一起進行處理
- 代碼示例:
#python3.4 import urllib.request import urllib.parse url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬蟲&t=b") resp = urllib.request.urlopen(url) print(resp.read().decode('utf-8'))
- 運行結果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py Traceback (most recent call last): File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in <module> resp = urllib.request.urlopen(url) File "C:\Python34\lib\urllib\request.py", line 161, in urlopen return opener.open(url, data, timeout) File "C:\Python34\lib\urllib\request.py", line 448, in open req = Request(fullurl, data) File "C:\Python34\lib\urllib\request.py", line 266, in __init__ self.full_url = url File "C:\Python34\lib\urllib\request.py", line 292, in full_url self._parse() File "C:\Python34\lib\urllib\request.py", line 321, in _parse raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db' Process finished with exit code 1
- 結果顯示:ValueError!無法成功抓取網頁!
- 結合測試1、2、3,可得到下面結果:
(1)在python3.4中,如果url中包含中文,可以用 urllib.parse.quote("爬蟲") 進行處理。
(2)url中的中文需要單獨處理,不能中英文一起處理。
- Tips:如果想了解一個函數的參數傳值
#python3.4 import urllib.request
help(urllib.request.urlopen)
- 運行上面代碼,控制台輸出
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py Help on function urlopen in module urllib.request: urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None) Process finished with exit code 0