urllib庫初體驗以及中文編碼問題的探討

本文轉載自查看原文 2016-12-26 19:02 1506 Python

提出問題：如何簡單抓取一個網頁的源碼
解決方法：利用urllib庫，抓取一個網頁的源代碼

------------------------------------------------------------------------------------

代碼示例

#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read())

運行結果

b'\n<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8"/>\n    <title>\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title>    \n    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>\n    <meta content="\xe6\x8a\x80\xe6\x9c\xaf\xe6\x90\x9c\xe7\xb4\xa2,IT\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe6\x90\x9c\xe7\xb4\xa2,\xe4\xbb\xa3\xe7\xa0\x81\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e" name="keywords" />\n    <meta content="\xe9\x9d\xa2\xe5\x90\x91\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe7\x9a\x84\xe4\xb8\x93\xe4\xb8\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x82\xe9\x81\x87\xe5\x88\xb0\xe6\x8a\x80\xe6\x9c\xaf\xe9\x97\xae\xe9\xa2\x98\xe6\x80\x8e\xe4\xb9\x88\xe5\x8a\x9e\xef\xbc\x8c\xe5\x88\xb0\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b..." name="description" />\n    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />\n    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>\n    <script src="/Scripts/Common.js" type="text/javascript"></script>\n    <script src="/Scripts/Home.js" type="text/javascript"></script>\n</head>\n<body>\n    <div class="top">\n        \n        <div class="top_tabs">\n            <a href="http://www.cnblogs.com">\xc2\xab \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe9\xa6\x96\xe9\xa1\xb5 </a>\n        </div>\n        <div id="span_userinfo" class="top_links">\n        </div>\n    </div>\n    <div style="clear: both">\n    </div>\n    <center>\n        <div id="main">\n            <div class="logo_index">\n                <a href="http://zzk.cnblogs.com">\n                    <img alt="\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8blogo" src="/images/logo.gif" /></a>\n            </div>\n            <div class="index_sozone">\n                <div class="index_tab">\n                    <a href="/n" onclick="return  channelSwitch(&#39;n&#39;);">\xe6\x96\xb0\xe9\x97\xbb</a>\n<a class="tab_selected" href="/b" onclick="return  channelSwitch(&#39;b&#39;);">\xe5\x8d\x9a\xe5\xae\xa2</a>                    <a href="/k" onclick="return  channelSwitch(&#39;k&#39;);">\xe7\x9f\xa5\xe8\xaf\x86\xe5\xba\x93</a>\n                    <a href="/q" onclick="return  channelSwitch(&#39;q&#39;);">\xe5\x8d\x9a\xe9\x97\xae</a>\n                </div>\n                <div class="search_block">\n                    <div class="index_btn">\n                        <input type="button" class="btn_so_index" onclick="Search();" value="&nbsp;\xe6\x89\xbe\xe4\xb8\x80\xe4\xb8\x8b&nbsp;" />\n                        <span class="help_link"><a target="_blank" href="/help">\xe5\xb8\xae\xe5\x8a\xa9</a></span>\n                    </div>\n                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />\n                </div>\n            </div>\n        </div>\n        <div class="footer">\n            &copy;2004-2016 <a href="http://www.cnblogs.com">\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</a>\n        </div>\n    </center>\n</body>\n</html>\n'

附上python2.7的實現代碼：

#python2.7
import urllib2
 
response = urllib2.urlopen("http://zzk.cnblogs.com/b")
print response.read()

可見，python3.4和python2.7的代碼存在差異性。

----------@_@？問題出現！----------------------------------------------------------------------

發現問題：查看上面的運行結果，會發現中文並沒有正常顯示。
解決問題：處理中文編碼問題

--------------------------------------------------------------------------------------------------

處理源碼中的中文問題！！！

修改代碼，如下：

#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read().decode('UTF-8'))

運行，結果顯示：

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客園</title>    
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" />
    <meta content="面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Home.js" type="text/javascript"></script>
</head>
<body>
    <div class="top">
        
        <div class="top_tabs">
            <a href="http://www.cnblogs.com">« 博客園首頁 </a>
        </div>
        <div id="span_userinfo" class="top_links">
        </div>
    </div>
    <div style="clear: both">
    </div>
    <center>
        <div id="main">
            <div class="logo_index">
                <a href="http://zzk.cnblogs.com">
                    <img alt="找找看logo" src="/images/logo.gif" /></a>
            </div>
            <div class="index_sozone">
                <div class="index_tab">
                    <a href="/n" onclick="return  channelSwitch(&#39;n&#39;);">新聞</a>
<a class="tab_selected" href="/b" onclick="return  channelSwitch(&#39;b&#39;);">博客</a>                    <a href="/k" onclick="return  channelSwitch(&#39;k&#39;);">知識庫</a>
                    <a href="/q" onclick="return  channelSwitch(&#39;q&#39;);">博問</a>
                </div>
                <div class="search_block">
                    <div class="index_btn">
                        <input type="button" class="btn_so_index" onclick="Search();" value="&nbsp;找一下&nbsp;" />
                        <span class="help_link"><a target="_blank" href="/help">幫助</a></span>
                    </div>
                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />
                </div>
            </div>
        </div>
        <div class="footer">
            &copy;2004-2016 <a href="http://www.cnblogs.com">博客園</a>
        </div>
    </center>
</body>
</html>


Process finished with exit code 0

結果顯示：處理完編碼后，網頁源碼中中文可以正常顯示了

-----------@_@! 探討一個新的中文編碼問題 ----------------------------------------------------------

　　　問題：“如果url中出現中文，那么應該如果解決呢？”

　　　例如：url = "http://zzk.cnblogs.com/s?w=python爬蟲&t=b"

-----------------------------------------------------------------------------------------------------

接下來，我們來解決url中出現中文的問題！！！

（1）測試1：保留原來的格式，直接訪問，不做任何處理

代碼示例：

#python3.4
import urllib.request

url="http://zzk.cnblogs.com/s?w=python爬蟲&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('UTF-8'))

運行結果：

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in <module>
    response = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 463, in open
    response = self._open(req, data)
  File "C:\Python34\lib\urllib\request.py", line 481, in _open
    '_open', req)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python34\lib\urllib\request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python34\lib\http\client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "C:\Python34\lib\http\client.py", line 1116, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python34\lib\http\client.py", line 973, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Process finished with exit code 1

　　果然不行！！！

（2）測試2：中文單獨處理

代碼示例：

import urllib.request
import urllib.parse

url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬蟲")+"&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))

運行結果：

運行結果

結果顯示：對url中的中文進行單獨處理，url對應內容可以正常抓取了

------@_@! 又有一個新的問題-----------------------------------------------------------

問題：如果把url的中英文一起進行處理呢？還能成功抓取嗎？

----------------------------------------------------------------------------------------

（3）於是，測試3出現了！測試3：url中，中英文一起進行處理

代碼示例：

#python3.4
import urllib.request
import urllib.parse

url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬蟲&t=b")
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))

運行結果：

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in <module>
    resp = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 448, in open
    req = Request(fullurl, data)
  File "C:\Python34\lib\urllib\request.py", line 266, in __init__
    self.full_url = url
  File "C:\Python34\lib\urllib\request.py", line 292, in full_url
    self._parse()
  File "C:\Python34\lib\urllib\request.py", line 321, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db'

Process finished with exit code 1

結果顯示：ValueError！無法成功抓取網頁！

結合測試1、2、3，可得到下面結果：

（1）在python3.4中，如果url中包含中文，可以用 urllib.parse.quote("爬蟲") 進行處理。

（2）url中的中文需要單獨處理，不能中英文一起處理。

Tips：如果想了解一個函數的參數傳值

#python3.4
import urllib.request

help(urllib.request.urlopen)

運行上面代碼，控制台輸出

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None)

Process finished with exit code 0

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 json中文編碼問題 .Net Core中文編碼問題 urlencode遇到中文編碼問題網絡傳輸中文編碼問題 cmd 中文編碼問題 -- 永久修改 python爬蟲之中文編碼問題 ZKUI中文編碼以及以docker方式運行的問題深入分析 Java 中的中文編碼問題 String.getBytes()方法中的中文編碼問題 Java中HTTP網絡傳輸中文編碼問題