python抓取中文網頁亂碼通用解決方法

本文轉載自查看原文 2013-08-11 18:19 10530 Python

注：轉載自http://www.cnpythoner.com/

我們經常通過python做采集網頁數據的時候，會碰到一些亂碼問題，今天給大家分享一個解決網頁亂碼，尤其是中文網頁的通用方法。

首頁我們需要安裝chardet模塊，這個可以通過easy_install 或者pip來安裝。

安裝完以后我們在控制台上導入模塊，如果正常就可以。

比如我們遇到的一些ISO-8859-2也是可以通過下面的方法解決的。

直接上代碼吧：

import urllib2
import sys
import chardet

req = urllib2.Request( "http://www.163.com/") ##這里可以換成http://www.baidu.com,http://www.sohu.com
content = urllib2.urlopen(req).read()
typeEncode = sys.getfilesystemencoding() ##系統默認編碼
infoencode = chardet.detect(content).get( 'encoding', 'utf-8') ##通過第3方模塊來自動提取網頁的編碼
html = content.decode(infoencode, 'ignore').encode(typeEncode) ##先轉換成unicode編碼，然后轉換系統編碼輸出
print html

通過上面的代碼，相信能夠解決你采集亂碼的問題。

接着開始學習網絡爬蟲的深入點兒的東東：

以抓取韓寒博客文章目錄來加以說明：http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html，下面是截圖

我用的是Chrome瀏覽器（firefox也行），打開上述網頁，鼠標右擊，選擇審查元素，就會出現下面所示

首先我們來實現抓取第一篇文章“一次告別”的page的url

按住ctrl+f就會在上圖左下角打開搜索欄，輸入”一次告別“，就會自動定位到html源碼所在位置，如上高亮顯示的地方

接下來我們就是要把對應的url：http://blog.sina.com.cn/s/blog_4701280b0102ek51.html提取出來

詳細實現代碼如下：

 1 #coding:utf-8
 2 import urllib
 3 str0 = '<a title="一次告別" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102ek51.html">一次告別</a>'
 4 title = str0.find(r'<a title')
 5 print title
 6 href = str0.find(r'href=')
 7 print href
 8 html = str0.find(r'.html')
 9 print html
10 url = str0[href + 6:html + 5]
11 print url
12 content = urllib.urlopen(url).read()
13 #print content
14 filename = url[-26:]
15 print filename
16 open(filename, 'w').write(content)

catchBlog.py

下面對代碼進行解釋：

首先利用find函數開始依次匹配查找'<a title','href=','.html',這樣就可以找到關鍵字符所在的索引位置，然后就可以定位到http://blog.sina.com.cn/s/blog_4701280b0102ek51.html的位置[href+6:html+5]
最后利用urllib的相關函數打開並讀取網頁內容，寫到content中

運行程序：

0
40
93
http://blog.sina.com.cn/s/blog_4701280b0102ek51.html
blog_4701280b0102ek51.html

於是在代碼所在目錄生成html文件blog_4701280b0102ek51.html

至此便抓取到第一篇文章的url及網頁內容；上述操作主要學習了以下幾個內容：1.分析博客文章列表特征2.提取字符串中的網絡連接地址3.下載博文到本地

接下來繼續深入：獲取博文目錄第一頁所有文章的鏈接並將所有文章下載下來

 1 #coding:utf-8
 2 import urllib
 3 import time
 4 url = ['']*50
 5 con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()
 6 i = 0
 7 title = con.find(r'<a title=')
 8 href = con.find(r'href=', title)
 9 html = con.find(r'.html', href)
10 
11 while title != -1 and href != -1 and html != -1 and i < 50:
12     url[i] = con[href+6:html+5]
13     print url[i]
14     title = con.find(r'<a title=', html)
15     href = con.find(r'href=', title)
16     html = con.find(r'.html', href)
17     i = i + 1
18 else:
19     print "Find end!"
20 j = 0
21 while j < 50:
22     content = urllib.urlopen(url[j]).read()
23     open(r'hanhan/' + url[j][-26:], 'w+').write(content)
24     j = j + 1
25     print 'downloading', url[j]
26     time.sleep(15)
27 else:
28     print 'Download article finish!'
29 #print 'con', con

catchBlog1.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3獲取中文網頁亂碼的問題 python 解決抓取網頁中的中文顯示亂碼問題 python爬蟲中文亂碼解決方法 Python使用request包請求網頁亂碼解決方法 python logging模塊寫入中文，文件亂碼的解決方法 python logging模塊寫入中文，文件亂碼的解決方法 python 寫入JSON中文亂碼解決方法【轉】Python BeautifulSoup 中文亂碼解決方法使用vscode運行python出現中文亂碼的解決方法 python解決中文亂碼的方法