python網絡編程學習筆記（7）：HTML和XHTML解析(HTMLParser、BeautifulSoup)

本文轉載自查看原文 2012-10-15 11:02 16019 python網絡編程/ python/ python學習筆記/ html

轉載請注明：@小五義http://www.cnblogs.com/xiaowuyi

在python中能夠進行html和xhtml的庫有很多，如HTMLParser、sgmllib、htmllib、BeautifulSoup、mxTidy、uTidylib等，這里介紹一下HTMLParser、BeautifulSoup等模塊。

一、利用HTMLParser進行網頁解析
具體HTMLParser官方文檔可參考http://docs.python.org/library/htmlparser.html#HTMLParser.HTMLParser

1、從一個簡單的解析例子開始
例1：
test1.html文件內容如下：

<html> 
<head> 
<title> XHTML 與 HTML 4.01 標准沒有太多的不同</title> 
</head> 
<body> 
i love you 
</body> 
</html>

下面是能夠列出title和body的程序示例：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] #提出標簽 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

運行結果如下：
title: XHTML 與 HTML 4.01 標准沒有太多的不同
body:
i love you
程序定義了一個TitleParser類，它是HTMLParser類的子孫。HTMLParser的feed方法將接收數據，並通過定義的HTMLParser對象對數據進行相應的解析。其中handle_starttag、handle_endtag判斷起始和終止tag，handle_data檢查是否取得數據，如果self.processing不為None，那么就取得數據。

2、解決html實體問題
（HTML 中有用的字符實體）
（1）實體名稱
當與到HTML中的實體問題時，上面的例子就無法實現，如這里將test1.html的代碼改為：
例2：

<html> 
<head> 
<title> XHTML 與&quot; HTML 4.01 &quot;標准沒有太多的不同</title> 
</head> 
<body> 
i love you&times; 
</body> 
</html>

利用上面的例子進行分析，其結果是：
title: XHTML 與 HTML 4.01 標准沒有太多的不同
body:
i love you
實體完全消失了。這是因為當出現實體的時候，HTMLParser調用了handle_entityref()方法，因為代碼中沒有定義這個方法，所以就什么都沒有做。經過修改后，如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例：解決實體問題 
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

運行結果為：
title: XHTML 與" HTML 4.01 "標准沒有太多的不同
body:
i love you×
這里就把所有的實體顯示出來了。

（2）實體編碼
例3：

<html> 
<head> 
<title> XHTML 與&quot; HTML 4.01 &quot;標准沒有太多的不同</title> 
</head> 
<body> 
i love&#247; you&times; 
</body> 
</html>

如果利用例2的代碼執行后結果為：
title: XHTML 與" HTML 4.01 "標准沒有太多的不同
body:
i love you×
結果中÷ 對應的÷沒有顯示出來。
添加handle_charref（）進行處理，具體代碼如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例：解決實體問題 
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 

    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 

    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

運行結果為：
title: XHTML 與" HTML 4.01 "標准沒有太多的不同
body:
i love÷ you×

3、提取鏈接
例4：

<html> 
<head> 
<title> XHTML 與&quot; HTML 4.01 &quot;標准沒有太多的不同</title> 
</head> 
<body> 

<a href="http://pypi.python.org/pypi" title="link1">i love&#247; you&times;</a> 
</body> 
</html>

這里在handle_starttag(self,tag,attrs)中，tag=a時，attrs記錄了屬性值，因此只需要將attrs中name=href的value提出即可。具體如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例：提取鏈接 
# -*- coding: cp936 -*- 
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
        if tag =='a': 
            for name,value in attrs: 
                if name=='href': 
                    print '連接地址：'+value 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 

    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 

    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

運行結果為：
title: XHTML 與" HTML 4.01 "標准沒有太多的不同
連接地址：http://pypi.python.org/pypi
body:

i love÷ you×

4、提取圖片
如果網頁中有一個圖片文件，將其提取出來，並存為一個單獨的文件。
例5：

<html> 
<head> 
<title> XHTML 與&quot; HTML 4.01 &quot;標准沒有太多的不同</title> 
</head> 
<body> 
i love&#247; you&times; 
<a href="http://pypi.python.org/pypi" title="link1">我想你</a> 
<div id="m"><img src="http://www.baidu.com/img/baidu_sylogo1.gif" width="270" height="129" ></div> 
</body> 
</html>

將baidu_sylogo1.gif存取出來，具體代碼如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例：提取圖片 
# -*- coding: cp936 -*- 
from htmlentitydefs import entitydefs 
import HTMLParser,urllib 
def getimage(addr):#提取圖片並存在當前目錄下 
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+'已經生成！' 

class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
        if tag =='a': 
            for name,value in attrs: 
                if name=='href': 
                    print '連接地址：'+value 
        if tag=='img': 
            for name,value in attrs: 
                if name=='src': 
                    getimage(value) 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 

    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 

    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

運動結果為：
title: XHTML 與" HTML 4.01 "標准沒有太多的不同
連接地址：http://pypi.python.org/pypi
baidu_sylogo1.gif已經生成！
body:
i love÷ you×
?ò????

5、實際例子：
例6、獲取人人網首頁上的各各鏈接地址，代碼如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##HTMLParser示例：獲取人人網首頁上的各各鏈接地址 
#coding: utf-8 
from htmlentitydefs import entitydefs 
import HTMLParser,urllib 
def getimage(addr): 
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+'已經生成！' 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['a'] 
        self.processing=None 
        self.linkstring='' 
        self.linkaddr='' 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            for name,value in attrs: 
                if name=='href': 
                    self.linkaddr=value 
            self.processing=tag 

    def handle_data(self,data): 
        if self.processing: 
            self.linkstring +=data 
            #print data.decode('utf-8')+':'+self.linkaddr 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print self.linkstring.decode('utf-8')+':'+self.linkaddr 
            self.processing=None 
            self.linkstring='' 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 

    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 

    def gettitle(self): 
        return self.linkaddr 
tp=TitleParser() 
tp.feed(urllib.urlopen('http://www.renren.com/').read())

運行結果：
分享:http://share.renren.com
應用程序:http://app.renren.com
公共主頁:http://page.renren.com
人人生活:http://life.renren.com
人人小組:http://xiaozu.renren.com/
同名同姓:http://name.renren.com
人人中學:http://school.renren.com/allpages.html
大學百科:http://school.renren.com/daxue/
人人熱點:http://life.renren.com/hot
人人小站:http://zhan.renren.com/
人人逛街:http://j.renren.com/
人人校招:http://xiaozhao.renren.com/
:http://www.renren.com
注冊:http://wwv.renren.com/xn.do?ss=10113&rt=27
登錄:http://www.renren.com/
幫助:http://support.renren.com/helpcenter
給我們提建議:http://support.renren.com/link/suggest
更多:#
:javascript:closeError();
打開郵箱查收確認信:#
重新輸入:javascript:closeError();
:javascript:closeStop();
客服:http://help.renren.com/#http://help.renren.com/support/contomvice?pid=2&selection={couId:193,proId:342,cityId:1000375}
:javascript:closeLock();
立即解鎖:http://safe.renren.com/relive.do
忘記密碼？:http://safe.renren.com/findPass.do
忘記密碼？:http://safe.renren.com/findPass.do
換一張:javascript:refreshCode_login();
MSN:#
360:https://openapi.360.cn/oauth2/authorize?client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default
天翼:https://oauth.api.189.cn/emp/oauth2/authorize?app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tyLoginCallBack
為什么要填寫我的生日？:#birthday
看不清換一張?:javascript:refreshCode();
想了解更多人人網功能？點擊此處:javascript:;
:javascript:;
:javascript:;
立刻注冊:http://reg.renren.com/xn6245.do?ss=10113&rt=27
關於:http://www.renren.com/siteinfo/about
開放平台:http://dev.renren.com
人人游戲:http://wan.renren.com
公共主頁:http://page.renren.com/register/regGuide/
手機人人:http://mobile.renren.com/mobilelink.do?psf=40002
團購:http://www.nuomi.com
皆喜網:http://www.jiexi.com
營銷服務:http://ads.renren.com
招聘:http://job.renren-inc.com/
客服幫助:http://support.renren.com/helpcenter
隱私:http://www.renren.com/siteinfo/privacy
京ICP證090254號:http://www.miibeian.gov.cn/
互聯網葯品信息服務資格證:http://a.xnimg.cn/n/core/res/certificate.jpg

二、利用BeautifulSoup進行網頁解析
1、BeautifulSoup下載和安裝
下載地址：http://www.crummy.com/software/BeautifulSoup/download/3.x/
中文文檔地址：http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Entity%20Conversion
安裝方法：將下載的文件解壓縮后，文件夾下有個setup.py文件，然后在cmd下，運行python setup.py install進行安裝，注意setup.py的路徑問題。安裝成功后，在python中就可以直接import BeautifulSoup了。
2、從一個簡單的解析例子開始
例7：

<html> 
<head> 
<title> XHTML 與&quot; HTML 4.01 &quot;標准沒有太多的不同</title> 
</head> 
<body> 
i love&#247; you&times; 
<a href="http://pypi.python.org/pypi" title="link1">我想你</a> 
<div id="m"><img src="http://www.baidu.com/img/baidu_sylogo1.gif" width="270" height="129" ></div> 
</body> 
</html>

獲取title的代碼：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##BeautifulSoup示例：title 
#coding: utf8 
import BeautifulSoup 

a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulSoup(htmlline.decode('gb2312')) 
#print soup.prettify()#規范化html文件 
titleTag=soup.html.head.title 
print titleTag.string

運行結果：
XHTML 與" HTML 4.01 "標准沒有太多的不同
從代碼和結果來看，應注意兩點：
第一，在BeautifulSoup.BeautifulSoup(htmlline.decode('gb2312'))初始化過程中，應注意字符編碼格式，從網上搜索了一下，開始用utf-8的編碼顯示不正常，換為gb2312后顯示正常。其實可以用soup.originalEncoding方法來查看原文件的編碼格式。
第二，結果中未對字符實體進行處理，在BeautifulSoup中文文檔中，有專門對實體轉換的解釋，這里將上面的代碼改為以下代碼后，結果將正常顯示：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##BeautifulSoup示例：title 
#coding: utf8 
import BeautifulSoup 
a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
#print soup.prettify()#規范化html文件 
titleTag=soup.html.head.title 
print titleTag.string

這里convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES中的ALL_ENTITIES定義了XML和HTML兩者的實體代碼。當然，也可以直接用XML_ENTITIES或者HTML_ENTITIES。運行結果如下：
XHTML 與" HTML 4.01 "標准沒有太多的不同
3、提取鏈接
還有用上面的例子，這里代碼變為：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##BeautifulSoup示例：提取鏈接 
#coding: utf8 
import BeautifulSoup 
a=open('test1.html','r') 
htmlline=a.read() 
a.close() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
name=soup.find('a').string 
links=soup.find('a')['href'] 
print name+':'+links

運行結果為：
我想你:http://pypi.python.org/pypi
4、提取圖片
依然是用上面的例子，把baidu圖片提取出來。
代碼為：

##@小五義：http://www.cnblogs.com/xiaowuyi
#coding: utf8 
import BeautifulSoup,urllib 
def getimage(addr):#提取圖片並存在當前目錄下 
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+' finished!' 
a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
links=soup.find('img')['src'] 
getimage(links)

提取鏈接和提取圖片兩部分主要都是用了find方法，具體方法為：
find(name, attrs, recursive, text, **kwargs)
findAll是列出全部符合條件的，find只列出第一條。這里注意的是findAll返回的是個list。
5、實際例子：
例8、獲取人人網首頁上的各各鏈接地址，代碼如下：

##@小五義：http://www.cnblogs.com/xiaowuyi 
##BeautifulSoup示例：獲取人人網首頁上的各各鏈接地址 
#coding: utf8 
import BeautifulSoup,urllib 
linkname='' 
htmlline=urllib.urlopen('http://www.renren.com/').read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('utf-8')) 
links=soup.findAll('a') 
for i in links: 
    ##判斷tag是a的里面，href是否存在。 
    if 'href' in str(i): 
        linkname=i.string 
        linkaddr=i['href'] 
        if 'NoneType' in str(type(linkname)):#當i無內容是linkname為Nonetype類型。 
            print linkaddr 
        else: 
            print linkname+':'+linkaddr

運行結果：
分享:http://share.renren.com
應用程序:http://app.renren.com
公共主頁:http://page.renren.com
人人生活:http://life.renren.com
人人小組:http://xiaozu.renren.com/
同名同姓:http://name.renren.com
人人中學:http://school.renren.com/allpages.html
大學百科:http://school.renren.com/daxue/
人人熱點:http://life.renren.com/hot
人人小站:http://zhan.renren.com/
人人逛街:http://j.renren.com/
人人校招:http://xiaozhao.renren.com/
http://www.renren.com
注冊:http://wwv.renren.com/xn.do?ss=10113&rt=27
登錄:http://www.renren.com/
幫助:http://support.renren.com/helpcenter
給我們提建議:http://support.renren.com/link/suggest
更多:#
javascript:closeError();
打開郵箱查收確認信:#
重新輸入:javascript:closeError();
javascript:closeStop();
客服:http://help.renren.com/#http://help.renren.com/support/contomvice?pid=2&selection={couId:193,proId:342,cityId:1000375}
javascript:closeLock();
立即解鎖:http://safe.renren.com/relive.do
忘記密碼？:http://safe.renren.com/findPass.do
忘記密碼？:http://safe.renren.com/findPass.do
換一張:javascript:refreshCode_login();
MSN:#
360:https://openapi.360.cn/oauth2/authorize?client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default
天翼:https://oauth.api.189.cn/emp/oauth2/authorize?app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tyLoginCallBack
#birthday
看不清換一張?:javascript:refreshCode();
javascript:;
javascript:;
立刻注冊:http://reg.renren.com/xn6245.do?ss=10113&rt=27
關於:http://www.renren.com/siteinfo/about
開放平台:http://dev.renren.com
人人游戲:http://wan.renren.com
公共主頁:http://page.renren.com/register/regGuide/
手機人人:http://mobile.renren.com/mobilelink.do?psf=40002
團購:http://www.nuomi.com
皆喜網:http://www.jiexi.com
營銷服務:http://ads.renren.com
招聘:http://job.renren-inc.com/
客服幫助:http://support.renren.com/helpcenter
隱私:http://www.renren.com/siteinfo/privacy
京ICP證090254號:http://www.miibeian.gov.cn/
互聯網葯品信息服務資格證:http://a.xnimg.cn/n/core/res/certificate.jpg

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 解析html與xhtml的神器——HTMLParser與SGMLParser Python學習筆記用BeautifulSoup模塊解析HTML Python—解析HTML頁面（HTMLParser） python之HTMLParser解析HTML文檔 Python 用HTMLParser解析HTML文件 python模塊學習---HTMLParser(解析HTML文檔元素) 【Python】 html解析BeautifulSoup python開發_HTMLParser_html文檔解析 python自帶的用於解析HTML的庫HtmlParser Python HTML解析模塊HTMLParser(爬蟲工具)