python--爬蟲入門（八）體驗HTMLParser解析網頁，網頁抓取解析整合練習

本文轉載自查看原文 2016-03-31 12:54 10063 爬蟲/ python

python系列均基於python3.4環境

基本概念

　　html.parser的核心是HTMLParser類。工作的流程是：當你feed給它一個類似HTML格式的字符串時，它會調用goahead方法向前迭代各個標簽，並調用對應的parse_xxxx方法提取start_tag,tag,data,comment和end_tag等等標簽信息和數據，然后調用對應的方法對這些抽取出來的內容進行處理。

幾個比較常用的：

handle_startendtag  #處理開始標簽和結束標簽
handle_starttag     #處理開始標簽，比如<xx>
handle_endtag       #處理結束標簽，比如</xx>或者<……/>
handle_charref      #處理特殊字符串，就是以&#開頭的，一般是內碼表示的字符
handle_entityref    #處理一些特殊字符，以&開頭的，比如 &nbsp;
handle_data         #處理<xx>data</xx>中間的那些數據
handle_comment      #處理注釋
handle_decl         #處理<!開頭的，比如<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
handle_pi           #處理形如<?instruction>的

　　@_@) 接下來，我們來體驗下html.parser!!!

下面這一段將是用來做測試數據的html代碼段：

<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客園</title>
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" />
    <meta content="面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Home.js" type="text/javascript"></script>
</head>

體驗三個基本函數：

def handle_starttag(self, tag, attrs)  #處理開始標簽，比如<xx>
def handle_data(self, data)            #處理<xx>data</xx>中間的那些數據
def handle_endtag(self, tag)           #處理結束標簽，比如</xx>或者<……/>

代碼示例:(python3.4)

import html.parser as h

class MyHTMLParser(h.HTMLParser):

    a_t=False

    #處理開始標簽，比如<xx>
    def handle_starttag(self, tag, attrs):
        print("開始一個標簽:",tag)

        if str(tag).startswith("title"):
            self.a_t=True

        for attr in attrs:
            print("屬性值：",attr)
       # print()

    #處理<xx>data</xx>中間的那些數據
    def handle_data(self, data):
        if self.a_t is True:
            print("得到的數據: ",data)

    #處理結束標簽，比如</xx>或者<……/>
    def handle_endtag(self, tag):
        self.a_t=False
        print("結束一個標簽:",tag)
        print()

p=MyHTMLParser()
mystr = '''<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客園</title>
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" />
    <meta content="面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Home.js" type="text/javascript"></script>
</head>'''
p.feed(mystr)
p.close()

運行結果：

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/h2.py
開始一個標簽: head
開始一個標簽: meta
屬性值： ('charset', 'utf-8')
結束一個標簽: meta

開始一個標簽: title
得到的數據:  找找看 - 博客園
結束一個標簽: title

開始一個標簽: link
屬性值： ('rel', 'shortcut icon')
屬性值： ('href', '/Content/Images/favicon.ico')
屬性值： ('type', 'image/x-icon')
結束一個標簽: link

開始一個標簽: meta
屬性值： ('content', '技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎')
屬性值： ('name', 'keywords')
結束一個標簽: meta

開始一個標簽: meta
屬性值： ('content', '面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看...')
屬性值： ('name', 'description')
結束一個標簽: meta

開始一個標簽: link
屬性值： ('type', 'text/css')
屬性值： ('href', '/Content/Style.css')
屬性值： ('rel', 'stylesheet')
結束一個標簽: link

開始一個標簽: script
屬性值： ('src', 'http://common.cnblogs.com/script/jquery.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

開始一個標簽: script
屬性值： ('src', '/Scripts/Common.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

開始一個標簽: script
屬性值： ('src', '/Scripts/Home.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

結束一個標簽: head


Process finished with exit code 0

View Result

-------@_@? html.parser------------------------------------------------------------

提問：除了上面列出的比較常用的功能之外？還有什么別的功能呢？

--------------------------------------------------------------------------------------

了解下html.parser還有什么功能！！！

代碼如下：

import html.parser
help(html.parser)

運行結果：

C:\Python34\python.exe E:/pythone_workspace/mydemo/test.py
Help on module html.parser in html:

NAME
    html.parser - A parser for HTML and XHTML.

CLASSES
    _markupbase.ParserBase(builtins.object)
        HTMLParser
    
    class HTMLParser(_markupbase.ParserBase)
     |  Find tags and other markup and call handler functions.
     |  
     |  Usage:
     |      p = HTMLParser()
     |      p.feed(data)
     |      ...
     |      p.close()
     |  
     |  Start tags are handled by calling self.handle_starttag() or
     |  self.handle_startendtag(); end tags by self.handle_endtag().  The
     |  data between tags is passed from the parser to the derived class
     |  by calling self.handle_data() with the data as argument (the data
     |  may be split up in arbitrary chunks).  If convert_charrefs is
     |  True the character references are converted automatically to the
     |  corresponding Unicode character (and self.handle_data() is no
     |  longer split in chunks), otherwise they are passed by calling
     |  self.handle_entityref() or self.handle_charref() with the string
     |  containing respectively the named or numeric reference as the
     |  argument.
     |  
     |  Method resolution order:
     |      HTMLParser
     |      _markupbase.ParserBase
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, strict=<object object at 0x00A50488>, *, convert_charrefs=<object object at 0x00A50488>)
     |      Initialize and reset this instance.
     |      
     |      If convert_charrefs is True (default: False), all character references
     |      are automatically converted to the corresponding Unicode characters.
     |      If strict is set to False (the default) the parser will parse invalid
     |      markup, otherwise it will raise an error.  Note that the strict mode
     |      and argument are deprecated.
     |  
     |  check_for_whole_start_tag(self, i)
     |      # Internal -- check to see if we have a complete starttag; return end
     |      # or -1 if incomplete.
     |  
     |  clear_cdata_mode(self)
     |  
     |  close(self)
     |      Handle any buffered data.
     |  
     |  error(self, message)
     |  
     |  feed(self, data)
     |      Feed data to the parser.
     |      
     |      Call this as often as you want, with as little or as much text
     |      as you want (may include '\n').
     |  
     |  get_starttag_text(self)
     |      Return full source of start tag: '<...>'.
     |  
     |  goahead(self, end)
     |      # Internal -- handle data as far as reasonable.  May leave state
     |      # and data to be processed by a subsequent call.  If 'end' is
     |      # true, force handling all data as if followed by EOF marker.
     |  
     |  handle_charref(self, name)
     |      # Overridable -- handle character reference
     |  
     |  handle_comment(self, data)
     |      # Overridable -- handle comment
     |  
     |  handle_data(self, data)
     |      # Overridable -- handle data
     |  
     |  handle_decl(self, decl)
     |      # Overridable -- handle declaration
     |  
     |  handle_endtag(self, tag)
     |      # Overridable -- handle end tag
     |  
     |  handle_entityref(self, name)
     |      # Overridable -- handle entity reference
     |  
     |  handle_pi(self, data)
     |      # Overridable -- handle processing instruction
     |  
     |  handle_startendtag(self, tag, attrs)
     |      # Overridable -- finish processing of start+end tag: <tag.../>
     |  
     |  handle_starttag(self, tag, attrs)
     |      # Overridable -- handle start tag
     |  
     |  parse_bogus_comment(self, i, report=1)
     |      # Internal -- parse bogus comment, return length or -1 if not terminated
     |      # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
     |  
     |  parse_endtag(self, i)
     |      # Internal -- parse endtag, return end or -1 if incomplete
     |  
     |  parse_html_declaration(self, i)
     |      # Internal -- parse html declarations, return length or -1 if not terminated
     |      # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
     |      # See also parse_declaration in _markupbase
     |  
     |  parse_pi(self, i)
     |      # Internal -- parse processing instr, return end or -1 if not terminated
     |  
     |  parse_starttag(self, i)
     |      # Internal -- handle starttag, return end or -1 if not terminated
     |  
     |  reset(self)
     |      Reset this instance.  Loses all unprocessed data.
     |  
     |  set_cdata_mode(self, elem)
     |  
     |  unescape(self, s)
     |      # Internal -- helper to remove special character quoting
     |  
     |  unknown_decl(self, data)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  CDATA_CONTENT_ELEMENTS = ('script', 'style')
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from _markupbase.ParserBase:
     |  
     |  getpos(self)
     |      Return current line number and offset.
     |  
     |  parse_comment(self, i, report=1)
     |      # Internal -- parse comment, return length or -1 if not terminated
     |  
     |  parse_declaration(self, i)
     |      # Internal -- parse declaration (for use by subclasses).
     |  
     |  parse_marked_section(self, i, report=1)
     |      # Internal -- parse a marked section
     |      # Override this to handle MS-word extension syntax <![if word]>content<![endif]>
     |  
     |  updatepos(self, i, j)
     |      # Internal -- update line number and offset.  This should be
     |      # called for each piece of data exactly once, in order -- in other
     |      # words the concatenation of all the input strings to this
     |      # function should be exactly the entire input.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from _markupbase.ParserBase:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

DATA
    __all__ = ['HTMLParser']

FILE
    c:\python34\lib\html\parser.py



Process finished with exit code 0

View Result

---------@_@！整合練習--------------------------------------------------------------

上一篇python--爬蟲入門（七）urllib庫初體驗以及中文編碼問題的探討，提到抓取網頁！

那么，我們將前面內容和上篇整合一下，練習練習

----------------------------------------------------------------------------------------

開始整合練習！！！

新建package，命名為spider，新建兩個.py文件。

（1）HtmlParser.py代碼如下：

import html.parser as h

class MyHTMLParser(h.HTMLParser):

    a_t=False

    #處理開始標簽，比如<xx>
    def handle_starttag(self, tag, attrs):
        print("開始一個標簽:",tag)

        if str(tag).startswith("title"):
            self.a_t=True

        for attr in attrs:
            print("屬性值：",attr)
       # print()

    #處理<xx>data</xx>中間的那些數據
    def handle_data(self, data):
        if self.a_t is True:
            print("得到的數據: ",data)

    #處理結束標簽，比如</xx>或者<……/>
    def handle_endtag(self, tag):
        self.a_t=False
        print("結束一個標簽:",tag)
        print()

（2）Demo.py代碼如下：

import urllib.request
import urllib.parse
import spider.HtmlParser

response=urllib.request.urlopen("http://zzk.cnblogs.com/b")
myStr=response.read().decode('UTF-8')
print("-----------網頁源碼-----------------")
print(myStr)
print("-----------開始解析網頁-------------")
p=spider.HtmlParser.MyHTMLParser()
p.feed(myStr)
p.close()

運行Demo.py，結果顯示：

C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/Demo.py
-----------網頁源碼-----------------

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客園</title>    
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" />
    <meta content="面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Home.js" type="text/javascript"></script>
</head>
<body>
    <div class="top">
        
        <div class="top_tabs">
            <a href="http://www.cnblogs.com">« 博客園首頁 </a>
        </div>
        <div id="span_userinfo" class="top_links">
        </div>
    </div>
    <div style="clear: both">
    </div>
    <center>
        <div id="main">
            <div class="logo_index">
                <a href="http://zzk.cnblogs.com">
                    <img alt="找找看logo" src="/images/logo.gif" /></a>
            </div>
            <div class="index_sozone">
                <div class="index_tab">
                    <a href="/n" onclick="return  channelSwitch(&#39;n&#39;);">新聞</a>
<a class="tab_selected" href="/b" onclick="return  channelSwitch(&#39;b&#39;);">博客</a>                    <a href="/k" onclick="return  channelSwitch(&#39;k&#39;);">知識庫</a>
                    <a href="/q" onclick="return  channelSwitch(&#39;q&#39;);">博問</a>
                </div>
                <div class="search_block">
                    <div class="index_btn">
                        <input type="button" class="btn_so_index" onclick="Search();" value="&nbsp;找一下&nbsp;" />
                        <span class="help_link"><a target="_blank" href="/help">幫助</a></span>
                    </div>
                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />
                </div>
            </div>
        </div>
        <div class="footer">
            &copy;2004-2016 <a href="http://www.cnblogs.com">博客園</a>
        </div>
    </center>
</body>
</html>

-----------開始解析網頁-------------
開始一個標簽: html
開始一個標簽: head
開始一個標簽: meta
屬性值： ('charset', 'utf-8')
結束一個標簽: meta

開始一個標簽: title
得到的數據:  找找看 - 博客園
結束一個標簽: title

開始一個標簽: link
屬性值： ('rel', 'shortcut icon')
屬性值： ('href', '/Content/Images/favicon.ico')
屬性值： ('type', 'image/x-icon')
結束一個標簽: link

開始一個標簽: meta
屬性值： ('content', '技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎')
屬性值： ('name', 'keywords')
結束一個標簽: meta

開始一個標簽: meta
屬性值： ('content', '面向程序員的專業搜索引擎。遇到技術問題怎么辦，到博客園找找看...')
屬性值： ('name', 'description')
結束一個標簽: meta

開始一個標簽: link
屬性值： ('type', 'text/css')
屬性值： ('href', '/Content/Style.css')
屬性值： ('rel', 'stylesheet')
結束一個標簽: link

開始一個標簽: script
屬性值： ('src', 'http://common.cnblogs.com/script/jquery.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

開始一個標簽: script
屬性值： ('src', '/Scripts/Common.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

開始一個標簽: script
屬性值： ('src', '/Scripts/Home.js')
屬性值： ('type', 'text/javascript')
結束一個標簽: script

結束一個標簽: head

開始一個標簽: body
開始一個標簽: div
屬性值： ('class', 'top')
開始一個標簽: div
屬性值： ('class', 'top_tabs')
開始一個標簽: a
屬性值： ('href', 'http://www.cnblogs.com')
結束一個標簽: a

結束一個標簽: div

開始一個標簽: div
屬性值： ('id', 'span_userinfo')
屬性值： ('class', 'top_links')
結束一個標簽: div

結束一個標簽: div

開始一個標簽: div
屬性值： ('style', 'clear: both')
結束一個標簽: div

開始一個標簽: center
開始一個標簽: div
屬性值： ('id', 'main')
開始一個標簽: div
屬性值： ('class', 'logo_index')
開始一個標簽: a
屬性值： ('href', 'http://zzk.cnblogs.com')
開始一個標簽: img
屬性值： ('alt', '找找看logo')
屬性值： ('src', '/images/logo.gif')
結束一個標簽: img

結束一個標簽: a

結束一個標簽: div

開始一個標簽: div
屬性值： ('class', 'index_sozone')
開始一個標簽: div
屬性值： ('class', 'index_tab')
開始一個標簽: a
屬性值： ('href', '/n')
屬性值： ('onclick', "return  channelSwitch('n');")
結束一個標簽: a

開始一個標簽: a
屬性值： ('class', 'tab_selected')
屬性值： ('href', '/b')
屬性值： ('onclick', "return  channelSwitch('b');")
結束一個標簽: a

開始一個標簽: a
屬性值： ('href', '/k')
屬性值： ('onclick', "return  channelSwitch('k');")
結束一個標簽: a

開始一個標簽: a
屬性值： ('href', '/q')
屬性值： ('onclick', "return  channelSwitch('q');")
結束一個標簽: a

結束一個標簽: div

開始一個標簽: div
屬性值： ('class', 'search_block')
開始一個標簽: div
屬性值： ('class', 'index_btn')
開始一個標簽: input
屬性值： ('type', 'button')
屬性值： ('class', 'btn_so_index')
屬性值： ('onclick', 'Search();')
屬性值： ('value', '\xa0找一下\xa0')
結束一個標簽: input

開始一個標簽: span
屬性值： ('class', 'help_link')
開始一個標簽: a
屬性值： ('target', '_blank')
屬性值： ('href', '/help')
結束一個標簽: a

結束一個標簽: span

結束一個標簽: div

開始一個標簽: input
屬性值： ('type', 'text')
屬性值： ('onkeydown', 'searchEnter(event);')
屬性值： ('class', 'input_index')
屬性值： ('name', 'w')
屬性值： ('id', 'w')
結束一個標簽: input

結束一個標簽: div

結束一個標簽: div

結束一個標簽: div

開始一個標簽: div
屬性值： ('class', 'footer')
開始一個標簽: a
屬性值： ('href', 'http://www.cnblogs.com')
結束一個標簽: a

結束一個標簽: div

結束一個標簽: center

結束一個標簽: body

結束一個標簽: html


Process finished with exit code 0

View Result

(@_@)Y，本篇分享到這里！待續~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python--爬蟲入門（七）urllib庫初體驗以及中文編碼問題的探討 Python爬蟲常用之HtmlParser python網絡爬蟲之LXML與HTMLParser Python3簡單爬蟲抓取網頁圖片怎樣使用python爬蟲進行網頁圖片抓取 Python3簡單爬蟲抓取網頁圖片 Python—解析HTML頁面（HTMLParser）利用Python抓取和解析網頁 Python爬蟲練習:抓取筆趣閣小說(一) python開發_HTMLParser_html文檔解析