python 使用 BeautifulSoup 解析html

本文轉載自查看原文 2015-12-15 17:48 3527 python/ python script/ python note/ BeautifulSoup/ html/ 解析

下載地址：http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz

說明：這個版本使用python 2.7比較好。

install: 解壓縮，然后運行python setup.py install

linux系統還可以：sudo apt-get install Python-bs4

還可以：pip install beautifulsoup4

官方文檔：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

(也可以使用 pyQuery)

使用

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str, 'html.parser')

輸出文檔

with open('test.html', 'w') as f:
    f.write(soup.prettify().encode('utf-8'))

當你調用__str__,prettify或者renderContents時，你可以指定輸出的編碼。默認的編碼(str使用的)是UTF-8。下面是處理ISO-8851-1的串並以不同的編碼輸出同樣的串的例子。 soup.__str__("ISO-8859-1")

四大對象種類

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:

Tag: 對於 Tag，它有兩個重要的屬性，是 name 和 attrs
NavigableString: 獲取標簽內部的文字
BeautifulSoup：you can treat it as a Tag object
Comment：獲取注釋

Tag:

print type(soup.a)
#<class 'bs4.element.Tag'>

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

NavigableString:

print soup.p.string
#The Dormouse's story

足夠有用：

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.find("a", attrs={"class": "sister"}) #只找第一個

print soup.find_all("a", attrs={"class": "sister"}, limit=2)

import re soup.find(string=re.compile("sisters"))

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

head_tag.contents
[<title>The Dormouse's story</title>]

head_tag.children
[<title>The Dormouse's story</title>]

title_tag.parent
# <head><title>The Dormouse's story</title></head>

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

find_all == findAll

find_all(name, attrs, recursive, string, limit, **kwargs)

我的程序：

from bs4 import BeautifulSoup

def parse_html(text): 
   soup = BeautifulSoup(text, from_encoding="UTF-8")
    # 找出id="historyTable"的table, 找到它內部的第一個table，獲取所有的 tr
    target = soup.find(id="historyTable").find('table').findAll('tr')
    results = []
    rec = []
    for tr in target[1:]: # ignore th
        tds = tr.findAll('td') # 獲取所有的 td
        build_no = str(tds[1].span.string.strip()) # 找出第二個td的span節點，取出它的text內容
        patch = str(tds[0].a.string) # 第一個td 的 a 節點的text
        status_node = tds[2].find('a')
        status = str(status_node.find('span').string)
        status_link = '%s/%s'%(TEAMCITY_HOME, status_node.attrs['href']) # 屬性
        started = str(tds[5].string.replace(u'\xa0', ' ')) # 去掉無法解析的字符

        print '-'*10
        print '%s\t'%patch,
        print '%s\t'%build_no,
        print '%s\t'%status,
        print '%s\t'%started

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python】 html解析BeautifulSoup python爬蟲之BeautifulSoup的HTML解析使用BeautifulSoup模塊解析HTML python爬蟲-html解析器beautifulsoup Python學習筆記用BeautifulSoup模塊解析HTML Python爬蟲 | Beautifulsoup解析html頁面 Python HTML解析器BeautifulSoup(爬蟲解析器) Python 使用 beautifulsoup 4 模塊來處理 HTML Python爬蟲〇六———數據解析之beautifulsoup的使用 Python（00）：BeautifulSoup(BS4)解析HTML和XML