python 模塊BeautifulSoup使用


BeautifulSoup是一個專門用於解析html/xml的庫。官網:http://www.crummy.com/software/BeautifulSoup/

說明,BS有了4.x的版本了。官方說:

Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation

Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.

我的電腦上面用

help(BeautifulSoup.__version__)看到版本號為:

3.2.1

Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.

安裝其實很簡單,BeautifulSoup只有一個文件,只要把這個文件拷到你的工作目錄,就可以了。

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup # To get everything

創建 BeautifulSoup 對象

BeautifulSoup對象需要一段html文本就可以創建了。

下面的代碼就創建了一個BeautifulSoup對象:

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>PythonClub.org</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

采用
print soup.prettify()
后:
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

 

 

查找HTML內指定元素

BeautifulSoup可以直接用”.”訪問指定HTML元素

根據html標簽(tag)查找:查找html title

可以用 soup.html.head.title 得到title的name,和字符串值。

>>> soup.html.head.title 注意,包含title標簽
<title>PythonClub.org</title>
>>> soup.html.head.title.name
u'title'
>>> soup.html.head.title.string
u'PythonClub.org'
>>> 

也可以直接通過soup.title直接定位到指定HTML元素:

>>> soup.title
<title>PythonClub.org</title>
>>> 

根據html內容查找:查找包含特定字符串的整個標簽內容

下面的例子給出了查找含有”para”的html tag內容:

>>> soup.findAll(text=re.compile("para"))
[u'This is paragraph ', u'This is paragraph ']
>>> soup.findAll(text=re.compile("para"))[0].parent
<p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p>
>>> soup.findAll(text=re.compile("para"))[0].parent.contents
[u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.']

基本的方法:findAll

findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name and the keyword arguments.

  1. The simplest usage is to just pass in a tag name. This code finds all the <B> Tags in the document:

    soup.findAll('b')
    #[<b>one</b>, <b>two</b>]
    
  2. You can also pass in a regular expression. This code finds all the tags whose names start with B:

    import re
    tagsStartingWithB = soup.findAll(re.compile('^b'))
    [tag.name for tag in tagsStartingWithB]
    #[u'body', u'b', u'b']
    
  3. You can pass in a list or a dictionary. These two calls find all the <TITLE> and all the <P> tags. They work the same way, but the second call runs faster:

    soup.findAll(['title', 'p'])
    #[<title>Page title</title>, 
    # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, 
    # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
    
    soup.findAll({'title' : True, 'p' : True})
    #[<title>Page title</title>, 
    # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, 
    # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
 

The keyword arguments impose restrictions on the attributes of a tag. This simple example finds all the tags which have a value of "center" for their "align" attribute:

soup.findAll(align="center")
#[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]

Searching by CSS class

The attrs argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.

You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }), but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary. The string will be used to restrict the CSS class.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in 
                        <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""")

soup.find("b", { "class" : "lime" })
#<b class="lime">Lime</b>

soup.find("b", "hickory")
#<b class="hickory">Hickory</b>
 

根據CSS屬性查找HTML內容

soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
 
soup.findAll(attrs={'id' : re.compile("para$")})
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

深入理解BeautifulSoup

轉自:http://www.pythonclub.org/modules/beautifulsoup/start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

一篇文章

------------------------------------

湯料——Soup中的對象

標簽(Tag)

標簽對應於HTML元素,也就是應於一對HTML標簽以及括起來的內容(包括內層標簽和文本),如:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

soup.b就是一個標簽,soup其實也可以視為是一個標簽,其實整個HTML就是由一層套一層的標簽組成的。

名字(Name)

名字對應於HTML標簽中的名字(也就是尖括號里的第一項)。每個標簽都具有名字,標簽的名字使用.name來訪問,例如上例中,

tag.name == u'b'
soup.name == u'[document]'

屬性(Atrriutes)

屬性對應於HTML標簽中的屬性部分(也就是尖括號里帶等號的那些)。標簽可以有許多屬性,也可以沒有屬性。屬性使用類似於字典的形式訪問,用方括號加屬性名,例如上例中,

tag['class'] == u'boldest' 

可以使用.attrs直接獲得這個字典,例如,

tag.attrs == {u'class': u'boldest'}

文本(Text)

文本對應於HTML中的文本(也就是尖括號外的部分)。文件使用.text來訪問,例如上例中,

tag.text ==  u'Extremely bold'

string和text區別


找湯料——Soup中的查找

解析一個HTML通常是為了找到感興趣的部分,並提取出來。BeautifulSoup提供了findfind_all的方法進行查找。find只返回找到的第一個標簽,而find_all則返回一個列表。因為查找用得很多,所以BeautifulSoup做了一些很方便的簡化的使用方式:

tag.find_all("a")  #等價於 tag("a") 這是4.0的函數find_all
tag.find("a") #等價於 tag.a

因為找不到的話,find_all返回空列表,find返回None,而不會拋出異常,所以,也不用擔心 tag("a") 或tag.a 會因為找不到而報錯。限於python的語法對變量名的規定,tag.a 的形式只能是按名字查找,因為點號.后面只能接變量名,而帶括號的形式 tag() 或 tag.find() 則可用於以下的各種查找方式。

查找可以使用多種方式:字符串、列表、鍵-值(字典)、正則表達式、函數

  • 字符串: 字符串會匹配標簽的名字,例如 tag.a 或 tag("a")

  • 列表: 可以按一個字符串列表查找,返回名字匹配任意一個字符串的標簽。例如 tag("h2", "p")

  • 鍵-值: 可以用tag(key=value)的形式,來按標簽的屬性查找。鍵-值查找里有比較多的小花招,這里列幾條:

    1. class
      class是Python的保留字,不能當變量名用,偏偏在HTML中會有很多 class=XXX 的情況,BeautifulSoup的解決方法是加一下划線,用 class_ 代替,如 tag(class_=XXX)
    2. True
      當值為True時,會匹配所有帶這個鍵的標簽,如 tag(href=True)
    3. text
      text做為鍵時表示查找按標簽中的文本查找,如 tag(text=something)
  • 正則表達式: 例如 tag(href=re.compile("elsie"))

  • 函數: 當以上方法都行不通時,函數是終極方法。寫一個以單個標簽為參數的函數,傳入 find 或find_all 進行查找。如

    def fun(tag):
        return tag.has_key("class") and not tag.has_key("id")
    tag(fun) # 會返回所有帶class屬性但不帶id屬性的標簽
    

再來一碗——按文檔的結構查找

HTML可以解析成一棵標簽樹,因此也可以按標簽在樹中的相互關系來查找。

  • 查找上層節點:find_parents() 和 find_parent()

  • 查找下一個兄弟節點:find_next_siblings() 和 find_next_sibling()

  • 查找上一個兄弟節點:find_previous_siblings() 和 find_previous_sibling()

以上四個都只會查同一父節點下的兄弟

  • 查找下層節點:其實上面說的find和find_all就是干這活的

  • 查找下一個節點(無視父子兄弟關系) find_all_next() 和 find_next()

  • 查找上一個節點(無視父子兄弟關系) find_all_previous() 和 find_previous()

以上的這些查找的參都和find一樣,可以搭配着用。


看顏色選湯——按CSS查找

用 .select()方法,看 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

一些小花招

  • BeautifulSoup 可以支持多種解析器,如lxml, html5lib, html.parser. 如:BeautifulSoup("<a></b>", "html.parser")

具體表現可參考 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

  • BeautifulSoup 在解析之前會先把文本轉換成unicode,可以用 from_encoding 指定編碼,如:BeautifulSoup(markup, from_encoding="iso-8859-8")

  • soup.prettify()可以輸出排列得很好看的HTML文本,遇上中文的話可以指定編碼使其顯示正常,如soup.prettify("gbk")

  • 還是有編碼問題,看:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit

轉自:http://cndenis.iteye.com/blog/1746706

 

soup2個重要的屬性:

 

.contents and .children

A tag’s children are available in a list called .contents:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents [<title>The Dormouse's story</title>]

type(head_tag.contents[0]) 
<class 'BeautifulSoup.Tag'> 說明content里面的類型不是string,而是固有的類型

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:

len(soup.contents)
# 1
soup.contents[0].name
# u'html'

A string does not have .contents, because it can’t contain anything:

text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
如果一個soup對象里面包含了html 標簽,那么string是為None的。不管html tag前面是否有string。

soup=BeautifulSoup("<head><title>The Dormouse's story</title></head>")
head=soup.head
 
print head.string

輸出None說明了這個問題

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

for child in title_tag.children:
    print(child)
# The Dormouse's story



一個遞歸獲取文本的函數:
def gettextonly(self,soup): 
        v=soup.string
        if v==None:
            c=soup.contents
            resulttext=''
            for t in c:
                subtext=self.gettextonly(t)
                resulttext+=subtext+'\n'
            return resulttext
        else:
            return v.strip()

一個分割字符串為單詞的函數:

def separatewords(self,text):
        splitter=re.compile('\\W')
        return [s.lower() for s in splitter.split(text) if s!='']
    

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM