BeautifulSoup是一個專門用於解析html/xml的庫。官網:http://www.crummy.com/software/BeautifulSoup/
說明,BS有了4.x的版本了。官方說:
Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.
我的電腦上面用
help(BeautifulSoup.__version__)看到版本號為:
3.2.1
Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.
安裝其實很簡單,BeautifulSoup只有一個文件,只要把這個文件拷到你的工作目錄,就可以了。
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
創建 BeautifulSoup 對象
BeautifulSoup對象需要一段html文本就可以創建了。
下面的代碼就創建了一個BeautifulSoup對象:
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>PythonClub.org</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.', '<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.', '</html>'] soup = BeautifulSoup(''.join(doc))
采用
print soup.prettify()
后:
# <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
查找HTML內指定元素
BeautifulSoup可以直接用”.”訪問指定HTML元素
根據html標簽(tag)查找:查找html title
可以用 soup.html.head.title 得到title的name,和字符串值。
>>> soup.html.head.title 注意,包含title標簽 <title>PythonClub.org</title> >>> soup.html.head.title.name u'title' >>> soup.html.head.title.string u'PythonClub.org' >>>
也可以直接通過soup.title直接定位到指定HTML元素:
>>> soup.title <title>PythonClub.org</title> >>>
根據html內容查找:查找包含特定字符串的整個標簽內容
下面的例子給出了查找含有”para”的html tag內容:
>>> soup.findAll(text=re.compile("para")) [u'This is paragraph ', u'This is paragraph '] >>> soup.findAll(text=re.compile("para"))[0].parent <p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p> >>> soup.findAll(text=re.compile("para"))[0].parent.contents [u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.']
基本的方法:findAll
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name
and the keyword arguments.
-
The simplest usage is to just pass in a tag name. This code finds all the <B>
Tag
s in the document:soup.findAll('b') #[<b>one</b>, <b>two</b>]
-
You can also pass in a regular expression. This code finds all the tags whose names start with B:
import re tagsStartingWithB = soup.findAll(re.compile('^b')) [tag.name for tag in tagsStartingWithB] #[u'body', u'b', u'b']
-
You can pass in a list or a dictionary. These two calls find all the <TITLE> and all the <P> tags. They work the same way, but the second call runs faster:
soup.findAll(['title', 'p']) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll({'title' : True, 'p' : True}) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
The keyword arguments impose restrictions on the attributes of a tag. This simple example finds all the tags which have a value of "center" for their "align" attribute:
soup.findAll(align="center") #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]
Searching by CSS class
The attrs
argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class
, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" })
, but that's a lot of code for such a common operation. Instead, you can pass a string for attrs
instead of a dictionary. The string will be used to restrict the CSS class.
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""") soup.find("b", { "class" : "lime" }) #<b class="lime">Lime</b> soup.find("b", "hickory") #<b class="hickory">Hickory</b>
根據CSS屬性查找HTML內容
soup.findAll(id=re.compile("para$")) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(attrs={'id' : re.compile("para$")}) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
深入理解BeautifulSoup
轉自:http://www.pythonclub.org/modules/beautifulsoup/start
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
一篇文章
------------------------------------
湯料——Soup中的對象
標簽(Tag)
標簽對應於HTML元素,也就是應於一對HTML標簽以及括起來的內容(包括內層標簽和文本),如:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
soup.b就是一個標簽,soup其實也可以視為是一個標簽,其實整個HTML就是由一層套一層的標簽組成的。
名字(Name)
名字對應於HTML標簽中的名字(也就是尖括號里的第一項)。每個標簽都具有名字,標簽的名字使用.name
來訪問,例如上例中,
tag.name == u'b'
soup.name == u'[document]'
屬性(Atrriutes)
屬性對應於HTML標簽中的屬性部分(也就是尖括號里帶等號的那些)。標簽可以有許多屬性,也可以沒有屬性。屬性使用類似於字典的形式訪問,用方括號加屬性名,例如上例中,
tag['class'] == u'boldest'
可以使用.attrs直接獲得這個字典,例如,
tag.attrs == {u'class': u'boldest'}
文本(Text)
文本對應於HTML中的文本(也就是尖括號外的部分)。文件使用.text
來訪問,例如上例中,
tag.text == u'Extremely bold'
string和text區別:
找湯料——Soup中的查找
解析一個HTML通常是為了找到感興趣的部分,並提取出來。BeautifulSoup提供了find
和find_all
的方法進行查找。find
只返回找到的第一個標簽,而find_all
則返回一個列表。因為查找用得很多,所以BeautifulSoup做了一些很方便的簡化的使用方式:
tag.find_all("a") #等價於 tag("a") 這是4.0的函數find_all
tag.find("a") #等價於 tag.a
因為找不到的話,find_all返回空列表,find
返回None
,而不會拋出異常,所以,也不用擔心 tag("a")
或tag.a
會因為找不到而報錯。限於python的語法對變量名的規定,tag.a
的形式只能是按名字查找,因為點號.后面只能接變量名,而帶括號的形式 tag()
或 tag.find()
則可用於以下的各種查找方式。
查找可以使用多種方式:字符串、列表、鍵-值(字典)、正則表達式、函數
-
字符串: 字符串會匹配標簽的名字,例如
tag.a
或tag("a")
-
列表: 可以按一個字符串列表查找,返回名字匹配任意一個字符串的標簽。例如
tag("h2", "p")
-
鍵-值: 可以用
tag(key=value)
的形式,來按標簽的屬性查找。鍵-值查找里有比較多的小花招,這里列幾條:- class
class
是Python的保留字,不能當變量名用,偏偏在HTML中會有很多class=XXX
的情況,BeautifulSoup的解決方法是加一下划線,用class_
代替,如tag(class_=XXX)
。 - True
當值為True時,會匹配所有帶這個鍵的標簽,如tag(href=True)
- text
text做為鍵時表示查找按標簽中的文本查找,如tag(text=something)
- class
-
正則表達式: 例如
tag(href=re.compile("elsie"))
-
函數: 當以上方法都行不通時,函數是終極方法。寫一個以單個標簽為參數的函數,傳入
find
或find_all
進行查找。如def fun(tag): return tag.has_key("class") and not tag.has_key("id") tag(fun) # 會返回所有帶class屬性但不帶id屬性的標簽
再來一碗——按文檔的結構查找
HTML可以解析成一棵標簽樹,因此也可以按標簽在樹中的相互關系來查找。
-
查找上層節點:
find_parents()
和find_parent()
-
查找下一個兄弟節點:
find_next_siblings()
和find_next_sibling()
- 查找上一個兄弟節點:
find_previous_siblings()
和find_previous_sibling()
以上四個都只會查同一父節點下的兄弟
-
查找下層節點:其實上面說的find和find_all就是干這活的
-
查找下一個節點(無視父子兄弟關系)
find_all_next()
和find_next()
- 查找上一個節點(無視父子兄弟關系)
find_all_previous()
和find_previous()
以上的這些查找的參都和find
一樣,可以搭配着用。
看顏色選湯——按CSS查找
用 .select()
方法,看 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
一些小花招
- BeautifulSoup 可以支持多種解析器,如lxml, html5lib, html.parser. 如:
BeautifulSoup("<a></b>", "html.parser")
具體表現可參考 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
-
BeautifulSoup 在解析之前會先把文本轉換成unicode,可以用
from_encoding
指定編碼,如:BeautifulSoup(markup, from_encoding="iso-8859-8")
-
soup.prettify()可以輸出排列得很好看的HTML文本,遇上中文的話可以指定編碼使其顯示正常,如
soup.prettify("gbk")
-
還是有編碼問題,看:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit
轉自:http://cndenis.iteye.com/blog/1746706
soup2個重要的屬性:
.contents and .children
A tag’s children are available in a list called .contents:
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents [<title>The Dormouse's story</title>]
type(head_tag.contents[0])
<class 'BeautifulSoup.Tag'> 說明content里面的類型不是string,而是固有的類型
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
A string does not have .contents, because it can’t contain anything:
text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
如果一個soup對象里面包含了html 標簽,那么string是為None的。不管html tag前面是否有string。
soup=BeautifulSoup("<head><title>The Dormouse's story</title></head>")
head=soup.head
print head.string
輸出None說明了這個問題
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
for child in title_tag.children:
print(child)
# The Dormouse's story
一個遞歸獲取文本的函數:
def gettextonly(self,soup): v=soup.string if v==None: c=soup.contents resulttext='' for t in c: subtext=self.gettextonly(t) resulttext+=subtext+'\n' return resulttext else: return v.strip()
一個分割字符串為單詞的函數:
def separatewords(self,text): splitter=re.compile('\\W') return [s.lower() for s in splitter.split(text) if s!='']