python爬蟲：BeautifulSoup 庫的基本函數用法及框架

本文轉載自查看原文 2019-08-03 19:41 1294 Python Crawler

安裝：

Win平台: “以管理員身份運行”cmd 執行

pip install beautifulsoup4

Beautiful Soup 庫的理解：

Beautiful Soup 庫解析器：

Beautiful Soup 庫的基本元素：

基於bs4庫的HTML內容遍歷方法：

下行遍歷：

soup = BeautifulSoup(url,"html.parser")

#遍歷兒子節點
for child in soup.body.children: 
    print(child)


#遍歷子孫節點
for child in soup.body.descendants: 
    print(child)

標簽樹的上行遍歷:

標簽樹的平行遍歷:

#遍歷后續節點
for sibling in soup.a.next_sibling: 
    print(sibling)


#遍歷前續節點
for sibling in soup.a.previous_sibling: 
    print(sibling)

小結：

函數調用：

soup = BeautifulSoup(open("index.html"))
# 打開當前目錄下 index.html 文件

soup.prettify()函數的作用是打印整個 html 文件的 dom 樹

解析 BeautifulSoup 對象

想從 html 中獲取到自己所想要的內容，我歸納出三種辦法：

1）利用 Tag 對象

從上文得知，BeautifulSoup 將復雜 HTML 文檔轉換成一個復雜的樹形結構,每個節點都是Python對象。跟安卓中的Gson庫有異曲同工之妙。節點對象可以分為 4 種：Tag, NavigableString, BeautifulSoup, Comment。

Tag 對象可以看成 HTML 中的標簽。這樣說，你大概明白具體是怎么回事。我們再通過例子來更加深入了解 Tag 對象。以下代碼是以 prettify() 打印的結果為前提。

例子1

獲取head標簽內容

print(soup.head)
# 輸出結果如下：
<head><title>The Dormouse's story</title></head>

例子2

獲取title標簽內容

print(soup.title)
# 輸出結果如下：
<title>The Dormouse's story</title>

例子3

獲取p標簽內容

print(soup.p)
# 輸出結果如下：
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

如果 Tag 對象要獲取的標簽有多個的話，它只會返回所以內容中第一個符合要求的標簽。

對象一般含有屬性，Tag 對象也不例外。它具有兩個非常重要的屬性， <font color='red'>name</font> 和 <font color='red'>attrs</font>。

name
name 屬性是 Tag 對象的標簽名。不過也有特殊的，soup 對象的 name 是 [document]

print(soup.name)
print(soup.head.name)
# 輸出結果如下：
[document]
head

attrs
attrs 屬性是 Tag 對象所包含的屬性值，它是一個字典類型。

print(soup.p.attrs）
# 輸出結果如下：
{'class': ['title'], 'name': 'dromouse'}

其他三個屬性也順帶介紹下:

NavigableString

說白了就是：Tag 對象里面的內容

print(soup.title.string)
 # 輸出結果如下：
The Dormouse's story

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內容.大部分時候,可以把它當作 Tag 對象。它是一個特殊的 Tag。

print(type(soup.name))
print(soup.name)
print(soup.attrs)
# 輸出結果如下：
<type 'unicode'>
[document]
{} 空字典

Comment

Comment 對象是一個特殊類型的 NavigableString 對象。如果 HTML 頁面中含有注釋及特殊字符串的內容。而那些內容不是我們想要的，所以我們在使用前最好做下類型判斷。例如：

if type(soup.a.string) == bs4.element.Comment:
    ...    # 執行其他操作，例如打印內容

2）利用過濾器

過濾器其實是一個find_all()函數，它會將所有符合條件的內容以列表形式返回。它的構造方法如下：

find_all(name, attrs, recursive, text, **kwargs )

name 參數可以有多種寫法：

（1）節點名

print(soup.find_all('p'))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were</p>]

（2）正則表達式

print(soup.find_all(re.compile('^p')))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were</p>]

（3）列表
如果參數為列表，過濾標准為列表中的所有元素。看下具體代碼，你就會一目了然了。

print(soup.find_all(['p', 'a']))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,  <p class="story">Once upon a time there were three little sisters; and their names were</p>,  <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>]

另外 attrs 參數可以也作為過濾條件來獲取內容，而 limit 參數是限制返回的條數。

3）利用 CSS 選擇器

以 CSS 語法為匹配標准找到 Tag。同樣也是使用到一個函數，該函數為select()，返回類型也是 list。它的具體用法如下, 同樣以 prettify() 打印的結果為前提：

（1）通過 tag 標簽查找

print(soup.select(head))
# 輸出結果如下：
[<head><title>The Dormouse's story</title></head>]

（2）通過 id 查找

print(soup.select('#link1'))
# 輸出結果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>]

（3）通過 class 查找

print(soup.select('.sister'))
# 輸出結果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>]

（4）通過屬性查找

print(soup.select('p[name=dromouse]'))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

print(soup.select('p[class=title]'))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

（5）組合查找

print(soup.select("body p"))
# 輸出結果如下：
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were</p>]

print(soup.select("p > a"))
# 輸出結果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>]

print(soup.select("p > .sister"))
# 輸出結果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>]

5 處理上下關系

從上文可知，我們已經能獲取到節點對象，但有時候需要獲取其父節點或者子節點的內容，我們要怎么做了？這就需要對parse tree進行遍歷

（1）獲取子節點
利用.children屬性，該屬性會返回當前節點所以的子節點。但是它返回的類型不是列表，而是迭代器

（2）獲取所有子孫節點
使用.descendants屬性，它會返回所有子孫節點的迭代器

（3）獲取父節點
通過.parent屬性可以獲得所有子孫節點的迭代器

（4）獲取所有父節點
.parents屬性，也是返回所有子孫節點的迭代器

（5）獲取兄弟節點
兄弟節點可以理解為和本節點處在統一級的節點，.next_sibling屬性獲取了該節點的下一個兄弟節點，.previous_sibling則與之相反，如果節點不存在，則返回 None

注意：實際 HTML 中的 tag 的.next_sibling和 .previous_sibling屬性通常是字符串或空白，因為空白或者換行也可以被視作一個節點，所以得到的結果可能是空白或者換行

（5）獲取所有兄弟節點
通過.next_siblings和.previous_siblings屬性可以對當前節點的兄弟節點迭代輸出

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法 python爬蟲---BeautifulSoup的用法 python BeautifulSoup庫用法總結 python爬蟲beautifulsoup查找定位Select用法 python爬蟲從入門到放棄（六）之 BeautifulSoup庫的使用 Python爬蟲小白入門（三）BeautifulSoup庫爬蟲解析庫——BeautifulSoup 爬蟲（四）：BeautifulSoup庫的使用 python爬蟲之request and BeautifulSoup Python 爬蟲—— requests BeautifulSoup

python爬蟲：BeautifulSoup 庫 的基本函數用法及框架

安裝：