小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

本文轉載自查看原文 2019-12-19 09:15 268 Python 爬蟲/ Python

人生苦短，我用 Python

前文傳送門：

小白學 Python 爬蟲（1）：開篇

小白學 Python 爬蟲（2）：前置准備（一）基本類庫的安裝

小白學 Python 爬蟲（3）：前置准備（二）Linux基礎入門

小白學 Python 爬蟲（4）：前置准備（三）Docker基礎入門

小白學 Python 爬蟲（5）：前置准備（四）數據庫基礎

小白學 Python 爬蟲（6）：前置准備（五）爬蟲框架的安裝

小白學 Python 爬蟲（7）：HTTP 基礎

小白學 Python 爬蟲（8）：網頁基礎

小白學 Python 爬蟲（9）：爬蟲基礎

小白學 Python 爬蟲（10）：Session 和 Cookies

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

小白學 Python 爬蟲（14）：urllib 基礎使用（四）

小白學 Python 爬蟲（15）：urllib 基礎使用（五）

小白學 Python 爬蟲（16）：urllib 實戰之爬取妹子圖

小白學 Python 爬蟲（17）：Requests 基礎使用

小白學 Python 爬蟲（18）：Requests 進階操作

小白學 Python 爬蟲（19）：Xpath 基操

小白學 Python 爬蟲（20）：Xpath 進階

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

引言

前面一篇我們介紹的選擇方法都是通過屬性來進行選擇的，這種方法使用起來非常簡單，但是，如果 DOM 結構比較復雜的話，這種方法就不是那么友好了。

所以 Beautiful Soup 還為我們提供了一些搜索方法，如 find_all() 和 find() ， DOM 節點不好直接用屬性方法來表示，我們可以直接搜索嘛~~~

find_all()

先看下語法結構：

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索當前 tag 的所有 tag 子節點，並判斷是否符合過濾器的條件。

name

name 參數可以查找所有名字為 name 的 tag ，字符串對象會被自動忽略掉。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(name = "a"))
print(type(soup.find_all(name = "a")[0]))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<class 'bs4.element.Tag'>

這次的示例換成了字符串，主要是為了各位同學看起來方便，再也不用去對照着圖片看了。

這個示例我們使用了 find_all() 方法，並且傳入了 name 參數，值為 a ，含義是我們要查找所有的 <a> 節點，可以看到，返回的結果數據類型是列表，長度為 3 ，並且元素類型為 bs4.element.Tag 。

因為元素類型為 bs4.element.Tag ，我們可以通過前一篇文章介紹的屬性直接獲取其中的內容：

for a in soup.find_all(name = "a"):
    print(a.string)

結果如下：

Elsie
Lacie
Tillie

attrs

除了可以通過 name 進行搜索，我們還可以通過屬性進行查詢：

print(soup.find_all(attrs={'id': 'link1'}))
print(soup.find_all(attrs={'id': 'link2'}))
print(type(soup.find_all(attrs={'id': 'link1'})))
print(type(soup.find_all(attrs={'id': 'link2'})))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>

這個示例我們傳入的是 attrs 參數，參數的數據類型是字典。

string

這個參數可用來匹配節點的文本，傳入的形式可以是字符串，可以是正則表達式對象：

import re

print(soup.find_all(text=re.compile('sisters')))

結果如下：

['Once upon a time there were three little sisters; and their names were\n']

keyword

如果一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數當作指定名字 tag 的屬性來搜索，比如下面的示例我們直接搜索 id 為 link 的節點和 class 為 title 的節點：

print(soup.find_all(id='link1'))
print(soup.find_all(class_='title'))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<p class="title"><b>The Dormouse's story</b></p>]

當然，我們也可以使用多個指定名字的參數同時過濾 tag 的多個屬性：

print(soup.find_all(href=re.compile("elsie"), id='link1'))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

有些 tag 屬性在搜索不能使用，比如 HTML5 中的 data-* 屬性，這時就需要用到上面介紹過的 attrs 參數了。

find()

find() 和 find_all() 非常的像，只不過 find() 不再像 find_all() 一樣直接返回所有的匹配節點，而是只返回第一個匹配的元素。舉幾個簡單的栗子：

print(soup.find(name = "a"))
print(type(soup.find(name = "a")))

結果如下：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>

其余的查詢方法各位同學可以參考官方文檔，小編這里簡單列舉一下：

find_parents() 和 find_parent() ：用來搜索當前節點的父輩節點。
find_next_siblings() 和 find_next_sibling() ：前者返回后面所有的兄弟節點，后者返回后面第一個兄弟節點。
find_previous_siblings() 和 find_previous_sibling() ：前者返回前面所有的兄弟節點，后者返回前面第一個兄弟節點。
find_all_next() 和 find_next() ：前者返回節點后所有符合條件的節點，后者返回第一個符合條件的節點。
find_all_previous() 和 find_previous() ：前者返回節點后所有符合條件的節點，后者返回第一個符合條件的節點。

CSS

Beautiful Soup 除了提供前面這些屬性選擇、搜索方法等方式來獲取節點，還提供了另外一種選擇器 —— CSS 選擇器。

如果對 CSS 選擇器不熟的話，可以參考：https://www.w3school.com.cn/css/index.asp 。

使用 CSS 選擇器方法非常簡單，只需要調用 select() 方法，傳入相應的 CSS 選擇器即可，還是寫幾個簡單的示例：

print(soup.select('#link1'))
print(type(soup.select('#link1')[0]))
print(soup.select('.story .sister'))

結果如下：

<class 'bs4.element.Tag'>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到，我們使用 CSS 選擇器獲得的結果同樣會是一個列表，並且里面的元素同樣是 bs4.element.Tag ，這就意味着我們可以使用它的屬性來獲取對應的信息。

小結

Beautiful Soup 就這么簡單的介紹完了，稍微做點小總結：

在選擇解析器的時候盡量選擇 lxml ，官方推薦，據說是快。
節點屬性篩選雖然簡單但是功能有點弱雞。
find_all() 和 find() 其實可以很方便的幫助我們完成絕大多數的工作。
CSS 選擇器推薦有經驗的同學使用，畢竟嘛，選擇 DOM 節點，還是 CSS 選擇器來的方便好使不是么？

示例代碼

本系列的所有代碼小編都會放在代碼管理倉庫 Github 和 Gitee 上，方便大家取用。

示例代碼-Github

示例代碼-Gitee

參考

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上） Python爬蟲利器：Beautiful Soup python爬蟲之Beautiful Soup的基本使用 Python爬蟲利器二之Beautiful Soup的用法爬蟲---Beautiful Soup 初始 python Beautiful Soup 抓取解析網頁 python beautiful soup庫的超詳細用法 Python爬蟲利器二之Beautiful Soup的用法小白學 Python 爬蟲（23）：解析庫 pyquery 入門 python中html解析-Beautiful Soup