python爬蟲：BeautifulSoup庫find_all ()、find()方法詳解

本文轉載自查看原文 2019-08-08 12:57 3563 Python Crawler

find()和findAll()官方定義如下：

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

唯一區別：

*find()返回的是第一個匹配的標簽結果

*find_all()返回的是所有匹配結果的列表

一般只用前2個參數：tag，attributes。

tag
可以傳一個標簽的名稱或多個標簽名稱組成的 Python列表做標簽參數。例如，下面的代碼將返回一個包含 HTML 文檔中所有標題標簽的列表：
.findAll({"h1","h2","h3","h4","h5","h6"})

attributes
屬性參數 attributes 是用一個 Python 字典封裝一個標簽的若干屬性和對應的屬性值。例如，下面這個函數會返回 HTML 文檔里紅色與綠色兩種顏色的 span 標簽：
.findAll("span", {"class":{"green", "red"}})
如果只返回一種顏色的，如綠色：
.findAll("span", {"class": "green"})

recursive
遞歸參數 recursive 是一個布爾變量。你想抓取 HTML 文檔標簽結構里多少層的信息？如果recursive 設置為 True ， findAll 就會根據你的要求去查找標簽參數的所有子標簽，以及子標簽的子標簽。如果 recursive 設置為 False ， findAll 就只查找文檔的一級標簽。 findAll默認是支持遞歸查找的（ recursive 默認值是 True ）；一般情況下這個參數不需要設置。

text
文本參數 text 有點不同，它是用標簽的文本內容去匹配，而不是用標簽的屬性。假如我們想查找前面網頁中包含“the prince”內容的標簽數量，我們可以把之前的 findAll 方法換成下面的代碼：

nameList = bsObj.findAll(text="the prince")
print(len(nameList))

limit
范圍限制參數 limit ，顯然只用於 findAll 方法。 find 其實等價於 findAll 的 limit 等於1 時的情形。如果你只對網頁中獲取的前 x 項結果感興趣，就可以設置它。

keywords
可以讓你選擇那些具有指定屬性的標簽，屬於冗余的技術，如下所示：第一行采用keywords,第二行采用前兩個參數：tag、attributes

bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})

用 keyword 偶爾會出現問題，尤其是在用 class 屬性查找標簽的時候，因為 class 是 Python 中受保護的關鍵字。所以一般只采用前2個參數tag、attributes即可。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法 python爬蟲（1）——BeautifulSoup庫函數find_all() (轉) BeautifulSoup庫之find_all函數 python3爬蟲（find_all用法等） BeautifulSoup4的find_all()和select()，簡單爬蟲學習 BeautifulSoup中的find，find_all python爬蟲時如何使用find和find_all的講解 find_all的用法 Python（bs4，BeautifulSoup） beautifulsoup用法2 (find_all select) python3爬蟲03（find_all用法等）