find()和find_all()的具體使用

本文轉載自查看原文 2019-12-12 16:32 951

在我們學會了BeautifulSoup庫的用法后，我們就可以使用這個庫對HTML進行解析，從網頁中提取我們需要的內容。

在BeautifulSoup 文檔里，find()、find_all()兩者的定義如下：

　　find(tag, attributes, recursive, text, keywords)

　　find(標簽，屬性，遞歸，文本，關鍵詞)

　　find_all(tag, attributes, recursive, text, limit, keywords)

　　find_all(標簽、屬性、遞歸、文本、限制、關鍵詞)

find()與find_all()的區別，find()只會取符合要求的第一個元素，find_all()會根據范圍限制參數limit限定的范圍取元素（默認不設置代表取所有符合要求的元素，find 等價於 find_all的 limit =1 時的情形），接下來將對每個參數一一介紹。

另外，find_all()會將所有滿足條件的值取出，組成一個list

下面我們就一一介紹函數中各個參數的作用：

一、標簽tag

標簽參數 tag 可以傳一個標簽的名稱或多個標簽名稱組成的set做標簽參數。例如，下面的代碼將返回一個包含 HTML 文檔中所有鏈接標簽的列表： find_all("a")

下面以“百度一下”網頁舉例，如下圖，現在要將百度頁面上的所有的鏈接取出，觀察網頁源代碼可以發現，標題對應的tag 是a，則soup.find_all('a')

代碼如下：

from bs4 import BeautifulSoup
import requests

url = 'https://www.baidu.com/'
urlhtml = requests.get(url)
urlhtml.encoding = 'utf-8'
soup = BeautifulSoup(urlhtml.text, 'lxml')

n = soup.find_all('a')
print(n)

結果如下，我們就找出了全部的<a>標簽。

上面例子只是一個標簽的情況，如果多個標簽寫法相同，只是注意要將所有的標簽寫在一個set里面

二、屬性attributes

屬性參數 attributes 是用字典封裝一個標簽的若干屬性和對應的屬性值。如，下面這個函數會返回 HTML 文檔里“mnav”的a標簽。find_all("a", {"class":{"mnav"}})

如下圖，現在要獲取網頁中的“新聞”等信息，通過觀察可知，它們的屬性為"mnav"，標簽為a。

則我們可以編寫代碼：

from bs4 import BeautifulSoup
import requests

url = 'https://www.baidu.com/'
urlhtml = requests.get(url)
urlhtml.encoding = 'utf-8'
soup = BeautifulSoup(urlhtml.text, 'lxml')

n = soup.find_all('a', {'class': 'mnav'})
print(n)

運行結果如下，可以看出輸出的鏈接的屬性全部為“mnav”。

三、遞歸recursive

遞歸參數 recursive 是一個布爾變量。你想抓取 HTML 文檔標簽結構里多少層的信息？如recursive 設置為 True， find_all()就會根據你的要求去查找標簽參數的所有子標簽，以及標簽的子標簽。如果 recursive 設置為 False， find_all()就只查找文檔的一級標簽。 find_all默認是支持遞歸查找的（recursive 默認值是 True），這里是很少使用的，所以我在這兒就不在舉例了。

四、文本text

文本參數 text 有點不同，它是用標簽的文本內容去匹配，而不是用標簽的屬性。

我們再以“百度一下”的網頁舉例吧，在這個網頁中，我們查找一下“新聞”在該網頁中出現了多少個（其實只出現了一個）

from bs4 import BeautifulSoup
import requests

url = 'https://www.baidu.com/'
urlhtml = requests.get(url)
urlhtml.encoding = 'utf-8'
soup = BeautifulSoup(urlhtml.text, 'lxml')

n = soup.find_all(text='新聞')
print(n)

結果如下：

需要特別注意一點，這里查找是用的是完全匹配原則，意思是如果這里你用了find_all(text=“新”)，得到的結果會是0個

五、關鍵詞keywords

關鍵詞參數 keyword，自己選擇那些具有指定屬性的標簽

上面網頁的內容，現在要取id='wrapper'的內容，則

from bs4 import BeautifulSoup
import requests

url = 'https://www.baidu.com/'
urlhtml = requests.get(url)
urlhtml.encoding = 'utf-8'
soup = BeautifulSoup(urlhtml.text, 'lxml')

n = soup.find_all(id='wrapper')
print(n)

結果如下，也就是選中的<div>中的內容

注意：如果是class、id等參數，用keywords 或者attributes用法一樣，如果是一些其他參數，則用keywords

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 find 和 find_all 用法 python爬蟲時如何使用find和find_all的講解 python 學習之FAQ:find 與 find_all 使用 BeautifulSoup中的find，find_all BS4(BeautifulSoup4)的使用--find_all()篇 BeautifulSoup庫之find_all函數 beautifulsoup用法2 (find_all select) python3爬蟲（find_all用法等） [Python]find_all函數 2020.2.7 Python爬網常見方法:find及find_all的使用方法