BeautifulSoup中的find，find_all

本文轉載自查看原文 2017-11-20 20:14 77511 爬蟲

1.一般來說，為了找到BeautifulSoup對象內任何第一個標簽入口，使用find()方法。

以上代碼是一個生態金字塔的簡單展示，為了找到第一生產者，第一消費者或第二消費者，可以使用Beautiful Soup。

找到第一生產者：

生產者在第一個<url>標簽里，因為生產者在整個html文檔中第一個<url>標簽中出現，所以可以使用find()方法找到第一生產者，在ecologicalpyramid.py

中寫入下面一段代碼，使用ecologicalpyramid.html文件創建BeautifulSoup對象。

from bs4 import BeautifulSoup
with open('ecologicalpyramid.html', 'r') as ecological_pyramid:　　　　# ecological 生態系統  pyramid 金字塔
　　soup = BeautifulSoup(ecological_pyramid)
producer_entries = soup.find('ul')
print(producer_entries.li.div.string)

輸出結果：plants

2.find()

find函數：

find(name, attrs, recursive, text, **wargs)　　　　# recursive 遞歸的，循環的

這些參數相當於過濾器一樣可以進行篩選處理。不同的參數過濾可以應用到以下情況：

查找標簽，基於name參數
查找文本，基於text參數
基於正則表達式的查找
查找標簽的屬性，基於attrs參數
基於函數的查找

通過標簽查找：

可以傳遞任何標簽的名字來查找到它第一次出現的地方。找到后，find函數返回一個BeautifulSoup的標簽對象。

from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
　　soup = BeautifulSoup(ecological_pyramid, 'html')
producer_entries = soup.find('ul')
print(type(producer_entries))

輸出結果： <class 'bs4.element.Tag'>

通過文本查找：

直接字符串的話，查找的是標簽。如果想要查找文本的話，則需要用到text參數。如下所示：

from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
　　soup = BeautifulSoup(ecological_pyramid, 'html')
producer_string = soup.find(text = 'plants')
print(plants_string)

輸出：plants

通過正則表達式查找：

有以下html代碼：

想找出第一個郵箱地址，但是第一個郵箱地址沒有標簽包含，所以通過其他方式很難找到。但是可以將郵箱地址進行正則表達式處理。

import re
from bs4 import BeautifulSoup

email_id_example = """<br/>
<div>The below HTML has the information that has email ids.</div> 
abc@example.com 
<div>xyz@example.com</div> 
<span>foo@example.com</span> 
"""

soup = BeautifulSoup(email_id_example)
emailid_regexp = re.compile("\w+@\w+\.\w+")　　　　# regexp 表達式對象
first_email_id = soup.find(text=emailid_regexp)　　
print(first_email_id)

輸出結果：abc@example.com

通過標簽屬性進行查找：

上面html代碼，其中第一個消費者在ul標簽里面且id屬性為priaryconsumer（priary consumer一次消費者，初級消費者）。

from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
    soup = BeautifulSoup(eccological_pyramid, 'html')
primary_consumer = soup.find(id='primaryconsumers')
print(primary_consumer.li.div.string)

輸出結果：deer

基於定制屬性查找：

通過標簽屬性查找的方式適用大多數標簽屬性，包括id，style，title，但有 “-”，Class標簽屬性例外。

比如html5標簽中的data-custom屬性，如果我們這樣

customattr = """<p data-custom='custom'>custo attribute
example</p>
"""
customsoup = BeautifulSoup(customattr, 'lxml')
customSoup.find(data-custom="custom")

那么則會報錯。原因是在python中變量不能含有"-"這個字符，而我們傳遞的data-custom有這個字符。

解決辦法是在attrs屬性用字典進行傳遞參數。

using_attrs = customsoup.find(attrs={'data-custom':'custom'})
print(using_attrs)

基於css類的查找：

class是python的保留關鍵字，所以無法使用class這個關鍵字。

第一種方法：在attrs屬性用字典進行傳遞參數

css_class = soup.find(attrs={'class':'primaryconsumers'})
print(css_class)

第二種方法：BeautifulSoup中的特別關鍵字參數class_。

css_class = soup.find(class_ = 'primaryconsumers')

基於定義的函數進行查找：

可以傳遞函數到find()來基於函數定義的條件查找。函數必須返回True或False。

def is_secondary_consumers(tag):
return tag.has_attr('id') and tag.get('id') == 'secondaryconsumers'

secondary_consumer = soup.find(is_secondary_consumers)
print(secondary_consumer.li.div.string)

輸出：fox

將方法進行組合后進行查找：

可以用其中任何方法進行組合進行查找，比如同時基於標簽名和id號。

3.find_all查找

find()查找第一個匹配結果出現的地方，find_all()找到所有匹配結果出現的地方。

查找所有3級消費者：

all_tertiaryconsumers = soup.find_all(class_ = 'tertiaryconsumerslist')        # tertiary第三的

其中all_tertiaryconsumers的類型是列表。

所以對其列表進行迭代，循環輸出三級消費者的名字。

for tertiaryconsumer in all_tertiaryconsumers:
print(tertiaryconsumer.div.string)

輸出結果：

lion

tiger

find_all()的參數：

find_all(name, attrs, recursive, text, limit, **kwargs)

limit參數可以限制得到的結果的數目。

參照前面的郵件地址例子，得到所有郵件地址：

email_ids = soup.find_all(text=emailid_regexp)
print(email_ids)

輸出結果：[u'abc@example.com',u'xyz@example.com',u'foo@example.com']

使用limit參數：

email_ids_limited = soup.find_all(text=emailid_regexp, limit = 2)
print(email_ids_limited)

限制得到兩個結果，所以輸出結果：[u'abc@example.com',u'xyz@example.com']

可以向find函數傳遞True或False參數，如果傳遞True給find_all()，則返回soup對象的所有標簽。對於find()來說，則返回soup對象的第一個標簽。

all_texts = soup.find_all(text=True)
print(all_texts)

輸出結果：

同樣，可以在傳遞text參數時傳遞一個字符串列表，那么find_all()會找到挨個在列表中定義過的字符串。

all_texts_in_list = soup.find_all(text=['plants', 'algae'])
print(all_texts_in_list)

輸出結果：

[u'plants', u'alage']

這個同樣適用於查找標簽，標簽屬性，定制屬性和CSS類。如：

div_li_tags = soup.find_all(['div', 'li'])

並且find()和find_all()都會查找一個對象所有后輩們，不過可以通過recursive參數控制。(recursive回歸，遞歸)

如果recursive=False，只會找到該對象的最近后代。

通過標簽之間的關系進行查找

查找父標簽

通過find_parents()或find_parent()。它們之間的不同類似於find()和find_all()的區別。

find_parents()返回全部的相匹配的父標簽，而find_parent()返回最近一個父標簽。適用於find()的方法同樣適用於這兩個方法。

在第一消費者例子中，可以找到離Primaryconsumer最近的ul父標簽。

primaryconsumers = soup.find_all(class_ = 'primaryconsumerlist')
primaryconsumer = primaryconsumers[0]
parent_ul = primaryconsumer.find_parents('ul')
print(parent_ul)

一個簡單的找到一個標簽的父標簽的方法是使用find_parent()卻不帶任何參數。

immediateprimary_consumer_parent = primary_consumer.find_parent()

查找同胞

標簽在同一個等級，這些標簽是同胞關系，比如參照上面金子塔例子，所有的ul標簽就是同胞的關系。上面的ul標簽下的producers，primaryconsumers，，

secondaryconsumers，teriaryconsumers就是同胞關系。

div下的plants和algae不是同胞關系，但是plants和臨近的number是同胞關系。

Beautiful Soup自帶查找同胞的方法。

比如find_next_siblings()和find_next_sibling()查找對象下面的同胞。(sibling兄弟姐妹)

producers = soup.find(id = 'producers')
next_siblings = producers.find_next_siblings()
print(next_siblings)

輸出結果將會輸出與之臨近的下面的所有同胞html代碼。

查找下一個

對每一個標簽來說，下一個元素可能會是定位字符串，標簽對象或者其他BeautifulSoup對象，我們定義下一個元素為當前元素最靠近的元素。

這不用於同胞定義，我們有方法可以找到我們想要標簽的下一個其他元素對象。find_all_next()找到與當前元素最靠近的所有對象。而find_next()找到離當前元素最接近的對象。

比如，找到在第一個div標簽后的所有li標簽

first_div = soup.div
all_li_tags = first_div.find_all_next('li')

查找上一個

與查找下一個相反的是查找前一個，用find_previous()和find_all_previous()。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法 python爬蟲（1）——BeautifulSoup庫函數find_all() (轉) find 和 find_all 用法 beautifulsoup find_all怎樣把帶有某種屬性的標簽選出而不含該屬性的標簽不選 BeautifulSoup4----利用find_all和get方法來獲取信息 find()和find_all()的具體使用 [Python]find_all函數 2020.2.7 python 學習之FAQ:find 與 find_all 使用 15 Beautiful Soup（提取數據詳解find_all()） mongoDB find的$in $all的區別