BS4(BeautifulSoup4)的使用--find_all()篇

本文轉載自查看原文 2016-12-03 18:15 22794 python/ BS4/ BeautifulSoup4/ find_all()/ html爬蟲解析

可以直接參考 BS4文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

注意的是：

1.有些tag屬性在搜索不能使用,比如HTML5中的 data-* 屬性:

 
           data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression  
          

但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:

 
           data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>] 
表達式可以是字符串、布爾值、正則表達式

2.class屬性要用class_=""

find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件.這里有幾個例子:

 
           soup.find_all("title") # [<title>The Dormouse's story</title>] soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>] soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] import re soup.find(text=re.compile("sisters")) # u'Once upon a time there were three little sisters; and their names were\n'  
          

有幾個方法很相似,還有幾個方法是新的,參數中的 text 和 id 是什么含義? 為什么 find_all("p", "title") 返回的是CSS Class為”title”的<p>標簽? 我們來仔細看一下 find_all() 的參數

name 參數

name 參數可以查找所有名字為 name 的tag,字符串對象會被自動忽略掉.

簡單的用法如下:

 
            soup.find_all("title") # [<title>The Dormouse's story</title>]  
           

重申: 搜索 name 參數的值可以使任一類型的過濾器 ,字符竄,正則表達式,列表,方法或是 True .

keyword 參數

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索,如果包含一個名字為 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性.

 
            soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
           

如果傳入 href 參數,Beautiful Soup會搜索每個tag的”href”屬性:

 
            soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  
           

搜索指定名字的屬性時可以使用的參數值包括字符串 , 正則表達式 , 列表, True .

下面的例子在文檔樹中查找所有包含 id 屬性的tag,無論 id 的值是什么:

 
            soup.find_all(id=True) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
           

使用多個指定名字的參數可以同時過濾tag的多個屬性:

 
            soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]  
           

有些tag屬性在搜索不能使用,比如HTML5中的 data-* 屬性:

 
            data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression  
           

但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:

 
            data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]  
           

按CSS搜索

按照CSS類名搜索tag的功能非常實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 做參數會導致語法錯誤.從Beautiful Soup的4.1.1版本開始,可以通過 class_ 參數搜索有指定CSS類名的tag:

 
            soup.find_all("a", class_="sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
           

class_ 參數同樣接受不同類型的 過濾器 ,字符串,正則表達式,方法或 True :

 
            soup.find_all(class_=re.compile("itl")) # [<p class="title"><b>The Dormouse's story</b></p>] def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
           

tag的 class 屬性是多值屬性 .按照CSS類名搜索tag時,可以分別搜索tag中的每個CSS類名:

 
            css_soup = BeautifulSoup('<p class="body strikeout"></p>') css_soup.find_all("p", class_="strikeout") # [<p class="body strikeout"></p>] css_soup.find_all("p", class_="body") # [<p class="body strikeout"></p>]  
           

搜索 class 屬性時也可以通過CSS值完全匹配:

 
            css_soup.find_all("p", class_="body strikeout") # [<p class="body strikeout"></p>]  
           

完全匹配 class 的值時,如果CSS類名的順序與實際不符,將搜索不到結果:

 
            soup.find_all("a", attrs={"class": "sister"}) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
           

`text` 參數

通過 text 參數可以搜搜文檔中的字符串內容.與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
    ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

雖然 text 參數用於搜索字符串,還可以與其它參數混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數值相符的tag.下面代碼用來搜索內容里面包含“Elsie”的<a>標簽:

 
            soup.find_all("a", text="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]  
           

`limit` 參數

find_all() 方法返回全部的搜索結構,如果文檔樹很大那么搜索會很慢.如果我們不需要全部結果,可以使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到 limit 的限制時,就停止搜索返回結果.

文檔樹中有3個tag符合搜索條件,但結果只返回了2個,因為我們限制了返回數量:

 
            soup.find_all("a", limit=2) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
           

`recursive` 參數

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False .

一段簡單的文檔:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

是否使用 recursive 參數的搜索結果:

 
            soup.html.find_all("title") # [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False) # []  
           

像調用 `find_all()` 一樣調用tag

find_all() 幾乎是Beautiful Soup中最常用的搜索方法,所以我們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:

 
           soup.find_all("a") soup("a")  
          

這兩行代碼也是等價的:

 
           soup.title.find_all(text=True) soup.title(text=True) 
          

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 find_all的用法 Python（bs4，BeautifulSoup） bs4 的select 和find_all方法效率對比 BeautifulSoup4的find_all()和select()，簡單爬蟲學習爬蟲-使用BeautifulSoup4（bs4）解析html數據 BeautifulSoup中的find，find_all BeautifulSoup庫之find_all函數 python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4） beautifulsoup用法2 (find_all select) python 在linux上面安裝beautifulsoup4(bs4) No module named 'bs4' python3 之 bs4 BeautifulSoup 簡單使用