python爬蟲（1）——BeautifulSoup庫函數find_all() (轉)

本文轉載自查看原文 2017-09-29 10:45 19946 python

原文地址:http://blog.csdn.net/depers15/article/details/51934210

python——BeautifulSoup庫函數find_all()

一、語法介紹

find_all( name , attrs , recursive , string , **kwargs ) 
find_all() 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件

二、參數及用法介紹

1、name參數

這是最簡單而直接的一種辦法了，我么可以通過html標簽名來索引；

sb = soup.find_all('img')

2、keyword參數

所謂關鍵字參數其實就是通過一個html標簽的id、href(這個主要指的是a標簽的 ）和title,我測試了class，這個方法好像不行，不過沒有關系，下面我會談到這個點的！
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

這里的true指的就是選中所有有id這個屬性的標簽；
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

當然還可以設置多個篩選的屬性；
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

還有有些屬性在搜索時就不能使用，就比如HTML5中的 data-* 屬性，咋辦？

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

雖然我們不能像id他們那樣使用，因為class在python中是保留字（保留字(reserved word)，指在高級語言中已經定義過的字，使用者不能再將這些字作為變量名或過程名使用。 
），所以呢，直接使用是回報錯的，所以class_應運而生； 
所以呢，順便上一張圖片，讓我們看一看python都有哪些保留字： 
![](http://images2017.cnblogs.com/blog/825729/201709/825729-20170929143447247-961841526.png)
通過標簽名和屬性名一起用：
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
除此之外呢，還有就是class_ 參數同樣接受不同類型的 過濾器 ,字符串,正則表達式,方法或 True :當然，上面的屬性也可以和標簽名結合起來使用；
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters) 
#這里的這個函數，其實就是一個布爾值True；
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3.sting參數

通過 string 參數可以搜搜文檔中的字符串內容.與 name 參數的可選值一樣, string 參數接受字符串 , 正則表達式 , 列表, True;

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

4.limit參數

這個參數其實就是控制我們獲取數據的數量，效果和SQL語句中的limit一樣；

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5.recursive參數

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False;
Html:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

python:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

所以他只獲取自己的直接子節點，也就是他自己,這個標簽自己就是他的直接子節點；

Beautiful Soup 提供了多種DOM樹搜索方法. 這些方法都使用了類似的參數定義. 比如這些方法: find_all(): name, attrs, text, limit. 但是只有 find_all() 和 find() 支持 recursive 參數.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲：BeautifulSoup庫find_all ()、find()方法詳解初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法 BeautifulSoup庫之find_all函數 BeautifulSoup4的find_all()和select()，簡單爬蟲學習 python3爬蟲（find_all用法等） find_all的用法 Python（bs4，BeautifulSoup） [Python]find_all函數 2020.2.7 BeautifulSoup中的find，find_all python爬蟲時如何使用find和find_all的講解 beautifulsoup用法2 (find_all select)