爬蟲基礎庫之beautifulsoup的簡單使用

本文轉載自查看原文 2019-02-18 17:13 840

beautifulsoup的簡單使用

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋如下：

''' Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。 它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據，因為簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。 '''

安裝

 
                 pip3 install beautifulsoup4

解析器

Beautiful Soup支持Python標准庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更加強大，速度更快，推薦安裝。

 
                 pip3 install lxml

另一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:

 
                 pip install html5lib

解析器對比　

官網文檔

快速開始

下面的一段HTML代碼將作為例子被多次用到.這是 愛麗絲夢游仙境的 的一段內容(以后內容中簡稱為 愛麗絲 的文檔):

html_doc = """
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

使用BeautifulSoup解析這段代碼,能夠得到一個 BeautifulSoup 的對象,並能按照標准的縮進格式的結構輸出:

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify())

幾個簡單的瀏覽結構化數據的方法:

soup.title
# <title>The Dormouse's story</title>  soup.title.name # u'title'  soup.title.string # u'The Dormouse's story'  soup.title.parent.name # u'head'  soup.p # <p class="title"><b>The Dormouse's story</b></p>  soup.p['class'] # u'title'  soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>  soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

從文檔中找到所有<a>標簽的鏈接:

for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie

從文檔中獲取所有文字內容:

print(soup.get_text())

如何使用

將一段文檔傳入BeautifulSoup 的構造方法,就能得到一個文檔的對象, 可以傳入一段字符串或一個文件句柄.

from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")

然后,Beautiful Soup選擇最合適的解析器來解析這段文檔,如果手動指定解析器那么Beautiful Soup會選擇指定的解析器來解析文檔。

對象的種類

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為種

Tag , NavigableString , BeautifulSoup , Comment .

Tag

通俗點講就是 HTML 中的一個個標簽，Tag 對象與XML或HTML原生文檔中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'>

tag的名字

soup對象再以愛麗絲夢游仙境的html_doc為例，操作文檔樹最簡單的方法就是告訴它你想獲取的tag的name.如果想獲取 <head> 標簽,只要用 soup.head :

soup.head
# <head><title>The Dormouse's story</title></head>  soup.title # <title>The Dormouse's story</title>

這是個獲取tag的小竅門,可以在文檔樹的tag中多次調用這個方法.下面的代碼可以獲取<body>標簽中的第一個標簽:

soup.body.b
# <b>The Dormouse's story</b>

通過點取屬性的方式只能獲得當前名字的第一個tag:

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要得到所有的<a>標簽,或是通過名字得到比一個tag更多的內容的時候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我們可以利用 soup加標簽名輕松地獲取這些標簽的內容，注意，它查找的是在所有內容中的第一個符合要求的標簽。

name和attributes屬性

Tag有很多方法和屬性,現在介紹一下tag中最重要的屬性: name和attributes

每個tag都有自己的名字,通過 .name 來獲取:

tag.name
# u'b'  tag['class'] # u'boldest'  tag.attrs # {u'class': u'boldest'}

tag的屬性可以被添加,刪除或修改. 再說一次, tag的屬性操作方法與字典一樣

tag['class'] = 'verybold' tag['id'] = 1 tag # <blockquote class="verybold" id="1">Extremely bold</blockquote> del tag['class'] del tag['id'] tag # <blockquote>Extremely bold</blockquote>  tag['class'] # KeyError: 'class' print(tag.get('class')) # None

NavigableString(字符串)

既然我們已經得到了標簽的內容，那么問題來了，我們要想獲取標簽內部的文字怎么辦呢？很簡單，用 .string 即可.

字符串常被包含在tag內.Beautiful Soup用 NavigableString 類來包裝tag中的字符串，通過 unicode() 方法可以直接將 NavigableString 對象轉換成Unicode字符串:

tag.string
# u'Extremely bold' type(tag.string) # <class 'bs4.element.NavigableString'>  unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # <type 'unicode'>

tag中包含的字符串不能編輯,但是可以被替換成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold") tag # <blockquote>No longer bold</blockquote>

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內容.大部分時候,可以把它當作 Tag 對象，是一個特殊的 Tag，我們可以分別獲取它的類型，名稱，以及屬性。

print type(soup.name) #<type 'unicode'> print soup.name # [document] print soup.attrs #{} 空字典

Comment

html_doc='<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>' soup = BeautifulSoup(html_doc, 'html.parser') print(soup.a.string) # Elsie print(type(soup.a.string)) # <class 'bs4.element.Comment'>

a 標簽里的內容實際上是注釋，但是如果我們利用 .string 來輸出它的內容，我們發現它已經把注釋符號去掉了，所以這可能會給我們帶來不必要的麻煩。

另外我們打印輸出下它的類型，發現它是一個 Comment 類型，所以，我們在使用前最好做一下判斷，判斷代碼如下:

if type(soup.a.string)==bs4.element.Comment: print soup.a.string

上面的代碼中，我們首先判斷了它的類型，是否為 Comment 類型，然后再進行其他操作，如打印輸出。

beautifulsoup的遍歷文檔樹

還拿”愛麗絲夢游仙境”的文檔來做例子:

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')

通過這段例子來演示怎樣從文檔的一段內容找到另一段內容

(1) 子節點

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點.Beautiful Soup提供了許多操作和遍歷子節點的屬性.

注意: Beautiful Soup中字符串節點不支持這些屬性,因為字符串沒有子節點。

.contents 和 .children

tag的 .contents 屬性可以將tag的子節點以列表的方式輸出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>  head_tag.contents [<title>The Dormouse's story</title>]  title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story']

字符串沒有 .contents 屬性,因為字符串沒有子節點:

text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

.children它返回的不是一個 list，不過我們可以通過遍歷獲取所有子節點。我們打印輸出 .children 看一下，可以發現它是一個 list 生成器對象

通過tag的 .children 生成器,可以對tag的子節點進行循環:

print(title_tag.children)       # <list_iterator object at 0x101b78860> print(type(title_tag.children)) # <class 'list_iterator'> 

for child in title_tag.children: print(child) # The Dormouse's story

.descendants

.contents 和 .children 屬性僅包含tag的直接子節點.例如,<head>標簽只有一個直接子節點<title>

head_tag.contents
# [<title>The Dormouse's story</title>]

但是<title>標簽也包含一個子節點:字符串 “The Dormouse’s story”,這種情況下字符串 “The Dormouse’s story”也屬於<head>標簽的子孫節點.

.descendants 屬性可以對所有tag的子孫節點進行遞歸循環。

for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story

上面的例子中, <head>標簽只有一個子節點,但是有2個子孫節點:<head>節點和<head>的子節點, BeautifulSoup 有一個直接子節點(<html>節點),卻有很多子孫節點:

len(list(soup.children))
# 1 len(list(soup.descendants)) # 25

(2) 節點內容

如果tag只有一個 NavigableString 類型子節點,那么這個tag可以使用 .string 得到子節點。如果一個tag僅有一個子節點,那么這個tag也可以使用 .string 方法,輸出結果與當前唯一子節點的 .string 結果相同。

通俗點說就是：如果一個標簽里面沒有標簽了，那么 .string 就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了，那么 .string 也會返回最里面的內容。例如：

print (soup.head.string) #The Dormouse's story print (soup.title.string) #The Dormouse's story

如果tag包含了多個子節點,tag就無法確定，string 方法應該調用哪個子節點的內容, .string 的輸出結果是 None

print (soup.html.string) #None

(3) 多個內容

 
                 .strings  .stripped_strings 屬性

.strings

獲取多個內容，不過需要遍歷獲取，比如下面的例子：

for string in soup.strings:
    print(repr(string))
    
    
'''
  '\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'  
    
'''

.stripped_strings

輸出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白內容

for string in soup.stripped_strings:
    print(repr(string))


'''

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'

'''

(4) 父節點

繼續分析文檔樹,每個tag或字符串都有父節點:被包含在某個tag中

.parent

通過 .parent 屬性來獲取某個元素的父節點.在例子“愛麗絲”的文檔中,<head>標簽是<title>標簽的父節點:

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head>

文檔的頂層節點比如<html>的父節點是 BeautifulSoup 對象:

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

.parents

通過元素的 .parents 屬性可以遞歸得到元素的所有父輩節點,下面的例子使用了 .parents 方法遍歷了<a>標簽到根節點的所有節點.

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents: if parent is None: print(parent) else: print(parent.name) # p # body # html # [document] # None

(5) 兄弟節點

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")

.next_sibling 和 .previous_sibling

兄弟節點可以理解為和本節點處在統一級的節點，.next_sibling 屬性獲取了該節點的下一個兄弟節點，.previous_sibling 則與之相反，如果節點不存在，則返回 None

在文檔樹中,使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節點:

sibling_soup.b.next_sibling
# <c>text2</c>  sibling_soup.c.previous_sibling # <b>text1</b>

注意：實際文檔中的tag的 .next_sibling 和 .previous_sibling 屬性通常是字符串或空白，因為空白或者換行也可以被視作一個節點，所以得到的結果可能是空白或者換行

實際文檔中的tag的 .next_sibling 和 .previous_sibling 屬性通常是字符串或空白. 看看“愛麗絲”文檔:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

如果以為第一個<a>標簽的 .next_sibling 結果是第二個<a>標簽,那就錯了,真實結果是第一個<a>標簽和第二個<a>標簽之間的頓號和換行符:

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>  link.next_sibling # u',\n'

第二個<a>標簽是頓號的 .next_sibling 屬性:

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

全部兄弟節點

.next_siblings .previous_siblings 屬性

通過 .next_siblings 和 .previous_siblings 屬性可以對當前節點的兄弟節點迭代輸出

for sibling in soup.a.next_siblings: print(repr(sibling)) '''
 ',\n' <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> ' and\n' <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ';\nand they lived at the bottom of a well.'

'''

前后節點

.next_element .previous_element 屬性

與 .next_sibling .previous_sibling 不同，它並不是針對於兄弟節點，而是在所有節點，不分層次

比如 head 節點為

 
                 < 
                 head 
                 >< 
                 title 
                 >The Dormouse's story</ 
                 title 
                 ></ 
                 head 
                 >

那么它的下一個節點便是 title，它是不分層次關系的

print(soup.head.next_element) #<title>The Dormouse's story</title>

所有前后節點

.next_elements .previous_elements 屬性

通過 .next_elements 和 .previous_elements 的迭代器就可以向前或向后訪問文檔的解析內容,就好像文檔正在被解析一樣

for i in soup.a.next_elements: print(repr(i)) ''' 'Elsie' ',\n' <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 'Lacie' ' and\n' <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 'Tillie' ';\nand they lived at the bottom of a well.' '\n' <p class="story">...</p> '...' '\n' '''

以上是遍歷文檔樹的基本用法。

beautifulsoup的搜索文檔樹

find_all

 
                 find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件:

soup.find_all("title") # [<title>The Dormouse's story</title>]  soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>]  soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] import re soup.find(string=re.compile("sisters")) # u'Once upon a time there were three little sisters; and their names were\n'

有幾個方法很相似,還有幾個方法是新的,參數中的 string 和 id 是什么含義? 為什么 find_all("p", "title") 返回的是CSS Class為”title”的標簽? 我們來仔細看一下 find_all() 的參數.

name 參數

name 參數可以查找所有名字為 name 的tag,字符串對象會被自動忽略掉.

簡單的用法如下:

soup.find_all("title") # [<title>The Dormouse's story</title>]

搜索 name 參數的值可以使任一類型的過濾器 ,字符竄,正則表達式,列表,方法或是 True .

<1> 傳字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中所有的標簽

soup.find_all('b') # [<b>The Dormouse's story</b>]

<2> 傳正則表達式

如果傳入正則表達式作為參數,Beautiful Soup會通過正則表達式的 match() 來匹配內容.下面例子中找出所有以b開頭的標簽,這表示<body>和標簽都應該被找到

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

<3> 傳列表

如果傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中所有<a>標簽和標簽

soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<4> 傳 True

True 可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字符串節點

for tag in soup.find_all(True): print(tag.name) ''' html head title body p b p a a a p '''

<5> 傳方法

如果沒有合適過濾器,那么還可以定義一個方法,方法只接受一個元素參數,如果這個方法返回 True 表示當前元素匹配並且被找到,如果不是則反回 False

下面方法校驗了當前元素,如果包含 class 屬性卻不包含 id 屬性,那么將返回 True:

def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')

將這個方法作為參數傳入 find_all() 方法,將得到所有標簽:

print(soup.find_all(has_class_but_no_id)) ''' [ <p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p>, <p class="story">...</p> ] '''

keyword 參數

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索,如果包含一個名字為 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性.

soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] import re print(soup.find_all(href=re.compile("elsie"))) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜索指定名字的屬性時可以使用的參數值包括字符串 , 正則表達式 , 列表, True .

下面的例子在文檔樹中查找所有包含 id 屬性的tag,無論 id 的值是什么:

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多個指定名字的參數可以同時過濾tag的多個屬性:

soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

在這里我們想用 class 過濾，不過 class 是 python 的關鍵詞，這怎么辦？加個下划線就可以

print(soup.find_all("a", class_="sister")) ''' [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ] '''

通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:

data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]

text 參數

通過 text 參數可以搜搜文檔中的字符串內容.與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表, True

import re print(soup.find_all(text="Elsie")) # ['Elsie'] print(soup.find_all(text=["Tillie", "Elsie", "Lacie"])) # ['Elsie', 'Lacie', 'Tillie'] print(soup.find_all(text=re.compile("Dormouse"))) # ["The Dormouse's story", "The Dormouse's story"]

limit 參數

find_all() 方法返回全部的搜索結構,如果文檔樹很大那么搜索會很慢.如果我們不需要全部結果,可以使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到 limit 的限制時,就停止搜索返回結果.

print(soup.find_all("a",limit=2)) ''' [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] '''

recursive 參數

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False .

print(soup.html.find_all("title")) # [<title>The Dormouse's story</title>] print(soup.html.find_all("title",recursive=False)) # []

find()

 
                 find( name , attrs , recursive , string , **kwargs )

find_all() 方法將返回文檔中符合條件的所有tag,盡管有時候我們只想得到一個結果.比如文檔中只有一個<body>標簽,那么使用 find_all() 方法來查找<body>標簽就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法.下面兩行代碼是等價的:

soup.find_all('title', limit=1) # [<title>The Dormouse's story</title>]  soup.find('title') # <title>The Dormouse's story</title>

唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.

find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .

print(soup.find("nosuchtag")) # None

soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是多次調用當前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>  soup.find("head").find("title") # <title>The Dormouse's story</title>

find_parents() 和 find_parent()

a_string = soup.find(string="Lacie") print(a_string) # Lacie print(a_string.find_parent()) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> print(a_string.find_parents()) print(a_string.find_parent("p")) ''' <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> '''

find_next_siblings() 和 find_next_sibling()

 
                 find_next_siblings( name , attrs , recursive , string , **kwargs ) 
                
                 find_next_sibling( name , attrs , recursive , string , **kwargs )

這2個方法通過 .next_siblings 屬性對當tag的所有后面解析的兄弟tag節點進行迭代, find_next_siblings() 方法返回所有符合條件的后面的兄弟節點, find_next_sibling() 只返回符合條件的后面的第一個tag節點.

first_link = soup.a

print(first_link.find_next_sibling("a")) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> print(first_link.find_next_siblings("a")) ''' [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ] '''

find_previous_siblings() 和 find_previous_sibling()的使用類似於find_next_sibling和find_next_siblings。

find_all_next() 和 find_next()

 
                 find_all_next( name , attrs , recursive , string , **kwargs ) 
                
                 find_next( name , attrs , recursive , string , **kwargs )

這2個方法通過 .next_elements 屬性對當前tag的之后的tag和字符串進行迭代, find_all_next() 方法返回所有符合條件的節點, find_next() 方法返回第一個符合條件的節點:　　

first_link = soup.a
print(first_link.find_all_next(string=True)) # ['Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n'] print(first_link.find_next(string=True)) # Elsie

find_all_previous() 和 find_previous()的使用類似於find_all_next() 和 find_next()。

beautifulsoup的css選擇器

我們在寫 CSS 時，標簽名不加任何修飾，類名前加點，id名前加 #，在這里我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

（1）通過標簽名查找

print(soup.select("title")) #[<title>The Dormouse's story</title>] print(soup.select("b")) #[<b>The Dormouse's story</b>]

（2）通過類名查找

print(soup.select(".sister")) ''' [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] '''

（3）通過 id 名查找

print(soup.select("#link1")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

（4）組合查找

組合查找即和寫 class 文件時，標簽名與類名、id名進行的組合原理是一樣的，例如查找 p 標簽中，id 等於 link1的內容，二者需要用空格分開

print(soup.select("p #link2")) #[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

直接子標簽查找

print(soup.select("p > #link2")) # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

（5）屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標簽屬於同一節點，所以中間不能加空格，否則會無法匹配到。

print(soup.select("a[href='http://example.com/tillie']")) #[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select 方法返回的結果都是列表形式，可以遍歷形式輸出，然后用 get_text() 方法來獲取它的內容：

for title in soup.select('a'): print (title.get_text()) ''' Elsie Lacie Tillie '''

豆瓣網改寫

from bs4 import BeautifulSoup soup = BeautifulSoup(s, 'html.parser') s=soup.find_all(class_="item") for item in s: print(item.find(class_="pic").a.get("href")) print(item.find(class_="pic").em.string) print(item.find(class_="info").contents[1].a.span.string) print(item.find(class_="info").contents[3].contents[3].contents[3].string) print(item.find(class_="info").contents[3].contents[3].contents[7].string)

總結

本篇內容比較多，把 Beautiful Soup 的方法進行了大部分整理和總結，不過這還不算完全，仍然有 Beautiful Soup 的修改刪除功能，不過這些功能用得比較少，只整理了查找提取的方法，希望對大家有幫助！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲（四）：BeautifulSoup庫的使用四 . 爬蟲 BeautifulSoup庫參數和使用爬蟲基礎：BeautifulSoup網頁解析庫 Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲） python爬蟲從入門到放棄（六）之 BeautifulSoup庫的使用小白學爬蟲(六) - 之 BeautifulSoup庫的使用十五 web爬蟲講解2—urllib庫中使用xpath表達式—BeautifulSoup基礎爬蟲解析庫——BeautifulSoup python爬蟲學習(一)：BeautifulSoup庫基礎及一般元素提取方法 python爬蟲之beautifulsoup的使用