BeautifulSoup是Python的一個HTML或XML的解析庫,可以用它來方便地從網頁提取數據(以下為崔慶才的爬蟲書的學習筆記)
一. 安裝方式
#安裝beautifulsoup4 pip install beautifulsoup4 #安裝lxml pip install lxml
二. 基本語法
1. 節點選擇器:基本用法
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story>Once upon a time there are three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie -->/a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
假如想要獲取上述html中的title節點及其文本內容,請看以下語法:
引入並初始化beautifulsoup
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')
初始化對於一些不標准的html,可以自動更正格式,如補全標簽等等
獲取title節點,查看它的類型
print(soup.title) print(type(soup.title)) #輸出結果 <title>The Dormouse's story</title> <class 'bs4.element.Tag'>
獲取到的title節點,正是節點加文本內容
獲取title節點文本內容
print(soup.title.string) #輸出結果 The Dormouse's story
如果想獲取其他信息,比如節點的名字及其屬性,這些也比較方便
獲取title節點的名字
print(soup.title.name) #輸出結果 title
獲取p節點的多個屬性和某一屬性
p節點有多個屬性,比如class和name等,可以調用attrs獲取所有屬性
#獲取多個屬性 print(soup.p.attrs) #輸出結果: {'class': ['title'], 'name': 'dromouse'} #獲取某個屬性:方法一 print(soup.p.attrs['name'] #輸出結果: dromouse #獲取某個屬性:方法二 print(soup.p['name'] #輸出結果: dromouse #獲取單個屬性需要注意的地方 print(soup.p['class']) #輸出結果: ['title']
需要注意的是,有的返回的是字符串,有的返回的是字符串組成的列表。比如,name屬性的值是唯一的,返回的結果就是單個字符串,而對於class,一個節點的元素可能有多個class,所以返回的是列表。另外,這里的p節點是第一個p節點
嵌套選擇或層級選擇
如果多個節點層級嵌套在一起,可以通過層級關系依次選擇,比如要選擇title節點及其內容,之前我們是soup.title,現在可以這樣操作:soup.head.title
html = """ <html><head><title>The Dormouse's story</title></head> <body> """
print(soup.head.title) print(type(soup.head.title)) print(soup.head.title.string) #輸出結果: <title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story
2. 節點選擇器:高級用法
父節點和祖先節點
如果要獲取某個節點元素的父節點,可以調用parent屬性
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> </p> <p class="story>...</p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent) #輸出結果:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
這里我們選擇的是第一個a節點的父節點元素,很明顯,它的父節點是p節點,輸出結果便是p節點及其內部的內容
如果想要獲取所有的祖先元素,可以調用parents屬性:
html = """ <html> <body> <p class="story"> <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> </p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(type(soup.a.parents)) print(list(enumerate(soup.a.parents))) #運行結果: <class 'generator'> [(0, <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p>), (1, <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body>), (2, <html> <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body></html>), (3, <html> <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body></html>)]
這里為什么出現了兩個html開頭的文本呢?是因為parents遍歷的順序是p—body—html—[document]
子節點和子孫節點
選取節點元素知乎,如果想要獲取它的直接子節點,可以調用contents屬性:
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elise" class="sister" id="link1"> <span>Elise</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
可以看到,返回結果是列表形式。p節點里既包含文本,又包含節點
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents) #運行結果: ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\nand\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\nand they lived at the bottom of a well.\n']
span節點作為p節點的孫子節點,並沒有單獨列出,而是包含在a中被列出,說明contents屬性得到的結果是直接子節點的列表
同樣,我們可以調用children屬性得到相應的結果:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child) #運行結果: <list_iterator object at 0x000000000303F7B8> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well.
如果還想獲得所有的子孫節點的話,可以調用descendants屬性:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child) #運行結果: <generator object Tag.descendants at 0x000000000301F228> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a> 2 3 <span>Elise</span> 4 Elise 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 and 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 and they lived at the bottom of a well.
遍歷輸出可以看到,這次輸出的結果就包含了span節點,descendants會遞歸查詢所有子節點,得到所有的子孫節點
兄弟節點
如果想獲取兄弟節點,應該怎么辦呢?
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> Hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling', soup.a.next_sibling) print('Prev Sibling', soup.a.previous_sibling) print('Next Siblings', list(enumerate(soup.a.next_siblings))) print('Prev Siblings', list(enumerate(soup.a.previous_siblings))) #輸出結果: Next Sibling Hello Prev Sibling Once upon a time there were three little sisters; and their names were Next Siblings [(0, '\n Hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n')] Prev Siblings [(0, '\n Once upon a time there were three little sisters; and their names were\n')]
next_sibling和previous_sibling分別獲取節點的下一個和上一個兄弟元素,next_siblings和previous_siblings則分別返回后面和前面的兄弟節點
3. 方法選擇器
find_all():查詢所有符合條件的元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(name='ul')) print(type(soup.find_all(name='ul')[0])) #運行結果: [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <class 'bs4.element.Tag'>
利用find_all方法查詢ul節點,返回結果是列表類型,長度為2,每個元素都是bs4.element.Tag類型
還可以進行嵌套查詢,獲取li節點的文本內容
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string) #輸出結果: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] Foo Bar Jay [<li class="element">Foo</li>, <li class="element">Bar</li>] Foo Bar
除了根據節點名查詢,還可以傳入一些屬性來查詢
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'})) #輸出結果: [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
對於一些常用的屬性,比如id和class等,可以不用attrs來傳遞。比如,要查詢id為list-1的節點,可以直接傳入id這個參數。還是上面的文本,我們換一種方式來查詢:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1')) print(soup.find_all(class_='element')) #輸出結果: [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text參數可以用來匹配節點的文本,傳入的形式可以是字符串,可以是正則表達式對象
html = ''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> '''
from bs4 import BeautifulSoup import re soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text=re.compile('link'))) #輸出結果: ['Hello, this is a link', 'Hello, this is a link, too']
find():返回單個元素,也就是第一個匹配的元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find(name='ul')) print(type(soup.find(name='ul'))) print(soup.find(class_='list')) #輸出結果: <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <class 'bs4.element.Tag'> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
其他的查詢方法
find_parents()和find_parent():前者返回所有祖先節點,后者返回直接父節點
find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟節點,后者返回后面第一個兄弟節點
find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟節點,后者返回前面第一個兄弟節點
find_all_next()和find_next():前者返回節點后所有符合條件的節點,后者返回第一個符合條件的節點
find_all_previous()和find_previous():前者返回節點前所有符合條件的節點,后者返回第一個符合條件的節點
3. CSS選擇器
使用CSS選擇器時,只需要調用select()方法,傳入相應的CSS選擇器即可
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0])) #輸出結果: [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>
嵌套選擇
select()方法同樣支持嵌套選擇。例如,先選擇所有ul節點,再遍歷每個ul節點
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li')) #輸出結果: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
可以看到,這里正常輸出了所有ul節點下所有li節點組成的列表
獲取屬性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id']) #輸出結果: list-1 list-1 list-2 list-2
可以看到,直接傳入中括號和屬性名,或通過attrs屬性獲取屬性值,都可以成功
獲取文本
要獲取文本,可以用前面所講的string屬性或者get_text()方法
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print('Get Text:', li.get_text()) print('String:', li.string) #輸出結果: Get Text: Foo String: Foo Get Text: Bar String: Bar Get Text: Jay String: Jay Get Text: Foo String: Foo Get Text: Bar String: Bar