BeautifulSoup基本用法

本文轉載自查看原文 2019-06-22 16:35 2143 爬蟲

BeautifulSoup是Python的一個HTML或XML的解析庫，可以用它來方便地從網頁提取數據（以下為崔慶才的爬蟲書的學習筆記）

一. 安裝方式

#安裝beautifulsoup4
pip install beautifulsoup4

#安裝lxml
pip install lxml

二. 基本語法

1. 節點選擇器：基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story>Once upon a time there are three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie -->/a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

假如想要獲取上述html中的title節點及其文本內容，請看以下語法：

引入並初始化beautifulsoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

初始化對於一些不標准的html，可以自動更正格式，如補全標簽等等

獲取title節點，查看它的類型

print(soup.title)
print(type(soup.title))


#輸出結果
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>

獲取到的title節點，正是節點加文本內容

獲取title節點文本內容

print(soup.title.string)


#輸出結果
The Dormouse's story

如果想獲取其他信息，比如節點的名字及其屬性，這些也比較方便

獲取title節點的名字

print(soup.title.name)


#輸出結果
title

獲取p節點的多個屬性和某一屬性

p節點有多個屬性，比如class和name等，可以調用attrs獲取所有屬性

#獲取多個屬性
print(soup.p.attrs)

#輸出結果：
{'class': ['title'], 'name': 'dromouse'}


#獲取某個屬性：方法一
print(soup.p.attrs['name']

#輸出結果：
dromouse


#獲取某個屬性：方法二
print(soup.p['name']

#輸出結果：
dromouse


#獲取單個屬性需要注意的地方
print(soup.p['class'])

#輸出結果：
['title']

需要注意的是，有的返回的是字符串，有的返回的是字符串組成的列表。比如，name屬性的值是唯一的，返回的結果就是單個字符串，而對於class，一個節點的元素可能有多個class，所以返回的是列表。另外，這里的p節點是第一個p節點

嵌套選擇或層級選擇

如果多個節點層級嵌套在一起，可以通過層級關系依次選擇，比如要選擇title節點及其內容，之前我們是soup.title，現在可以這樣操作：soup.head.title

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""

print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)


#輸出結果：
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

2. 節點選擇器：高級用法

父節點和祖先節點

如果要獲取某個節點元素的父節點，可以調用parent屬性

html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story>...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)


#輸出結果：

<p class="story">
            Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>

這里我們選擇的是第一個a節點的父節點元素，很明顯，它的父節點是p節點，輸出結果便是p節點及其內部的內容

如果想要獲取所有的祖先元素，可以調用parents屬性：

html = """
<html>
<body>
<p class="story">
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))


#運行結果：
<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]

這里為什么出現了兩個html開頭的文本呢？是因為parents遍歷的順序是p—body—html—[document]

子節點和子孫節點

選取節點元素知乎，如果想要獲取它的直接子節點，可以調用contents屬性：

html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elise" class="sister" id="link1">
<span>Elise</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

可以看到，返回結果是列表形式。p節點里既包含文本，又包含節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)


#運行結果：
['\n    Once upon a time there were three little sisters; and their names were\n    ', <a class="sister" href="http://example.com/elise" id="link1">
<span>Elise</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\nand\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\nand they lived at the bottom of a well.\n']

span節點作為p節點的孫子節點，並沒有單獨列出，而是包含在a中被列出，說明contents屬性得到的結果是直接子節點的列表

同樣，我們可以調用children屬性得到相應的結果：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)


#運行結果：
<list_iterator object at 0x000000000303F7B8>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elise" id="link1">
<span>Elise</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
and they lived at the bottom of a well.

如果還想獲得所有的子孫節點的話，可以調用descendants屬性：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)


#運行結果：
<generator object Tag.descendants at 0x000000000301F228>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elise" id="link1">
<span>Elise</span>
</a>
2 

3 <span>Elise</span>
4 Elise
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
and they lived at the bottom of a well.

遍歷輸出可以看到，這次輸出的結果就包含了span節點，descendants會遞歸查詢所有子節點，得到所有的子孫節點

兄弟節點

如果想獲取兄弟節點，應該怎么辦呢？

html = """
<html>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
            Hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))


#輸出結果：
Next Sibling 
            Hello

Prev Sibling 
            Once upon a time there were three little sisters; and their names were

Next Siblings [(0, '\n            Hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n            and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n')]

next_sibling和previous_sibling分別獲取節點的下一個和上一個兄弟元素，next_siblings和previous_siblings則分別返回后面和前面的兄弟節點

3. 方法選擇器

find_all()：查詢所有符合條件的元素

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))



#運行結果：
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

利用find_all方法查詢ul節點，返回結果是列表類型，長度為2，每個元素都是bs4.element.Tag類型

還可以進行嵌套查詢，獲取li節點的文本內容

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)


#輸出結果：
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

除了根據節點名查詢，還可以傳入一些屬性來查詢

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))


#輸出結果：
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

對於一些常用的屬性，比如id和class等，可以不用attrs來傳遞。比如，要查詢id為list-1的節點，可以直接傳入id這個參數。還是上面的文本，我們換一種方式來查詢：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))


#輸出結果：
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text參數可以用來匹配節點的文本，傳入的形式可以是字符串，可以是正則表達式對象

html = '''
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
</div>
</div>
'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))


#輸出結果：
['Hello, this is a link', 'Hello, this is a link, too']

find()：返回單個元素，也就是第一個匹配的元素

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))


#輸出結果：
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

其他的查詢方法

find_parents()和find_parent()：前者返回所有祖先節點，后者返回直接父節點

find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟節點，后者返回后面第一個兄弟節點

find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟節點，后者返回前面第一個兄弟節點

find_all_next()和find_next()：前者返回節點后所有符合條件的節點，后者返回第一個符合條件的節點

find_all_previous()和find_previous()：前者返回節點前所有符合條件的節點，后者返回第一個符合條件的節點

3. CSS選擇器

使用CSS選擇器時，只需要調用select()方法，傳入相應的CSS選擇器即可

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))


#輸出結果：
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

嵌套選擇

select()方法同樣支持嵌套選擇。例如，先選擇所有ul節點，再遍歷每個ul節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))


#輸出結果：
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

可以看到，這里正常輸出了所有ul節點下所有li節點組成的列表

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])


#輸出結果：
list-1
list-1
list-2
list-2

可以看到，直接傳入中括號和屬性名，或通過attrs屬性獲取屬性值，都可以成功

獲取文本

要獲取文本，可以用前面所講的string屬性或者get_text()方法

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print('Get Text:', li.get_text())
    print('String:', li.string)


#輸出結果：
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 BeautifulSoup的基本用法 BeautifulSoup的基本用法 beautifulSoup基本用法及find選擇器 BeautifulSoup4 提取數據爬蟲用法詳解爬蟲入門【3】BeautifulSoup4用法簡介 BeautifulSoup BeautifulSoup beautifulsoup教程 BeautifulSoup基本步驟 Beautifulsoup模塊