爬蟲基礎:BeautifulSoup網頁解析庫


BeautifulSoup

BeautifulSoup是靈活又方便的網頁解析庫,處理高效,支持多種解析器。利用它不用編寫正則表達式即可以方便地實現網頁信息的提取

 

安裝BeautifulSoup

pip3 install beautifulsoup4

 

BeautifulSoup用法

解析庫

解析庫 使用方法 優勢 劣勢
Python標准庫 BeautifulSoup(markup,"html.parser") Python的內置標准庫、執行速度適中、文檔容錯能力強 Python2.7.3 or Python3.2.2之前的版本容錯能力差
lxml HTML解析庫 BeautifulSoup(markup,"lxml") 速度快、文檔容錯能力強 需要安裝C語言庫
lxml XML解析庫 BeautifulSoup(markup,"xml") 速度快、唯一支持XML的解析器 需要安裝C語言庫
html5lib BeautifulSoup(markup,"html5lib") 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴展

 

 

 

 

 

 

 

基本使用

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.baidu.com').text
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())#prettify美化,會格式化輸出,還會自動補齊閉合
print(soup.title.string)#打印head里面的title

 

標簽選擇器

選擇元素

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)#html title,並且標簽也會輸出
print(type(soup.title))#type <class 'bs4.element.Tag'>
print(soup.head)#html head
print(soup.p)#只第一個找到的p標簽
print(soup.p.name)#獲取名稱 就是p標簽的名字,就是p嘛

 

獲取名稱

見上面例子

 

獲取屬性

有些類似jQuery

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])#返回第一個找到的p標簽的屬性名為name的屬性值,返回值是dropmouse。soup.p.attrs返回的是由屬性鍵值對組成的字典{'class': ['title'], 'name': 'dropmouse'}
print(soup.p['name'])#返回值也是dropmouse,和上面的方法結果一樣。

 

獲取內容

比如獲取p標簽中的內容

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)#選擇之后加.string就是選擇標簽中的內容,這個內容不包含HTML標簽

 

嵌套選擇

'bs4.element.Tag'還可以選擇該Tab中的子標簽。比如

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.body.p.string)#也和jQuery類似

 

子節點和子孫節點

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)#返回p標簽內的所有內容,包括換行符。list類型
print(soup.p.string)#none,由於p標簽里面嵌套了許多其他HTML標簽,而且不止一個,所以返回none

另一種得到子節點的方法

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)#返回包含直接子節點的迭代器
for i,child in enumerate(soup.p.children):
    print(i,child)

* 返回結果:*

<list_iterator object at 0x7fda5c186c88>
  Once upon a time there were three little sisters;and their names lll
  <a class="sister" href="http://www.baidu.com" id="" link1=""><!---Elsa---></a>
  <a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a>
  and
  <a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>
  ;
  and They lived at the bottom of a well.

 

子孫節點

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)

會返回第一個找到的p下的所有子孫節點。

<generator object descendants at 0x7f0b04eceaf0>
Once upon a time there were three little sisters;and their names lll
<a class="sister" href="http://www.baidu.com" id="" link1="">
<span>Elsle</span>
</a>
<span>Elsle</span>
Elsle
<a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a>
Lacie
and
<a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>
Tille
;
and They lived at the bottom of a well.

 

父節點和祖先節點

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

返回結果:先找到第一個a標簽,然后找到這個a標簽的父節點,再輸出整個p標簽包含里面的所有內容都輸出。

<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" href="http://www.baidu.com" id="" link1="">
<span>Elsle</span>
</a>
<a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a> and
<a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>;
and They lived at the bottom of a well.</p>

 

祖先節點

soup.a.parents #這就是第一個找到a的祖先標簽,返回一個迭代器。迭代器包含所有的祖先,一層層從p標簽、body標簽、html標簽

 

兄弟節點

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))#后面的所有兄弟
print(list(enumerate(soup.a.previous_siblings)))#前面的所有兄弟節點

用上面介紹的選擇器很難精確的選擇某個element(往往只能選擇第一個找到的元素),所以BeautifulSoup還提供了標准選擇器,向CSS選擇器一樣可以用標簽名、屬性、內容查找文檔。

 

標准選擇器

  • find_all(name,attrs,recursive,text,**kwargs)

name--標簽名

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#find_all返回一個列表,這里返回找到所有的ul包含ul之內的所有內容。
print(type(soup.find_all('ul')[0]))

*輸出結果: *

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>]
<class 'bs4.element.Tag'>

因為find_all列表中的每個元素是element.Tag類型的標簽,所以還可以遍歷Tag中的子節點。這樣可以層層嵌套的查找

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

返回結果:返回ul下面的所有li

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

 

attr

  • find_all(attrs={'name':'element'})查找屬性為name:element鍵值對的所有元素
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"class":"list"}))#特殊的屬性如class、id 可以用class_="list"和id="list-1"代替。
print(soup.find_all(attrs={"id":"list-1"}))

 

text

  • find_all(text="FOO")
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text="Foo"))

返回值:['Foo']

查找元素沒用,只能判斷有沒有找到目標。用處不大。

  • find(name,attrs,recursive,text,**kwargs)

返回找到的第一個元素,如果沒找到返回None,find_all是返回所有元素的列表。
不演示了

  • find_parents()

find_parent與find_all和find()類似

返回所有的祖先節點和返回父節點

  • find_next_siblings(),find_next_sibling()

返回后面所有的兄弟節點和返回后面的第一個節點

  • find_previous_siblings(),find_previous_sibling()

返回前面所有的兄弟節點和返回前面第一個兄弟節點

  • find_all_next(),find_next()

返回節點后所有符合條件的節點和返回節點后第一個符合條件的節點

  • find_all_previous(),find_previous()

返回節點前所有符合條件的節點和返回節點前第一個符合條件的節點

 

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.pannel .pannel-heading'))#返回pannel類下pannel-heading類的元素的內容
print(soup.select('ul li'))#返回ul類型之下的li類型的標簽,包含內容
print(soup.select('#list-2 .element'))#返回id=list-2下的element類的元素

結果

<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>, <li class="element">FOO</li>, <li class="element">BAR</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

 

獲取屬性

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#返回所有ul的id這個屬性的值
    print(ul.attrs['id'])#返回所有ul的id這個屬性的值,和上面一樣,用這個辦法可以返回任意的屬性。

 

獲取內容  get_text()

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())

返回結果:

Foo
Bar
That's ok
FOO
BAR

 

總結

  • 推薦使用lxml解析庫,必要時使用html.parser或者html5lib
  • 標簽選擇器速度快但篩選功能弱
  • 建議使用find()、find_all()查詢匹配單個或多個結果
  • 如果對CSS選擇器熟悉,建議使用CSS選擇器select()
  • 記住常用的獲取屬性和文本的方法

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM