Python爬蟲〇六———數據解析之beautifulsoup的使用

本文轉載自查看原文 2021-02-21 01:30 436 python爬蟲學習

我們在上一章講了最直接的索引方法——正則，今天今天講一個稍微好用一點的數據解析的方法：beautifulsoup4。bs4是在python中獨有的一種解析方式，而前面所講的正則的解析方法，顧名思義，是基於正則表達式的，所以是不限制編程語言的。

通過bs4進行數據解析的流程

按照前面講過的數據解析原理，就是定位標簽和獲取便簽或者是標簽屬性中存儲的數據值，按照這個思路，bs4的數據解析的流程是這樣的：

實例化一個BeautifulSoup對象，並且將頁面的源碼的數據加載到該對象中。
通過調用BeautifulSoup對象中相關屬性和方法進行標簽定位和數據提取

bs4環境安裝

bs4的安裝可以使用pip直接安裝，安裝后還需要安裝一個lxml解析器

pip install bs4 
pip install lxml

在安裝過程中可以用-i指定國內的源。

BeautifulSoup支持Python標准庫中的HTML解析器，還支持了幾種第三方解析器。下面的表格講的是各種第三方解析氣的特點

解析器	使用方法	優勢	劣勢
Python標准庫	`BeautifulSoup(markup, "html.parser")`	Python的內置標准庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安裝C語言庫
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

BeautifulSoup的實例化

BeautifulSoup的實例化有兩種情況，一個是加載本地的html文檔數據，還有一種是加載爬取網上數據。

加載本地html文件

先寫一個簡單的html文件供后面的案例使用（文件名test.html）

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

實例化本地文件袋方法有兩種

方式1

直接使用文件句柄

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    print(soup.title)
soup = BeautifulSoup(data,'lxml')
print(soup.title)

##########輸出##########
<title>
   The Dormouse's story
  </title>

這里用了一個bs4的屬性，獲文檔的head

方式2

第二種方式是先讀取文檔，再實例化

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    data = f.read()
soup = BeautifulSoup(data,'lxml')
print(soup.title)

##########輸出##########
<title>
   The Dormouse's story
  </title>

兩種方法效果一樣，具體使用哪種看個人喜好。

加載爬取內容

爬取內容的加載和前面的第二個方式一樣，通過requests模塊get到html數據以后直接實例化就行了。

BeautifulSoup對象的處理

BeautifulSoup對象的處理是這一節要講到重點，還是對上面那個test.html文件來演示，如何通過對數據的解析來了解BeautifulSoup的常規使用方法

在實例化過程中，BeautifulSoup將復雜的HTML文檔轉換成一個樹形結構，樹的每一個節點都是一個Python對象，所有的對象都可以歸納為四中

Tag
Navigablestring
BeautifulSoup
Comment

Tag

tag和HTML里的一樣，前面的案例中的.title已經用過一次了，可以通過.tag的方式獲取到soup對象中的第一個符合要求的tag。tag有很多屬性和方法，在遍歷文檔和搜索文檔中會詳細講到，這里主要講一個，獲取tag到屬性attributes

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag = soup.a
    print(tag.attrs)

##########輸出##########
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

因為一個標簽是可以包含多個屬性的，獲取到屬性是一個字典，如果我們想要獲取指定的屬性，比如class，就可以用字典的方式(['class'])拿到所需的對象。

多值屬性

在HTML5中有些tag的屬性是支持多個值的，最常見的就是class屬性，那么這個時候返回的就是一個list，即便屬性內只有一個值(就像前面的class只有一個sister)返回值也是一個list。

如果某個屬性看起來像是有多個值，但在各個版本的HTML中都沒有定義為多值屬性，那么BeautifulSoup就會把這個值作為一個字符串返回

from bs4 import BeautifulSoup
import requests

st = '<div id="c1 c2">123</div>'
soup = BeautifulSoup(st,'lxml')

tag = soup.div
print(tag.attrs['id'])
##########輸出##########
c1 c2

因為id是只有一個值得，所以即便看起來是用空格分割開，返回值也是一個整體的字符串

還有一種情況，是如果我們如果指定xml作為解析器，多值屬性會被合並成一個字符串輸出

from bs4 import BeautifulSoup
import requests

st = '<div class="c1 c2">123</div>'
soup = BeautifulSoup(st,'xml')

tag = soup.div
print(tag.attrs['class'])

##########輸出##########
c1 c2

注意上面在實例化的時候，我指定xml作為解析器。

搜索文檔樹

因為我們使用bs4最常用的環境就是解析數據，所以對文檔樹進行搜索是最常用的功能。其中最常用的搜索方法有兩種

find()
find_all()

至於其他的方法，參數和用法都類似，舉一反三即可

find_all()的使用

find_all（）可以查到文檔的內容，但是根據不同的參數有不同的效果

字符串

直接給個字符串，一般都是標簽類型，就可以以列表的形式返回所有該類型的標簽以及內部內容。還以以前面的html頁面為例。

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    
    print(soup.find_all('a'))

##########輸出##########

[<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>]

在上面的例子中，我用find_all來查找所有<a>標簽。

正則表達式

在參數中傳入正則表達式，BeautifulSoup會按照正則表達式的match()來匹配響應的內容

from bs4 import BeautifulSoup
import re

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    
    tags = soup.find_all(re.compile('d'))
    for tag in tags:
        print(tag.name)
##########輸出##########
head
body

上面的代碼就是用來獲取文檔中包含d標簽的標簽名。

列表

在參數中傳入列表，只要匹配列表中任意元素，就將其內容返回

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    print(soup.find_all(['a','b']))
##########輸出##########
[<b>
    The Dormouse's story
   </b>, <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>]

上面的案例就是獲取所有a標簽或者b標簽的內容。

Ture

用True可以匹配任何值，可以查到所有的tags

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tags = soup.find_all(True)
    for tag in tags:
        print(tag.name)

##########輸出##########
html
head
title
body
p
b
p
a
a
a
p

方法

我們還可以將一個方法傳入，注意該方法只能接受一個tag作為參數。如果方法返回值為ture則表示匹配，否則返回false

比如我們要找到同時包含class和id兩個屬性的tag，所以要先定義一個方法，然后把這個方法作為參數傳過去

from bs4 import BeautifulSoup

def has_class_and_id(tag):
    return tag.has_attr('class') and tag.has_attr('id')

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tags = soup.find_all(has_class_but_no_id)
    for tag in tags:
        print(tag）
##########輸出##########
<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
<a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
<a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>

在方法中校驗了tag是否具有class和id兩個屬性，如果有，就返回該tag，否則匹配失敗。

find_all()還有一些用法的細節，我放在下面講find()方法的時候講，二者是一樣的,區別就是find_all返回的是一個列表，而find返回的是第一個匹配出來的元素。在沒有匹配到對應元素的時候，find_all返回一個空的列表，而find返回值None。

find()方法的使用

find()方法里可以放下面的參數

find(name,attrs,recursive,string.**kwargs)

下面一個個來講

name參數

直接的name

name參數用於查詢名字為name的tag，同樣，name還可以使用上面所說的任意一種filter。上面的眾多例子都是這種用法，不再舉例說明。

keyword參數

如果指定的名字不是搜索的內置的參數名，搜索的時候會吧該參數當做指定名字屬性來搜索，比方id，href。

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag1 = soup.find(href='http://example.com/elsie')
    print('tag1\n',tag1)
    
    tag2 = soup.find(id='link2')
    print('tag2\n',tag2)

##########輸出##########
tag1
 <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
tag2
 <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>

基於CSS的搜索(class)

搜索css的時候，由於要用到的class是Python中的關鍵字，所以要用class_來代替

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag = soup.find(class_='sister')
    print(tag)

##########輸出##########
<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>

這個方法使用會非常頻繁，一定要注意！

name的參數可以配合使用，比如我們想要解析到a標簽里id值為link1的標簽

soup.find('a',id='link1')

這樣的方法是可以的。

string參數

string參數用來搜索文檔中匹配到字符串內容，我在這里遇到一個坑：直接使用string=的時候，要完全匹配才可以，包含了換行符的是不行的！

選擇器的使用

BeautifulSoup還提供了一個.select方法作為選擇器，這個選擇器可以簡單的作為id/class標簽/等各種選擇器使用，還可以用作層級選擇器

tag = soup.select('.story>a')

上面的代碼就是搜索class為story下的所有a標簽，注意返回的是一個列表。

層級選擇器的使用

便於掩飾，寫一個簡單的代碼,下面的代碼里a標簽忘記閉合了，不影響效果

s = """<div class='test'>
     <ul>
      <li><span><a>1</span></li>
      <li><span><a>2</span></li>
      <li><span><a>3</span></li>
    </ul>
    </div>
    """
soup = BeautifulSoup(s)

注意層級關系

大於號>表示一個層級，

soup.select('.test>ul>li>span')
##########輸出##########
[<span><a>1</a></span>, <span><a>2</a></span>, <span><a>3</a></span>]

空格表示間隔多個層級

此外還有一些別的用法

通過是否存在某個屬性來找

soup.select(a['href'])

就是查帶有href屬性的a標簽

通過屬性的值來找

tag = soup.select('a[id="link2"]')

獲取標簽鍵的文本數據

在了解了上面的方法后我們就可以按要求定位到需要的標簽，下面就要獲取標簽內的文本數據，這里有兩個用法

soup.text
soup.string
soup.get_text()
contents

假設我們現在有一段html代碼

from bs4 import BeautifulSoup

s = """<div class='test'>
    <span>span標簽
        <a>a標簽內</a>
    </span>
    </div>
    """
soup = BeautifulSoup(s)

來講一下上面幾種方法的區別

text可以獲取標簽下面所有的文本內容，返回值為字符串

tag = soup.select_one('span')
print('txt',tag.text)
print('string',tag.string)
##########輸出##########
txt span標簽
        a標簽內

string None

string返回值為none，因為string的返回值是一個Navigablestring，當一個tag內有多個節點存在，string方法是不知道調哪個，所以就會返回一個None。當tag里的節點唯一時就會返回一個值

soup = BeautifulSoup(s)

tag = soup.select_one('a')
print('txt',tag.text)
print('string',tag.string)
print(type(tag.string))
##########輸出##########
txt a標簽內
string a標簽內
<class 'bs4.element.NavigableString'>

其中.text屬性和get_text()方法的效果是一樣的。

但是有些情況我們指向獲取到第一個層級里的內容，用text顯然是不方便的，這時候就用到最后一個屬性了

tag = soup.select_one('span')
print(tag.contents[0])
##########輸出##########
span標簽

contents主要是用於講tag里的子節點以列表的方式輸出，這里使用的方法不是其主要功能。

bs4為我們提供了有一個NavigableString類，可以對字符串進行一些操作，這里不在過多說明，可以看官網上的講解。

使用案例

在大致了解了 bs4的使用方法后，通過兩個案例來試一下。

爬取三國演義內容

需求：爬取三國演義小說所有的章節標題和章節內容

url:https://www.shicimingju.com/book/sanguoyanyi.html

下面我們就一步步來試一下，先看一下原頁面的html

其實我們就要定位到這個a標簽里的鏈接和后面的章節名稱就行了。試一下怎么拿到這些數據

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    print(title_tags)

這時候發現一個問題，爬取出來的中文都是亂碼，下面是列表的前兩個元素

<a href="/book/sanguoyanyi/1.html">ç¬¬ä¸åÂ·å®´æ¡åè±ªæ°ä¸ç»ä¹  æ©é»å·¾è±éé¦ç«å</a>, 
<a href="/book/sanguoyanyi/2.html">ç¬¬äºåÂ·å¼ ç¿¼å¾·æéç£é®    ä½å½èè°è¯å®¦ç«</a>

看一下頁面的html源碼,編碼是utf8,那是為什么呢？我們可以打印一下resp的響應類型

print(resp.encoding)
##########輸出##########
ISO-8859-1

我們在用resp.text屬性后，返回的是一個經過unicoding后的數據。那么怎么轉換成utf-8呢？

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8' #修改編碼類型
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    print(title_tags[0])
##########輸出##########
<a href="/book/sanguoyanyi/1.html">第一回·宴桃園豪傑三結義  斬黃巾英雄首立功</a>

這樣就好了，注意修改編碼類型的方法，是一個賦值語句而不是調用的方法，經過指定的編碼轉變后，拿到的數據就正常了。看下定位a標簽的方法，是不是比較簡單。

為了爬取每章節的具體內容，這里定義一個字典，每個鍵值對就存章節名稱和對應的鏈接就可以。注意點是href里的鏈接是一個相對路徑，要加上'https://www.shicimingju.com'。

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    
    article_dic = {}
    for title in title_tags:
        article_dic[title.text]='https://www.shicimingju.com'+title['href']
        
    print(article_dic)

上面的代碼就是生成字典的過程。下面就要遍歷字典，爬取相應的數據即可

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    
    article_dic = {}
    for title in title_tags:
        article_dic[title.text]='https://www.shicimingju.com'+title['href']
        
        
    with open('三國演義.txt','w',encoding='utf-8') as f:
        for key in article_dic:
            f.write(key)
            url = article_dic[key]
            article_page = requests.get(url = url)
            article_page.encoding = 'utf-8'
            soup = BeautifulSoup(article_page.text,'lxml')
            article = soup.select_one('.chapter_content').text
            f.write(article)
            print(key,'finish')

整個流程就完成了！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲之beautifulsoup的使用 python爬蟲之BeautifulSoup的HTML解析爬蟲-使用BeautifulSoup4（bs4）解析html數據 python爬蟲-html解析器beautifulsoup Python爬蟲 | Beautifulsoup解析html頁面 python爬蟲學習基礎之網頁解析(2)BeautifulSoup python 使用 BeautifulSoup 解析html python爬蟲：BeautifulSoup 使用select方法的使用爬蟲解析庫——BeautifulSoup 在python使用selenium獲取動態網頁信息並用BeautifulSoup進行解析--動態網頁爬蟲