Python爬蟲小白入門（三）BeautifulSoup庫

本文轉載自查看原文 2016-12-29 14:00 57047 爬蟲/ Python

一、前言

上一篇演示了如何使用requests模塊向網站發送http請求，獲取到網頁的HTML數據。這篇來演示如何使用BeautifulSoup模塊來從HTML文本中提取我們想要的數據。

update on 2016-12-28：之前忘記給BeautifulSoup的官網了，今天補上，順便再補點BeautifulSoup的用法。

update on 2017-08-16：很多網友留言說Unsplash網站改版了，很多內容是動態加載的。所以建議動態加載的內容使用PhantomJS而不是Request庫進行請求，如果使用PhantomJS請看我的下一篇博客，如果是定位html文檔使用的class等名字更改的話，建議大家根據更改后的內容進行定位，學爬蟲重要的是爬取數據的邏輯，邏輯掌握了網站怎么變都不重要啦。

二、運行環境

我的運行環境如下：

系統版本
Windows10。
Python版本
Python3.5，推薦使用Anaconda 這個科學計算版本，主要是因為它自帶一個包管理工具，可以解決有些包安裝錯誤的問題。去Anaconda官網，選擇Python3.5版本，然后下載安裝。
IDE
我使用的是PyCharm，是專門為Python開發的IDE。這是JetBrians的產品，點我下載。

三、模塊安裝

BeautifulSoup 有多個版本，我們使用BeautifulSoup4。詳細使用看BeautifuSoup4官方文檔。
使用管理員權限打開cmd命令窗口，在窗口中輸入下面的命令即可安裝：
conda install beautifulsoup4

直接使用Python3.5 沒有使用Anaconda版本的童鞋使用下面命令安裝：
pip install beautifulsoup4

然后我們安裝lxml，這是一個解析器，BeautifulSoup可以使用它來解析HTML，然后提取內容。

Anaconda 使用下面命令安裝lxml：
conda install lxml

使用Python3.5 的童鞋們直接使用pip安裝會報錯（所以才推薦使用Anaconda版本），安裝教程看這里。

如果不安裝lxml，則BeautifulSoup會使用Python內置的解析器對文檔進行解析。之所以使用lxml，是因為它速度快。

文檔解析器對照表如下：

解析器	使用方法	優勢	劣勢
Python標准庫	BeautifulSoup(markup,"html.parser")	1. Python的內置標准庫 2. 執行速度適 3. 中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	BeautifulSoup(markup,"lxml")	1. 速度快 2. 文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup,["lxml-xml"]) BeautifulSoup(markup,"xml")	1. 速度快 2. 唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup,"html5lib")	1. 最好的容錯性 2. 以瀏覽器的方式解析文檔 3. 生成HTML5格式的文檔	速度慢，不依賴外部擴展

四、BeautifulSoup 庫的使用

網上找到的幾個官方文檔：BeautifulSoup4.4.0中文官方文檔，BeautifulSoup4.2.0中文官方文檔。不同版本的用法差不多，幾個常用的語法都一樣。

首先來看BeautifulSoup的對象種類，在使用的過程中就會了解你獲取到的東西接下來應該如何操作。

4.1 BeautifulSoup對象的類型

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構，每個節點都是Python對象。所有對象可以歸納為4種類型: Tag , NavigableString , BeautifulSoup , Comment 。下面我們分別看看這四種類型都是什么東西。

4.1.1 Tag

這個就跟HTML或者XML（還能解析XML？是的，能！）中的標簽是一樣一樣的。我們使用find()方法返回的類型就是這個（插一句：使用find-all()返回的是多個該對象的集合，是可以用for循環遍歷的。）。返回標簽之后，還可以對提取標簽中的信息。

提取標簽的名字：

tag.name

提取標簽的屬性：

tag['attribute']
我們用一個例子來了解這個類型：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')  #聲明BeautifulSoup對象
find = soup.find('p')  #使用find方法查到第一個p標簽
print("find's return type is ", type(find))  #輸出返回值類型
print("find's content is", find)  #輸出find獲取的值
print("find's Tag Name is ", find.name)  #輸出標簽的名字
print("find's Attribute(class) is ", find['class'])  #輸出標簽的class屬性值

4.1.2 NavigableString

NavigableString就是標簽中的文本內容（不包含標簽）。獲取方式如下：
tag.string
還是以上面那個例子，加上下面這行，然后執行：
print('NavigableString is：', find.string)

4.1.3 BeautifulSoup

BeautifulSoup對象表示一個文檔的全部內容。支持遍歷文檔樹和搜索文檔樹。

4.1.4 Comment

這個對象其實就是HTML和XML中的注釋。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

有些時候，我們並不想獲取HTML中的注釋內容，所以用這個類型來判斷是否是注釋。

if type(SomeString) == bs4.element.Comment:
    print('該字符是注釋')
else:
    print('該字符不是注釋')

4.2 BeautifulSoup遍歷方法

4.2.1 節點和標簽名

可以使用子節點、父節點、及標簽名的方式遍歷：

soup.head #查找head標簽
soup.p #查找第一個p標簽

#對標簽的直接子節點進行循環
for child in title_tag.children:
    print(child)

soup.parent #父節點

# 所有父節點
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

# 兄弟節點
sibling_soup.b.next_sibling #后面的兄弟節點
sibling_soup.c.previous_sibling #前面的兄弟節點

#所有兄弟節點
for sibling in soup.a.next_siblings:
    print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

4.2.2 搜索文檔樹

最常用的當然是find()和find_all()啦，當然還有其他的。比如find_parent() 和 find_parents()、 find_next_sibling() 和 find_next_siblings() 、find_all_next() 和 find_next()、find_all_previous() 和 find_previous() 等等。
我們就看幾個常用的，其余的如果用到就去看官方文檔哦。

find_all()
搜索當前tag的所有tag子節點，並判斷是否符合過濾器的條件。返回值類型是bs4.element.ResultSet。
完整的語法：
find_all( name , attrs , recursive , string , **kwargs )
這里有幾個例子

soup.find_all("title")
# [<title>The Dormouse's story</title>]
#
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
# 
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

name 參數：可以查找所有名字為 name 的tag。
attr 參數：就是tag里的屬性。
string 參數：搜索文檔中字符串的內容。
recursive 參數：調用tag的 find_all() 方法時，Beautiful Soup會檢索當前tag的所有子孫節點。如果只想搜索tag的直接子節點，可以使用參數 recursive=False 。

find()
與find_all()類似，只不過只返回找到的第一個值。返回值類型是bs4.element.Tag。
完整語法：
find( name , attrs , recursive , string , **kwargs )
看例子：

soup.find('title')
# <title>The Dormouse's story</title>
#
soup.find("head").find("title")
# <title>The Dormouse's story</title>

基本功已經練完，開始實戰！

五、繼續上一篇實例

繼續上一篇的網站Unsplash，我們在首頁選中圖片，查看html代碼。發現所有的圖片都在a標簽里，並且class都是cV68d，如下圖。

通過仔細觀察，發現圖片的鏈接在style中的background-image中有個url。這個url就包含了圖片的地址，url后面跟了一堆參數，可以看到其中有&w=XXX&h=XXX，這個是寬度和高度參數。我們把高度和寬度的參數去掉，就能獲取到大圖。下面，我們先獲取到所有的含有圖片的a標簽，然后在循環獲取a標簽中的style內容。

其實在圖片的右下方有一個下載按鈕，按鈕的標簽中有一個下載鏈接，但是該鏈接並不能直接請求到圖片，需要跳轉幾次，通過獲取表頭里的Location才能獲取到真正的圖片地址。后續我再以這個角度獲取圖片寫一篇博文，咱們現根據能直接獲取到的url稍做處理來獲取圖片。小伙伴兒們也可能會發現其他的方式來獲取圖片的url，都是可以的，盡情的嘗試吧！

import requests #導入requests 模塊
from bs4 import BeautifulSoup  #導入BeautifulSoup 模塊

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}  #給請求指定一個請求頭來模擬chrome瀏覽器
web_url = 'https://unsplash.com'r = requests.get(web_url, headers=headers) #像目標url地址發送get請求，返回一個response對象
all_a = BeautifulSoup(r.text, 'lxml').find_all('a', class_='cV68d')  #獲取網頁中的class為cV68d的所有a標簽
for a in all_a:
  print(a['style']) #循環獲取a標簽中的style

這里的find_all('a', class_='cV68d') 是找到所有class為cV68d的a標簽，返回的是一個list，所以可以用for循環獲取每個a標簽。
還有，get請求使用了headers參數，這個是用來模擬瀏覽器的。如何知道‘User-Agent’是什么呢？
在你的Chrome瀏覽器中，按F12，然后刷新網頁，看下圖就可以找到啦。

OK，我們來執行以下上面的代碼，結果如下：

接下來的任務是在一行的文本中取到圖片的url。仔細看每一行的字符串，兩個雙引號之間的內容就是圖片的url了，所以我們Python的切片功能來截取這中間的內容。

改寫for循環中的內容：

for a in all_a: 
    img_str = a['style'] #a標簽中完整的style字符串
    print(img_str[img_str.index('"')+1 : img_str.index('"',img_str[img_str.index('"')+1)]) #使用Python的切片功能截取雙引號之間的內容

獲取到url后還要把寬度和高度的參數去掉。

        for a in all_a:
            img_str = a['style'] #a標簽中完整的style字符串
            print('a標簽的style內容是：', img_str)
            first_pos = img_str.index('"') + 1
            second_pos = img_str.index('"',first_pos)
            img_url = img_str[first_pos: second_pos] #使用Python的切片功能截取雙引號之間的內容
            width_pos = img_url.index('&w=')
            height_pos = img_url.index('&q=')
            width_height_str = img_url[width_pos : height_pos]
            print('高度和寬度數據字符串是：', width_height_str)
            img_url_final = img_url.replace(width_height_str, '')
            print('截取后的圖片的url是：', img_url_final)

有了這些圖片的url，就可以通過繼續發請求的方式獲取圖片啦。接下來我們先來封裝一下發請求的代碼。
先創建一個類：

class BeautifulPicture(): 
   def __init__(self):  #類的初始化操作
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}  #給請求指定一個請求頭來模擬chrome瀏覽器
        self.web_url = 'https://unsplash.com' #要訪問的網頁地址
        self.folder_path = 'D:\BeautifulPicture'  #設置圖片要存放的文件目錄

然后封裝request請求：

    def request(self, url):  #返回網頁的response
        r = requests.get(url)  # 像目標url地址發送get請求，返回一個response對象
        return r

我們在文件目錄下保存圖片的話，要先創建文件目錄。所以再添加一個創建目錄的方法：
要先引入os庫哦。
import os
然后是方法定義：

    def mkdir(self, path):  ##這個函數創建文件夾
        path = path.strip()
        isExists = os.path.exists(path)
        if not isExists:
            print('創建名字叫做', path, '的文件夾')
            os.makedirs(path)
            print('創建成功！')
        else:
            print(path, '文件夾已經存在了，不再創建')

再然后是保存圖片啦。

    def save_img(self, url, name): ##保存圖片
        print('開始保存圖片...')
        img = self.request(url)
        time.sleep(5)
        file_name = name + '.jpg'
        print('開始保存文件')
        f = open(file_name, 'ab')
        f.write(img.content)
        print(file_name,'文件保存成功！')
        f.close()

工具方法都已經准備完畢，開始我們的邏輯部分：

    def get_pic(self):
        print('開始網頁get請求')
        r = self.request(self.web_url)
        print('開始獲取所有a標簽')
        all_a = BeautifulSoup(r.text, 'lxml').find_all('a', class_='cV68d')  #獲取網頁中的class為cV68d的所有a標簽
        print('開始創建文件夾')
        self.mkdir(self.folder_path)  #創建文件夾
        print('開始切換文件夾')
        os.chdir(self.folder_path)   #切換路徑至上面創建的文件夾
        i = 1 #后面用來給圖片命名
        for a in all_a:
            img_str = a['style'] #a標簽中完整的style字符串
            print('a標簽的style內容是：', img_str)
            first_pos = img_str.index('"') + 1
            second_pos = img_str.index('"',first_pos)
            img_url = img_str[first_pos: second_pos] #使用Python的切片功能截取雙引號之間的內容
            width_pos = img_url.index('&w=')
            height_pos = img_url.index('&q=')
            width_height_str = img_url[width_pos : height_pos]
            print('高度和寬度數據字符串是：', width_height_str)
            img_url_final = img_url.replace(width_height_str, '')
            print('截取后的圖片的url是：', img_url_final)
            self.save_img(img_url_final, str(i))
            i += 1

最后就是執行啦：

beauty = BeautifulPicture()  #創建一個類的實例
beauty.get_pic()  #執行類中的方法

最后來一個完整的代碼，對中間的一些部分進行了封裝和改動，並添加了每部分的注釋，一看就明白了。有哪塊有疑惑的可以留言~~

import requests #導入requests 模塊
from bs4 import BeautifulSoup  #導入BeautifulSoup 模塊
import os  #導入os模塊

class BeautifulPicture():

    def __init__(self):  #類的初始化操作
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'}  #給請求指定一個請求頭來模擬chrome瀏覽器
        self.web_url = 'https://unsplash.com'  #要訪問的網頁地址
        self.folder_path = 'D:\BeautifulPicture'  #設置圖片要存放的文件目錄

    def get_pic(self):
        print('開始網頁get請求')
        r = self.request(self.web_url)
        print('開始獲取所有a標簽')
        all_a = BeautifulSoup(r.text, 'lxml').find_all('a', class_='cV68d')  #獲取網頁中的class為cV68d的所有a標簽
        print('開始創建文件夾')
        self.mkdir(self.folder_path)  #創建文件夾
        print('開始切換文件夾')
        os.chdir(self.folder_path)   #切換路徑至上面創建的文件夾

        for a in all_a: #循環每個標簽，獲取標簽中圖片的url並且進行網絡請求，最后保存圖片
            img_str = a['style'] #a標簽中完整的style字符串
            print('a標簽的style內容是：', img_str)
            first_pos = img_str.index('"') + 1
            second_pos = img_str.index('"',first_pos)
            img_url = img_str[first_pos: second_pos] #使用Python的切片功能截取雙引號之間的內容
            #獲取高度和寬度的字符在字符串中的位置
            width_pos = img_url.index('&w=')
            height_pos = img_url.index('&q=')
            width_height_str = img_url[width_pos : height_pos] #使用切片功能截取高度和寬度參數，后面用來將該參數替換掉
            print('高度和寬度數據字符串是：', width_height_str)
            img_url_final = img_url.replace(width_height_str, '')  #把高度和寬度的字符串替換成空字符
            print('截取后的圖片的url是：', img_url_final)
            #截取url中參數前面、網址后面的字符串為圖片名
            name_start_pos = img_url.index('photo')
            name_end_pos = img_url.index('?')
            img_name = img_url[name_start_pos : name_end_pos]
            self.save_img(img_url_final, img_name) #調用save_img方法來保存圖片

    def save_img(self, url, name): ##保存圖片
        print('開始請求圖片地址，過程會有點長...')
        img = self.request(url)
        file_name = name + '.jpg'
        print('開始保存圖片')
        f = open(file_name, 'ab')
        f.write(img.content)
        print(file_name,'圖片保存成功！')
        f.close()

    def request(self, url):  #返回網頁的response
        r = requests.get(url, headers=self.headers)  # 像目標url地址發送get請求，返回一個response對象。有沒有headers參數都可以。
        return r

    def mkdir(self, path):  ##這個函數創建文件夾
        path = path.strip()
        isExists = os.path.exists(path)
        if not isExists:
            print('創建名字叫做', path, '的文件夾')
            os.makedirs(path)
            print('創建成功！')
        else:
            print(path, '文件夾已經存在了，不再創建')

beauty = BeautifulPicture()  #創建類的實例
beauty.get_pic()  #執行類中的方法

執行的過程中可能會有點慢，這是因為圖片本身比較大！如果僅僅是為了測試爬蟲，則可以不把圖片的寬度和高度替換掉，圖片就沒那么大啦，運行過程會快很多。

六、后語

伙伴兒們是不是發現，我們只獲取到了10張圖片，並沒有把網站所有照片都下載下來。

這是因為咱們爬取的網站是下拉刷新的，下拉一次，刷新10張照片。那么，該如何爬取這種下拉刷新的網頁呢？請看下一篇嘍。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲從入門到放棄（六）之 BeautifulSoup庫的使用 Python爬蟲小白入門（二）requests庫小白學爬蟲(六) - 之 BeautifulSoup庫的使用小白學 Python 爬蟲（23）：解析庫 pyquery 入門 python爬蟲入門（三）XPATH和BeautifulSoup4 python爬蟲：BeautifulSoup 庫的基本函數用法及框架 $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法爬蟲解析庫——BeautifulSoup 爬蟲（四）：BeautifulSoup庫的使用 python爬蟲之request and BeautifulSoup