bs4解析器的選擇

網絡爬蟲的最終目的就是過濾選取網絡信息，最重要的部分可以說是解析器。解析器的優劣決定了爬蟲的速度和效率。bs4庫除了支持我們上文用過的‘html.parser’解析器外，還支持很多第三方的解析器，下面我們來對他們進行對比分析。

bs4庫官方推薦我們使用的是lxml解析器，原因是它具有更高的效率，所以我們也將采用lxml解析器。
PS注意：很多人學Python過程中會遇到各種煩惱問題，沒有人解答容易放棄。為此小編建了個Python全棧免費答疑.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，不懂的問題有老司機解決里面還有最新Python實戰教程免非下,，一起相互監督共同進步！

lxml解析器的安裝：

依舊采用pip安裝工具來安裝：

$ pip install lxml

注意，由於我用的是unix類系統，用pip工具十分的方便，但是如果在windows下安裝，總是會出現這樣或者那樣的問題，這里推薦win用戶去lxml官方，下載安裝包，來安裝適合自己系統版本的lxml解析器。

使用lxml解析器來解釋網頁

我們依舊以上一篇的愛麗絲文檔為例子

    html_doc = """
    <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

試一下吧：

import bs4 #首先我們先將html文件已lxml的方式做成一鍋湯 soup = bs4.BeautifulSoup(open('Beautiful Soup 爬蟲/demo.html'),'lxml') #我們把結果輸出一下，是一個很清晰的樹形結構。 #print(soup.prettify()) ''' OUT: <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> '''

如何具體的使用？

bs4 庫首先將傳入的字符串或文件句柄轉換為 Unicode的類型，這樣，我們在抓取中文信息的時候，就不會有很麻煩的編碼問題了。當然，有一些生僻的編碼如：‘big5’，就需要我們手動設置編碼：
soup = BeautifulSoup(markup, from_encoding="編碼方式")

對象的種類：

bs4 庫將復雜的html文檔轉化為一個復雜的樹形結構，每個節點都是Python對象，所有對象可以分為以下四個類型：Tag , NavigableString , BeautifulSoup , Comment
我們來逐一解釋：

Tag：和html中的Tag基本沒有區別，可以簡單上手使用

NavigableString：被包裹在tag內的字符串

BeautifulSoup：表示一個文檔的全部內容，大部分的時候可以吧他看做一個tag對象，支持遍歷文檔樹和搜索文檔樹方法。

Comment：這是一個特殊的NavigableSting對象，在出現在html文檔中時，會以特殊的格式輸出，比如注釋類型。

搜索文檔樹的最簡單的方法就是搜索你想獲取tag的的name：

soup.head # <head><title>The Dormouse's story</title></head> soup.title # <title>The Dormouse's story</title>

如果你還想更深入的獲得更小的tag：例如我們想找到body下的被b標簽包裹的部分

soup.body.b # <b>The Dormouse's story</b>

但是這個方法只能找到按順序第一個出現的tag

獲取所有的標簽呢？

這個時候需要find_all()方法，他返回一個列表類型

tag=soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] #假設我們要找到a標簽中的第二個元素： need = tag[1] #簡單吧

tag的.contents屬性可以將tag的子節點以列表的方式輸出：

head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head> head_tag.contents [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] print(title_tag) # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story']

另外通過tag的 .children生成器，可以對tag的子節點進行循環：

for child in title_tag.children: print(child) # The Dormouse's story

這種方式只能遍歷出子節點。如何遍歷出子孫節點呢？

子孫節點：比如 head.contents 的子節點是<title>The Dormouse's story</title>,這里 title本身也有子節點：‘The Dormouse‘s story’ 。這里的‘The Dormouse‘s story’也叫作head的子孫節點

for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story

如何找到tag下的所有的文本內容呢？

1、如果該tag只有一個子節點（NavigableString類型）：直接使用tag.string就能找到。

2、如果tag有很多個子、孫節點，並且每個節點里都string：

我們可以用迭代的方式將其全部找出：

for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'好了，關於bs4庫的基本使用，我們就先介紹到這。剩下來的部分：父節點、兄弟節點、回退和前進，都與上面從子節點找元素的過程差不多，

總結

很多人學Python過程中會遇到各種煩惱問題，沒有人解答容易放棄。為此小編建了個Python全棧免費答疑.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，不懂的問題有老司機解決里面還有最新Python實戰教程免非下,，一起相互監督共同進步！

本文的文字及圖片來源於網絡加上自己的想法,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯系我們以作處理。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲數據提取之bs4的使用方法 Python之解BS4庫如何安裝與使用？正確方法教你 Xpath re bs4 等爬蟲解析器的性能比較 Python爬蟲bs4解析實戰【Python 庫】bs4的使用 bs4解析庫爬蟲解析之(六) --- bs4模塊 Python BS4庫的安裝與使用詳解 python bs4的使用 Python網絡爬蟲(數據解析-bs4模塊)