關於BeautifulSoup4 解析器的說明

本文轉載自查看原文 2019-06-20 09:36 528 爬蟲

一.解析器概述

　　如同前幾章筆記，當我們輸入:

soup=BeautifulSoup(response.body)

　　對網頁進行析取時，並未規定解析器，此時使用的是python內部默認的解析器“html.parser”。

　　解析器是什么呢？ BeautifulSoup做的工作就是對html標簽進行解釋和分類，不同的解析器對相同html標簽會做出不同解釋。

　　舉個官方文檔上的例子：

BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>

BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>

BeautifulSoup("<a></p>", "html.parser")
# <a></a>

　　官方文檔上多次提到推薦使用"lxml"和"html5lib"解析器，因為默認的"html.parser"自動補全標簽的功能很差，經常會出問題。

二.不同解析器的對比

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed Lenient (as of Python 2.7.3 and 3.2.)	Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Lenient	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup,"xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

　　可以看出，“lxml”的解析速度非常快，對錯誤也有一定的容忍性。“html5lib”對錯誤的容忍度是最高的，而且一定能解析出合法的html5代碼，但速度很慢。

　　我在實際爬取網站的時候，原網頁的編碼方式不統一，其中有一句亂碼，用“html.parser”和“lxml”都解析到亂碼的那句，后面的所有標簽都被忽略了。而“html5lib”能夠完美解決這個問題。

三.from_encoding參數 (對應BeautifulSoup3中的fromEncoding參數)

　　由於不同網站的編碼方式不同，在用BeautifulSoup進行解析時，要注明對應的編碼方式。

　　查看編碼方式的方法如下：　　

import chardet
chardet.detect(response.body)

　　如：得到

{'confidence': 0.99, 'encoding': 'GB2312'}

　　由此可知編碼方式為GB2312，在析取時，默認的from_encoding參數為utf-8，不是utf-8的應注明。即：

soup=BeautifulSoup(respone.body,"html5lib",from_encoding='gb2312')

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲筆記(四)------關於BeautifulSoup4解析器與編碼 Python爬蟲(十四)_BeautifulSoup4 解析器 python——BeautifulSoup4解析器，JSON與JsonPATH，多線程爬蟲，動態HTML處理非結構化數據與結構化數據提取---- BeautifulSoup4 解析器 BeautifulSoup解析器的選擇 Python HTML解析器BeautifulSoup(爬蟲解析器) python爬蟲-html解析器beautifulsoup python3解析庫BeautifulSoup4 BeautifulSoup4庫和CSS選擇器 shell命令解析器功能說明及入門命令1