html解析（etree.xpath、BeautifulSoup和pyquery ）

本文轉載自查看原文 2021-05-15 14:46 1068 python

etree.xpath 使用

參考網站：https://www.w3school.com.cn/xpath/xpath_functions.asp

第1步導入lxml模塊

第2步初始化准備要用處理的文件或者字符串

第3步，按照各種規則來提取第2步已經處理好的html

第一種：利用現有的html文件導入


from lxml import etree
html = etree.parse('./maoyan.html',etree.HTMLParser(encoding='utf-8')) //利用現有的html文件導入，使用方法
result01 = html.xpath('//i/ancestor::dd')

第2種，利用代碼中的定義的text

text = '''

<div>

<ul>

<li class="item-0"><a href="link1.html">first item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-inactive"><a href="link3.html">third item</a></li>

<li class="item-1"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

</div>

'''

html = etree.HTML(text) / /直接使用初始化已經有的text

result = etree.tostring(html )

print(result.decode('utf-8'))

BeautifulSoup

參考網站：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id14

第1步，導入BeautifulSoup

第2步，初始化准備要用處理的文件或者字符串

第3步，按照各種規則來提取第2步已經處理好的html

兩種構造方法： soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")



from bs4 import BeautifulSoup

#使用已經有的html來處理，建議使用open先打開html，因為如果有中文字符的話，直接打開的話會出現中文亂碼問題

html = open('maoyan.html','r',encoding='utf-8') 
soup = BeautifulSoup(html,'lxml')

print(type(soup.div.div.div.div.div.contents))

pyquery

參考網站：https://pyquery.readthedocs.io/en/latest/index.html

第1步，

導入from pyquery import PyQuery as pq

第2步：初始化需要處理要用處理的文件或者字符串

第3步，按照各種規則來提取第2步已經處理好的html

構造方法：
 from pyquery import PyQuery as pq
 from lxml import etree
 import urllib

　　1、直接字符串

　　doc = pq("<html></html>") 　　pq 參數可以直接傳入 HTML 代碼，doc 現在就相當於 jQuery 里面的 $ 符號了

　　2、lxml.etree

　　doc = pq(etree.fromstring("<html></html>"))

　　可以首先用 lxml 的 etree 處理一下代碼，這樣如果你的 HTML 代碼出現一些不完整或者疏漏，都會自動轉化為完整清晰結構的 HTML代碼。

　　3、直接傳URL

　　doc = pq('http://www.baidu.com')

　　這里就像直接請求了一個網頁一樣，類似用 urllib2 來直接請求這個鏈接，得到 HTML 代碼

　　4、傳文件

　　doc = pq(filename='hello.html')可以直接傳文件名

說明：

# 讀取文件內容初始化，編碼格式為GBK，當有不可識別字符時會報錯，可通過open指定編碼格式為utf-8來解決

html = open('maoyan.html','r',encoding='utf-8')

doc = pq(html.read())

print(doc('dd'))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 （最全）Xpath、Beautiful Soup、Pyquery三種解析庫解析html 功能概括 Python3 BeautifulSoup和Pyquery解析庫隨筆爬蟲之解析庫-----re、beautifulsoup、pyquery BeautifulSoup與Xpath解析庫總結第三篇：解析庫之re、beautifulsoup、pyquery 【Python】 html解析BeautifulSoup python筆記1--lxml.etree解析html Python 基於lxml.etree實現xpath查找HTML元素 xpath解析html python 使用 BeautifulSoup 解析html