前言
lxml是一種使用Python編寫的庫,可以迅速、靈活地處理XML和HTML,學過xpath定位的,可以立馬上手
使用環境:
python3.7
lxml 4.3.3
lxml安裝
pip install lxml,安裝報錯;指定版本為4.4.3時,安裝成功


pip show lxml查看版本號

html解析
這里用到etree.HTML方法把html的文本內容解析成html對象
要打印html內容,可以用etree.tostring方法,encoding="utf-8"參數可以正常輸出html里面的中文內容。pretty_print=True是以標准格式輸出
# coding:utf-8
from lxml import etree
htmldemo = '''
<meta charset="UTF-8"> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<html><head><title>yoyo ketang</title></head>
<body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
<p class="yoyo">這里是我的微信公眾號:yoyoketang
<a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>;
快來關注吧!</p>
<p class="story">...</p>
'''
# etree.HTML解析html內容
demo = etree.HTML(htmldemo)
# 打印解析內容str
t = etree.tostring(demo, encoding="utf-8", pretty_print=True)
print(t.decode("utf-8"))
運行結果
<html> <head><meta charset="UTF-8"/> <!-- for HTML5 --> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title>yoyo ketang</title> </head> <body> <b><!--Hey, this in comment!--></b> <p class="title"><b>yoyoketang</b></p> <p class="yoyo">這里是我的微信公眾號:yoyoketang <a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>; 快來關注吧!</p> <p class="story">...</p> </body> </html>
soupparser解析器
soupparser解析器比上面的etree.HTML容錯性要好一點,因為其處理不規范的html能力比etree強太多。
import lxml.html.soupparser as soupparser
demo = soupparser.fromstring(htmldemo)
t = etree.tostring(demo, encoding="utf-8", pretty_print=True)
print(t.decode("utf-8"))
xpath使用案例
使用html解析器,最終是想獲取html上的某些元素屬性和text文本內容,接下來看下,用最少的代碼,簡單高效的找出想要的內容。
比如要獲取“這里是我的微信公眾號:yoyoketang
# coding:utf-8
from lxml import etree
htmldemo = '''
<meta charset="UTF-8"> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<html><head><title>yoyo ketang</title></head>
<body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
<p class="yoyo">這里是我的微信公眾號:yoyoketang
<a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>;
快來關注吧!</p>
<p class="story">...</p>
'''
# etree.HTML解析html內容
demo = etree.HTML(htmldemo)
rs=demo.xpath('//p[@class="yoyo"]')
t=rs[0].text
print(t)
運行結果:

從代碼量上看,簡單的三行代碼就能找到想要的內容了,rs是xpath定位獲取到的一個list對象,會找出所有符合條件的元素對象。可以用for循環查看詳情。
# coding:utf-8
from lxml import etree
htmldemo = '''
<meta charset="UTF-8"> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<html><head><title>yoyo ketang</title></head>
<body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
<p class="yoyo">這里是我的微信公眾號:yoyoketang
<a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>;
快來關注吧!</p>
<p class="story">...</p>
'''
# etree.HTML解析html內容
demo = etree.HTML(htmldemo)
rs=demo.xpath('//p[@class="yoyo"]')
print(rs) #list對象
for j in rs:
#打印定位到的內容
print(etree.tostring(j,encoding="utf-8",pretty_print=True).decode("utf-8"))
print(j.attrib)
運行結果
[<Element p at 0x262b525a988>]
<p class="yoyo">這里是我的微信公眾號:yoyoketang
<a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>;
快來關注吧!</p>
{'class': 'yoyo'}
二次查找
通過xpath定位語法//p[@class="yoyo"]定位到的是class="yoyo"這個元素以及它的所有子節點,如果想定位其中一個子節點,可以二次定位,繼續xpath查看,如獲取:python筆記
# coding:utf-8
from lxml import etree
htmldemo = '''
<meta charset="UTF-8"> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<html><head><title>yoyo ketang</title></head>
<body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
<p class="yoyo">這里是我的微信公眾號:yoyoketang
<a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>,
<a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>;
快來關注吧!</p>
<p class="story">...</p>
'''
# etree.HTML解析html內容
demo = etree.HTML(htmldemo)
rs=demo.xpath('//p[@class="yoyo"]')
print(rs[0].text)
rs1=rs[0].xpath('//a[@id="link2"]')
print(rs1[0].text)
rs2=demo.xpath('//a[@id="link2"]')
print(rs2[0].text)
運行結果

Xpath

注意
etree.toString()返回的是bytes類型,需要調用decode方法將其轉換成String類型
經過處理后的html代碼,會被自動修復,添加缺少的標簽。
