python xpath查找元素

本文轉載自查看原文 2020-11-08 16:38 642 python

下面的文本部分摘抄自：W3school

選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。

下面列出了最有用的路徑表達式：

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

XPath 軸

軸可定義相對於當前節點的節點集。

軸名稱	結果
ancestor	選取當前節點的所有先輩（父、祖父等）。
ancestor-or-self	選取當前節點的所有先輩（父、祖父等）以及當前節點本身。
attribute	選取當前節點的所有屬性。
child	選取當前節點的所有子元素。
descendant	選取當前節點的所有后代元素（子、孫等）。
descendant-or-self	選取當前節點的所有后代元素（子、孫等）以及當前節點本身。
following	選取文檔中當前節點的結束標簽之后的所有節點。
namespace	選取當前節點的所有命名空間節點。
parent	選取當前節點的父節點。
preceding	選取文檔中當前節點的開始標簽之前的所有節點。
preceding-sibling	選取當前節點之前的所有同級節點。
self	選取當前節點。

XPath 運算符

下面列出了可用在 XPath 表達式中的運算符：

運算符	描述	實例	返回值
\|	計算兩個節點集	//book \| //cd	返回所有擁有 book 和 cd 元素的節點集
+	加法	6 + 4	10
-	減法	6 - 4	2
*	乘法	6 * 4	24
div	除法	8 div 4	2
=	等於	price=9.80	如果 price 是 9.80，則返回 true。如果 price 是 9.90，則返回 false。
!=	不等於	price!=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
<	小於	price<9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
<=	小於或等於	price<=9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
>	大於	price>9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
>=	大於或等於	price>=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.70，則返回 false。
or	或	price=9.80 or price=9.70	如果 price 是 9.80，則返回 true。如果 price 是 9.50，則返回 false。
and	與	price>9.00 and price<9.90	如果 price 是 9.80，則返回 true。如果 price 是 8.50，則返回 false。
mod	計算除法的余數	5 mod 2	1

示例代碼：

# 需要先安裝第三方模塊lxml
import lxml.etree as etree

html = """
<!DOCTYPE html>
<html>
<head lang="en">
    <title>xpath測試</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<div id="content">
    <ul id="ul">
        <li>NO.1</li>
        <li>NO.2</li>
        <li>NO.3</li>
    </ul>
    <ul id="ul2">
        <li>one</li>
        <li>two</li>
    </ul>
    <div>
        <p>第二個標簽</p>
    </div>
</div>
<div id="url">
    <a href="http:www.58.com" title="58">58</a>
    <a href="http:www.csdn.net" title="CSDN">CSDN</a>
    <a href="http:www.csdn.net222" title="CSDN">sssss</a>
</div>
</body>
</html>
"""
selector = etree.HTML(html) # element類型
# etree.dump(selector) # 查看element的內容
# html = etree.parse('./hello.html') # 從文件讀取

print('-----------------------------基礎用法---------------------------')

tag1 = selector.xpath('html') # 查找html結點，注意，返回值是列表
print(tag1)
tag1 = selector.xpath('/html') # /：代表從根節點開始查找子節點 html
print(tag1)
tag1 = selector.xpath('/html//li') # // : 代表查找所有后代結點 li
print(tag1)
tag1 = selector.xpath('.') # . :代表當前結點自身
print(tag1)
tag1 = selector.xpath('/html/head') # 查找根節點的 html 子節點的 head 子節點
print(tag1)
tag1 = tag1.xpath('..') # .. 代表父節點
print(tag1)
tag1 = selector.xpath('//@href') # @代表選擇屬性，也就是查找所有后代結點的 href 屬性的內容，返回的是href的值哦
print(tag1)
tag1 = selector.xpath('//@href="http:www.58.com"') # 是否存在 href="http:www.58.com" 的元素，返回True/False
print(tag1)

print('-----------------------------特殊過濾----------------------------')

tag1 = selector.xpath('//a[1]') # [1]代表選擇找到的第一個a標簽，不清除為啥用在屬性上不起作用，譬如：//@href[1]
print(tag1)
tag1 = selector.xpath('//a[last()]') # last() 代表選擇最后一個a標簽
print(tag1)
tag1 = selector.xpath('//a[last()-1]') # 倒數第二個a標簽
print(tag1)
tag1 = selector.xpath('//a[position()<3]') # 選擇前2個a標簽
print(tag1)
tag1 = selector.xpath('//div[@id="content"]') # 選擇擁有屬性id='content'的所有div
print(tag1)
tag1 = selector.xpath('//div[@id]') # 選擇擁有id屬性的div
print(tag1)
tag1 = selector.xpath('//div[@id="url"][a>10]') # 返回的是div。div 的 id="url"，並且此div的子節點 a 的內容要>10
print(tag1)
tag1 = selector.xpath('//@*') # *代表所有，@代表屬性。也就是篩選所有的屬性名
print(tag1)
tag1 = selector.xpath('//*') # 選擇所有的結點。
print(tag1)
tag1 = selector.xpath('//div[@*]') # 選擇擁有任何屬性的div。沒有屬性的div不會被選擇
print(tag1)

print('--------------------------運算符--------------------------')

tag1 = selector.xpath('//head | //body') # | 代表 “或”，選擇head或者body標簽
print(tag1)
tag1 = selector.xpath('6+5') # 6+5
print(tag1)
tag1 = selector.xpath('//a=58') # 是否存在a的內容=58的標簽，返回True或False（a是一個標簽，判斷a標簽的內容是否等於58）
print(tag1)

print('-------------------------Axes（軸）-----------------------')

# child::meta，代表選擇當前結點的所有meta子結點
child = selector.xpath('//head/child::meta[@content]')

 # ancestor::*,代表選擇當前結點的所有祖先結點（父，祖父..)
anc = selector.xpath('//div[@id="content"]/ancestor::*')

 # ancestor-or-self:祖先和當前結點
anc_self = selector.xpath('//div[@id="content"]/ancestor-or-self::*')

 # attribute：當前結點的所有屬性名
attr = selector.xpath('//div[@id="content"]/attribute::*')

 # descendant:所有后代結點（子，孫）
desc = selector.xpath('//div[@id="content"]/descendant::*')

 # descendant-or-self：所有后代結點（子，孫）和自己
descendant = selector.xpath('//div[@id="content"]/descendant-or-self::*')

 # 當前結點后面的所有結點（不管層級，所有結點)
following = selector.xpath('//div[@id="content"]/following::*')

# 命名空間
namespace = selector.xpath('//div[@id="content"]/namespace::*')

 # 當前結點的父節點
parent = selector.xpath('//div[@id="content"]/parent::*')

 # 當前結點前面的所有結點（不管層級，所有結點)
preceding = selector.xpath('//div[@id="content"]/preceding::*')

 # 當前結點前面的所有兄弟結點
preceding_sibling = selector.xpath('//div[@id="content"]/preceding-sibling::*')

 # 當前結點自身
self = selector.xpath('//div[@id="content"]/self::*')
print(child,anc,anc_self,attr,desc,descendant,following,namespace,parent,preceding,preceding_sibling,self,sep='\n')

print('-------------------------func---------------------------')

# contains(),篩選href屬性值中包含58的a標簽
tag1 = selector.xpath('//a[contains(@href,"58")]')
print(tag1)

tag1 = selector.xpath('concat("1","6")') # 字符拼接
print(tag1)

tag2 = selector.xpath('//meta[@content]/@content')
tag1 = selector.xpath('starts-with(//meta[@content]/@content,"text")') # 判斷是否以text字符串為起始，True/false
print(tag2,tag1)

print('-------------------------常用方法---------------------------')

# 獲取div標簽下所有的文本（所有的文本，包含它的子孫后代的文本)，但是可能會將換行符作為字符串 \n 顯示出來
con = selector.xpath('//div[@id="url"]//text()') 

 # 獲取當前標簽的文本，不包含子孫。
con2 = selector.xpath('//div[@id="url"]/text()')

# 通過 .text 獲取element的內容文本
divs = selector.xpath('//div[@id="url"]/*')
for d in divs:
    print(d.text,d.tag,d.attrib) # 分別是：標簽的文本，標簽的tag，標簽所有的屬性鍵值對（字典格式)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 xpath學習，通過xpath查找指定的元素 Python Xpath 提取html整個元素（標簽與內容） Python列表中查找某個元素的索引(多個）如何在python列表中查找某個元素的索引 python 元組查找元素返回索引 APPIUM API整理（python）---元素查找 [python] 查找列表中重復的元素 xpath定位兄弟元素 selenium之元素定位-xpath xpath獲取同級元素