Python3爬蟲（五）解析庫的使用之XPath

本文轉載自查看原文 2018-04-27 11:23 5540 Python爬蟲

Infi-chu:

http://www.cnblogs.com/Infi-chu/

XPath：

全稱是 XML Path Language，XML路徑語言，它是一門在XML文檔中和HTML文檔中查找信息的語言

1.XPath常用規則

表達式　　　　描述

nodename　　選取此節點的所有子節點

/　　　　　　從當前節點選取直接子節點

//　　　　　　從當前節點選取子孫節點

.　　　　　　選取當前節點

..　　　　　　選取當前節點的父節點

@ 　　　　　選取屬性

2.准備工作：安裝 lxml 庫

3.例子：

from lxml import etree
text =
'''
<div>
<ul>
<li class="ex1"><a href="ex1.html">ex1</a></li>
<li class="ex2"><a href="ex2.html">ex2</a>
</ul>
</div>
'''
html = etree.HTML(text)    # 調用HTML類進行html初始化工作
r = etree.tostring(html)     # 修復HTML代碼，補全其他選項
print(r.decode('utf-8'))       # 結果返回是bytes，我們將其轉化成UTF-8

4.所有節點

選取所有節點：

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//*')    # 選取所有
print(res)

5.子節點

選取li節點的所有直接a子節點：

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//li/a')
print(res)

6.父節點

使用.和..

7.屬性匹配

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//li[@class='ex1']')
print(res)

8.文本屬性

選取li節點的內部文本，兩種方法，推薦第二種

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//li[@class='ex1']/a/text()')
print(res)

b.推薦，信息更全

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//li[@class="ex1"]//text()')
print(res)

9.屬性獲取

獲取所有li節點下所有a節點的href屬性

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())
res = html.xpath('//li/a/@href')
print(res)

10.屬性多值匹配

from lxml import etree
text =
'''
<div>
<ul>
<li class="li li-first"><a href="ex1.html">li1</a></li>
</ul>
</div>
'''
html = etree.HTML(text)
res = html.xpath('//li[contains(@class,"li")]/a/text()')
print(res)

【注】

contains()中，

第一個參數傳入屬性名稱，第二個參數傳入屬性值

11.多屬性匹配

根據多個屬性確定一個節點

from lxml import etree
text =
'''
<div>
<ul>
<li class="li" name="123"><a href="ex1.html">ex1</a></li>
</ul>
</div>
'''
html = etree.HTML(text)
res = html.xpath('//li[@contains(@class,"li") and @name="123"]/a/text()')
print(res)

12.按序選擇（多個節點）

from lxml import etree
text =
'''
<div>
<ul>
<li class="ex1"><a href="ex1.html">ex1</a></li>
<li class="ex2"><a href="ex2.html">ex2</a></li>
<li class="ex3"><a href="ex3.html">ex3</a></li>
</ul>
</div>
'''
html = etree.HTML(text)
res = html.xpath('//li[1]/a/text()')    # 第一個li
res = html.xpath('//li[last()]/a/text()')    #  最后一個li
res = html.xpath('//li[position()<3]/a/text()')    # 前兩個li
res = html.xpath('//li[last()-2]/a/text()')    # 第一個li

【注】

序號從1開始

13.節點軸選擇

from lxml import etree
text =
'''
<div>
<ul>
<li class="ex1"><a href="ex1.html">ex1</a></li>
<li class="ex2"><a href="ex2.html">ex2</a></li>
<li class="ex3"><a href="ex3.html">ex3</a></li>
</ul>
</div>
'''
html = etree.HTML(text)
res = html.xpath('//li[1]/ancestor::*')    # 獲取祖先節點
res = html.xpath('//li[1]/ancestor::div')    # 獲取祖先div節點
res = html.xpath('//li[1]/attribute::*')    # 所有屬性值
res = html.xpath('//li[1]/child::a[href="ex1.html"]')    # 所有直接子節點
res = html.xpath('//li[1]/descendant::span')    # 所有子孫節點
res = html.xpath('//li[1]/following::*[2]')    # 當前節點之后的所有節點
res = html.xpath('//li[1]/following-sibling::*')    # 當前節點之后的所有同級節點

　【注】這些都是軸

ancestor、attribute、child、descendant、following、following-sibling

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python3 【解析庫XPath】 python爬蟲之xpath的基本使用 python爬蟲數據解析之xpath XPath解析html及實例-使用xpath的爬蟲 Python爬蟲之Lxml庫與Xpath語法 python3網絡爬蟲學習——基本庫的使用（1） python爬蟲xpath的語法 Python爬蟲 | xpath的安裝爬蟲解析之css,xpath語法爬蟲（2）——requests以及xpath的使用