python爬蟲數據解析之xpath

本文轉載自查看原文 2019-04-18 14:13 1797 python

xpath是一門在xml文檔中查找信息的語言。xpath可以用來在xml文檔中對元素和屬性進行遍歷。

在xpath中，有7中類型的節點，元素，屬性，文本，命名空間，處理指令，注釋及根節點。

節點

首先看下面例子:

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author> 
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

上面的節點例子：

<bookstore> （文檔節點） <author>J K. Rowling</author> （元素節點） lang="en" （屬性節點）

父：在上面的例子里，book是title，author，year，price的父。

子：反過來，title，author，year，price是book的子。

同胞：title，author，year，price是同胞。

先輩：title的先輩是book，bookstore。

后代：bookstore的后代是book，tite，author，year，price。

再看一個例子：

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore

如何選取節點呢?

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。

對應上面的例子，得到結果：

謂語：謂語用來查找某個特定節點或者包含某個指定值的節點。

比如：

選取未知節點：

比如：

選取若干路徑：通過在路徑表達式中使用“|”運算符，您可以選取若干個路徑。

常用xpath屬性：

    # 找到class屬性為song的div標簽
    //div[@class="song"] 層級定位: # 找到class屬性為tang的div直系字標簽ul下的第二個字標簽li下的直系字標簽a
    //div[@class='tang']/ul/li[2]/a 邏輯運算: 找到class屬性為空且href屬性為tang的a標簽 //a[@class='' and @href='tang'] 模糊定位 # 查找class屬性值里包含'ng'字符串的div標簽
    //div[contains(@class, 'ng')] # 配配class屬性以ta為開頭的div標簽
    //div[start_with(@class, 'ta')] 獲取文本 //div[@class="song"]/p[1]/text() 獲取屬性 # 獲取class屬性為tang的div下的第二個li下面a標簽的href屬性
    //div[@class="tang"]//li[2]/a/@href

在python中應用

將html文檔或者xml文檔轉換成一個etree對象，然后調用對象中的方法查找指定節點。

1 本地文件：

　　tree = etree.parse(文檔)

　　tree.xpath(xpath表達式)

2 網絡數據:

　　tree = etree.HTML(網頁字符串)

　　tree.xpath(xpath表達式)

例子1：隨機爬取糗事百科糗圖首頁的一張圖片

import requests from lxml import etree import random def main(): # 網頁url
    url = 'https://www.qiushibaike.com/pic/' ua_headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'} # 網頁代碼
    response = requests.get(url=url, headers=ua_headers).text # 轉換為etree對象
    tree = etree.HTML(response) # 匹配到所有class屬性為thumb的div標簽下的img標簽的src屬性值,返回一個列表
    img_lst = tree.xpath('//div[@class="thumb"]//img/@src') # 隨機挑選一個圖片並且下載下來
    res = requests.get(url='https:'+random.choice(img_lst), headers=ua_headers).content # 將圖片保存到本地
    with open('image.jpg', 'wb') as f: f.write(res) if __name__ == '__main__': main()

例子2：爬取煎蛋網首頁的圖片

import requests from lxml import etree def main(): url = 'http://jandan.net/ooxx' headers = { "User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) "
                      "Version/5.1 Safari/534.50"} response = requests.get(url=url, headers=headers).text tree = etree.HTML(response) img_lst = tree.xpath('//div[@class="text"]//img/@src') for one_image in img_lst: res = requests.get(url='http:'+one_image, headers=headers).content with open('image/' + one_image.split('/')[-1] + '.gif', 'wb') as f: f.write(res) if __name__ == '__main__': main()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲中XPath和lxml解析庫 Python爬蟲系列之 xpath：html解析神器爬蟲之數據解析（bs4，Xpath） python爬蟲數據解析的四種不同選擇器Xpath，Beautiful Soup，pyquery，re python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(2) python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(1) 爬蟲之解析庫Xpath python爬蟲--數據解析 Python爬蟲（三）——數據解析 Python3爬蟲（五）解析庫的使用之XPath