爬蟲 xpath (數據提取)

本文轉載自查看原文 2018-06-13 18:12 2899 爬蟲

xpath 是數據提取的一種常用的方法

XPath 是一門在 XML 文檔中查找信息的語言。XPath 用於在 XML 文檔中通過元素和屬性進行導航。

在 XPath 中，有七種類型的節點：元素、屬性、文本、命名空間、處理指令、注釋以及文檔（根）節點。XML 文檔是被作為節點樹來對待的。樹的根被稱為文檔節點或者根節點。

選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。

下面列出了最有用的路徑表達式：

nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

操作步驟:

一、引入

from lxml.html import etree

二、創建文檔樹

html_obj = etree.HTML(html, parser=HTMLParser(encoding='utf-8'))

def HTML(text, parser=None, base_url=None): # real signature unknown; restored from __doc__
    """
    HTML(text, parser=None, base_url=None)
    
        Parses an HTML document from a string constant.  Returns the root
        node (or the result returned by a parser target).  This function
        can be used to embed "HTML literals" in Python code.
    
        To override the parser with a different ``HTMLParser`` you can pass it to
        the ``parser`` keyword argument.
    
        The ``base_url`` keyword argument allows to set the original base URL of
        the document to support relative Paths when looking up external entities
        (DTD, XInclude, ...).
    """
    pass

還可以這樣寫:

html_obj = etree.fromstring(html, parser=HTMLParser(encoding='utf-8'))

def fromstring(text, parser=None, base_url=None): # real signature unknown; restored from __doc__
    """
    fromstring(text, parser=None, base_url=None)
    
        Parses an XML document or fragment from a string.  Returns the
        root node (or the result returned by a parser target).
    
        To override the default parser with a different parser you can pass it to
        the ``parser`` keyword argument.
    
        The ``base_url`` keyword argument allows to set the original base URL of
        the document to support relative Paths when looking up external entities
        (DTD, XInclude, ...).
    """
    pass

parser=HTMLParser(encoding='utf-8')自定義的解析器，有默認的解析器。

三、用xpath提取數據
比如:

div_obj = html_obj.xpath('//div[@class="l_post"]') # [@條件]

divs = html_obj.xpath('//div[contains(@class, "l_post")]') #如果屬性值有多個可用contains

divs = html_obj.xpath('//div[@class="l_post"]/text()')  #取標簽中的值時  text()

divs = html_obj.xpath('//div[@class="l_post"]/a/@href') #取標簽的屬性值時  @屬性名

擴展:

文檔樹轉化為字符串時:

html = etree.tostring(html_obj, encoding='utf-8').decode()

返回的是二進制的代碼,所以要解碼

在網頁中查看html 點到要取的節點右鍵復制中有xpath,可直接粘貼路徑表達式,可能會不太准確

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(2) python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(1) [PHP] xpath提取網頁數據內容 Python爬蟲數據提取總結 jmeter之Xpath提取器通過HtmlAgilityPack插件和xpath解析html完成爬蟲抓取數據 python爬蟲之xpath的基本使用 Xpath語法-爬蟲(一) 爬蟲(三)：對requests、xpath模塊爬蟲之解析庫Xpath