使用 lxml 中的 xpath 高效提取文本與標簽屬性值

本文轉載自查看原文 2015-12-27 07:16 28307 python/ xpath/ lxml

以下代碼在 python 3.5 + jupyter notebook 中運行測試無誤！

# 我們爬取網頁的目的，無非是先定位到DOM樹的節點，然后取其文本或屬性值

myPage = '''<html>
        <title>TITLE</title>
        <body>
        <h1>我的博客</h1>
        <div>我的文章</div>
        <div id="photos">
         <img src="pic1.jpeg"/><span id="pic1">PIC1 is beautiful!</span>
         <img src="pic2.jpeg"/><span id="pic2">PIC2 is beautiful!</span>
         <p><a href="http://www.example.com/more_pic.html">更多美圖</a></p>
         <a href="http://www.baidu.com">去往百度</a>
         <a href="http://www.163.com">去往網易</a>
         <a href="http://www.sohu.com">去往搜狐</a>
        </div>
        <p class="myclassname">Hello,\nworld!<br/>-- by Adam</p>
        <div class="foot">放在尾部的其他一些說明</div>
        </body>
        </html>'''
        
html = etree.fromstring(myPage)

# 一、定位
divs1 = html.xpath('//div')
divs2 = html.xpath('//div[@id]')
divs3 = html.xpath('//div[@class="foot"]')
divs4 = html.xpath('//div[@*]')
divs5 = html.xpath('//div[1]')
divs6 = html.xpath('//div[last()-1]')
divs7 = html.xpath('//div[position()<3]')
divs8 = html.xpath('//div|//h1')
divs9 = html.xpath('//div[not(@*)]')

# 二、取文本 text() 區別 html.xpath('string()')
text1 = html.xpath('//div/text()')
text2 = html.xpath('//div[@id]/text()')
text3 = html.xpath('//div[@class="foot"]/text()')
text4 = html.xpath('//div[@*]/text()')
text5 = html.xpath('//div[1]/text()')
text6 = html.xpath('//div[last()-1]/text()')
text7 = html.xpath('//div[position()<3]/text()')
text8 = html.xpath('//div/text()|//h1/text()')


# 三、取屬性 @
value1 = html.xpath('//a/@href')
value2 = html.xpath('//img/@src')
value3 = html.xpath('//div[2]/span/@id')


# 四、定位（進階）
# 1.文檔(DOM)元素(Element)的find，findall方法
divs = html.xpath('//div[position()<3]')
for div in divs:
    ass = div.findall('a')  # 這里只能找到:div->a, 找不到:div->p->a
    for a in ass:
        if a is not None:
            #print(dir(a))
            print(a.text, a.attrib.get('href')) #文檔(DOM)元素(Element)的屬性：text, attrib

# 2.與1等價
a_href = html.xpath('//div[position()<3]/a/@href')
print(a_href)

# 3.注意與1、2的區別
a_href = html.xpath('//div[position()<3]//a/@href')
print(a_href)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用xpath提取頁面所有a標簽的href屬性值 Textrank權值提取文本標簽提取：【爬蟲】使用xpath與lxml移除特定標簽命名實體識別，使用pyltp提取文本中的地址 XPath語法和lxml模塊（數據提取）爬蟲之lxml - etree - xpath的使用 xpath提取包含標簽的所有文本內容 - xpath常用語法匯總 BeautifulSoup去除html中的標簽，獲取文本 lxml中xpath獲取當前節點所有子節點的文本方法更簡單高效的HTML數據提取-Xpath