python爬蟲數據解析的四種不同選擇器Xpath，Beautiful Soup，pyquery，re

本文轉載自查看原文 2018-11-16 19:56 1261 爬蟲/ Xpath/ re匯總/ PyQuery/ BeautiflSoup

這里主要是做一個關於數據爬取以后的數據解析功能的整合，方便查閱，以防混淆

主要講到的技術有Xpath，BeautifulSoup，PyQuery，re（正則）

首先舉出兩個作示例的代碼，方便后面舉例

解析之前需要先將html代碼轉換成相應的對象，各自的方法如下：

Xpath：

In [7]: from lxml import etree In [8]: text = etree.HTML(html)

BeautifulSoup：

In [2]: from bs4 import BeautifulSoup In [3]: soup = BeautifulSoup(html, 'lxml')

PyQuery：

In [10]: from pyquery import PyQuery as pq In [11]: doc = pq(html)

re：沒有需要的對象，他是直接對字符串進行匹配的規則

示例1

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
'''

接下來我們來用不同的解析方法分析示例的HTML代碼

匹配標題內容：

Xpath：

In [16]: text.xpath('//title/text()')[0]
Out[16]: "The Dormouse's story"

BeautifulSoup：

In [18]: soup.title.string
Out[18]: "The Dormouse's story"

PyQuery：

In [20]: doc('title').text()
Out[20]: "The Dormouse's story"

re:

In [11]: re.findall(r'<title>(.*?)</title></head>', html)[0]
Out[11]: "The Dormouse's story"

匹配第三個a標簽的href屬性：

Xpath：#推薦

In [36]: text.xpath('//a[@id="link3"]/@href')[0]
Out[36]: 'http://example.com/tillie'

BeautifulSoup：

In [27]: soup.find_all(attrs={'id':'link3'})
Out[27]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']
Out[33]: 'http://example.com/tillie'

PyQuery：#推薦

In [45]: doc("#link3").attr.href
Out[45]: 'http://example.com/tillie'

re:

In [46]: re.findall(r'<a href="(.*?)" class="sister" id="link3">Tillie</a>;', html)[0]
Out[46]: 'http://example.com/tillie'

匹配P標簽便是內容的全部數據：

Xpath：

In [48]: text.xpath('string(//p[@class="story"])').strip()
Out[48]: 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'

In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('\n'))
Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'

BeautifulSoup：

In [89]: ' '.join(list(soup.body.stripped_strings)).replace('\n', '')
Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."

PyQuery:

In [99]: doc('.story').text()
Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'

re：不推薦使用，過於麻煩

In [101]: re.findall(r'<p class="story">(.*?)<a href="http://example.com/elsie" class="sister" id="link1">(.*?)</a>(.*?)<a href="http://example.com/lacie" class="siste
     ...: r" id="link2">(.*?)</a>(.*?)<a href="http://example.com/tillie" class="sister" id="link3">(.*?)</a>;(.*?)</p>', html, re.S)[0]
Out[101]:
('Once upon a time there were three little sisters; and their names were\n',
 'Elsie',
 ',\n',
 'Lacie',
 ' and\n',
 'Tillie',
 '\nand they lived at the bottom of a well.')

示例2

html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''

匹配second item

Xpath：

In [14]: text.xpath('//li[2]/a/text()')[0]
Out[14]: 'second item'

BeautifulSoup：

In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string
Out[23]: 'second item'

PyQuery：

In [34]: doc('.item-1>a')[0].text
Out[34]: 'second item'

re:

In [35]: re.findall(r'<li class="item-1"><a href="link2.html">(.*?)</a></li>', html)[0]
Out[35]: 'second item'

匹配第五個li標簽的href屬性：

Xpath：

In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0]
Out[36]: 'link5.html'

BeautifulSoup：

In [52]:  soup.find_all(attrs={'class': 'item-0'})
Out[52]:
[<li class="item-0">first item</li>,
 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>,
 <li class="item-0"><a href="link5.html">fifth item</a></li>]

In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href']
Out[53]: 'link5.html'

PyQuery：

In [75]: [i.attr.href for i in doc('.item-0 a').items()][1]
Out[75]: 'link5.html'

re:

In [95]: re.findall(r'<li class="item-0" ><a href="(.*?)">fifth item</a></li>',html)[0]
Out[95]: 'link5.html'

示例3

<li><span class="label">房屋用途</span>普通住宅</li>

分別獲取出房屋用途和普通住宅

Xpath：

In [47]: text.xpath('//li/span/text()')[0]
Out[47]: '房屋用途'

In [49]: text.xpath('//li/text()')[0]
Out[49]: '普通住宅'

BeautifulSoup：

In [65]: soup.span.string
Out[65]: '房屋用途'

In [69]: soup.li.contents[1] # contents 獲取直接子節點
Out[69]: '普通住宅'

PyQuery：

In [70]: doc('li span').text()
Out[70]: '房屋用途'

In [75]: doc('li .label')[0].tail
Out[75]: '普通住宅'

re: 略

示例4

<div class="unitPrice">
    <span class="unitPriceValue">26667<i>元/平米</i></span>
</div>

分別獲取26667和元/平米

Xpath:

In [81]: text.xpath('//div[@class="unitPrice"]/span/text()')[0]
Out[81]: '26667'

In [82]: text.xpath('//div[@class="unitPrice"]/span/i/text()')[0]
Out[82]: '元/平米'

BeautifulSoup:

In [97]: [i for i in soup.find('div', class_="unitPrice").strings]
Out[97]: ['\n', '26667', '元/平米', '\n']

In [98]: [i for i in soup.find('div', class_="unitPrice").strings][1]
Out[98]: '26667'

In [99]: [i for i in soup.find('div', class_="unitPrice").strings][2]
Out[99]: '元/平米'

PyQuery:

In [109]: doc('.unitPrice .unitPriceValue')[0].text
Out[109]: '26667'

In [110]: doc('.unitPrice .unitPriceValue i')[0].text
Out[110]: '元/平米'

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 （最全）Xpath、Beautiful Soup、Pyquery三種解析庫解析html 功能概括 Requests爬蟲包及解析工具 xpath、正則、Beautiful Soup Python爬蟲 XPath 選擇器詳解 python中pyquery庫的css選擇器實戰解析 Python爬蟲利器：Beautiful Soup 小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下） Python-選擇器Xpath,Css,Re python爬蟲之Beautiful Soup的基本使用小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上） Python爬蟲利器二之Beautiful Soup的用法