小白學 Python 爬蟲（23）：解析庫 pyquery 入門

from pyquery import PyQuery

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

d = PyQuery(html)
print(d('p'))

結果如下：

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

以上是直接使用字符串進行的初始化，同時它還支持直接傳入 URL 地址進行初始化：

d_url = PyQuery(url='https://www.geekdigging.com/', encoding='UTF-8')
print(d_url('title'))

結果如下：

<title>極客挖掘機</title>

這樣寫的話，其實 PyQuery 會先請求這個 URL ，然后用響應得到的 HTML 內容完成初始化，與下面這樣寫其實也是一樣的：

r = requests.get('https://www.geekdigging.com/')
r.encoding = 'UTF-8'
d_requests = PyQuery(r.text)
print(d_requests('title'))

CSS 選擇器

我們先來簡單感受下 CSS 選擇器的用法，真的是非常的簡單方便：

d_css = PyQuery(html)
print(d_css('.story .sister'))
print(type(d_css('.story .sister')))

結果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

這里的寫法含義是我們先尋找 class 為 story 的節點，尋找到以后接着在它的子節點中繼續尋找 class 為 sister 的節點。

最后的打印結果中可以看到，它的類型依然為 pyquery.pyquery.PyQuery ，說明我們可以繼續使用這個結果解析。

查找節點

我們接着介紹一下常用的查找函數，這些查找函數最贊的地方就是它們和 JQuery 的用法完全一致。

find() ：查找節點的所有子孫節點。
children() ：只查找子節點。
parent() ：查找父節點。
parents() ：查找祖先節點。
siblings() ：查找兄弟節點。

下面來一些簡單的示例：

# 查找子節點
items = d('body')
print('子節點：', items.find('p'))
print(type(items.find('p')))

# 查找父節點
items = d('#link1')
print('父節點：', items.parent())
print(type(items.parent()))

# 查找兄弟節點
items = d('#link1')
print('兄弟節點：', items.siblings())
print(type(items.siblings()))

結果如下：

子節點： <p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<class 'pyquery.pyquery.PyQuery'>
父節點： <p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<class 'pyquery.pyquery.PyQuery'>
兄弟節點： <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

遍歷

通過上面的示例，可以看到，如果 pyquery 取出來的有多個節點，雖然類型也是 PyQuery ，但是和 Beautiful Soup 不一樣的是返回的並不是列表，如果我們需要繼續獲取其中的節點，就需要遍歷這個結果，可以使用 items() 這個獲取結果進行遍歷：

a = d('a')
for item in a.items():
    print(item)

結果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

這里我們調用 items() 后，會返回一個生成器，遍歷一下，就可以逐個得到 a 節點對象了，它的類型也是 PyQuery 類型。每個 a 節點還可以調用前面所說的方法進行選擇，比如繼續查詢子節點，尋找某個祖先節點等，非常靈活。

提取信息

前面我們獲取到節點以后，接着就是要獲取我們所需要的信息了。

獲取信息主要分為兩個部分，一個是獲取節點的文本信息，一個獲取節點的屬性信息。

獲取文本信息

a_1 = d('#link1')
print(a_1.text())

結果如下：

Elsie

如果想獲取這個節點內的 HTML 信息，可以使用 html() 方法：

a_2 = d('.story')
print(a_2.html())

結果如下：

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

獲取屬性信息

當我們獲取到節點以后，可以使用 attr() 來獲取相關的屬性信息：

attr_1 = d('#link1')
print(attr_1.attr('href'))

結果如下：

http://example.com/elsie

除了我們可以使用 attr() 這個方法以外， pyquery 還為我們提供了 attr 屬性，比如上面的示例還可以寫成這樣：

print(attr_1.attr.href)

結果和上面的示例是一樣的。

小結

我們在前置准備中安裝的幾種解析器到此就介紹完了，綜合比較一下，Beautiful Soup 對新手比較友好，無需了解更多的其他知識就可以上手使用，但是對於復雜 DOM 的解析，依然需要一定的 CSS 選擇器的基礎，如果對 Xpath 比較熟練的話直接使用 lxml 倒是最為方便的，如果和小編一樣，對 JQuery 和 CSS 選擇器都比較熟悉，那么 pyquery 倒是一個很不錯的選擇。

接下來小編計划做幾個簡單的實戰分享，敬請期待哦~~~

示例代碼

本系列的所有代碼小編都會放在代碼管理倉庫 Github 和 Gitee 上，方便大家取用。

示例代碼-Github

示例代碼-Gitee

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲之解析庫pyquery python爬蟲從入門到放棄（七）之 PyQuery庫的使用小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下） pyquery 的用法 --爬蟲解析庫 python3解析庫pyquery Python的網頁解析庫-PyQuery 小白學 Python 爬蟲（33）：爬蟲框架 Scrapy 入門基礎（一）小白學 Python 爬蟲（34）：爬蟲框架 Scrapy 入門基礎（二）小白學 Python 爬蟲（32）：異步請求庫 AIOHTTP 基礎入門