爬蟲之解析庫pyquery

本文轉載自查看原文 2019-07-26 17:37 475 爬蟲

初始化

安裝: pip install pyquery

字符串的形式初始化

html = """
<html lang="en">
    <head>
        簡單好用的
        <title>PyQuery</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1">Python</li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>
        </ul>
    </body>
</html>
"""

doc = pq(html)
print(doc("title"))

<title>PyQuery</title>

URL初始化

# PyQuery對象首先會請求這個url,用得到的HTML內容完成初始化
doc = pq(url="https://www.cnblogs.com/songzhixue/")
print(doc("title"))

<title>村里唯一的架構師 - 博客園</title>&#13;


doc = pq(requests.get("https://www.cnblogs.com/songzhixue/").text)
print(doc("title"))

<title>村里唯一的架構師 - 博客園</title>&#13;

# 兩種方法相同

文件初始化

# 讀取本地的html文件以字符串的形式傳遞給PyQuery類來初始 化
doc = pq(filename="demo.html")   # demo.html為本地文件
print(doc("title"))

css選擇器

html = """
<html lang="en">
    <head>
        簡單好用的
        <title>PyQuery</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1">Python</li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>
        </ul>
    </body>
</html>
"""

# 先選取id為container的節點,在選取內部class屬性為object-1的節點
doc = pq(html)
print(doc("#container .object-1"))
print(type(doc("#container .object-1"))) # 輸出類型還是PyQuery類型

<li class="object-1">Python</li>
            
<class 'pyquery.pyquery.PyQuery'>

查找節點

html = """
<html lang="en">
    <head>
        簡單好用的
        <title>PyQuery</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1">
                Python
                <span>你好</span>
            </li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>
        </ul>
    </body>
</html>
"""

子節點

獲取所有子孫節點

# 獲取所有子孫節點
doc = pq(html)
a = doc("#container")
lis = a.find("li")   # 查詢的范圍是節點的所有子孫節點
print(lis)

<li class="object-1">
                Python
                <span>你好</span>
            </li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>

獲取所有子節點

# 獲取所有子節點
doc = pq(html)
a = doc("#container")
li = a.children()
print(li)

通過css選擇器選擇子節點中的某個節點

# 通過css選擇器選擇子節點中的某個節點  篩選出子節點中class屬性為object-1的節點
doc = pq(html)
a = doc("#container")
li = a.children(".object-1")
print(li)

<li class="object-1">
                Python
                <span>你好</span>
            </li>

父節點

直接父節點

# 這里的父節點是該節點的直接父節點
doc = pq(html1)
a= doc(".object-1")
li = a.parent()
print(li)

<ul id="container">
            <li class="object-1">
                Python
                <span>你好</span>
            </li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>
        </ul>

祖先節點

# 獲取所有父節點,即祖先節點
doc = pq(html1)
a = doc(".object-1")
li = a.parents()
print(li)
# 結果會有兩個,一個是父級節點一個是祖先節點

通過css選擇器選擇父節點中的某個節點

doc = pq(html1)
a = doc(".object-1")
li = a.parents("#container")
print(li)

<ul id="container">
            <li class="object-1">
                Python
                <span>你好</span>
            </li>
            <li class="object-2">大法</li>
            <li class="object-3">好</li>
        </ul>

兄弟節點

獲取所有兄弟節點

# 獲取所有兄弟節點
doc = pq(html)
a = doc(".object-1")
li = a.siblings()
print(li)

<li class="object-2">大法</li>
            <li class="object-3">好</li>

通過css選擇器選擇兄弟節點中的某個節點

# 通過css選擇器選擇兄弟節點中的某個節點
doc = pq(html)
a = doc(".object-1")
li = a.siblings(".object-3")
print(li)

<li class="object-3">好</li>

遍歷

- 上面選擇節點的結果可能是多個節點,也可能是單個節點類型都是pyquery類型

單個節點可以直接用str轉換成字符串直接打印

doc = pq(html)
a = doc(".object-1")
li = a.siblings(".object-3")
print(str(li))
print(type(str(li)))

<li class="object-3">好</li>
        
<class 'str'>

查詢結果為多個節點需要遍歷來獲取

# 查詢結果為多個節點需要遍歷來獲取
# 多個節點需要調用items方法
doc = pq(html)
a = doc("li").items()    # 調用items會得到一個生成器
print(a)

for i in a:    # 循環生成器取出每個節點,類型也是pyquery
    print(i)


<generator object PyQuery.items at 0x00000254B449CCA8>
<li class="object-1">
                Python
                <span>你好</span>
            </li>
            
<li class="object-2">大法</li>
            
<li class="object-3">好</li>

獲取信息

html = """
<html lang="en">
    <head>
        簡單好用的
        <title>PyQuery</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1">
                Python
                <a href="www.taobao.com">world</a>
                <a href="www.baidu.com">hello</a>
                
            </li>
            <li class="object-2">
                大法
                <a href="www.taobao.com">world</a>
            </li>
            <li class="object-3">好</li>
        </ul>
    </body>
</html>
"""

獲取屬性

# 找到某個節點后,就可以調用attr()方法來獲取屬性   
a = doc(".object-1")
# print(a.find("a").attr("href"))    
# 當返回結果包含多個節點時,調用attr()方法只會得到第一個節點的屬性

# 如果想要獲取所有a節點的屬性,需要使用遍歷
for i in a.find("a").items():
    print(i.attr("href"))


www.taobao.com
www.baidu.com

獲取文本

- 調用text()方法獲取文本
- 當我們得到的結果是多個節點時
    - text()  可以獲取到匹配標簽內的所有文本,返回的是所有文本內容組成的字符串
    - html()  返回的是匹配到的所有節點中的第一個節點內的html文本,如果想要獲取所有節點中的html需要遍歷

獲取純文本

# 獲取純文本
doc = pq(html)
li = doc("li")
li = li.text()
print(li)

Python world hello 大法 world 好

獲取節點內的HTML

# 獲取節點內的HTML    帶標簽 只能獲取匹配到的第一個節點內的HTML
doc = pq(html)
li = doc("li")
print(li.html())

Python
<a href="www.taobao.com">world</a>
<a href="www.baidu.com">hello</a>

獲取節點內的所有HTML

# 遍歷獲取所有節點中的html
doc = pq(html)
li = doc("li")
for i in li.items():
    print(i.html())

Python
<a href="www.taobao.com">world</a>
<a href="www.baidu.com">hello</a>
大法
<a href="www.taobao.com">world</a>         
好

節點操作

html = """
<html lang="en">
    <head>
        簡單好用的
        <title>PyQuery</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1">
                Python
                <a href="www.taobao.com">world</a>
                <a href="www.baidu.com">hello</a>
                
            </li>
            <li class="object-2">
                大法
                <a href="www.taobao.com">world</a>
            </li>
            <li class="object-3">好</li>
        </ul>
    </body>
</html>
"""

刪除屬性

doc = pq(html)
a = doc(".object-2")
print(a)
a.removeClass("object-2")   # 刪除object-2這個class屬性
print(a)

<li class="object-2">
                大法
                <a href="www.taobao.com">world</a>
            </li>
            
<li class="">
                大法
                <a href="www.taobao.com">world</a>
            </li>

添加屬性

doc = pq(html)
a = doc(".object-2")
print(a)
a.removeClass("object-2")   # 刪除object-2這個class屬性
print(a)
a.addClass("item")     # 給該標簽添加一個item的class屬性
print(a)

<li class="object-2">
                大法
                <a href="www.taobao.com">world</a>
            </li>
            
<li class="">
                大法
                <a href="www.taobao.com">world</a>
            </li>
            
<li class="item">
                大法
                <a href="www.taobao.com">world</a>
            </li>

attr

# 屬性操作  【一個參數是查找 兩個參數是設置屬性】
# 修改屬性
doc = pq(html)
a = doc(".object-1")
a.attr("name","henry")  # 給li標簽添加一個name屬性,值為henry
print(a)

<li class="object-1" name="henry">
                Python
                <a href="www.taobao.com">world</a>
                <a href="www.baidu.com">hello</a>
                
            </li>

text

# 文本操作  【有參數是添加或修改文本內容 沒有參數是查找所有文本內容】
# 文本內容操作
doc = pq(html)
a = doc(".object-1")
a.text("hello world")
print(a)

<li class="object-1">hello world</li>

html

# 標簽操作  【有參數是添加或修改標簽 沒有參數是查找第一個標簽,獲取所有需要遍歷】
# 標簽操作
doc = pq(html)
a = doc(".object-1")
a.html("<span>span標簽</span>")
print(a)

<li class="object-1"><span>span標簽</span></li>

偽類選擇器

html = """
    <div class="wrap">
        <div id="container">
            <ul class="list">
                <li class="item-0">fist item</li>
                <li class="item-1"><a href="link1.html">second</a></li>
                <li class="item-0 active"><a href="link2.html"><span class="bold">third item</span></a></li>
                <li class="item-1 active"><a href="link3.html">fourth item</a></li>
                <li class="item-0"><a href="link4.html">fifth item</a></li>
            </ul>
        </div>
    </div>
"""

選擇第一個節點

# 選擇第一個節點
doc = pq(html)
a = doc("li:first-child")
print(a)

<li class="item-0">fist item</li>

選擇最后一個節點

# 選擇最后一個節點
doc = pq(html)
a = doc("li:last-child")
print(a)

<li class="item-0"><a href="link4.html">fifth item</a></li>

選擇指定節點

# 選擇第2個li節點
doc = pq(html)
a = doc("li:nth-child(2)")
print(a)

<li class="item-1"><a href="link1.html">second</a></li>

選擇指定節點之后的節點

# 選擇第2個節點之后的所有節點
doc = pq(html)
a = doc("li:gt(2)")
print(a)

<li class="item-1 active"><a href="link3.html">fourth item</a></li>
                <li class="item-0"><a href="link4.html">fifth item</a></li>

選擇偶數節點

# 選擇偶數位值節點
doc = pq(html)
a = doc("li:nth-child(2n)")
print(a)

<li class="item-1"><a href="link1.html">second</a></li>
                <li class="item-1 active"><a href="link3.html">fourth item</a></li>

包含哪些文本的節點

# 包含second文本的節點
doc = pq(html)
a = doc("li:contains(second)")
print(a)

<li class="item-1"><a href="link1.html">second</a></li>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pyquery 的用法 --爬蟲解析庫爬蟲之解析庫-----re、beautifulsoup、pyquery 小白學 Python 爬蟲（23）：解析庫 pyquery 入門 python3解析庫pyquery Python的網頁解析庫-PyQuery python爬蟲從入門到放棄（七）之 PyQuery庫的使用 Python3 BeautifulSoup和Pyquery解析庫隨筆 python爬蟲之PyQuery的基本使用第三篇：解析庫之re、beautifulsoup、pyquery python爬蟲之pyquery學習