beautifulSoup模塊（專門用於解析XML文檔），查找元素（遍歷整個文檔），過濾查找（標簽名，屬性等過濾），刪除文檔樹(標簽，注釋)

本文轉載自查看原文 2019-04-13 21:10 964

beautifulSoup模塊（專門用於解析XML文檔）

安裝：pip3 install bs4

安裝解析器：

# lxml,根據操作系統不同,可以選擇下列方法來安裝lxml:   (推薦使用，解析快)
    $ apt-get install Python-lxml
    $ easy_install lxml
    $ pip3 install lxml

# html5lib,html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:（最穩定）
    $ apt-get install Python-html5lib
    $ easy_install html5lib
    $ pip3 install html5lib
# lxml HTML 解析器
# lxml XML 解析器

基本使用

import requests
from bs4 import BeautifulSoup

response=requests.get("https://www.baidu.com")
soup = BeautifulSoup(response.text,"lxml") # 開始解析，生成解析對象，第一個參數html文檔，第二個參數為解析器
res=soup.prettify() #處理好縮進，結構化顯示

查找元素（遍歷整個文檔）

爬取一個標簽的名字，屬性，文本

# 一個標簽分為 標簽名，屬性，文本
soup = BeautifulSoup("xx.html","lxml")
tag = soup.body # 查找body標簽內的所有內容
tag.name # 爬取標簽的名字
tag.attrs # 爬取標簽的屬性值
tag.text # 爬取標簽的文本

點語法查找元素

# 點語法查找第一個標簽的
tag.a  # 查找第一個a標簽
tag.a.attrs.get("href")  # 查找a標簽的href屬性值
tag.a.text  # 查找a標簽的文本信息

嵌套查找

tag.a.p.text # 查找的是a標簽里的p標簽的文本

獲取某一個標簽里的所有子標簽，不能取出子標簽里的子標簽

tag.a.contents # 查找a標簽里的所有子標簽，返回一個列表
tag.a.children # 查找的是a標簽里的所有子標簽，返回一個迭代器
for i in tag.a.children:
    print(i.name) # 查找a標簽的所有子標簽的名字

tag.p.descendants # 獲取子孫節點,p下所有的標簽都會選擇出來(能取出子標簽里的子標簽)
# 會把所有子標簽的文本，空格拆成一個節點

獲取父標簽

tag.p.parent  # 查找p標簽的上一級
tag.p.parents # 查找p標簽的所有上級（父級，父級的父級，所有父級），返回一個迭代器
eg: list(tag.p.parents)

獲取兄弟標簽，文本也被當作一個節點

tag.a.next_sibling  # 獲取a標簽的下一個標簽
tag.a.next_sibling.next_sibling  # 獲取a標簽的下一個兄弟標簽的下一個兄弟標簽
tag.a.next_siblings  # 獲取a標簽所有下面的兄弟標簽

tag.a.previous_sibling # 獲取a標簽的上一個兄弟標簽
tag.a.previous_siblings # 獲取a標簽的所有上面兄弟標簽

刪除文檔樹

去除指定標簽，PageElement.extract() 方法將當前tag移除文檔樹,並作為方法結果返回:

from bs4 import BeautifulSoup
# 去除屬性ul
[s.extract() for s in soup("ul")]
# 去除屬性svg
[s.extract() for s in soup("svg")]
# 去除屬性script
[s.extract() for s in soup("script")]

去除注釋，PageElement.extract() 方法將當前tag移除文檔樹,並作為方法結果返回:

from bs4 import BeautifulSoup, Comment

 # 去除注釋
comments = soup.findAll(text=lambda text: isinstance(text, Comment))
[comment.extract() for comment in comments]

使用decompose()——方法將當前節點移除文檔樹並完全銷毀:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

soup.i.decompose()

a_tag
# <a href="http://example.com/">I linked to</a>

過濾查找

簡單使用

from bs4 import BeautifulSoup
response=requests.get("https://www.baidu.com")
soup = BeautifulSoup(response.text,"lxml") # 開始解析，生成解析對象，第一個參數html文檔，第二個參數為解析器
soup.find_all("a") # 查找所有的a標簽（包含a標簽里的所有子標簽）

標簽名過濾

soup.find_all("a") # 查找所有的a標簽（包含a標簽里的所有子標簽）
soup.find_all(["a","p"])  # 查找所有的a標簽，p標簽

標簽屬性過濾

soup.find_all("a",attrs={"id":"link1"})  # 查找id為link1的所有a標簽
soup.find_all(name="a",attrs={"id":"link1"})  # 查找id為link1的所有a標簽


# class屬性過濾
soup.find_all(name="a",class_="sister brother") # 多類名過濾，類名必須完全一致，才能過濾
# 特殊符號屬性過濾，放在attrs參數里
soup.find_all(name="a",attrs={"data-a":"sister"}

特殊文本過濾，文本必須完全一致

soup.find_all(text="xxxx") # 過濾出來是文本
soup.find_all(name="ssss",text="xxxx")  # 過濾出來的文本是xxxx的標簽名為ssss的標簽

正則匹配過濾

import re
c = re.compile("a")
print(soup.find_all(name=c))  # 過濾出標簽名帶有a的標簽

True 過濾標簽

soup.find_all(name=True)  # 查找所有的標簽
soup.find_all(id=True)  # 查找所有帶有id屬性的標簽

函數名方法過濾

def myFilter(tag):  # 必須只能有一個參數 參數表示要過濾的標簽
    return tag.name == "a" and tag.text != "Elsie" and tag.has_attr("id")
soup.find_all(myFilter,limit=1)  # 查找標簽名為a且文本為Elsie且有一個id屬性

選擇器過濾

soup.select("a")  # 查早所有的a標簽
soup.select(".sister") # 查找所有類名為sister的標簽，類選擇器
soup.select("#bb") # 查找所有id為bb的標簽名，id選擇器
soup.select("#b #c") # 查找所有id為b下的id為c的標簽，后代選擇器
soup.select("#b>.c") # 查找所有id為bb下的類名為c的標簽，子代選擇器

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 xpath模塊，簡單用法，查詢標簽(解析數據,過濾查找,獲取標簽名等)，謂語條件查找，軸匹配查找 BeautifulSoup 使用select方法詳解（通過標簽名，類名， id，組合，屬性查找） JAVA文檔注釋標簽 JAVA文檔注釋標簽通過搜索文檔內容、加標簽、備注等快速查找文檔使用Python爬蟲庫BeautifulSoup遍歷文檔樹並對標簽進行操作詳解（新手必學）使用BeautifulSoup解析XML文檔 java 文檔注釋 -- javadoc 標簽 BeautifulSoup根據class的屬性查找標簽的方法 beautifulsoup 根據class屬性查找標簽的方法

beautifulSoup模塊 （專門用於解析XML文檔），查找元素（遍歷整個文檔），過濾查找（標簽名，屬性等過濾），刪除文檔樹(標簽，注釋)