BeautifulSoup4的基本操作

本文轉載自查看原文 2019-11-09 16:36 422 BeautifulSoup

BeautifulSoup是一個可以從HTML或XML文件中提取數據的Python庫.

1.prettify()方法：將Beautiful Soup的文檔樹格式化后以Unicode編碼輸出,每個XML/HTML標簽都獨占一行。

from bs4 import BeautifulSoup
from urllib.request import urlopen

f = urlopen('https://tieba.baidu.com/index.html').read()
soup = BeautifulSoup(f,'html.parser')

s = soup.prettify()
print(s)

輸出結果：

2.基本操作

s = BeautifulSoup('<p class="123">喜歡捕捉美的瞬間</p>','html.parser')
print(s.p)   #文檔中的tag
print(s.p.name)     #tag的名字
'''
如果這個標簽只有一個 NavigableString類型的子節點，可以使用string來獲取tag中的子節點字符串。
如果文檔下的標簽包含多個字符串，可以使用strings循環獲得。
輸出的字符串中可能包含了很多空格或空行,使用stripped_strings可以去除多余空白。
'''
print(s.p.string)    
print(s.p.attrs)     #tag的屬性操作方法與字典一樣，.attrs可以拿到屬性和屬性值
print(s.p['class'])    #根據屬性查屬性值
print(s.get_text())    #只得到tag中包含的文本內容
print(s.p.string.replace_with('哩哩啦啦'))   #tag中包含的字符串不能編輯,但是可以被替換成其它的字符串

結果：

<p class="123">喜歡捕捉美的瞬間</p>
p
喜歡捕捉美的瞬間
{'class': ['123']}
['123']
喜歡捕捉美的瞬間
喜歡捕捉美的瞬間

3.CDATA替代注釋

from bs4 import CData

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup，'html.parser')
comment = soup.b.string

cdata = CData("A CDATA block")
comment.replace_with(cdata)
print(soup.b.prettify())

輸出： <b><![CDATA[A CDATA block]]></b>

4.contents屬性可以將tag的子節點以列表的方式輸出.

f = urlopen('https://tieba.baidu.com/index.html').read()
soup = BeautifulSoup(f,'html.parser')

#s = soup.prettify()
#print(s)
tag = soup.head
print(tag.contents)
print('------------------------')
print(len(tag))     #列表展示，可以用len()拿到子節點的個數
print('------------------------')
l = tag.contents[3].contents[1].contents[4]   #也可根據索引來查找子節點
print(l.name)

結果：

5.children生成器可以用來遍歷子節點：（descendants可以遍歷子孫節點）

a = soup.head.contents[3].contents[1]
for child in a.children:
    print(child)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲基本操作、requests和BeautifulSoup BeautifulSoup4基本使用 Python: 安裝BeautifulSoup4 beautifulsoup4 安裝教程 BeautifulSoup4 庫的基本使用安裝BeautifulSoup4 python安裝BeautifulSoup4 BeautifulSoup4的使用方法 python爬蟲beautifulsoup4系列3 Ubuntu下安裝BeautifulSoup4