python3 之 bs4 BeautifulSoup 簡單使用

本文轉載自查看原文 2021-09-13 20:52 201 python/ python 爬蟲

python3 bs4

Beautiful Soup

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式
官方文檔

解析器

對網頁進行析取時，若未規定解析器，此時使用的是python內部默認的解析器“html.parser”。
官方文檔上多次提到推薦使用"lxml"和"html5lib"解析器，因為默認的"html.parser"自動補全標簽的功能很差，經常會出問題。
解析器是什么呢？
- BeautifulSoup做的工作就是對html標簽進行解釋和分類，不同的解析器對相同html標簽會做出不同解釋。

解析器	使用方法	優勢	劣勢
python 標准庫	BeautifulSoup(markup, "html.parser")	1、python 內置的標准庫	python2.7.2 or python3.2.2 前的文檔容錯性差
		2、執行速度適中
		3、文檔容錯能力強
lxml HTML 解析	BeautifulSoup(markup, "lxml")	1、速度快	需要安裝C語言庫
lxml HTML 解析	BeautifulSoup(markup, "lxml")	2、文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, ["lxml","xml"])	1、速度快	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, "xml")	2、唯一支持 XML 的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	1、最好的容錯性	1、需要安裝C語言庫
		2、以瀏覽器的方式解析文檔	2、不依賴外部擴展
		3、生成 HTML5 格式的文檔	2、不依賴外部擴展

安裝及基本使用

Windows下安裝

# 安裝 BeautifulSoup
pip install beautifulsoup4

# 安裝解析器
# Beautiful Soup 支持 python 標准庫中 HTML 解析器, 還支持一些第三方的解析器, 其中一個是 lxml
# 安裝 lxml
pip install lxml
# 另一個可供選擇的解析器是純 python 實現的 html5lib, html5lib與瀏覽器相同, 可以選擇下列方法來安裝
pip install html5lib

BeautifulSoup的使用

實例化

html_file_path = os.path.join(os.getcwd(), '../html_dir', 'test_lxml.html')

html_file = ''
with open(html_file_path, 'r') as f:
	lines  = f.readlines()
	for line in lines:
		html_file += line

# 初始化時自動更正格式, 輸出結果中包含 html 和 body 節點, 不會自動縮進
# "lxml": 指定解析器, 優先使用 lxml 解析器
soup = BeautifulSoup(html_file, 'lxml')     # 傳入字符串格式的 HTML
soup = BeautifulSoup(open(html_file_path))      # 傳入一個文件對象

HTML 文檔的內容

<head><title>The Dormouse's story</title></head>

<p class="story">
	this is P label
	<a href="http://www.baidu.com" class="baidu" id="link1"><span>baidu</span></a><span>this is span</span>
	<a href="http://www.cnblogs.com" class="cnblogs" id="link2"><span>cnblogs</span></a>
</p>
<div>
	<ul class="ul1">
		<li class="item-0 li" name="item0"><a href="link1.html">first item</a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">fourth item</a></li>
		<li class="item-0"><a href="link5.html">fifth item</a></li>
		<li class="aaa li-aaa"><a href="link6.html">aaaaa item</a></li>
		<li class="li li-first" name="item6"><a href="link6.html"><span>six item</span></a></li>
		<li class="li li-first" name="item7"><a href="link7.html"><span>seven item</span></a></li>
	</ul>
	
	<ul class="ul2">
		<li class="item-10 li" name="item10"><a href="link10.html">10 item</a></li>
		<li class="item-11 li" name="item11"><a href="link11.html">11 item</a></li>
		<li class="item-12 li" name="item12"><a href="link12.html">12 item</a></li>
		<li class="item-13 li" name="item13"><a href="link13.html">13 item</a></li>
		<li class="item-14 li" name="item14"><a href="link14.html">14 item</a></li>
		<li class="item-15 li" name="item15"><a href="link15.html">15 item</a></li>
		<li class="item-16 li" name="item16"><a href="link16.html">16 item</a></li>
	</ul>
</div>

Tag

name
- 每一個標簽都有自己的名字, 通過 tag.name 的方式獲取
- tag.name = "tag_new_name": 修改標簽的名字, 后面在獲取該標簽信息需要使用新名字, tag_new_name.name

print(f'通過 .name 獲取標簽名: {soup.p.name}')
soup.p.name = 'p_tag'   # 修改 p 標簽名
print(f'需要通過修改后的標簽名 p_tag.name 獲取標簽名: {soup.p_tag.name}')
print(f'通過 .name 獲取標簽名: {soup.ul.name}')
soup.ul.name = 'new_ul'   # 修改 ul 標簽名
print(f'需要通過修改后的標簽名 p_tag.name 獲取標簽名: {soup.new_ul.name}')

attributes
- 一個標簽可能有很多屬性, 比如: class、name、id..., 標簽屬性的操作方法和字典相同
- tag.attrs: 獲取標簽所有屬性, 返回一個字典格式的 {屬性: 屬性值} 鍵值對
- 標簽屬性的操作方法和字典一樣, 增刪改查

print(f"需要通過修改后的標簽名 p_tag.name 獲取 class 屬性: {soup.p_tag['class']}")
soup.p_tag['id'] = 'p1'     # p 標簽增加 id 屬性
soup.p_tag['class'] = 'story'     # p 標簽修改 class 屬性
print(f"需要通過修改后的標簽名 p_tag.name 獲取所有屬性: {soup.p_tag.attrs}")
print(f"需要通過修改后的標簽名 p_tag.name 獲取 id 屬性: {soup.p_tag['id']}")

多值屬性
- HTML4 定義了一系列可以包含多個值的屬性。在 HTML5 中移除了一些,卻增加更多.最常見的多值的屬性是 class (一個tag可以有多個CSS的class). 還有一些屬性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值屬性的返回類型是list

print(f"獲取 li 標簽的所有屬性, class 是多值屬性, value 是列表格式的兩個屬性值: {soup.li.attrs}")

如果某個屬性看起來好像有多個值, 但在任何版本的 HTML 定義中都沒有被定義為多值屬性, 那么 Beautiful Soup 會將這個屬性作為字符串返回

id_soup = BeautifulSoup("<p id='my id1'></p>")
print(f"HTML未定義過的多值屬性, 將兩個值返回成一個字符串: {id_soup.p['id']}")

如果轉換的文檔是XML格式,那么tag中不包含多值屬性

id_soup = BeautifulSoup("<p id='my id1'></p>", 'xml')
print(f"xml 格式中的多值屬性, 將兩個值返回成一個字符串: {id_soup.p['id']}")

可遍歷的字符串

print(f'可遍歷的字符串: {soup.a.string}, type: {type(soup.a.string)}')
soup.a.string.replace_with('No longer bold')    # 標簽的 字符串不能編輯, 但是可以替換
print(f"可遍歷的字符串, 替換后的字符串: {soup.a.string}, type: {type(soup.a.string)}")

子節點

Tag 的名字
- 操作文檔樹最簡單的方法就是告訴它你想獲取的 tag 的 name。如果想獲取標簽,只要用 soup.head :
- 可以在文檔樹的tag中多次調用這個方法

print(f'Tag 的名字, 將會打印包括 head 標簽及其內的所有內容: {soup.head}')
print(f'獲取 title: {soup.head.title}')
print(f'獲取 ul 標簽下 li 標簽下 a 標簽的名字: {soup.ul.li.a}')

find_all()
- 查找所有符合條件的標簽

print(f"查找所有的 a 標簽數量: {len(soup.find_all('a'))}, 結果: {soup.find_all('a')}")

contents()
- 將 tag 的子節點以列表的方式輸出
- .contents 屬性僅包含tag的直接子節點

print(f"查找所有的 ul 標簽下的第二個 li 標簽下的 a 標簽: {soup.ul.contents[3].a}")
print(f'contents 將子節點以列表的形式輸出: 數量: {len(soup.ul.contents)}, 結果: {soup.ul.contents}')

children
- 返回對象是一個生成器
- children 屬性僅包含tag的直接子節點

li_list = soup.ul
for item in li_list.children:
	if item != '\n':    # 去掉換行符
		print(f'ul 下的 li 標簽下的 a 標簽的文本: {item.a.string}')

descendants
- 返回對象是一個生成器
- descendants 屬性可以對所有 tag 的子孫節點進行遞歸循環

li_list = soup.ul
print(f'descendants 對象是一個生成器: {len(list(li_list.descendants))}, 結果: {li_list.descendants}')
for item in li_list.descendants:
	if item != '\n':    # 去掉換行符
		print(f'descendants 遞歸循環 ul 下的所有子孫節點: {item}')

string
- 如果tag只有一個 NavigableString 類型子節點,那么這個tag可以使用 .string 得到子節點
- 如果一個tag僅有一個子節點,那么這個tag也可以使用 .string 方法,輸出結果與當前唯一子節點的 .string 結果相同
- 如果tag包含了多個子節點,tag就無法確定 .string 方法應該調用哪個子節點的內容, .string 的輸出結果是 None

print(f"head 只有一個 title 子節點: {soup.head.string}")
print(f"title 只有一個文本子節點: {soup.head.title.string}")
print(f"ul 有多個子節點: {soup.ul.string}")

strings 和 stripped_strings
- 返回的是一個生成器
- 如果 tag 中包含多個字符串, 可以使用 .strings 來循環獲取, 但是會包含空白內容或換行符等
- 使用 .stripped_strings 可以去除多余空白內容, 全部是空格的行會被忽略掉,段首和段末的空白會被刪除

print('使用 strings 獲取 ul 下的多個子節點')
for item in soup.ul.strings:
	if item != '\n':
		print(item)

print('使用 stripped_strings 獲取 ul 下的多個子節點')
for item in soup.ul.stripped_strings:
	if item != '\n':
		print(item)

父節點

parent
- 通過 parent 屬性來獲取某個元素的父節點

print(f'獲取 title 的父節點 head: {soup.title.parent}')
print(f'獲取 title 文本的父節點 title: {soup.title.string.parent}')
print(f'獲取 html 頂層節點的父節點是整個 HTML, 返回 bs4.BeautifulSoup 對象: {type(soup.html.parent)}')
print(f'soup 的 parent 是 None: {soup.parent}')

parents
- 返回對象是一個生成器
- 通過元素的 .parents 屬性可以遞歸得到元素的所有祖先節點

print(f'獲取 title 的所有的祖先節點, 返回對象是一個生成器: {soup.title.parents}')
for item in soup.title.parents:
	if item != '\n':
		print(item, end='\n')

兄弟節點

next_sibling【下一個兄弟節點】和 previous_sibling【上一個兄弟節點】
- 實際文檔中的tag的 .next_sibling 和 .previous_sibling 屬性通常是字符串或空白
- 如果以為第一個
- 標簽的 .next_sibling 結果是第二個
- 標簽,那就錯了,真實結果是第一個
- 標簽和第二個
- 標簽之間的換行符

print(f'ul 節點下的 li 節點: {list(soup.ul.children)}')
# 注意: 我下面選擇的元素都是換行符, 所以打印的結果是標簽
print(f'ul 節點下的 li 節點的下一個兄弟節點: {list(soup.ul.children)[0].next_sibling}')
print(f'ul 節點下的 li 節點的下一個兄弟節點: {list(soup.ul.children)[2].next_sibling}')
print(f'ul 節點下的 li 節點的上一個兄弟節點: {list(soup.ul.children)[4].previous_sibling}')
print(f'ul 節點下的 li 節點的上一個兄弟節點: {list(soup.ul.children)[2].previous_sibling}')

通過 .next_siblings 和 .previous_siblings 屬性可以對當前節點的兄弟節點迭代輸出
- .next_siblings 和 .previous_siblings: 返回結果是生成器

print(f'ul 節點下的 li 節點: {list(soup.ul.children)}')
print(f'next_siblings : {list(soup.ul.children)[0].next_siblings}')
print(f'previous_siblings : {list(soup.ul.children)[4].previous_siblings}')
print('迭代 next_siblings 的結果: ')
# 這次循環打印會有換行
for item in list(soup.ul.children)[0].next_siblings:
	print(item)

print('迭代 previous_siblings 的結果: ')
# 這次循環打印會有換行
for item in list(soup.ul.children)[4].previous_siblings:
	print(item)

回退和前進

.next_element 和 .previous_element
- next_element:
  - 指向解析過程中下一個被解析的對象(字符串或tag),結果可能與 .next_sibling 相同,但通常是不一樣的
  - next_element 解析的內容當前標簽內的內容, 而不是當前標簽結束后的下一個標簽
  - 例如: <li class="item-10 li" name="item10"><a href="link10.html">10 item</a></li>
    - 解析器先進入 <li> 標簽, 然后是 <a> 標簽, 然后是字符串 10 item, 然后關閉 </a> 標簽, 關閉 </li> 標簽
    - next_element 解析的就是 <li> 標簽后面一個對象 <a> 標簽
- previous_element: 與 next_element 正好相反, 當前對象的上一個解析對象

print(f'ul 節點下的 li 節點: {list(soup.ul.children)}')
print(f'ul 節點下的 li 節點下的所有子節點 第三個 li: {list(soup.ul.children)[3]}')
print(f'ul 節點下的 li 節點下的的所有子節點 第三個 li 的中的標簽 a: {list(soup.ul.children)[3].next_element}')
print(f'ul 節點下的 li 節點下的的所有子節點 第三個 li 的中的標簽 a 的上一個解析標簽: {list(soup.ul.children)[3].next_element.previous_element}')

.next_elements 和 .previous_elements
- 返回的是生成器
- 通過 .next_elements 和 .previous_elements 的迭代器就可以向前或向后訪問文檔的解析內容,就好像文檔正在被解析一樣

print(f'ul 節點下的 li 節點: {list(soup.ul.children)}')
print(f'ul 節點下的 li 節點下的所有子節點 第三個 li: {list(soup.ul.children)[3]}')
print(f'ul 節點下的 li 節點下的的所有子節點 第三個 li 的中的標簽 a: next_elements')
for item in list(soup.ul.children)[3].next_elements:
	print(item, end='\n==========\n')

print(f'ul 節點下的 li 節點下的的所有子節點 第三個 li 的中的標簽 a 的上一個解析標簽: previous_elements')
for item in list(soup.ul.children)[3].next_element.previous_elements:
	print(item, end='\n==========\n')

搜索文檔樹

find()
- 獲取匹配的第一個標簽
- find(name, attrs, recursive, text, **kwargs)
- 唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表, 而 find() 方法直接返回結果
- find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時, 返回 None
- soup.head.title 是標簽的名字方法的簡寫, 這個簡寫的原理就是多次調用當前標簽的 find() 方法
  - soup.head.title 和 soup.find('head').find('title') 實際一樣
find_all()
- find_all(): 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件
- find_all(name, attrs, recursive, text, **kwargs )
  - name:
    - name 參數可以查找所有名字為 name 的標簽, 字符串對象會被自動忽略掉
    - name 參數可以是任意類型過濾器, 字符串, 正則表達式, 列表, True
  - recursive:
    - recursive=False: 只搜索標簽的直接子節點
  - keyword:
    - 如果指定名字的參數不是搜索內置參數名, 搜索時會把該參數當做指定名字標簽的屬性來搜索, 如果包含一個名字為 id 的參數, Beautiful Soup 會搜索每個標簽的 id 屬性
    - 如果傳入 href 參數, Beautiful Soup 會搜索每個標簽的 href 屬性
    - 搜索指定名字的屬性時可以使用的參數包括: 字符串, 正則表達式, 列表, True
    - 有些標簽屬性搜索不能使用, 比如: HTML5 中的 data-* 屬性, 可以通過 find_all() 方法的 attr 參數定義一個字典參數來搜索包括含特殊屬性的標簽

print(f"兩個方法等價: {soup.title.find_all(text=True)}, {soup.title(text=True)}")
print(f"定義一個字典參數來搜索包含特殊屬性的標簽: {soup.find_all(attrs={'data-foo': 'value'})}")

字符串: 在搜索方法中傳入一個字符串參數, Beautiful Soup 會查找與字符串完整匹配的內容
- 如果傳入字節碼參數, Beautiful Soup 會當做 UTF-8 編碼, 可以傳入一段 Unicode 編碼來避免 Beautiful Soup 解析編碼出錯

print(f"查找所有的 a 標簽: {soup.find_all('a')}")
print(f"查找所有的 title 標簽: {soup.find_all('title')}")

正則表達式: 如果傳入正則表達式作為參數, Beautiful Soup 會通過正則表達式的 match() 來匹配內容

print(f"查找所有的 p 開頭的標簽: {soup.find_all(re.compile('^p'))}")
print(f"查找所有的 ul 開頭的標簽: {soup.find_all(re.compile('^ul'))}")
print(f"查找所有包含的 l 標簽: 數量: {len(soup.find_all(re.compile('l')))}, 結果: {soup.find_all(re.compile('l'))}")

列表: 如果傳入列表參數, Beautiful Soup 會將與列表中任意一元素匹配的內容返回

print(f"查找所有的 a、title、form 標簽: {soup.find_all(['a', 'title', 'form'])}")

True: 可以匹配任何值, 查找所有的標簽, 但是不會返回字符串節點

print(f"查找所有的標簽: {soup.find_all(True)}")

方法傳參
- 如果沒有合適的過濾器, 那么還可以自定義一個方法, 方法只接受一個參數, 如果這個方法返回 True 表示當前元素匹配並且被找到, 如果不是則返回 None

print(f"查找所有包含 class 和 id 屬性: {soup.find_all(lambda tag: tag.has_attr('class') and tag.has_attr('id'))}")

按 CSS 搜索
- 標識CSS類名的關鍵字 class 在 Python中是保留字, 使用 class 做參數會導致語法錯誤, 從 Beautiful Soup 的 4.1.1 版本開始, 可以通過 class_ 參數搜索有指定 CSS 類名的標簽
- class_ 參數同樣接受不同類型的過濾器, 字符串, 正則表達式, 方法, True
- 標簽的 class 屬性是多值屬性, 按照 CSS 類名搜索標簽時, 可以分別搜索標簽中的每個 CSS 類名
- 搜索 class 屬性時也可以通過 CSS 值完全匹配
- 完全匹配時, 如果 CSS 的類名的順序與實際不符, 將搜索不到結果
text 參數
- 通過 text 參數可以搜索文檔中的字符內容, 與 name 參數的可選值一樣, text 參數接受字符串, 正則表達式, 列表, True
limit 參數
- find_all() 方法返回所有的搜索結果, 如果文檔樹很大搜索結果會很慢, 如果我們不需要全部結果, 可以使用 limit 參數限制返回結果的數量, 效果與 SQL 中的 limit 關鍵字類似, 當搜索到的結果達到 limit 限制時, 就會停止搜索返回結果

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 bs4的簡單使用 Python（00）：BeautifulSoup(BS4)解析HTML和XML BeautifulSoup-bs4的簡單使用 BS4(BeautifulSoup4)的使用--find_all()篇 bs4 使用詳解簡單的python2.7基於bs4和requests的爬蟲 Python爬蟲bs4解析實戰 python ModuleNotFoundError: No module named 'bs4' Python爬蟲BS4庫的解析器正確使用方法關於爬蟲中常見的兩個網頁解析工具的分析 —— lxml / xpath 與 bs4 / BeautifulSoup