Python學習筆記用BeautifulSoup模塊解析HTML

本文轉載自查看原文 2019-07-17 16:14 1882 Python基礎

隨筆記錄方便自己和同路人查閱。

#------------------------------------------------我是可恥的分割線-------------------------------------------

　　Beautiful Soup 是一個模塊，用於從 HTML 頁面中提取信息（用於這個目的時，它比正則表達式好很多）。BeautifulSoup 模塊的名稱是 bs4（表示 Beautiful Soup，第 4 版）。要安裝它，需要在命令行中運行 pip install beautifulsoup4。雖然安裝時使用的名字是 beautifulsoup4，但要導入它，就使用 import bs4。

新建一個txt文檔，把下面內容復制到文檔，並吧文檔后綴改為.html


<html><head><title>The Website Title</title></head>
<body>
<p>Browse my <strong>博樂園</strong> website from <a href="https://www.cnblogs.com/lirongyang/">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Li Rong Yang</span></p>
</body></html>

　　你可以看到，既使一個簡單的 HTML 文件，也包含許多不同的標簽和屬性。對於復雜的網站，事情很快就變得令人困惑。好在，Beautiful Soup 讓處理 HTML 變得容易很多。

#------------------------------------------------我是可恥的分割線-------------------------------------------

　　1、從HTML創建一個BeautifulSoup對象

　　bs4.BeautifulSoup()方法，示例代碼：

#! python 3
# -*- coding:utf-8 -*-
# Autor: Li Rong Yang
import requests,bs4
#取得Response 對象
res = requests.get('https://www.cnblogs.com/lirongyang/')

try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

#使用bs4.BeautifulSoup()方法，解析Response 對象
noStarchSoup = bs4.BeautifulSoup(res.text,"html.parser")

print(noStarchSoup)

　　運行結果：

　　2、用select()方法尋找元素

　　針對你要尋找的元素，調用 method()方法，傳入一個字符串作為 CSS“選擇器”，這樣就可以取得 Web 頁面元素。選擇器就像正則表達式：它們指定了要尋找的模
式，在這個例子中，是在 HTML 頁面中尋找，而不是普通的文本字符串。

　　CSS 選擇器的例子

　　soup.select('div')所有名為<div>的元素

　　soup.select('#author')#帶有 id 屬性為 author 的元素
　　soup.select('.notice')#所有使用 CSS class 屬性名為 notice 的元素
　　soup.select('div span')#所有在<div>元素之內的<span>元素
　　soup.select('div > span')#所有直接在<div>元素之內的<span>元素，中間沒有其他元素
　　soup.select('input[name]')#所有名為<input>，並有一個 name 屬性，其值無所謂的元素
　　soup.select('input[type="button"]')#所有名為<input>，並有一個 type 屬性，其值為 button 的元素

　　soup.select('#author')方式示例代碼：

#! python 3
# -*- coding:utf-8 -*-
# Autor: Li Rong Yang
import requests,bs4
exampleFile = open('d:\\example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(),"html.parser")

#select('#author')返回一個列表，其中包含所有帶有 id="author"的元素
elems = exampleSoup.select('#author')
#查看select()方法，返回的類型
print(type(elems))

#查看select()方法，告訴我們列表中有幾個 Tag 對象
print(len(elems))

#查看select()方法，返回的長度
print(type(elems[0]))

#查看select()方法， getText()方法，返回該元素的文本
print(elems[0].getText())

#查看select()方法，將返回一個字符串，其中包含開始和結束標簽，以及該元素的文本
print(str(elems))

#查看select()方法，attrs是一個字典，包含該元素的屬性'id'，以及id屬性的值'author'
print(elems[0].attrs)

　　運行結果：

　　3、通過元素的屬性獲取數據，我們以本地保存的html為例子

#! python 3
# -*- coding:utf-8 -*-
# Autor: Li Rong Yang
import requests,bs4
exampleSoup = bs4.BeautifulSoup(open('d:\\example.html'),"html.parser")

elems = exampleSoup.select('a')[0]
#以字符串形式顯示查找的內容
print(str(elems))
print(elems.get('id'))

elems = exampleSoup.select('span')[0]

#以字符串形式顯示查找的內容
print(str(elems))
#查找id相符的內容
print(elems.get('id'))
#將屬性名'id'傳遞給 get()，返回該屬性的值'author'
print(elems.attrs)

　　運行結果：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 使用 BeautifulSoup 解析html python爬蟲-html解析器beautifulsoup Python（00）：BeautifulSoup(BS4)解析HTML和XML python3 BeautifulSoup模塊 python BeautifulSoup模塊的安裝 Python學習－使用BeautifulSoup來解析網頁一：基礎入門 Python網絡爬蟲之BeautifulSoup模塊數據解析模塊BeautifulSoup簡單使用 html解析（etree.xpath、BeautifulSoup和pyquery ） python筆記之re模塊學習