Python【BeautifulSoup解析和提取網頁數據】

本文轉載自查看原文 2019-08-31 12:30 4217 爬蟲

【解析數據】

使用瀏覽器上網，瀏覽器會把服務器返回來的HTML源代碼翻譯為我們能看懂的樣子

在爬蟲中，也要使用能讀懂html的工具，才能提取到想要的數據

【提取數據】是指把我們需要的數據從眾多數據中挑選出來

點擊右鍵-顯示網頁源代碼，在這個頁面里去搜索會更加准確

安裝

pip install BeautifulSoup4（Mac電腦需要輸入pip3 install BeautifulSoup4)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

解析數據

在括號中，輸入兩個參數，

第0個參數，必須是字符串類型；

第1個參數是解析器這里使用用的是一個Python內置庫：html.parser

 1 import requests  2 
 3 from bs4 import BeautifulSoup  4 #引入BS庫
 5 
 6 res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')  7 
 8 html = res.text  9 
10 soup = BeautifulSoup(html,'html.parser') #把網頁解析為BeautifulSoup對象
11 
12 print(type(soup)) #查看soup的類型 soup的數據類型是 <class 'bs4.BeautifulSoup'> soup是一個BeautifulSoup對象。
13 
14 print(soup) 15 # 打印soup

response.text和soup 打印出的內容一模一樣

它們屬於不同的類：<class 'str'> 與<class 'bs4.BeautifulSoup'> 前者是字符串，后者是已經被解析過的BeautifulSoup對象

打印出來一樣的原因：BeautifulSoup對象在直接打印的時候會調用對象內的str方法，所以直接打印 bs 對象顯示字符串是str的返回結果

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

提取數據

find()與find_all()

是BeautifulSoup對象的兩個方法

可以匹配html的標簽和屬性用法一樣

區別

find()只提取首個滿足要求的數據

find_all()提取出的是所有滿足要求的數據

 1 import requests  2 
 3 from bs4 import BeautifulSoup  4 
 5 url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
 6 
 7 res = requests.get (url)  8 
 9 print(res.status_code) 10 
11 soup = BeautifulSoup(res.text,'html.parser') 12 
13 item = soup.find('div') #使用find()方法提取首個<div>元素，並放到變量item里。
14 
15 print(type(item)) #打印item的數據類型
16 
17 print(item)       #打印item
18 
19  
20 200
21 
22 <class 'bs4.element.Tag'>  #是一個Tag類對象
23 
24 <div>大家好，我是一個塊</div>
25 
26  
27 
28  
29 
30 items = soup.find_all('div') #用find_all()把所有符合要求的數據提取出來，並放在變量items里
31 
32 print(type(items)) #打印items的數據類型
33 
34 print(items)       #打印items
35  
36 
37 200
38 
39 <class 'bs4.element.ResultSet'>   #是一個ResultSet類的對象
40 
41 [<div>大家好，我是一個塊</div>, <div>我也是一個塊</div>, <div>我還是一個塊</div>] 42 
43 #列表結構，其實是Tag對象以列表結構儲存了起來，可以把它當做列表來處理

soup.find('div',class_='books')

class_ 和python語法中的類 class區分，避免程序沖突

還可以使用其它屬性，比如style屬性等

括號中的參數：標簽和屬性可以任選其一，也可以兩個一起使用，這取決於我們要在網頁中提取的內容

 1 import requests # 調用requests庫
 2 
 3 from bs4 import BeautifulSoup # 調用BeautifulSoup庫
 4 
 5 res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')# 返回一個Response對象，賦值給res
 6 
 7 html= res.text# 把Response對象的內容以字符串的形式返回
 8 
 9 soup = BeautifulSoup( html,'html.parser') # 把網頁解析為BeautifulSoup對象
10 
11 items = soup.find_all(class_='books') # 通過定位標簽和屬性提取我們想要的數據
12 
13 print(type(items)) #打印items的數據類型 #items數據類型是<class 'bs4.element.ResultSet>， 前面說過可以把它當做列表list
14 
15 #for循環遍歷列表
16 
17 for item in items: 18 
19     print('想找的數據都包含在這里了：\n',item) # 打印item
20 
21 print(type(item))   #<class 'bs4.element.Tag'> 是Tag對象

#####################################################################

Tag對象

find()和find_all()打印出來的東西還不是目標數據，里面含着HTML標簽

xxxxx

items = soup.find_all(class_='books') # 通過定位標簽和屬性提取我們想要的數據

for item in items:

print(type(item))

數據類型是<class 'bs4.element.Tag'>，是Tag對象

此時，需要用到Tag對象的三種常用屬性與方法

此外，提取Tag對象中的文本，用到Tag對象的另外兩種屬性——Tag.text，和Tag['屬性名']

 1 import requests # 調用requests庫
 2 
 3 from bs4 import BeautifulSoup # 調用BeautifulSoup庫
 4 
 5 res =requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')  6 
 7 # 返回一個response對象，賦值給res
 8 
 9 html=res.text 10 
11 # 把res解析為字符串
12 
13 soup = BeautifulSoup( html,'html.parser') 14 
15 # 把網頁解析為BeautifulSoup對象
16 
17 items = soup.find_all(class_='books')   # 通過匹配屬性class='books'提取出我們想要的元素
18 
19 for item in items:                      # 遍歷列表items
20 
21     kind = item.find('h2')               # 在列表中的每個元素里，匹配標簽<h2>提取出數據
22 
23     title = item.find(class_='title')     # 在列表中的每個元素里，匹配屬性class_='title'提取出數據
24 
25     brief = item.find(class_='info')      # 在列表中的每個元素里，匹配屬性class_='info'提取出數據
26 
27     print(kind.text,'\n',title.text,'\n',title['href'],'\n',brief.text) # 打印書籍的類型、名字、鏈接和簡介的文字

##################################################################

對象的變化過程

開始用requests庫獲取數據，

到用BeautifulSoup庫來解析數據，

再繼續用BeautifulSoup庫提取數據，

不斷經歷的是我們操作對象的類型轉換。

################################################################

提取擴展

嵌套提取好幾層：

find('ul',class_='nav').find('ul').find_all('li')

#提取個人理解：

每一個find的屬性或者標簽都是對應的層

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲-提取網頁數據的三種武器 python爬蟲——爬取網頁數據和解析數據 python爬蟲學習基礎之網頁解析(2)BeautifulSoup 用Python實現網頁數據抓取 Python爬蟲〇六———數據解析之beautifulsoup的使用 Python爬蟲教程-23-數據提取-BeautifulSoup4（一） python3+beautifulSoup4.6抓取某網站小說（三）網頁分析，BeautifulSoup解析 Django+python實現網頁數據的excel導出 python爬取動態網頁數據，詳解 python3 采集需要登錄的網頁數據