BeautifulSoup模塊函數詳解

本文轉載自查看原文 2018-07-13 15:01 3094

BeautifulSoup是Python的一個庫，最主要的功能就是從網頁爬取我們需要的數據。BeautifulSoup將html解析為對象進行處理，全部頁面轉變為字典或者數組，相對於正則表達式的方式，可以大大簡化處理過程。

0x01 安裝

建議安裝BeautifulSoup 4版本利用pip進行安裝:

pip install beautifulsoup4

BeautifulSoup默認支持Python的標准HTML解析庫，但是它也支持一些第三方的解析庫：

序號	解析庫	使用方法	優勢	劣勢
1	Python標准庫	BeautifulSoup(html,’html.parser’)	Python內置標准庫；執行速度快	容錯能力較差
2	lxml HTML解析庫	BeautifulSoup(html,’lxml’)	速度快；容錯能力強	需要安裝，需要C語言庫
3	lxml XML解析庫	BeautifulSoup(html,[‘lxml’,’xml’])	速度快；容錯能力強；支持XML格式	需要C語言庫
4	htm5lib解析庫	BeautifulSoup(html,’htm5llib’)	以瀏覽器方式解析，最好的容錯性	速度慢

0x02 創建對象

導入庫：

from bs4 import BeautifulSoup

創建實例：

url='http://www.baidu.com' resp=urllib2.urlopen(url) html=resp.read()

創建對象：

 bs=BeautifulSoup(html)

格式化輸出內容：

print bs.prettify()

0x03 對象種類

BeautifulSoup將復雜的html文檔轉換為樹形結構，每一個節點都是一個對象，這些對象可以歸納為幾種：

（1）Tag

Tag相當於html種的一個標簽：

#提取Tag print bs.title print type(bs.title)

結果：

<title>百度一下，你就知道</title> <class 'bs4.element.Tag'>

對於Tag，有幾個重要的屬性：

name:每個Tag對象的name就是標簽本省的名稱；
attrs:每個Tag對象的attrs就是一個字典，包含了標簽的全部屬性。

print bs.a.name print bs.a.attrs

輸出：

a
{u'href': u'/', u'id': u'result_logo', u'onmousedown': u"return c({'fm':'tab','tab':'logo'})"}

（2）NavigableString

Comment是一種特殊的NavigableString，對應的是注釋的內容，但是其輸出不包含注釋符。看這樣的一個例子：

#coding:utf-8 from bs4 import BeautifulSoup html=''' <a class="css" href="http://example.com/test" id="test"><!--test --></a> ''' bs=BeautifulSoup(html,"html.parser") print bs.a print bs.a.string

運行結果：

<a class="css" href="http://example.com/test" id="test"><!--def --></a>

a標簽的內容是注釋，但是使用.string仍然輸出了。這種情況下，我們需要做下判斷：

#判斷是否是注釋 if type(bs.a.string)==element.Comment: print bs.a.string

再看下面的例子：

<a class="css1" href="http://example.com/cdd" id="css">abc<!--def -->gh</a>

內容是注釋和字符串混合，此時可以用contents獲取全部對象：

for i in bs.a.contents: print i

如果需要忽略注釋內容的話，可以利用get_text()或者.text：

print bs.a.get_text()

如果想在BeutifulSoup之外使用 NavigableString 對象，需要調用unicode()方法，將該對象轉換成普通的Unicode字符串，否則就算BeautifulSoup已方法已經執行結束，該對象的輸出也會帶有對象的引用地址，這樣會浪費內存。

0x04 搜索文檔樹

重點介紹下find_all()方法：

find_all( name , attrs , recursive , text , **kwargs )

（1）name參數

name參數可以查找所有名字為name的Tag，字符串對象自動忽略掉。

print bs.find_all('a')

傳列表：

print bs.find_all(['a','b'])

傳入正則表達式：

print bs.find_all(re.compile('^b'))

所有以b開頭的標簽對象都會被找到。
傳遞方法：

def has_class_but_not_id(tag): return tag.has_attr('class') and not tag.has_attr('id') print bs.find_all(has_class_but_not_id)

（2）kwyowrds關鍵字

print bs.find_all(id='css') print bs.find_all(id=re.compile('^a'))

還可以混合使用：

print bs.find_all(id='css',href=re.compile('^ex'))

可以使用class作為過濾，但是class是Python中的關鍵字，可以使用class_代替，或者采用字典的形式傳輸參數：

print bs.find_all(class_='css') print bs.find_all(attrs={'class':'css'})

（3）text參數

用來搜索文檔中的字符串內容，text參數也接收字符串、正則表達式、列表、True等參數。

print bs.find_all(text=re.compile('^abc'))

（4）limit參數

限制返回對象的個數，與數據庫SQL查詢類似。

（5）recursive參數

調用tag的find_all()方法時，BeautifulSoup會檢索當前tag的所有子孫節點，如果只想搜索tag的直接子節點，可以使用參數 recursive=False。

0x05 CSS選擇器

可以采用CSS的語法格式來篩選元素：

#標簽選擇器 print bs.select('a') #類名選擇器 print bs.select('.css') #id選擇器 print bs.select('#css') #屬性選擇器 print bs.select('a[class="css"]') #遍歷 for tag in bs.select('a'): print tag.get_text()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Beautifulsoup模塊 python3 BeautifulSoup模塊 python BeautifulSoup模塊的安裝 python 模塊BeautifulSoup使用 python3 BeautifulSoup模塊 requests和BeautifulSoup模塊的使用 BeautifulSoup的select函數的使用 python BeautifulSoup庫詳解 requests+BeautifulSoup詳解 Python網絡爬蟲之BeautifulSoup模塊