python中的BeautifulSoup使用小結

本文轉載自查看原文 2017-07-16 00:34 1173 python

1.安裝

pip install beautifulsoup4

2.代碼文件中導入

from bs4 import BeautifulSoup

解析器	使用方法	優勢	劣勢
Python標准庫	BeautifulSoup(markup, “html.parser”)	Python的內置標准庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

r = requests.get('http://www.baidu.com/')
soup = BeautifulSoup(r.text, 'html.parser')

soup = BeautifulSoup(open('index.html'))

print soup.prettify()  #美化HTML代碼顯示

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象:

soup.head
soup.a
#顯示第一個同名標簽
soup.head.name #顯示標簽名稱，這里輸出‘head’
soup.head.attrs  #顯示標簽的屬性，以字典形式返回所有屬性
soup.head['class'] #顯示head標簽的class屬性值
soup.head['class'] = 'newclass' #修改head標簽class屬性值為‘newclass’
del soup.head['class'] #刪除head標簽的class屬性

soup.head.string  #獲取標簽內的正文內容，返回值類型為NavigableString

6.遍歷

soup.body.contents[0]  #獲取body標簽的第一個子結點，contents是一個列表
for child in soup.body.children:
    print(child.string)     #children與contents一樣，都獲取全部直接子結點，只不過children是一個生成器，需遍歷取出

for child in soup.body.descendants:
    print(child.string)    #遞歸遍歷獲取自身下面所有層級的所有節點，從最高一層列出然后下一層，直到最底層。

for string in soup.body.children.strings:
    print(repr(string))    #strings獲取多個正文內容，需遍歷取出，stripped_strings去掉每個字符串前后空格及空行，多余的空格或空行全部去掉，使用方法與strings一致

soup.body.parent #獲取父節點
for parent in soup.head.title.string.parents:
    print(parent.name)    #遍歷上級節點路徑，返回結果為title,head,html

.next_sibling   #下一兄弟節點
.previous_sibling  #上一兄弟節點
.next_siblings  #往下遍歷所有兄弟節點
.previous_siblings  #往上遍歷所有兄弟節點
.next_element    #下一節點，不分層級
.previous_element    #上一節點，不分層級
.next_elements     #往下順序遍歷所有節點，不分層級
.previous_elements   #往上遍歷所有節點，不分層級

7.搜索查找標簽

find_all( name , attrs , recursive , text , **kwargs )
#例：
#（1）name參數
soup.find_all('a')  #查找所有a標簽
soup.body.div.find_all('a')  #查找body下面第一個div中的所有a標簽

for tag in soup.find_all(re.compile('^b'))；
    print(tag.name)      #正則表達式查找所有以b開頭的標簽

soup.find_all(['a','b'])  #列表查找，返回所有a標簽和b標簽

soup.find_all(True)    #為True時，所有標簽都滿足搜索條件，返回所有標簽


#以下為自定義過濾條件，篩選滿足自定義條件的標簽
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)  #返回所有具有class屬性但無id屬性的標簽


#（2）attrs參數，以標簽屬性搜索
soup.find_all(id='nd2') #返回所有標簽中屬性id等於nd2的標簽
soup.find_all(href=re.compile("elsie"), id='link1')  #多個條件同時篩選，可用正則表達式
soup.find_all("a", class_="sister") #屬性中如果有python關鍵字，比如class屬性，不可以直接class='sister',應加個下划線與python關鍵字區分class_='sister'
soup.find_all(attrs={"data-foo": "value"})
#類似於html5中的data-foo屬性不可直接寫為soup.find_all(data-foo='value')，因為python命名規則中不允許有中划線（即橫杠），應以字典形式傳入attrs參數中，所有的屬性搜索都可以使用這種方法

#（3）text參數
soup.find_all(text="Tillie") #搜索文檔中的字符串內容為tillie，與name參數一樣，可用列表、正則表達式等

#（4）limit參數
soup.find_all('a', limit=2) #返回搜索文檔中前兩個a標簽，文檔較大時可節約資源

#（5）recursive參數
soup.head.find_all("title", recursive=False)  
#在head的直接子節點中搜索，默認為recursive=True，表示在所有子孫節點中搜索

find( name , attrs , recursive , text , **kwargs )
#與find_all用法完全一致，區別在於find只返回第一個滿足條件的結果，而find_all返回的是一個列表，需遍歷操作

#以下方法參數用法與 find_all() 完全相同，下面只列出區別

find_parents()  find_parent()
#find_all() 和 find() 只搜索當前節點的所有子節點,孫子節點等. find_parents() 和 find_parent() 用來搜索當前節點的父輩節點,搜索方法與普通tag的搜索方法相同,搜索文檔搜索文檔包含的內容

find_next_siblings()  find_next_sibling()
#這2個方法通過 .next_siblings 屬性對當 tag 的所有后面解析的兄弟 tag 節點進行迭代, find_next_siblings() 方法返回所有符合條件的后面的兄弟節點,find_next_sibling() 只返回符合條件的后面的第一個tag節點

find_previous_siblings()  find_previous_sibling()
#這2個方法通過 .previous_siblings 屬性對當前 tag 的前面解析的兄弟 tag 節點進行迭代, find_previous_siblings() 方法返回所有符合條件的前面的兄弟節點, find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點

find_all_next()  find_next()
#這2個方法通過 .next_elements 屬性對當前 tag 的之后的 tag 和字符串進行迭代, find_all_next() 方法返回所有符合條件的節點, find_next() 方法返回第一個符合條件的節點

find_all_previous() 和 find_previous()
#這2個方法通過 .previous_elements 屬性對當前節點前面的 tag 和字符串進行迭代, find_all_previous() 方法返回所有符合條件的節點, find_previous()方法返回第一個符合條件的節點

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python中xPath技術和BeautifulSoup的使用 Python3中BeautifulSoup的使用方法 [Python]BeautifulSoup安裝與使用 python 模塊BeautifulSoup使用 Python BeautifulSoup 使用 python 中BeautifulSoup入門 python爬蟲之beautifulsoup的使用 python 使用 BeautifulSoup 解析html Python之BeautifulSoup常用詳細使用 python3 BeautifulSoup模塊使用