一 .BeautifulSoup庫使用和參數
1 .Beautiful簡介
簡單來說,Beautiful Soup是python的一個庫,最主要的功能是從網頁抓取數據。官方解釋如下: Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。
它是一個工具箱,通過解析文檔為用戶提供需要抓取的數據,因為簡單,所以不需要多少代碼就可以寫出一個完整的應用程序。
Beautiful Soup自動將輸入文檔轉換為Unicode編碼,輸出文檔轉換為utf-8編碼。你不需要考慮編碼方式
,除非文檔沒有指定一個編碼方式,這時,Beautiful Soup就不能自動識別編碼方式了。然后,你僅僅需要說明
一下原始編碼方式就可以了。Beautiful Soup已成為和lxml、html6lib一樣出色的python解釋器,
為用戶靈活地提供不同的解析策略或強勁的速度。
2. 常用解析庫
Beautiful Soup支持Python標准庫中的HTML解析器,還支持一些第三方的解析器,如果我們不安裝它,則 Python 會使用 Python默認的解析器,
lxml 解析器更加強大,速度更快,推薦安裝。

3. 基本使用
from bs4 import BeautifulSoup html = ''' <html><head><title>哈哈哈哈</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a 威威time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">你好</a>, <a href="http://example.com/lacie" class="sister" id="link2">天是</a> and <a href="http://example.com/tillie" class="sister" id="link3">草泥</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 print(soup.prettify()) # 格式化輸出 html格式 print(soup.title) # 打印標簽中的所有內容 print(soup.title.name) # 獲取標簽對象的名字 print(soup.title.string) # 獲取標簽中的文本內容 == soup.title.text print(soup.title.parent.name) # 獲取父級標簽的名字 print(soup.p) # 獲取第一個p標簽的內容 print(soup.p["class"]) # 獲取第一個p標簽的class屬性 print(soup.a) # 獲取第一個a標簽 print(soup.find_all('a')) # 獲取所有的a標簽 print(soup.find(id='link3')) # 獲取id為link3的標簽 print(soup.p.attrs) # 獲取第一個p標簽的所有屬性 print(soup.p.attrs['class']) # 獲取第一個p標簽的class屬性 print(soup.find_all('p',class_='title')) # 查找屬性為title的p # 通過下面代碼可以分別獲取所有的鏈接以及文字內容 for link in soup.find_all('a'): print(link.get('href')) # 獲取鏈接 print(soup.get_text())# 獲取所有文本
4. 標簽選擇器
from bs4 import BeautifulSoup html = ''' <html> <head> <title>哈哈哈哈</title> </head> <body> <p class="title"><b>哈哈哈哈哈哈哈哈哈</b></p> <p class="story">Once upon a 威威time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">你好</a>, <a href="http://example.com/lacie" class="sister" id="link2">天是</a> and <a href="http://example.com/tillie" class="sister" id="link3">草泥</a>; and they lived at the bottom of a well.</p> <p class="story">666666666666666666666</p> ''' soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 print(soup.prettify()) print(soup.title) # <title>哈哈哈哈</title> print(type(soup.title)) # <class 'bs4.element.Tag'> print(soup.head) # <head><title>哈哈哈哈</title></head> print(soup.p) # <p class="title"><b>哈哈哈哈哈哈哈哈哈</b></p>
通過這種soup.標簽名 我們就可以獲得這個標簽的內容
這里有個問題需要注意,通過這種方式獲取標簽,如果文檔中有多個這樣的標簽,
返回的結果是第一個標簽的內容,如我們通過soup.p獲取p標簽,而文檔中有多個p標簽,
但是只返回了第一個p標簽內容。
5. 獲取名稱
soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 print(soup.title.name) # title
當我們通過soup.title.name的時候就可以獲得該title標簽的名稱,即title。
6. 獲取屬性
# BeautifulSoup入門 from bs4 import BeautifulSoup # https://www.cnblogs.com/felixwang2/p/8711746.html html = ''' <html><head><title>哈哈哈哈</title></head> <body> <p class="title" name="AA"><b>The Dormouse's story</b></p> <p class="story">Once upon a 威威time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">11111111111</a>, <a href="http://example.com/lacie" class="sister" id="link2">222222222222</a> and <a href="http://example.com/tillie" class="sister" id="link3">333333333333</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 print(soup.p.attrs['name']) #AA print(soup.p['name'])#AA
上面兩種方式都可以獲取p標簽的name屬性值
7. 獲取內容
soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 print(soup.p.string) # 表示獲取第一個p標簽的值 # The Dormouse's story
8 .嵌套選擇
soup = BeautifulSoup(html,'lxml') # 創建BeautifulSoup對象 # 我們直接可以通過下面嵌套的方式獲取 print(soup.head.title.string)# 哈哈哈哈
9. 子節點和子孫節點(contents的使用)和children的使用
contents的使用
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.contents) # 獲取p標簽中的所有內容,各部分存入一個列表 ################################ 運行結果 ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
children的使用
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child) # 通過children也可以獲取內容,和contents獲取的結果是一樣的,但是children是一個迭代對象,而不是列表,只能通過循環的方式獲取信息 print(soup.descendants)# 獲取子孫節點
10 .父節點和祖父節點
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.a.parent)
通過soup.a.parent就可以獲取父節點的信息
通過list(enumerate(soup.a.parents))可以獲取祖先節點,這個方法返回的結果是一個列表,會分別將a標簽的父節點的信息存放到列表中,
以及父節點的父節點也放到列表中,並且最后還會講整個文檔放到列表中,所有列表的最后一個元素以及倒數第二個元素都是存的整個文檔的信息
11. 兄弟節點
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.a.next_siblings )# 獲取后面的兄弟節點 print(soup.a.previous_siblings) # 獲取前面的兄弟節點 print(soup.a.next_sibling )# 獲取下一個兄弟標簽 print(soup.a.previous_sinbling )# 獲取上一個兄弟標簽
12. 標准選擇器 (find_all(name,attrs,recursive,text,**kwargs))可以根據標簽名,屬性,內容查找文檔
find_all
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) # 找到所有ul標簽 print(type(soup.find_all('ul')[0])) # 拿到第一個ul標簽 # find_all可以多次嵌套,如拿到ul中的所有li標簽 for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) # 找到id為ilist-1的標簽 print(soup.find_all(attrs={'name': 'elements'})) # 找到name屬性為elements的標簽
注意:attrs可以傳入字典的方式來查找標簽,但是這里有個特殊的就是class,
因為class在python中是特殊的字段,
所以如果想要查找class相關的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的標簽屬性可以不寫attrs,
例如id
text
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo111</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo111</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo111')) # 查到所有text="Foo"的文本 返回一個列表 # ['Foo111', 'Foo111']
13. find
find(name,attrs,recursive,text,**kwargs)
find返回的匹配結果的第一個元素
其他一些類似的用法:
find_parents()返回所有祖先節點,find_parent()返回直接父節點。
find_next_siblings()返回后面所有兄弟節點,find_next_sibling()返回后面第一個兄弟節點。
find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next()返回節點后所有符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous()返回節點后所有符合條件的節點, find_previous()返回第一個符合條件的節點
14. CSS選擇器
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))
通過select()直接傳入CSS選擇器就可以完成選擇 熟悉前端的人對CSS可能更加了解,其實用法也是一樣的 .表示class #表示id 標簽1,標簽2 找到所有的標簽1和標簽2 標簽1 標簽2 找到標簽1內部的所有的標簽2 [attr] 可以通過這種方法找到具有某個屬性的所有標簽 [atrr=value] 例子[target=_blank]表示查找所有target=_blank的標簽
15. 獲取內容(通過get_text()就可以獲取文本內容)
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text()) # Foo # Bar # Jay # Foo # Bar
16. 獲取屬性(獲取屬性的時候可以通過[屬性名]或者attrs[屬性名])
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])