總結來源於官方文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all
示例代碼段
html_doc = """
<html>
<head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->
<a><!--Hey, buddy. Want to buy a used parser?--></a></title>
</head>
<body>
<p class="title">
<b>The Dormouse's story</b>
<a><!--Hey, buddy. Want to buy a used parser?--></a>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
1、快速操作:
soup.title == soup.find('title') # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string == soup.title.text == soup.title.get_text() # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p == soup.find('p') # . 點屬性,只能獲取當前標簽下的第一個標簽 # <p class="title"><b>The Dormouse's story</b></p> soup.p['class'] # u'title' soup.a == soup.find('a') # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(['a','b']) # 查找所有的a標簽和b標簽
soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的標簽
soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
2、Beautiful Soup對象有四種類型:
1、BeautifulSoup
2、tag:標簽
3、NavigableString : 標簽中的文本,可包含注釋內容
4、Comment :標簽中的注釋,純注釋,沒有正文內容
標簽屬性的操做跟字典是一樣一樣的
html多值屬性(xml不適合):
意思為一個屬性名稱,它是多值的,即包含多個屬性值,即使屬性中只有一個值也返回值為list,
如:class,rel , rev , accept-charset , headers , accesskey
其它屬性為單值屬性,即使屬性值中有多個空格隔開的值,也是反回一個字符串
soup.a['class'] #['sister'] id_soup = BeautifulSoup('<p id="my id"></p>') id_soup.p['id'] #'my id'
3、html中tag內容輸出:
string:輸出單一子標簽文本內容或注釋內容(選其一,標簽中包含兩種內容則輸出為None)
strings: 返回所有子孫標簽的文本內容的生成器(不包含注釋)
stripped_strings:返回所有子孫標簽的文本內容的生成器(不包含注釋,並且在去掉了strings中的空行和空格)
text:只輸出文本內容,可同時輸出多個子標簽內容
get_text():只輸出文本內容,可同時輸出多個子標簽內容
string:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup, 'html.parser') comm = soup.b.string print(comm) # Hey, buddy. Want to buy a used parser? print(type(comm)) #<class 'bs4.element.Comment'>
strings:
head_tag = soup.body for s in head_tag.strings: print(repr(s)) 結果: '\n' "The Dormouse's story" '\n' 'Once upon a time there were three little sisters; and their names were\n ' 'Elsie' ',\n ' 'Lacie' ' and\n ' 'Tillie' ';\n and they lived at the bottom of a well.\n ' '\n' '...' '\n'
stripped_strings:
head_tag = soup.body for s in head_tag.stripped_strings: print(repr(s)) 結果: "The Dormouse's story" 'Once upon a time there were three little sisters; and their names were' 'Elsie' ',' 'Lacie' 'and' 'Tillie' ';\n and they lived at the bottom of a well.' '...'
text:
soup = BeautifulSoup(html_doc, 'html.parser') head_tag = soup.body print(head_tag.text) 結果: The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...
soup = BeautifulSoup(html_doc, 'html.parser') head_tag = soup.body print(repr(head_tag.text)) 結果: "\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n Elsie,\n Lacie and\n Tillie;\n and they lived at the bottom of a well.\n \n...\n"
4、返回子節點列表:
.contents: 以列表的方式返回節點下的直接子節點
.children:以生成器的方式反回節點下的直接子節點
soup = BeautifulSoup(html_doc, 'html.parser') head_tag = soup.head print(head_tag) print(head_tag.contents) print(head_tag.contents[0]) print(head_tag.contents[0].contents) for ch in head_tag.children: print(ch) 結果: <head><title>The Dormouse's story</title></head> [<title>The Dormouse's story</title>] <title>The Dormouse's story</title> ["The Dormouse's story"] <title>The Dormouse's story</title>
5、返回子孫節點的生成器:
.descendants: 以列表的方式返回標簽下的子孫節點
for ch in head_tag.descendants: print(ch) 結果: <title>The Dormouse's story</title> The Dormouse's story
6、父標簽(parent):如果是bs4對象,不管本來是標簽還是文本都可以找到其父標簽,但是文本對象不能找到父標簽
soup = BeautifulSoup(html_doc, 'html.parser') tag_title = soup.b # b標簽 print(tag_title.parent) # b標簽的父標簽 p print(type(tag_title.string)) # b標簽中的文本的類型,文本中有注釋時結果為None <class 'bs4.element.NavigableString'> print(tag_title.string.parent) # b標簽中文本的父標簽 b print(type(tag_title.text)) # b 標簽中的文本類型為str,無bs4屬性找到父標簽
7、遞歸父標簽(parents):遞歸得到元素的所有父輩節點
soup = BeautifulSoup(html_doc, 'html.parser') link = soup.a for parent in link.parents: print(parent.name)
結果:
p
body
html
[document]
8、前后節點查詢(不是前后標簽哦,文本也是節點之一):previous_sibling,next_sibling
9、以生成器的方式迭代返回所有兄弟節點
for sib in soup.a.next_siblings: print(sib) print("---------") 結果: ------------- , --------- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> --------- --------- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> --------- ; and they lived at the bottom of a well. ---------
10、搜索文檔樹
過濾器:
1、字符串
2、正則表達式
3、列表
4、True
5、方法
html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup import re soup = BeautifulSoup(html_doc, 'html.parser') soup.find_all("a") # 字符串參數 soup.find_all(re.compile("^b")) # 正則參數 soup.find_all(re.compile("a")) # 正則參數 soup.find_all(re.compile("l$")) # 正則參數 soup.find_all(["a", "b"]) # 標簽的列表參數 soup.find_all(True) # 返回所有標簽 def has_class_no_id(tag): return tag.has_attr("class") and not tag.has_attr("id") soup.find_all(has_class_no_id) # 方法參數
11、find選擇器:
語法 :
# find_all( name , attrs , recursive , text , **kwargs ) # name :要查找的標簽名 # attrs: 標簽的屬性 # recursive: 遞歸 # text: 查找文本 # **kwargs :其它 鍵值參數
特殊情況:
data-foo="value",因中橫杠不識別的原因,只能寫成attrs={"data-foo":"value"},
class="value",因class是關鍵字,所以要寫成class_="value"或attrs={"class":"value"}
from bs4 import BeautifulSoup import re html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # find_all( name , attrs , recursive , text , **kwargs ) # name :要查找的標簽名(字符串、正則、方法、True) # attrs: 標簽的屬性 # recursive: 遞歸 # text: 查找文本 # **kwargs :其它 鍵值參數 soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all('p', 'title')) # p標簽且class="title" soup.find_all('title') # 以列表形式返回 所有title標簽a soup.find_all(attrs={"class":"sister"}) # 以列表形式返回 所有class屬性==sister的標簽 soup.find_all(id='link2') # 返回所有id屬性==link2的標簽 soup.find_all(href=re.compile("elsie")) # 返回所有href屬性包含elsie的標簽 soup.find_all(id=True) # 返回 所有包含id屬性的標簽 soup.find_all(id="link1", href=re.compile('elsie')) # id=link1且href包含elsie
關於class的搜索
soup = BeautifulSoup(html_doc, 'html.parser') css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser') css_soup.find_all("p", class_="body") # 多值class,指定其中一個即可 css_soup.find_all("p", class_="strikeout") css_soup.find_all("p", class_="body strikeout") # 精確匹配 # text 參數可以是字符串,列表、方法、True soup.find_all("a", text="Elsie") # text="Elsie"的a標簽
12、父節點方法:
find_parents( name , attrs , recursive , text , **kwargs )
find_parent( name , attrs , recursive , text , **kwargs )
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <p> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and </p> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') a_string = soup.find(text="Lacie") # 文本為Lacie的節點 type(a_string), a_string # <class 'bs4.element.NavigableString'> Lacie a_parent = a_string.find_parent() # a_string的父節點中的第一個節點 a_parent = a_string.find_parent("p") # a_string的父節點中的第一個p節點 a_parents = a_string.find_parents() # a_string的父節點 a_parents = a_string.find_parents("a") # a_string的父點中所有a節點
13、后面的鄰居節點:
find_next_siblings( name , attrs , recursive , text , **kwargs )
find_next_sibling( name , attrs , recursive , text , **kwargs )
html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') first_link = soup.a # 第一個a標簽 a_sibling = first_link.find_next_sibling() # 后面鄰居的第一個 a_sibling = first_link.find_next_sibling("a") # 后面鄰居的第一個a a_siblings = first_link.find_next_siblings() # 后面的所有鄰居 a_siblings = first_link.find_next_siblings("a") # 后面鄰居的所有a鄰居
14、前面的鄰居節點:
find_previous_siblings( name , attrs , recursive , text , **kwargs )
find_previous_sibling( name , attrs , recursive , text , **kwargs )
15、后面的節點:
find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <p> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and </p> <p> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') a_string = soup.find(text="Lacie") a_next = a_string.find_next() # 后面所有子孫標簽的第一個 a_next = a_string.find_next('a') # 后面所有子孫標簽的第一個a標簽 a_nexts = a_string.find_all_next() # 后面的所有子孫標簽 a_nexts = a_string.find_all_next('a') # 后面的所有子孫標簽中的所有a標簽
16、前面的節點:
find_all_previous( name , attrs , recursive , text , **kwargs )
find_previous( name , attrs , recursive , text , **kwargs )
17、解析部分文檔:
如果僅僅因為想要查找文檔中的<a>標簽而將整片文檔進行解析,實在是浪費內存和時間.最快的方法是從一開始就把<a>標簽以外的東西都忽略掉. SoupStrainer 類可以定義文檔的某段內容,這樣搜索文檔時就不必先解析整篇文檔,只會解析在 SoupStrainer 中定義過的文檔. 創建一個 SoupStrainer 對象並作為 parse_only 參數給 BeautifulSoup 的構造方法即可。
SoupStrainer 類參數:name , attrs , recursive , text , **kwargs
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import SoupStrainer a_tags = SoupStrainer('a') # 所有a標簽 id_tags = SoupStrainer(id="link2") # id=link2的標簽 def is_short_string(string): return len(string) < 10 # string長度小於10,返回True short_string = SoupStrainer(text=is_short_string) # 符合條件的文本 from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify() soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify() soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()
最后一個方法:
patel = re.compile('(?<=(original_tel = ")).*(?=("))')