beautifulSoup基本用法及find選擇器

本文轉載自查看原文 2018-10-07 10:49 3975 爬蟲

　　總結來源於官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

示例代碼段

html_doc = """
<html>
    <head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->
    <a><!--Hey, buddy. Want to buy a used parser?--></a></title>
    </head>
<body>
    <p class="title">
        <b>The Dormouse's story</b>
        <a><!--Hey, buddy. Want to buy a used parser?--></a>
    </p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""

　　1、快速操作：

soup.title  == soup.find('title')
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string  == soup.title.text  == soup.title.get_text()
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p   == soup.find('p')  # . 點屬性，只能獲取當前標簽下的第一個標簽
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a  == soup.find('a')
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(['a','b'])  # 查找所有的a標簽和b標簽
soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的標簽

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　2、Beautiful Soup對象有四種類型：

　　　　1、BeautifulSoup

　　　　2、tag：標簽

　　　　3、NavigableString : 標簽中的文本，可包含注釋內容

　　　　4、Comment ：標簽中的注釋，純注釋，沒有正文內容

　　標簽屬性的操做跟字典是一樣一樣的

　　html多值屬性(xml不適合)：

　　　　意思為一個屬性名稱，它是多值的，即包含多個屬性值，即使屬性中只有一個值也返回值為list，

　　　　如：class,rel , rev , accept-charset , headers , accesskey

　　　　其它屬性為單值屬性，即使屬性值中有多個空格隔開的值，也是反回一個字符串

soup.a['class']  #['sister']


id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']  #'my id'

　　3、html中tag內容輸出：　

　　　　string:輸出單一子標簽文本內容或注釋內容（選其一，標簽中包含兩種內容則輸出為None）

　　　　strings: 返回所有子孫標簽的文本內容的生成器（不包含注釋）

　　　　stripped_strings:返回所有子孫標簽的文本內容的生成器（不包含注釋,並且在去掉了strings中的空行和空格）

　　　　text:只輸出文本內容，可同時輸出多個子標簽內容

　　　　get_text():只輸出文本內容，可同時輸出多個子標簽內容

　　string:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comm = soup.b.string
print(comm)  # Hey, buddy. Want to buy a used parser?
print(type(comm))  #<class 'bs4.element.Comment'>

　　strings:

head_tag = soup.body
for s in head_tag.strings:
    print(repr(s))

結果：
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n        '
'Elsie'
',\n        '
'Lacie'
' and\n        '
'Tillie'
';\n        and they lived at the bottom of a well.\n    '
'\n'
'...'
'\n'

　　stripped_strings:

head_tag = soup.body
for s in head_tag.stripped_strings:
    print(repr(s))

結果：
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\n        and they lived at the bottom of a well.'
'...'

　　text:

soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.body
print(head_tag.text)

結果：
The Dormouse's story
Once upon a time there were three little sisters; and their names were
        Elsie,
        Lacie and
        Tillie;
        and they lived at the bottom of a well.
    
...

soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.body
print(repr(head_tag.text))

結果：
"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n        Elsie,\n        Lacie and\n        Tillie;\n        and they lived at the bottom of a well.\n    \n...\n"

　　4、返回子節點列表：

　　　　.contents: 以列表的方式返回節點下的直接子節點

　　　　.children:以生成器的方式反回節點下的直接子節點

soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
print(head_tag)
print(head_tag.contents)
print(head_tag.contents[0])
print(head_tag.contents[0].contents)

for ch in head_tag.children:
    print(ch)

結果：
<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
["The Dormouse's story"]
<title>The Dormouse's story</title>

　　5、返回子孫節點的生成器：

　　　　　.descendants: 以列表的方式返回標簽下的子孫節點

for ch in head_tag.descendants:
    print(ch)

結果：
<title>The Dormouse's story</title>
The Dormouse's story

　　6、父標簽（parent）：如果是bs4對象，不管本來是標簽還是文本都可以找到其父標簽，但是文本對象不能找到父標簽

soup = BeautifulSoup(html_doc, 'html.parser')
tag_title = soup.b  # b標簽
print(tag_title.parent)  # b標簽的父標簽 p
print(type(tag_title.string))  # b標簽中的文本的類型,文本中有注釋時結果為None <class 'bs4.element.NavigableString'>
print(tag_title.string.parent)  # b標簽中文本的父標簽 b
print(type(tag_title.text))  # b 標簽中的文本類型為str，無bs4屬性找到父標簽

　　7、遞歸父標簽（parents）：遞歸得到元素的所有父輩節點

soup = BeautifulSoup(html_doc, 'html.parser')
link = soup.a
for parent in link.parents:
    print(parent.name)

結果：

p
body
html
[document]

　　8、前后節點查詢(不是前后標簽哦，文本也是節點之一):previous_sibling,next_sibling

　　9、以生成器的方式迭代返回所有兄弟節點

for sib in soup.a.next_siblings:
    print(sib)
    print("---------")

結果：
-------------
,
        
---------
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
---------


---------
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
---------
;
        and they lived at the bottom of a well.
    
---------

　　10、搜索文檔樹

　　　　過濾器：

　　　　　　1、字符串

　　　　　　2、正則表達式

　　　　　　3、列表

　　　　　　4、True

　　　　　　5、方法

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

<p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all("a")  # 字符串參數
soup.find_all(re.compile("^b"))  # 正則參數
soup.find_all(re.compile("a"))  # 正則參數
soup.find_all(re.compile("l$"))  # 正則參數
soup.find_all(["a", "b"])  # 標簽的列表參數
soup.find_all(True)  # 返回所有標簽
def has_class_no_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")
soup.find_all(has_class_no_id)  # 方法參數

　　11、find選擇器：

　　　　語法：

　　　　# find_all( name , attrs , recursive , text , **kwargs )
　　　　#  name :要查找的標簽名
　　　　#  attrs: 標簽的屬性
　　　　#  recursive: 遞歸
　　　　#  text: 查找文本
　　　　# **kwargs :其它 鍵值參數

　　特殊情況:
　　　　data-foo="value",因中橫杠不識別的原因，只能寫成attrs={"data-foo":"value"},

　　　　class="value",因class是關鍵字，所以要寫成class_="value"或attrs={"class":"value"}

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# find_all( name , attrs , recursive , text , **kwargs )
#  name :要查找的標簽名（字符串、正則、方法、True）
#  attrs: 標簽的屬性
#  recursive: 遞歸
#  text: 查找文本
# **kwargs :其它 鍵值參數
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('p', 'title')) # p標簽且class="title"
soup.find_all('title')  # 以列表形式返回 所有title標簽a
soup.find_all(attrs={"class":"sister"})  # 以列表形式返回 所有class屬性==sister的標簽
soup.find_all(id='link2')  # 返回所有id屬性==link2的標簽
soup.find_all(href=re.compile("elsie")) # 返回所有href屬性包含elsie的標簽
soup.find_all(id=True)  # 返回 所有包含id屬性的標簽
soup.find_all(id="link1", href=re.compile('elsie'))  #  id=link1且href包含elsie

關於class的搜索

soup = BeautifulSoup(html_doc, 'html.parser')
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.find_all("p", class_="body")  # 多值class,指定其中一個即可
css_soup.find_all("p", class_="strikeout")
css_soup.find_all("p", class_="body strikeout")  # 精確匹配
# text 參數可以是字符串，列表、方法、True
soup.find_all("a", text="Elsie")  # text="Elsie"的a標簽

　　12、父節點方法：

　　　　find_parents( name , attrs , recursive , text , **kwargs )

　　　　find_parent( name , attrs , recursive , text , **kwargs )

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were</p>
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <p>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    </p>
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
a_string = soup.find(text="Lacie")  # 文本為Lacie的節點
type(a_string), a_string  # <class 'bs4.element.NavigableString'> Lacie
a_parent = a_string.find_parent()  # a_string的父節點中的第一個節點
a_parent = a_string.find_parent("p")  # a_string的父節點中的第一個p節點
a_parents = a_string.find_parents()  # a_string的父節點
a_parents = a_string.find_parents("a")  # a_string的父點中所有a節點

　　13、后面的鄰居節點：

　　　　find_next_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_next_sibling( name , attrs , recursive , text , **kwargs )

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were</p>
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a  # 第一個a標簽
a_sibling = first_link.find_next_sibling()  # 后面鄰居的第一個
a_sibling = first_link.find_next_sibling("a")  # 后面鄰居的第一個a
a_siblings = first_link.find_next_siblings()  # 后面的所有鄰居
a_siblings = first_link.find_next_siblings("a")  # 后面鄰居的所有a鄰居

　　14、前面的鄰居節點：

　　　　find_previous_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_previous_sibling( name , attrs , recursive , text , **kwargs )

　　15、后面的節點：

　　　　find_all_next( name , attrs , recursive , text , **kwargs )

　　　　find_next( name , attrs , recursive , text , **kwargs )

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were</p>
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <p>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    </p>
    <p>
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
a_string = soup.find(text="Lacie")
a_next = a_string.find_next()  # 后面所有子孫標簽的第一個
a_next = a_string.find_next('a')  # 后面所有子孫標簽的第一個a標簽
a_nexts = a_string.find_all_next()  # 后面的所有子孫標簽
a_nexts = a_string.find_all_next('a')  # 后面的所有子孫標簽中的所有a標簽

　　16、前面的節點：

　　　　find_all_previous( name , attrs , recursive , text , **kwargs )

　　　　find_previous( name , attrs , recursive , text , **kwargs )

　　17、解析部分文檔：

如果僅僅因為想要查找文檔中的<a>標簽而將整片文檔進行解析,實在是浪費內存和時間.最快的方法是從一開始就把<a>標簽以外的東西都忽略掉. SoupStrainer 類可以定義文檔的某段內容,這樣搜索文檔時就不必先解析整篇文檔,只會解析在 SoupStrainer 中定義過的文檔. 創建一個 SoupStrainer 對象並作為 parse_only 參數給 BeautifulSoup 的構造方法即可。

　　SoupStrainer 類參數：name , attrs , recursive , text , **kwargs

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import SoupStrainer
a_tags = SoupStrainer('a')  # 所有a標簽
id_tags = SoupStrainer(id="link2")  # id=link2的標簽
def is_short_string(string):
    return len(string) < 10  # string長度小於10，返回True
short_string = SoupStrainer(text=is_short_string)  # 符合條件的文本

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()
soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()
soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()

<div id="cnblogs_post_body" class="blogpost-body"> 　　總結來源於官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all 示例代碼段<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """ <html> <head><title>The Dormouse's story  <a></a></title> </head> <body> The Dormouse's story <a></a> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　1、快速操作：<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup.title == soup.find('title')# <title>The Dormouse's story</title>soup.title.name# u'title'
soup.title.string == soup.title.text == soup.title.get_text()# u'The Dormouse's story'soup.title.parent.name# u'head'
soup.p == soup.find('p') # . 點屬性，只能獲取當前標簽下的第一個標簽# class="title">The Dormouse's storysoup.p['class']# u'title'
soup.a == soup.find('a')# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(['a','b']) # 查找所有的a標簽和b標簽 soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的標簽 soup.find(id="link3")# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> </pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>　　2、Beautiful Soup對象有四種類型：　　　　1、BeautifulSoup　　　　2、tag：標簽　　　　3、NavigableString  : 標簽中的文本，可包含注釋內容　　　　4、Comment ：標簽中的注釋，純注釋，沒有正文內容 　　標簽屬性的操做跟字典是一樣一樣的　　html多值屬性(xml不適合)：　　　　意思為一個屬性名稱，它是多值的，即包含多個屬性值，即使屬性中只有一個值也返回值為list，　　　　如：class,<tt class="docutils literal">rel</tt> , <tt class="docutils literal">rev</tt> , <tt class="docutils literal">accept-charset</tt> , <tt class="docutils literal">headers</tt> , <tt class="docutils literal">accesskey</tt>　　　　其它屬性為單值屬性，即使屬性值中有多個空格隔開的值，也是反回一個字符串<div class="cnblogs_code"><pre>soup.a['class'] #['sister']

id_soup = BeautifulSoup('')id_soup.p['id'] #'my id'</pre></div> 　　3、html中tag內容輸出：　　　　　string:輸出單一子標簽文本內容或注釋內容（選其一，標簽中包含兩種內容則輸出為None）　　　　strings: 返回所有子孫標簽的文本內容的生成器（不包含注釋）　　　　stripped_strings:返回所有子孫標簽的文本內容的生成器（不包含注釋,並且在去掉了strings中的空行和空格）　　　　text:只輸出文本內容，可同時輸出多個子標簽內容　　　　get_text():只輸出文本內容，可同時輸出多個子標簽內容　　string:<div class="cnblogs_code"><pre>markup = ""soup = BeautifulSoup(markup, 'html.parser')comm = soup.b.stringprint(comm) # Hey, buddy. Want to buy a used parser?print(type(comm)) #<class 'bs4.element.Comment'></pre></div> 　　strings:<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>head_tag = soup.bodyfor s in head_tag.strings: print(repr(s))
結果：'\n'"The Dormouse's story"'\n''Once upon a time there were three little sisters; and their names were\n ''Elsie'',\n ''Lacie'' and\n ''Tillie'';\n and they lived at the bottom of a well.\n ''\n''...''\n'</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>　　stripped_strings:<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>head_tag = soup.bodyfor s in head_tag.stripped_strings: print(repr(s))
結果："The Dormouse's story"'Once upon a time there were three little sisters; and their names were''Elsie'',''Lacie''and''Tillie'';\n and they lived at the bottom of a well.''...'</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>　　text:<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')head_tag = soup.bodyprint(head_tag.text)
結果：The Dormouse's storyOnce upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')head_tag = soup.bodyprint(repr(head_tag.text))
結果："\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n Elsie,\n Lacie and\n Tillie;\n and they lived at the bottom of a well.\n \n...\n"</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>  　　4、返回子節點列表：　　　　.contents: 以列表的方式返回節點下的直接子節點　　　　.children:以生成器的方式反回節點下的直接子節點<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')head_tag = soup.headprint(head_tag)print(head_tag.contents)print(head_tag.contents[0])print(head_tag.contents[0].contents)
for ch in head_tag.children: print(ch)
結果：<head><title>The Dormouse's story</title></head>[<title>The Dormouse's story</title>]<title>The Dormouse's story</title>["The Dormouse's story"]<title>The Dormouse's story</title></pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　5、返回子孫節點的生成器：　　　　　.descendants: 以列表的方式返回標簽下的子孫節點<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>for ch in head_tag.descendants: print(ch)
結果：<title>The Dormouse's story</title>The Dormouse's story</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　6、父標簽（parent）：如果是bs4對象，不管本來是標簽還是文本都可以找到其父標簽，但是文本對象不能找到父標簽<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')tag_title = soup.b # b標簽print(tag_title.parent) # b標簽的父標簽 pprint(type(tag_title.string)) # b標簽中的文本的類型,文本中有注釋時結果為None <class 'bs4.element.NavigableString'>print(tag_title.string.parent) # b標簽中文本的父標簽 bprint(type(tag_title.text)) # b 標簽中的文本類型為str，無bs4屬性找到父標簽</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　7、遞歸父標簽（parents）：遞歸得到元素的所有父輩節點<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')link = soup.afor parent in link.parents: print(parent.name) 結果： </pre>p body html [document]

<div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　8、前后節點查詢(不是前后標簽哦，文本也是節點之一):previous_sibling,next_sibling<img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180124082140694-1377077553.png" alt="">  　　9、以生成器的方式迭代返回所有兄弟節點<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>for sib in soup.a.next_siblings: print(sib) print("---------")
結果：-------------, ---------<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>---------

---------<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>---------; and they lived at the bottom of a well. ---------</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　10、搜索文檔樹　　　　過濾器：　　　　　　1、字符串　　　　　　2、正則表達式　　　　　　3、列表　　　　　　4、True　　　　　　5、方法<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """<html><head><title>The Dormouse's story</title></head><body>class="title">The Dormouse's storyclass="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.
class="story">...</body>"""from bs4 import BeautifulSoupimport resoup = BeautifulSoup(html_doc, 'html.parser')soup.find_all("a") # 字符串參數soup.find_all(re.compile("^b")) # 正則參數soup.find_all(re.compile("a")) # 正則參數soup.find_all(re.compile("l$")) # 正則參數soup.find_all(["a", "b"]) # 標簽的列表參數soup.find_all(True) # 返回所有標簽def has_class_no_id(tag): return tag.has_attr("class") and not tag.has_attr("id")soup.find_all(has_class_no_id) # 方法參數</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　11、find選擇器：　　　　語法：<pre>　　　　# find_all( name , attrs , recursive , text , **kwargs )　　　　# name :要查找的標簽名　　　　# attrs: 標簽的屬性　　　　# recursive: 遞歸　　　　# text: 查找文本　　　　# **kwargs :其它鍵值參數 　　特殊情況: 　　　　data-foo="value",因中橫杠不識別的原因，只能寫成attrs={"data-foo":"value"},</pre><pre>　　　　class="value",因class是關鍵字，所以要寫成class_="value"或attrs={"class":"value"}</pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>from bs4 import BeautifulSoupimport rehtml_doc = """<html><head><title>The Dormouse's story</title></head>
class="title">The Dormouse's story
class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.
class="story">..."""# find_all( name , attrs , recursive , text , **kwargs )# name :要查找的標簽名（字符串、正則、方法、True）# attrs: 標簽的屬性# recursive: 遞歸# text: 查找文本# **kwargs :其它鍵值參數soup = BeautifulSoup(html_doc, 'html.parser')print(soup.find_all('p', 'title')) # p標簽且class="title"soup.find_all('title') # 以列表形式返回所有title標簽asoup.find_all(attrs={"class":"sister"}) # 以列表形式返回所有class屬性==sister的標簽soup.find_all(id='link2') # 返回所有id屬性==link2的標簽soup.find_all(href=re.compile("elsie")) # 返回所有href屬性包含elsie的標簽soup.find_all(id=True) # 返回所有包含id屬性的標簽soup.find_all(id="link1", href=re.compile('elsie')) # id=link1且href包含elsie</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div><img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180128222706647-1457600468.png" alt=""><pre>關於class的搜索</pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>soup = BeautifulSoup(html_doc, 'html.parser')css_soup = BeautifulSoup('', 'html.parser')css_soup.find_all("p", class_="body") # 多值class,指定其中一個即可css_soup.find_all("p", class_="strikeout")css_soup.find_all("p", class_="body strikeout") # 精確匹配# text 參數可以是字符串，列表、方法、Truesoup.find_all("a", text="Elsie") # text="Elsie"的a標簽</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　12、父節點方法：　　　　find_parents( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )　　　　find_parent( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """<html> <head> <title>The Dormouse's story</title> </head><body> class="title">The Dormouse's story class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. class="story">...</body>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser')a_string = soup.find(text="Lacie") # 文本為Lacie的節點type(a_string), a_string # <class 'bs4.element.NavigableString'> Laciea_parent = a_string.find_parent() # a_string的父節點中的第一個節點a_parent = a_string.find_parent("p") # a_string的父節點中的第一個p節點a_parents = a_string.find_parents() # a_string的父節點a_parents = a_string.find_parents("a") # a_string的父點中所有a節點</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> 　　13、后面的鄰居節點：　　　　find_next_siblings( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )　　　　find_next_sibling( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """<html><head><title>The Dormouse's story</title></head><body> class="title">The Dormouse's story class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, "http://example.com/elsie" class="sister" id="link1">Elsie, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. class="story">...</body>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser')first_link = soup.a # 第一個a標簽a_sibling = first_link.find_next_sibling() # 后面鄰居的第一個a_sibling = first_link.find_next_sibling("a") # 后面鄰居的第一個aa_siblings = first_link.find_next_siblings() # 后面的所有鄰居a_siblings = first_link.find_next_siblings("a") # 后面鄰居的所有a鄰居</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>  　　14、前面的鄰居節點：　　　　find_previous_siblings( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )　　　　find_previous_sibling( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> ) 　　15、后面的節點：　　　　find_all_next( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )　　　　find_next( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """<html> <head> <title>The Dormouse's story</title> </head><body> class="title">The Dormouse's story class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. class="story">...</body>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser')a_string = soup.find(text="Lacie")a_next = a_string.find_next() # 后面所有子孫標簽的第一個a_next = a_string.find_next('a') # 后面所有子孫標簽的第一個a標簽a_nexts = a_string.find_all_next() # 后面的所有子孫標簽a_nexts = a_string.find_all_next('a') # 后面的所有子孫標簽中的所有a標簽</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div>  　　16、前面的節點：　　　　find_all_previous( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> )　　　　find_previous( <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a> ) 　　17、解析部分文檔：　　　　如果僅僅因為想要查找文檔中的<a>標簽而將整片文檔進行解析,實在是浪費內存和時間.最快的方法是從一開始就把<a>標簽以外的東西都忽略掉. <tt class="docutils literal">SoupStrainer</tt> 類可以定義文檔的某段內容,這樣搜索文檔時就不必先解析整篇文檔,只會解析在 <tt class="docutils literal">SoupStrainer</tt> 中定義過的文檔. 創建一個 <tt class="docutils literal">SoupStrainer</tt> 對象並作為 <tt class="docutils literal">parse_only</tt> 參數給 <tt class="docutils literal">BeautifulSoup</tt> 的構造方法即可。<tt class="docutils literal">　　SoupStrainer</tt> 類參數：<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a> , <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div><pre>html_doc = """<html> <head> <title>The Dormouse's story</title> </head><body> class="title">The Dormouse's story class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. class="story">...</body>"""from bs4 import SoupStrainera_tags = SoupStrainer('a') # 所有a標簽id_tags = SoupStrainer(id="link2") # id=link2的標簽def is_short_string(string): return len(string) < 10 # string長度小於10，返回Trueshort_string = SoupStrainer(text=is_short_string) # 符合條件的文本
from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()</pre><div class="cnblogs_code_toolbar"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="復制代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="復制代碼"></a></div></div> </div>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 beautifulSoup基本用法及find選擇器 beautifulsoup之CSS選擇器 BeautifulSoup4庫和CSS選擇器 BeautifulSoup 基本選擇器，標准選擇器，css選擇器 python爬蟲——BeautifulSoup詳解（附加css選擇器） Python3---BeautifulSoup---節點選擇器 CSS選擇器的三種用法 selenium之css選擇器高級用法 CSS各類標簽用法——選擇器日期選擇器：datetimepicker用法總結