python爬蟲學習基礎之網頁解析(2)BeautifulSoup

本文轉載自查看原文 2021-12-21 10:03 731 python爬蟲之入門篇

網頁解析：從網頁中提取出所需的信息（例如新的url，數據等等）

網頁解析常用的方法有：re(正則表達式)，BeautifulSoup，lxml，parsel，requests-html

這一篇只講BeautifulSoup，其后面的以后面發，敬請期待吧。

官方文檔：Beautiful Soup 4.4.0 文檔 — Beautiful Soup 4.2.0 中文文檔，Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation (crummy.com)

引入：

　　Beautiful Soup 是一個可以從HTML（網頁）或XML文件中提取數據的Python庫。

　　Beautiful Soup 3 目前已經停止開發,我們推薦在現在的項目中使用Beautiful Soup 4, 移植到BS4（導入的位置）

例子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""



from bs4 import BeautifulSoup

# 使用BeautifulSoup解析這段代碼,
# 能夠得到一個 BeautifulSoup 的對象,
# 並能按照HTML標准的縮進格式的結構輸出,而且會把缺失的標簽補齊:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)

運行結果：

需要安裝 (參考)：Python安裝Bs4及使用方法_python_腳本之家 (jb51.net)

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment .

一、結點對象

1、Tag（標簽）

粗略的講一下，不一定准確，html的標簽：在html文件中 <name> <\name> 出現一對類似這樣的，name就是標簽。可以利用它定位並獲取相關的信息。

例如：

from bs4 import BeautifulSoup

# 獲取BeautifulSoup對象，
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')

# 標准html輸出
print(soup.prettify())

# 獲取對象的標簽
tag = soup.b

print('------------\n\n')
# 數據類型
print(type(tag))

運行結果：

進一步介紹一下Tag中的一些重要屬性。

1、Name

　　獲取標簽的名字，在<name> <\name>中name就是標簽名

用法：標簽對象.name

2、Attributes（屬性）

　　標簽其它屬性。

　　　　屬性略作解釋，在html中，<name> <\name> 在<name>中添加賦值的情況，例如：<name class="1">。class就是屬性，1就是屬性值。

　　可以自己添加和修改，也可以刪除（del 標簽對象.['屬性']）。

用法：標簽.對象['屬性']，可以獲取屬性值。或者標簽對象.attrs，可以獲取屬性與屬性值的字典數據

3、對於同一個標簽，可能有多個屬性值在Beautiful Soup中多值屬性的返回類型是list

例子：

from bs4 import BeautifulSoup

# 獲取BeautifulSoup對象，
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')


# 獲取對象的標簽
tag = soup.b


print('標簽名：',tag.name)

print('獲取屬性值: ',tag['class'])

print('獲取，屬性與屬性值：',tag.attrs)

# 修改屬性值
tag['class'] = '修改'

# 添加屬性
tag['new'] = '新添加的'

# 查看效果
print('看看修改和添加的結果：',tag.attrs)

運行結果：

2、 `NavigableString`

為了獲取標簽內容<name> Text <\name> Text

實現方法：標簽對象.string

對象轉換成Unicode字符串: unicode_string = unicode(tag.string)

tag中包含的字符串不能編輯,但是可以被替換成其它的字符串,用replace_with() 方法: tag.string.replace_with("修改的內容")

代碼：

from bs4 import BeautifulSoup

# 獲取BeautifulSoup對象，
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')


# 獲取對象的標簽
tag = soup.b

print('內容：',soup.string)

運行結果：

3、BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內容.大部分時候,可以把它當作 Tag 對象.

因為 BeautifulSoup 對象並不是真正的HTML或XML的tag,所以它沒有name和attribute屬性.

但有時查看它的 .name 屬性是很方便的,所以 BeautifulSoup 對象包含了一個值為 “[document]” 的特殊屬性 .name

方法：soup = BeautifulSoup()

4、Comment

對象是一個特殊類型的 NavigableString 對象:其輸出的內容不包括注釋符號。

html注釋的寫法： ；注釋符號

text = '''
<b><!--Hey, buddy. Want to buy a used parser?--></b>
'''

# 獲取BeautifulSoup對象
soup = BeautifulSoup(text,'html.parser')


# 獲取內容
string = soup.b.string
print(string)

# 查看一下它的數據類型
print(type(string))

運行結果：

二、遍歷文檔樹

在HTML中Tag可以嵌套，所以就會出現各個結點之間關系的描述屬性

1、子節點

tag的子節點： .contents 或 .children

　　tag的 contents屬性可以將tag的子節點以列表的方式輸出: soup.contents[0].name，第一個子節點的名字

　　　　BeautifulSoup 對象本身一定會包含子節點,也就是說<html>標簽也是 BeautifulSoup 對象的子節點:

　　　　字符串沒有 .contents 屬性,因為字符串沒有子節點:

　　通過tag的 .children生成器,可以對tag的子節點進行循環:

　　　　for child in tag.children:

　　　　　　print(child)

2、孫子節點：子節點的子節點

.descendants屬性可以對所有tag的子孫節點進行遞歸循環

for child in tag.descendants:

　　print(child)

2.1、內容問題

.string

如果一個tag僅有一個子節點,那么這個tag也可以使用 .string方法,輸出結果與當前唯一子節點的 .string 結果相同:

如果tag包含了多個子節點,tag就無法確定 .string方法應該調用哪個子節點的內容, .string 的輸出結果是 None:

所以引出：

.strings 和 stripped_strings

如果tag中包含多個字符串 ,可以使用 .strings來循環獲取

for string in soup.strings:

　　print(repr(string))

輸出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白內容:

for string in soup.stripped_strings:

　　print(repr(string))

全部是空格的行會被忽略掉,段首和段末的空白會被刪除

3、父節點：

.parent

通過 .parent屬性來獲取某個元素的父節點.

BeautifulSoup 對象的 .parent是None:

.parents

通過元素的 .parents屬性可以遞歸得到元素的所有父輩節點。遞歸得到父輩元素的所有節點，返回一個生成器

4、兄弟節點（同一級別的結點的，必須是同一個父節點）

.previous_sibling和.next_sibling（Tag前一個和后一個結點）

通過 .next_siblings 和 .previous_siblings 屬性可以對當前節點的兄弟節點迭代輸出:

for sibling in soup.a.next_siblings:

　　print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:

　　print(repr(sibling))

5、回退和前進

.next_element 和 .previous_element

.next_element 屬性指向解析過程中下一個被解析的對象(字符串或tag),結果可能與 .next_sibling 相同,但通常是不一樣的.

.previous_element 屬性剛好與 .next_element 相反,它指向當前被解析的對象的前一個解析對象:

.next_elements 和 .previous_elements

通過.next_element 和 .previous_element 的迭代器就可以向前或向后訪問文檔的解析內容,就好像文檔正在被解析一樣:

三、方法

1、方法：find_all()

作用：方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件.

語法：soup.find_all(name, attrs, recursive, text, limit, **kwargs)

參數解釋:

1、name參數：

　　name：參數可以查找所有名字為 name 的tag。

　　　　參數的值可以是：字符串,正則表達式,列表,方法或是 True

　　　　如果是Ture的話：可以匹配任何值, 找到所有的tag,但是不會返回字符串節點

2、attrs 參數:

　　　　有些tag屬性在搜索不能使用,比如HTML5中的 data-* 屬性:

　　　　但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag：

　　　　　　soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')

　　　　　　soup.find_all(attrs={"data-foo": "value"})

　　　　也可以通過屬性與屬性值構建字典

　　　　　　例如：attrs={"id": True} ，class（不用class_）同理

3、recursive參數：

　　調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False

　　默認recursive=True

4、text 參數：

　　通過text參數可以搜索文檔中的字符串內容，與name參數的可選值一樣，參數可以是：字符串，正則表達式，列表，True。

　　標簽內容，<a> text <\a> 這里的text就是要匹配的地方，而且返回也是text。

　　一般都用組合的方式來定位：soup.find_all("a", text="Elsie")

5、limit 參數：

　　可以使用 limit 參數限制返回結果的數量。

　　　　例如：soup.find_all("a", limit=2) 如果搜索找到a標簽的數量2，則會停止搜索。

6、kwargs參數：

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索，

　　如果包含一個名字為 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性

　　　　例如：soup.find_all(id="link1") (如果沒有屬性值，好像會是全部)

　　如果傳入href 參數,Beautiful Soup會搜索每個tag的”href”屬性:

　　　　soup.find_all(href=re.compile("elsie"))

　　如果屬性為class不能直接用class（會報錯），必須用class_ Beautiful Soup會搜索每個tag的”class”屬性:

　　　　例如：soup.find_all(class_="story")

　　屬性值可以是：字符串，正則表達式，列表，True

　　　　id=True的話就是只要有id這個屬性都可以，href，class_同理。

　　使用多個指定名字的參數可以同時過濾tag的多個屬性:

　　　　例如：soup.find_all(href=True , id=True)

后面的相關函數的參數一樣則效果一致

2、像調用 `find_all()` 一樣調用tag

BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用，等價於對象的 find_all()

下面兩個等價

soup.find_all("a")

soup("a")

下面兩個等價

soup.title.find_all(string=True)

soup.title(string=True)

3、find()

find(name , attrs，recursive , text , **kwargs)

find_all() 方法將返回文檔中符合條件的所有tag,數據是列表；而find()，只返回一個結果，不是列表

結果只有一個所以沒有limit這個參數。

4、find_parents() 和 find_parent()

pycharm把鼠標放到函數上面，ctrl+Q可以查看函數信息。

find_parent(name , attrs , **kwargs)

find_parents(name, attrs , limit, **kwargs)

記住：find_all() 和 find() 只搜索當前節點的所有子節點,孫子節點等.

　　 find_parents() 和 find_parent 用來搜索當前節點的父輩節點,搜索方法與普通tag的搜索方法相同

5、find_next_siblings() 合 find_next_sibling()

find_next_sibling(name , attrs , text, **kwargs)

find_next_siblings(name , attrs , text , limit , **kwargs)

　　對當tag的所有后面解析的兄弟tag節點進行迭代, find_next_siblings 方法返回所有符合條件的后面的兄弟節點,

　　find_next_sibling 只返回符合條件的后面的第一個tag節點。

6、find_previous_siblings() 和 find_previous_sibling()

find_previous_siblings( name , attrs , text , limit ,**kwargs)

find_previous_sibling(name , attrs , text , **kwargs)

　　對當前tag的前面解析的兄弟tag節點進行迭代, find_previous_siblings 方法返回所有符合條件的前面的兄弟節點,

　　find_previous_sibling 方法返回第一個符合條件的前面的兄弟節點:

7、find_all_next() 和 find_next()

find_all_next(name , attrs , text , limit , **kwargs)

find_next(name , attrs , text , **kwargs)

對當前tag的之后的 tag和字符串進行迭代, find_all_next() 方法返回所有符合條件的節點, find_next() 方法返回第一個符合條件的節點:

8、find_all_previous() 和 find_previous()

find_all_previous(name , attrs , text , limit ,**kwargs: Any)

find_previous(name , attrs , text , **kwargs: Any)

對當前節點前面的tag和字符串進行迭代, find_all_previous() 方法返回所有符合條件的節點, find_previous() 方法返回第一個符合條件的節點.

四、CSS選擇器

Beautiful Soup支持大部分的CSS選擇器

在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數, 即可使用CSS選擇器的語法找到tag:

例如：soup.select("title")

1、通過tag標簽逐層查找（空格隔開）:

例如：soup.select("body a") ，soup.select("html head title")

2、找到某個Tag標簽下的直接子標簽（>隔開）：

例如：soup.select("head > title")

　　　　soup.select("p > a:nth-of-type(3)") 對應的第三個

　　　　soup.select("p > #link1") #對應id屬性值

3、找到兄弟節點標簽（ ~ . 隔開）:

　　例如：soup.select("#link1 ~ .sister")

4、通過CSS的類名查找（.屬性值）:

soup.select(".sister")

soup.select("[class~=sister]")

5、通過tag的id查找:

soup.select("#link1")

soup.select("a#link2")

6、同時用多種CSS選擇器查詢元素:

soup.select("#link1,#link2")

7、通過是否存在某個屬性來查找:

soup.select('a[href]')

8、通過屬性的值來查找:

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]') 屬性值以它為開頭

soup.select('a[href$="tillie"]') 屬性值以它為結尾

soup.select('a[href*=".com/el"]') 屬性值包含

9、通過語言設置來查找:

multilingual_markup="""

<p lang="en">Hello</p>

<p lang="en-us">Howdy, y'all</p>

<p lang="en-gb">Pip-pip, old fruit</p>

<p lang="fr">Bonjour mes amis</p>

"""

multilingual_soup = BeautifulSoup(multilingual_markup)

multilingual_soup.select('p[lang|=en]')

10、返回查找到的元素的第一個

例如：soup.select_one(".sister")

四、修改文檔樹

1、修改tag的名字和屬性

重命名一個tag,改變屬性的值,添加或刪除屬性:

例如：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')

　　tag = soup.b

　　# 修改名字

　　tag.name = "blockquote"

　　# 修改類名名

　　tag['class'] = 'verybold'

　　# 添加屬性

　　tag['id'] = 1

　　# 刪除屬性
　　del tag['class']
　　del tag['id']

2、修改 .string

給tag的 .string 屬性賦值,就相當於用當前的內容替代了原來的內容:

　　例如：tag.string = "New Text."

　　注意: 如果當前的tag包含了其它tag,那么給它的 .string 屬性賦值會覆蓋掉原有的所有內容包括子tag

3、append()

Tag.append()方法想tag中添加內容

soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")

print(soup)

運行結果：

4、NavigableString() 和 .new_tag()

from bs4 import BeautifulSoup, NavigableString


soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")

new_string = NavigableString(" there")
tag.append(new_string)

print(tag)

運行結果：

如果想要創建一段注釋,或 NavigableString 的任何子類, 只要調用 NavigableString 的構造方法:

在上面的基礎上改
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)

print(tag)

運行結果：

創建一個tag最好的方法是調用工廠方法 BeautifulSoup.new_tag() :

　　例如：new_tag = soup.new_tag("a", href="http://www.example.com")

　　第一個參數作為tag的name,是必填,其它參數選填

5、insert()

Tag.insert() 方法與 Tag.append() 方法類似,區別是不會把新元素添加到父節點 .contents 屬性的最后,而是把元素插入到指定的位置.

6、insert_before() 和 insert_after()

在當前tag或文本節點前 / 后插入內容:

7、clear()

Tag.clear() 方法移除當前tag的內容:

8、extract()

PageElement.extract() 方法將當前tag移除文檔樹,並作為方法結果返回:

9、decompose()

Tag.decompose() 方法將當前節點移除文檔樹並完全銷毀:

10、replace_with()

PageElement.replace_with() 方法移除文檔樹中的某段內容,並用新tag或文本節點替代它:

11、wrap()

PageElement.wrap() 方法可以對指定的tag元素進行包裝 ,並返回包裝后的結果:

12、unwrap()

Tag.unwrap() 方法與 wrap() 方法相反.將移除tag內的所有tag標簽,該方法常被用來進行標記的解包:

五、輸出

1、格式化輸出

prettify() 方法將Beautiful Soup的文檔樹格式化后以Unicode編碼輸出,每個XML/HTML標簽都獨占一行

BeautifulSoup 對象和它的tag節點都可以調用 prettify() 方法:

例如：print(soup.prettify())

2、壓縮輸出

如果只想得到結果字符串,不重視格式,那么可以對一個 BeautifulSoup 對象或 Tag 對象使用Python的 unicode() 或 str() 方法:

3、輸出格式

Beautiful Soup輸出是會將HTML中的特殊字符轉換成Unicode,比如“&lquot;”:

如果將文檔轉換成字符串,Unicode編碼會被編碼成UTF-8.這樣就無法正確顯示HTML特殊字符了:

4、get_text()

只得到tag中包含的文本內容, 這個方法獲取到tag中包含的所有文版內容包括子孫tag中的內容,並將結果作為Unicode字符串返回:

可以通過參數指定tag的文本內容的分隔符:

例如：soup.get_text("|")

還可以去除獲得文本內容的前后空白:

　　例如：soup.get_text("|", strip=True)

六、文檔解析器

要解析的文檔是什么類型: 目前支持, “html”, “xml”, 和 “html5”
指定使用哪種解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲基礎：BeautifulSoup網頁解析庫 Python學習－使用BeautifulSoup來解析網頁一：基礎入門 python網絡爬蟲之解析網頁的BeautifulSoup(爬取電影圖片)[三] python爬蟲--解析網頁幾種方法之BeautifulSoup python爬蟲之BeautifulSoup的HTML解析 python爬蟲學習(一)：BeautifulSoup庫基礎及一般元素提取方法在python使用selenium獲取動態網頁信息並用BeautifulSoup進行解析--動態網頁爬蟲 Python爬蟲之解析網頁 python爬蟲-html解析器beautifulsoup Python爬蟲 | Beautifulsoup解析html頁面