爬蟲入門【3】BeautifulSoup4用法簡介

本文轉載自查看原文 2017-11-17 22:25 5812 爬蟲/ Python

快速開始使用BeautifulSoup

首先創建一個我們需要解析的html文檔，這里采用官方文檔里面的內容：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

要解析這段代碼，需要導入BeautifullSoup，可以選擇按照標准的縮進格式來輸出內容：

from bs4 import BeautifulSoup#導入BeautifulSoup的方法
#可以傳入一段字符串，或者傳入一個文件句柄。一般都會先用requests庫獲取網頁內容，然后使用soup解析。
soup=BeautifulSoup(html_doc,'html.parser')#這里一定要指定解析器，可以使用默認的html，也可以使用lxml比較快。
print(soup.prettify())#按照標准的縮進格式輸出獲取的soup內容。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

#幾種簡單瀏覽結構化數據的方法：
print(soup.title)#獲取文檔的title
print(soup.title.name)#獲取title的name屬性
print(soup.title.string)#獲取title的內容
print(soup.title.parent.name)#獲取title的parent名稱,也就是head,上一級.
print(soup.p)#獲取文檔中第一個p節點
print(soup.p['class'])#獲取第一個p節點的class內容
print(soup.a)#獲取文檔的第一個a節點
print(soup.find_all('a'))#獲取文檔中所有的a節點,返回一個list
soup.find(id='link3')#獲取文檔中id屬性為link3的節點

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]





<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))#獲取a節點的href屬性

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

#print(soup.get_text())
print(soup.text)#兩種方式都可以返回獲取的所有文本

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

對象的種類

其實HTML文檔包含了很多的節點，這些節點一般可以歸納為4類，Tag，NavigableString，BeautifulSoup，Comment。

Tag

Tag就是html文檔中的一個個標簽。
主要介紹Tag的name和attributes屬性。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')
tag = soup.b
type(tag)

bs4.element.Tag

#Name
#屬性通過.name來獲取
#如果改變tag的name，那么所有當前BS對象的HTML文檔都會改變。
print(tag.name)
tag.name='blockquote'
print(tag)

b
<blockquote class="boldest">Extremely bold</blockquote>

#Attributes
#獲取方法比較簡單，直接使用tag['attr_name']即可
#或者直接tag.attrs，可以返回所有的屬性組成的字典。
print(tag['class'])
print(tag.attrs)
#tag的屬性可以被刪除或者修改，添加，與字典的操作方式一樣
tag['class']='verybold'
tag['id']=1
print(tag)
#刪除Tag的屬性使用del方法
del tag['id']

verybold
{'id': 1, 'class': 'verybold'}
<blockquote class="verybold" id="1">Extremely bold</blockquote>

#有時候一個屬性可能存在多個值，比如class，那么就會返回一個list
css_soup = BeautifulSoup('<p class="body strikeout"></p>','html.parser')
print(css_soup.p['class'])

['body', 'strikeout']

#將tag轉換成字符串時，多值屬性會合並為一個值；
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','html.parser')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
#xml格式的文檔不包含多值屬性。

<p>Back to the <a rel="index contents">homepage</a></p>

NavigableString

可以遍歷的字符串。
字符串通常被包含在tag內，BS用NavigableString類來包裝tag中的字符串。

print(type(tag.string))#也就是tag的字符內容，<>string<>

<class 'bs4.element.NavigableString'>

#tag中的字符串不能編輯，但是可以替換成其他字符串
tag.string.replace_with('No longer bold')
print(tag)

<blockquote class="verybold">No longer bold</blockquote>

遍歷文檔數

還拿之前的html_doc來舉例，演示如何從一段內容找到另一段內容。

soup=BeautifulSoup(html_doc,'html.parser')

子節點

一個tag可能包含多個字符串或者其他tag，都是這個tag的子節點。

#如果想要獲取當前名字的第一個tag，直接用.tag_name就可以實現
print(soup.a)
#如果想要獲取當前名字的所有tag，需要用find_all('tag_name')才可以
print(soup.find_all('a'))
#tag的.contents屬性可以將tag的子節點以-列表-的方式輸出
head_tag=soup.head
print(head_tag.contents)
#通過tag的.children生成器，可以對tag的子節點進行循環,(直接子節點)
for child in head_tag.children:
    print(child)
#.descendants屬性可以對所有tag的子孫節點進行遞歸循環：
for child in head_tag.descendants:
    print(child)
#.string屬性，如果tag只有一個NavigableString類型子節點，那么這個tag可以使用.string得到子節點：
#如果包含多個子節點，tag就無法確定.string的方法應該調用哪個子節點，所以輸出None。

#如果tag中包含多個字符串，可以用.strings來循環獲取，輸出的字符串可能包含多個空格或空行，
#使用.stripped_strings可以去除多余空白內容。
for string in soup.stripped_strings:
    print(repr(string))

父節點

每個tag或字符串都由父節點，也就是包含在某個tag中。

#.parent屬性，用於獲取某個元素的父節點，比如：
title_tag=soup.title
print(title_tag.parent)
#文檔title的字符串也有父節點，title標簽
#.parents，可以遍歷tag到根節點的所有節點。

<head><title>The Dormouse's story</title></head>

兄弟節點

一段文檔以標准格式輸出時,兄弟節點有相同的縮進級別.
.next_sibling和.previous_sibling屬性，用來查詢兄弟節點：
.next_siblings和.previous_siblings屬性，可以對當前節點的兄弟節點迭代輸出。

回退和前進

.next_element 和 .previous_element屬性指向解析過程中的下一個或者上一個解析對象。
.next_elements 和 .previous_elements屬性，上或者下解析內容，列表。

搜索文檔樹

查找解析文檔中的標簽節點

#1、傳入字符串
soup.find_all('b')#查找所有<b>標簽
#2、正則表達式
import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
#3、傳入列表參數
soup.find_all(['a','b'])#查找所有的<a><b>標簽
#4、True參數，可以匹配任何值，
#5、如果沒有合適的過濾器，還可以定義一個方法，方法只接受一個元素參數，如果這個方法返回True，表示匹配到元素
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
#可以將上面方法傳入find_all()方法
soup.find_all(has_class_but_no_id)
#通過一個方法來過濾一類-標簽屬性-的時候，這個方法的參數是要被過濾的屬性的值，而不是這個標簽。
def not_lacie(href):
    return href and not re.compile('lacie').search(href)
soup.find_all(href=not_lacie)#找出href屬性不符合指定正則的標簽。

find_all()方法

find_all( name , attrs , recursive , string , **kwargs )
搜索當前tag的所有子節點，並且判斷是否符合過濾器的條件

#name參數，查找所有名字為name的tag，字符串對象被忽略。
soup.find_all('title')
#keyword參數，如果一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數當作指定名字tag的屬性來搜索。
soup.find_al(id='link2')
soup.find_all(href=re.compile('elsie'))
#如果多個指定名字的參數可以同時過濾tag的多個屬性：
soup.find_all(href=re.compile('elsie'),id='link1')
#有些tag屬性在搜索不能使用，比如HTML5中的data*屬性，但是可以通過find_all()的attrs參數定義一個字典來搜索：
data_soup.find_all(attrs={'data-foo':'value'})

按css搜索

#BS4.1開始，可以通過class_參數搜索具有指定css類名的tag：
soup.find_all('a',class_='sister')
#接受通過類型的過濾器，比如正則表達式
soup.find_all(class_=re.compile('it1'))

string參數

soup.find_all(string='Elsie')

limit參數

可以用來限制返回結果的數量

recursive參數

如果指向搜索tag的直接子節點，可以使用參數recursive=False。

像調用find_all()一樣來調用tag

每個tag對象可以被當作一個方法來使用，與調用find_all()方法相同。

soup.find_all('a')
soup('a')#這兩句代碼時等價的

find()方法

與find_all()相同的用法，但是只能返回一個結果。

CSS選擇器，select方法

soup.select('title')#選擇title標簽
soup.select('p nth-of-type(3)')

#通過tag標簽逐層查找
soup.select('body a')#查找body標簽下面的a標簽
#找到某個tag標簽下的直接子標簽：
soup.select('head>title')
#通過id來查找：
soup.select('#link1')
#通過class來查找：
soup.select('.sister')
soup.select('[class~=sister]')
#通過是否存在某個屬性來查找：
soup.select('a[href]')
#通過屬性的值來查找：
soup.select('a[href="http://www.baidu.com"]')

如果您覺得感興趣的話，可以添加我的微信公眾號：一步一步學Python

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 BeautifulSoup4 提取數據爬蟲用法詳解 python爬蟲入門（三）XPATH和BeautifulSoup4 爬蟲-使用BeautifulSoup4（bs4）解析html數據 Python爬蟲教程-23-數據提取-BeautifulSoup4（一） python爬蟲beautifulsoup4系列4-子節點 BeautifulSoup4的基本操作 Python: 安裝BeautifulSoup4 beautifulsoup4 安裝教程 python3.4學習筆記(十七) 網絡爬蟲使用Beautifulsoup4抓取內容 Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲）