初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法

本文轉載自查看原文 2019-07-26 22:39 6112 爬蟲/ python/ find_all/ select/ BeautifulSoup

from bs4 import BeautifulSoup

lxml 以lxml形式解析html，例：BeautifulSoup(html,'lxml') # 注：html5lib 容錯率最高
find 返回找到的第一個標簽
find_all 以list的形式返回找到的所有標簽
limit 指定返回的標簽個數
attrs 將標簽屬性放到一個字典中
string 獲取標簽下的非標簽字符串(值), 返回字符串
strings 獲取標簽下的所有非標簽字符串，返回生成器。
stripped_strings 獲取標簽下的所有非標簽字符串，並剔除空白字符，返回生成器。
get_text # 獲取標簽下的所有非標簽字符串,返回字符串格式
contents、children都是返回某個標簽下的直接子元素，包含字符串。 contents 返回一個列表，children 返回一個生成器

select 方法和find_all極其相似

以實際例子作說明：

1、定義一個html，並使用BeautifulSoup的lxml解析

from bs4 import BeautifulSoup
html = '''
<table>
<tr class='a1'>
    <td>職位名稱</td>
    <td>職位類別</td>
    <td>時間</td>
</tr>
<tr class='a1'>
    <td><a id='test' class='test' target='_blank' href='https://www.baidu.com/'>職位一</a></td>
    <td>類別一</td>
    <td>時間1</td>
</tr>
<tr class='a2'>
    <td><a id='test' class='test' target='_blank' href='https://www.baidu.com/'>職位二</a></td>
    <td>類別二</td>
    <td>時間2</td>
</tr class='a3'>
<tr>
    <td><a id='test' class='test' target='_blank' href='https://www.baidu.com/'>職位3</a></td>
    <td>類別3</td>
    <td>時間3</td>
</tr>
</table>
<div>
這是一個div
<p>
<!-- 這是一個注釋 -->
</p>
</div>
'''
soup = BeautifulSoup(html,'lxml') # 解析html

------------------------------------------------------------ find_all --------------------------------------------------------------------------

2、獲取所有的tr標簽

find 返回找到的第一個標簽，find_all以list的形式返回找到的所有標簽

trs = soup.find_all('tr') # 返回列表
n=1
for i in trs:
    print('第{}個tr標簽：'.format(n))
    print(i)
    n+=1

3、獲取第二個tr標簽

limit 可指定返回的標簽數量

trs = soup.find_all('tr',limit=2)[1]  # 從列表中獲取第二個元素，limit 獲取標簽個數
print(trs)

4、獲取class='a1'的tr標簽

　a.方法一： class_

trs = soup.find_all('tr',class_='a1')
n=1
for i in trs:
    print('第{}個class=''a1''的tr標簽：'.format(n))
    print(i)
    n+=1

　b.方法二：attrs 將標簽屬性放到一個字典中

trs = soup.find_all('tr',attrs={'class':'a1'})
n=1
for i in trs:
    print('第{}個class=''a1''的tr標簽：'.format(n))
    print(i)
    n+=1

5、提取所有id='test'且class='test'的a標簽

　方法一：class_

alist = soup.find_all('a',id='test',class_='test')
n=1
for i in alist:
    print('第{}個id=''test''且class=''test''的a標簽：'.format(n))
    print(i)
    n+=1

　方法二：attrs

alist = soup.find_all('a',attrs={'id':'test','class':'test'})
n=1
for i in alist:
    print('第{}個id=''test''且class=''test''的a標簽：'.format(n))
    print(i)
    n+=1

6、獲取所有a標簽的href屬性

alist = soup.find_all('a')

#方法一：通過下標獲取
for a in alist:
    href = a['href']
    print(href)

#方法二： 通過attrs獲取
for a in alist:
    href = a.attrs['href']
    print(href)

7、獲取所有的職位信息(所有文本信息)

string 獲取標簽下的非標簽字符串(值), 返回字符串

注：第一個tr為標題信息，不獲取。從第二個tr開始獲取。

trs = soup.find_all('tr')[1:]
movies = []
for tr in trs:
    move = {}
    tds = tr.find_all('td')
    move['td1'] = tds[0].string  # string 取td的值
    move['td2'] = tds[1].string
    move['td3'] = tds[2].string
    movies.append(move)
print(movies)

8、獲取所有非標記性字符

strings 獲取標簽下的所有非標簽字符串，返回生成器。

trs = soup.find_all('tr')[1:]
for tr in trs:
    infos = list(tr.strings)  # 獲取所有非標記性字符，包含換行、空格
    print(infos)

9、獲取所有非空字符

stripped_strings 獲取標簽下的所有非標簽字符串，並剔除空白字符，返回生成器。

trs = soup.find_all('tr')[1:]
for tr in trs:
    infos = list(tr.stripped_strings)  # 獲取所有非空字符，不包含換行、空格
    print(infos)

# stripped_strings 獲取所有職位信息
trs = soup.find_all('tr')[1:]
movies = []
for tr in trs:
    move = {}
    infos = list(tr.stripped_strings)
    move['職位'] = infos[0]
    move['類別'] = infos[1]
    move['時間'] = infos[2]
    movies.append(move)
print(movies)

10、get_text 獲取所有職位信息

get_text 獲取標簽下的所有非標簽字符串,返回字符串格式

trs = soup.find_all('tr')[1]
text = trs.get_text() # 返回字符串格式
print(text)

------------------------------------------------------------ select --------------------------------------------------------------------------

11、獲取所有tr標簽

trs = soup.select('tr')
for i in trs:
    print('tr標簽：',i)

12、獲取第二個tr標簽

trs = soup.select('tr')[1]
print(trs)

13、獲取所有class="al"的tr標簽

# 方法一：
trs = soup.select('tr.a1')  # tr標簽的class屬性
for i in trs:
    print(i)

# 方法二：
trs = soup.select('tr[class="a1"]')  # tr標簽的class屬性
for i in trs:
    print(i)

14、提取所有a標簽的href屬性

# 方法一：
a = soup.select('a')
for i in a:
    print(i['href'])

# 方法二：
a = soup.select('a')
for i in a:
    print(i.attrs['href'])

15、獲取所有的職位信息

trs = soup.select('tr')
for i in trs:
    print(list(i.stripped_strings))

歡迎查漏補遺！！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲（1）——BeautifulSoup庫函數find_all() (轉) BeautifulSoup中的find，find_all BeautifulSoup4----利用find_all和get方法來獲取信息 [Python]find_all函數 2020.2.7 python爬蟲：BeautifulSoup 使用select方法的使用 python 學習之FAQ:find 與 find_all 使用 find 和 find_all 用法 beautifulsoup find_all怎樣把帶有某種屬性的標簽選出而不含該屬性的標簽不選 find()和find_all()的具體使用 15 Beautiful Soup（提取數據詳解find_all()）

初識python 之 爬蟲：BeautifulSoup 的 find、find_all、select 方法

免責聲明！

初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法