BeautifulSoup4----利用find_all和get方法來獲取信息

本文轉載自查看原文 2017-02-12 01:23 32106 python-BeautifulSoup4

中文文檔
官方教學網頁源碼：

<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <p id="firstpara" align="center">
        This is paragraph<b>one</b>.
        </p>
        <p id="secondpara" align="blah">
        This is paragraph<b>two</b>.
        </p>
     </body>
</html>

find方法的參數及意義

find(name=None, attrs={}, recursive=True, text=None, **kwargs)

1,按照tag(標簽)搜索：

1 find(tagname)        # 直接搜索名為tagname的tag 如：find('head')
2 find(list)           # 搜索在list中的tag，如: find(['head', 'body'])
3 find(dict)           # 搜索在dict中的tag，如:find({'head':True, 'body':True})
4 find(re.compile('')) # 搜索符合正則的tag, 如:find(re.compile('^p')) 搜索以p開頭的tag
5 find(lambda)         # 搜索函數返回結果為true的tag, 如:find(lambda name: if len(name) == 1) 搜索長度為1的tag
6 find(True)           # 搜索所有tag

　　2,按照attrs(屬性)搜索:

1 find('id'='xxx')                                  # 尋找id屬性為xxx的
2 find(attrs={'id':re.compile('xxx'), 'algin':'xxx'}) # 尋找id屬性符合正則且algin屬性為xxx的
3 find(attrs={'id':True, 'algin':None})               # 尋找有id屬性但是沒有algin屬性的

利用BeautifulSoup4爬取豆瓣數據的ID

代碼如下：

import requests
from bs4 import BeautifulSoup as bs

#以豆瓣‘編程’分類的一個連接URL為例子開始爬數據ID
url = 'https://book.douban.com/tag/編程?start=20&type=T'
res = requests.get(url)  #發送請求
#print(res.encoding)    #這個是用來查看網頁編碼的
#res.encoding = 'utf-8'   #跟上一個結合來用，如果編碼有亂碼，則可以通過這個定義編碼來改變
html = res.text     
#print(html)

IDs = []
soup  = bs(html,"html.parser")     #定義一個BeautifulSoup變量
items = soup.find_all('a',attrs={'class':'nbg'})
#print(items)

for i in items:
    idl = i.get('href')
    #print(idl)
    id = idl.split('/')[4]
    print(id)
    IDs.append(id)
print('這一頁收集到書籍ID數：%d' % len(IDs))

第一部分是獲取網頁源代碼的過程，使用requests模塊
第二部分為使用BeautifulSoup來解析網頁，得到需要的信息
- ```
soup  = bs(html,"html.parser")
```
  這句的意思是聲明一個變量，用BeautifulSoup處理之后的原網頁代碼
- ```
items = soup.find_all('a',attrs={'class':'nbg'})
```
  這句的作用是查找a標簽，當然，a標簽會有很多，但是我們不需要所有，因此我們還需要判斷一下這種a標簽還有個屬性是class='nbg'，我們只需要這種a標簽。items得到的是一個list
- 屬性都放着attrs這個字典中，當某個屬性的值不是定值的時候，可以使用 '屬性名':True 這種方式。
- ```
for i in items:
    idl = i.get('href')
```
  這句的意思是獲取滿足條件的每個a標簽中屬性‘href’的值
- ```
id = idl.split('/')[4]
```
  由於‘href’的屬性是一個連接，但是我們只需要得到ID，所有可以將連接按照‘/’分解，提取ID
具體的爬蟲例子可以參照：智聯招聘爬蟲
Beautifulsoup的select選擇器方法可以參考爬蟲例子：前程無憂爬蟲

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初識python 之爬蟲：BeautifulSoup 的 find、find_all、select 方法 python爬蟲：BeautifulSoup庫find_all ()、find()方法詳解 BeautifulSoup中的find，find_all BeautifulSoup庫之find_all函數 beautifulsoup用法2 (find_all select) BS4(BeautifulSoup4)的使用--find_all()篇 BeautifulSoup4的find_all()和select()，簡單爬蟲學習 find_all的用法 Python（bs4，BeautifulSoup） python爬蟲（1）——BeautifulSoup庫函數find_all() (轉) find 和 find_all 用法

BeautifulSoup4----利用find_all和get方法來獲取信息

中文文檔

官方教學網頁源碼：

find方法的參數及意義

利用BeautifulSoup4爬取豆瓣數據的ID

具體的爬蟲例子可以參照：智聯招聘爬蟲

Beautifulsoup的select選擇器方法可以參考爬蟲例子：前程無憂爬蟲

免責聲明！