Python模塊學習之bs4

本文轉載自查看原文 2018-02-07 18:21 3104 爬蟲

1、安裝bs4

我用的ubuntu14.4，直接用apt-get命令就行

sudo apt-get install Python-bs4

2、安裝解析器

Beautiful Soup支持Python標准庫中的HTML解析器，還支持一些第三方的解析器，其中一個是lxml。

sudo apt-get install Python-lxml

3、如何使用

將一段文檔傳入BeautifulSoup的構造方法，就能得到一個文檔的對象，可以傳入一段字符串或一個文件句柄。

from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")

4、對象的種類

Beautfiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：tag，NavigableString，BeautifulSoup，Comment。

tag

Tag對象與XML或HMTL原生文檔中的tag相同：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'>

每個tag都有自己的名字，通過.name來獲取：

tag.name
# u'b'

一個tag可能有很多屬性。

tag['class'] # u'boldest'

tag.attrs
# {u'class': u'boldest'}

NavigableString

字符串常被包含在tag內。

tag.string
# u'Extremely bold' type(tag.string) # <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup對象表示的是一個文檔的全部內容。

soup
<html><body><b class="boldest">Extremely bold</b></body></html> type(soup) <class 'bs4.BeautifulSoup'>

Comment

一般表示的是文檔的注釋部分。

5、遍歷文檔樹

tag的名字

可以通過點取屬性的方式獲取tag，並且可以多次調用。

soup.head
# <head><title>The Dormouse's story</title></head>  soup.title # <title>The Dormouse's story</title>

通過點取屬性的方式只能獲取當前名字的第一個tag：

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想獲取所有的a標簽

soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6、搜索文檔樹

Beautiful Soup最重要的搜索方法有兩個：find（）,find_all()。

過濾器

最簡單的過濾器是字符串

soup.find_all('b') # [<b>The Dormouse's story</b>]

通過傳入正則表達式來作為參數

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

傳入列表參數

soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果沒有合適的過濾器，還可以自定義方法

find_all()

find_all( name , attrs , recursive , text , **kwargs )

name參數

name參數可以查找所有名字為name的tag，比如title\head\body\p等等

keyword參數

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索,如果包含一個名字為 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性.

 
          soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
         

如果傳入 href 參數,Beautiful Soup會搜索每個tag的”href”屬性:

 
          soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  
         

搜索指定名字的屬性時可以使用的參數值包括字符串 , 正則表達式 , 列表, True .

下面的例子在文檔樹中查找所有包含 id 屬性的tag,無論 id 的值是什么:

 
          soup.find_all(id=True) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
         

使用多個指定名字的參數可以同時過濾tag的多個屬性:

 
          soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]  
         

按css搜索

class由於與Python關鍵字沖突，因此在beatifulsoup中為class_

class_ 參數同樣接受不同類型的 過濾器 ,字符串,正則表達式,方法或 True

text參數

text參數可以搜索文檔中的字符串內容。與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表, True。

像調用 `find_all()` 一樣調用tag

find_all() 幾乎是Beautiful Soup中最常用的搜索方法,所以我們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:

 
           soup.find_all("a") soup("a")  
          

這兩行代碼也是等價的:

 
           soup.title.find_all(text=True) soup.title(text=True)  
          

CSS選擇器

Beautiful Soup支持大部分的CSS選擇器 [6] ,在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數,即可使用CSS選擇器的語法找到tag:

 
          soup.select("title") # [<title>The Dormouse's story</title>] soup.select("p nth-of-type(3)") # [<p class="story">...</p>]  
         

通過tag標簽逐層查找:

 
          soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") # [<title>The Dormouse's story</title>]  
         

找到某個tag標簽下的直接子標簽 [6] :

 
          soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > a:nth-of-type(2)") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []  
         

找到兄弟節點標簽:

 
          soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("#link1 + .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
         

通過CSS的類名查找:

 
          soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("[class~=sister]") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
         

通過tag的id查找:

 
          soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
         

通過是否存在某個屬性來查找:

 
          soup.select('a[href]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
         

通過屬性的值來查找:

 
          soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  
         

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python模塊學習之bs4 python bs4 BeautifulSoup Python安裝bs4 python bs4的使用 Python網絡爬蟲(數據解析-bs4模塊) 爬蟲解析之(六) --- bs4模塊 bs4 python解析html python關於bs4庫的整理【Python 庫】bs4的使用解決pycharm不能導入bs4模塊問題