Python之Html解析方法

本文轉載自查看原文 2019-05-17 15:03 18184 Python

一、強大的BeautifulSoup：BeautifulSoup是一個可以從html或xml文件中提取數據的Python庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式。在Python開發中，主要用的是BeautifulSoup的查找提取功能，修改功能很少使用

1、安裝BeautifulSoup

pip3 install beautifulsoup4

2、安裝第三方html解析器lxml

pip3 install lxml

3、安裝純Python實現的html5lib解析器

pip3 install html5lib

二、BeautifulSoup的使用：

1、導入bs4庫

from bs4 import BeautifulSoup #導入bs4庫

2、創建包含html代碼的字符串

html_str = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's stopy

Once upon a time there were three little sisters;and their names where

"""

3、創建BeautifulSoup對象

（1）直接通過字符串方式創建

soup = BeautifulSoup(html_str,'lxml') #html.parser是解析器，也可是lxml

print(soup.prettify()) ------>輸出soup對象的內容

（2）通過已有的文件來創建

soup = BeautifulSoup(open('/home/index.html'),features='html.parser')#html.parser是解析器，也可是lxml

4、BeautifulSoup對象的種類：BeautifulSoup將復雜HTML文檔轉換成一個復雜的樹形結構，每個節點都是Python對象

（1）BeautifulSoup：表示的是一個文檔的全部內容。大部分時候，可以把它當作Tag對象，是一個特殊的Tag，因為BeautifulSoup對象並不是真正的HTML和XML，所以沒有name和attribute屬性

（2）Tag：與XML或HTML原生文檔中的Tag相同，通俗講就是標記

如：

抽取title：print（soup.title）

抽取a ： print（soup.a）

抽取p：print（soup.p）

Tag中有兩個重要的屬性：name和attributes。每個Tag都有自己的名字，通過.name來獲取

print（soup.title.name）

操作Tag屬性的方法和操作字典相同

如：Hello World

print（soup.p[‘class’]）

也可以直接“點”取屬性，如 .attrs 獲取Tag中所有屬性

print（soup.p.attrs）

（3）NavigableString：獲取標記內部的文字.string

BeautifulSoup用 NavigableString類來封裝Tag中的字符串，一個 NavigableString字符串與Python中的Unicode字符串相同，通過unicode（）方法可以直接將 NavigableString對象轉換成Unicode字符串

如：u_string = unicode(soup.p.string)

（4）Comment：對於一些特殊對象，如果不清楚這個標記.string的情況下，可能造成數據提取混亂。因此在提取字符串時，可以判斷下類型：

if type(soup.a.string) == bs4.element.Comment:

print(soup.a.string)

5、遍歷文檔

（1）子節點：

A、對於直接子節點可以通過 .contents 和 .children來訪問

.contents ---->將Tag子節點以列表的方式輸出

print（soup.head.contents）

.children ----->返回一個生成器，對Tag子節點進行循環

for child in soup.head.children:

print（child）

B、獲取子節點的內容

.string ---> 如果標記里沒有標記了，則返回內容；如果標記里只有一個唯一的標記，則返回最里面的內容；如果包含多個子節點，Tag無法確定.string方法應該返回哪個時，則返回None

.strings ---->主要應用於Tag中包含多個字符串的情況，可以進行循環遍歷

for str in soup.strings:

print（repr(str)）

.stripped_string ----->可以去掉字符串中包含的空格或空行

for str in soup.stripped_strings:

print(repr(str))

（2）父節點

A、通過.parent屬性來獲取某個元素的父節點，如：

print（soup.title.parent）

B、通過.parents屬性可以遞歸得到元素的所有父輩節點

for parent in soup.a.parents:

if parent is None:

print(parent)

else:

print(parent.name)

（3）兄弟節點

. next_sibling ----->獲取該節點的下一個兄弟節點

. previous_sibling ----->獲取該節點的上一個兄弟節點

（4）前后節點

. next_elements ----->獲得該節點前面的所有節點

. previous_elements ----->獲得該節點后面的所有節點

6、搜索文檔樹

（1）find_all(name,attrs,recursive,text,**kwargs)

A、name參數：查找名字為name的標記

print（soup.find_all(‘‘’’b)）

B、text參數：查找文檔中字符串的內容

C、 recursive參數：檢索當前Tag的所有子孫節點時，若只想找直接子節點，該參數設置為False

7、CSS選擇器：使用soup.select()函數

（1）通過標記名查找

print（soup.select("title")）

（2）通過Tag的class屬性值查找

print（soup.select(".sister")）

（3）通過Tag的id屬性值查找

print（soup.select("#sister")）

（4）通過是否存在某個屬性查找

print（soup.select("a[href]")）

（5）通過屬性值查找

print（soup.select('a[href="http://exam.com"]')）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 3 解析 html python 解析html網頁 Python3解析HTML python中用lxml解析html Python—解析HTML頁面（HTMLParser） python 使用 BeautifulSoup 解析html python解析本地HTML文件 Python的html解析器 python爬蟲-html解析器beautifulsoup 解析python數據后用html輸出