數據解析模塊BeautifulSoup簡單使用

本文轉載自查看原文 2019-06-27 15:24 714 爬蟲

一、准備環境：

1、准備測試頁面test.html

<html>
<head>
    <title>
        The Dormouse's story
    </title>
</head>
<body>
<p class="title">
    <b>
        The Dormouse's story
    </b>
</p>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
        Tillie
    </a>
    ; and they lived at the bottom of a well.
</p>
<p class="story">
    ...
</p>
</body>
</html>

test.html

2、安裝相關模塊

pip install bs4
pip install requests

二、beautifulsoup相關語法：

1、實例化beautifulsoup對象

from bs4 import BeautifulSoup
# 實例化BeautifulSoup對象
# 1、轉化本地HTML文件
soup = BeautifulSoup(open('本地文件'), 'lxml')
# 如使用本地文件
with open('test.html',mode='r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
print(soup.a)   # 打印第一個a標簽的所有內容

# 2、通過requests.get或其它方式獲取到的HTML數據
soup = BeautifulSoup('字符串類型或者字節類型', 'lxml')
# 如通過requests獲取到的網頁數據
from requests
page_html = requests.get(url='http://www.baidu.com').text
soup = BeautifulSoup(page_html, 'lxml')
print(soup.a)   # 打印第一個a標簽的所有內容

2、通過實例化對象獲取標簽，標簽內容，標簽屬性（這里以上面准備的test.html為示例進行演示）。

import requests
from bs4 import BeautifulSoup

with open('test.html',mode='r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
print(soup.title)             # 打印title標簽的全部內容
print(soup.a)                 # 打印a標簽的全部內容
print(soup.a.attrs)           # 打印a標簽的所有屬性內容
print(soup.a.attrs['href'])   # 打印a標簽href屬性的值
print(soup.a['href'])         # 也可以簡寫


# 打印a標簽中的文本內容內容
print(soup.a.string)
print(soup.a.text)
print(soup.a.get_text())
# 需要注意的是，如果a標簽中還嵌套有其它標簽，soup.a.string將獲取不到值返回一個None，
# 而soup.a.text和soup.a.get_text()可以獲取到包括a標簽在內的所有子標簽中的文本內容。
# 注意：soup.tagName只定位到第一次出現的tagName標簽便結束匹配

soup.find('a')                                         # 與soup.tagName一樣只匹配到第一次出現的。不同的是可以使用標簽和屬性進行聯合查找。
print(soup.find('a',{'class':"sister",'id':'link2'}))  # 根據標簽和屬性進行定位

find_all()  # 和find的用法一樣，只是返回值是一個列表，這里就不演示了

# 根據選擇器進行定位
# 常見的選擇器：標簽選擇器(a)、類選擇器(.)、id選擇器(#)、層級選擇器
soup.select('a')              # 根據標簽定位到所有a標簽
print(soup.select('.sister')) # 根據類名sister定位
print(soup.select('#link1'))  # 根據id 進行定位
print(soup.select('p>a'))     # 定位所有p標簽下的a標簽

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用BeautifulSoup模塊解析HTML 數據解析之BeautifulSoup解析 Python爬蟲〇六———數據解析之beautifulsoup的使用 python 模塊BeautifulSoup使用 requests和BeautifulSoup模塊的使用爬蟲-使用BeautifulSoup4（bs4）解析html數據 python3 BeautifulSoup模塊使用 Python學習筆記用BeautifulSoup模塊解析HTML 使用BeautifulSoup解析XML文檔 python 使用 BeautifulSoup 解析html