BeautifulSoup4的使用方法

本文轉載自查看原文 2019-03-28 16:24 1402 爬蟲筆記

BeautifulSoup是一個可以從HTML或XML文件中提取數據的Python庫，它能實現文檔的導航和查找,修改文檔等操作

官方文檔地址："https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/"

幾個常用提取信息工具的比較：
- 正則：很快，使用復雜，不用安裝
- beautifulsoup：較慢，使用簡單，安裝簡單
- lxml：較快，使用簡單，安裝稍難

四大對象：

1.Tag
- 對應Html中的標簽
- 可以通過soup.tag_name訪問
- tag兩個重要屬性name和attrs

from urllib import request
from bs4 import BeautifulSoup

url = "http://www.baidu.com"

rsp = request.urlopen(url)
cnt = rsp.read()
soup = BeautifulSoup(cnt, "lxml")

cnt = soup.prettify()
print(cnt)
print("=="*10)
print(soup.link)
print(soup.link.name)
print(soup.link.attrs)
print(soup.link.attrs['type'])

2.NavigableString
- 對應內容值

3.BeautifulSoup
- 表示的是一個文檔的內容，大部分可以把它當做tag對象
- 一般可以用soup來表示

4.Comment
- 特殊類型的NavagableString對象，對其輸出，則內容不包括注釋符號

遍歷文檔對象的方法：
- contents：返回tag子節點以列表的方式返回
- children：返回tag子節點以迭代器形式返回
- descendants：返回所有子孫節點
- string：返回所有字符類型

from urllib import request
from bs4 import BeautifulSoup

url = "http://www.baidu.com"

rsp = request.urlopen(url)
cnt = rsp.read()
soup = BeautifulSoup(cnt, "lxml")

for node in soup.head.contents:
    if node.name == 'meta':
        print(node)
    if node.name == 'title':
        print(node.string)

#<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
#<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
#<meta content="always" name="referrer"/>
#<meta content="#2932e1" name="theme-color"/>
#百度一下，你就知道

搜索文檔對象的方法：
使用find_all(name, attrs, recursive, text, ** kwargs)
- name:按照字符串搜索，可以傳入的內容為字符串，正則表達式，列表
- kwargs參數，用來表示屬性
- text：對應tag的文本值

from urllib import request
from bs4 import BeautifulSoup
import re

url = 'http://www.baidu.com'

rsp = request.urlopen(url)
content = rsp.read()
soup = BeautifulSoup(content, 'lxml')

tags = soup.find_all(re.compile('^me'), content="always")
for tag in tags:
    print(tag)

#<meta content="always" name="referrer"/>

CSS選擇器的使用方法：
- 使用soup.select返回一個列表
- 通過標簽名稱: soup.select("title")
- 通過類名: soup.select(".content")
- id查找: soup.select("#name_id")
- 組合查找: soup.select("div #input_content")
- 屬性查找: soup.select("img[class='photo'])
- 獲取tag內容: tag.get_text

from urllib import request
from bs4 import BeautifulSoup

url = 'http://www.baidu.com'

rsp = request.urlopen(url)
content = rsp.read()
soup = BeautifulSoup(content, 'lxml')

titles = soup.select("title")
print(titles[0])

print("==" * 12)
metas = soup.select("meta[content='always']")
print(metas[0])

#<title>百度一下，你就知道</title>
#========================
#<meta content="always" name="referrer"/>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 BeautifulSoup4基本使用 Python學習之beautifulsoup4庫的使用 Python獲取網頁指定內容(BeautifulSoup工具的使用方法) python爬蟲beautifulsoup4系列3 Ubuntu下安裝BeautifulSoup4 python3解析庫BeautifulSoup4 關於BeautifulSoup4 解析器的說明【python小練】圖片爬蟲之BeautifulSoup4 python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4） BeautifulSoup4的find_all()和select()，簡單爬蟲學習