Python：數據解析（bs4 / xpath）

本文轉載自查看原文 2020-04-20 21:07 596 Python

最近在看B站上的視頻學習資料，此文是關於用bs4/xpath做數據解析相關的一些使用實例。

bs4解析

環境的安裝：

pip install bs4
pip install lxml

bs4數據解析的解析原理/流程

實例化一個BeautifulSoup對象，且將等待解析的數據加載到該對象中

方式1: BeautifulSoup(f,'lxml'):解析本地存儲的html文件
方式2: BeautifulSoup(page_text,'lxml'):解析互聯網上請求到的頁面數據

調用BeautifulSoup對象中的相關方法和屬性進行標簽定位和數據的提取

標簽定位

soup.tagName: 返回第一次出現的tagName標簽
屬性定位：soup.find('tagName',attrName='value')
findAll和find的用法一樣，但是返回值不一樣，findAll返回列表
選擇器定位：select('selector')

數據提取

提取標簽中存在的數據

.string: 取出標簽直系的文本內容
.text: 取出標簽中所有的文本內容

提取標簽屬性中存儲的數據

tagName['attrName']

import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' } f = open('./test.html','r') soup = BeautifulSoup(f,'lxml') soup.div # 標簽定位 soup.find('div',class_='song') # 屬性定位：根據樹新定位具體的標簽，class屬性為song的div標簽，因為class是內置屬性，所以要加下划線class_，如果是id則直接用id='xxx' soup.finAll('a',id='feng') soup.select('#feng') # 根據id選擇器定位a標簽 soup.select('.song') # 定位class為song的標簽 #層級選擇器 soup.select('.tang > ul > li > a') # >表示一個層級 soup.select('.tang a') # 空格表示多個層級 soup.p.string # 取出p標簽直系的文本內容 soup.div.text # 取出div標簽中所有的文本內容 soup.a['href'] # 取出按標簽屬性為href的數據 # 使用bs4解析爬取三國演義整篇小說內容：http://www.shicimingju.com/book/sanguoyanyi.html # 從首頁解析出章節的標題和詳情頁的url url = 'http://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(url, headers=headers).text # 首頁頁面的源碼數據 f = open('./sanguoyanyi.txt','w',encoding='utf-8') # 數據解析（章節標題，詳情頁的url） soup = BeautifulSoup(page_text,'lxml') a_list = soup.select('.book-mulu > ul > li > a') for a in a_list: title = a.string url_detail = 'http://www.shicimingju.com' + a['href'] # 解析提取章節內容 page_text_detail = requests.get(url_detail, headers=headers).text # 解析詳情頁中的章節內容 soup = BeautifulSoup(page_text_detail,'lxml') content = soup.find('div',class_='chapter_content').text f.write(title + ':' + content + '\n') print(title,'下載成功')

xpath解析

xpath解析原理

實例化一個etree對象，且將即將被解析的數據加載到該對象中

解析本地存儲的html文檔：etree.parse('fileName')
解析網上爬取的html數據：etree.HTML('page_text')

使用etree對象中的xpath方法結合着不同的xpath表達式實現標簽定位和數據提取

標簽定位

最左側的/: 必須要從跟標簽開始逐層的定位目標標簽
非最左側的/: 表示一個層級
最左側的//: 可以從任意位置定義目標標簽
非最左側的//: 表示多個層級
屬性定位: //tagName[@atrrName='value']
索引定位: //tagName[index]，index是從1開始的，不是從0
模糊匹配：

//div[contains(@class,'ng')] 定位到class屬性中包含ng的div標簽
//div[starts-with(@class,'ta')] 定位到class屬性值中以ta開頭的div標簽

數據提取

取標簽中的數據

/text(): 直系文本內容
//text(): 所有文本內容

取屬性的數據

tagName/@attrName

from lxml import etree import os tree = etree.parse('./test.html') tree.xpath('/html/head') # 從根標簽開始定位head標簽，返回<Element>列表 tree.xpath('//head') # 將html文檔中所有的head標簽定位到 tree.xpath('//div[@class="song"]') # 定位class為song的div標簽 tree.xpath('//li[1]') #定位第一個li標簽 tree.xpath('//a[@id="feng"]/text()') # 取出id為feng的a標簽的直系內容 tree.xpath('//div[@class="song"]//text()') # 取出class為song的div標簽里面的所有內容 tree.xpath('//a[@id="feng"]/@href') # 取出id為feng的a標簽中屬性為href的值 # 爬取圖片數據和圖片名稱，並保存到本地 # 第一頁：http://pic.netbian.com/4kmeinv/ # 非第一頁：http://pic.netbian.com/4kmeinv/index_2.html dir_name = 'imgLibs' if not os.path.exists(dir_name): os.mkdir(dir_name) url = 'http://pic.netbian.com/4kmeinv/index_%d.html' for page in range(1,6): if page == 1: new_url = 'http://pic.netbian.com/4kmeinv/' else: new_url = format(url%page) # 表示非第一頁的url response = requests.get(new_url, headers=headers) response.encoding = 'gbk' page_text = response.text tree = etree.HTML(page_text) # 解析圖片地址和圖片名稱 li_list = tree.xpath('//div[@class="slist"]/ul/li') # 全局解析 for li in li_list: img_src = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0] # 局部解析，./表示xpath對調用者對應的標簽 img_name = li.xpath('./a/img/@alt')[0] + '.jpg' img_data = requests.get(img_src,headers=headers).content # print(img_src, img_name) file_path = dir_name + '/' + img_name with open(file_path,'wb') as f: f.write(img_data) print(img_name,'下載成功') # 如何提升xpath表達式的通用性 url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url,headers=headers).text tree = etree.HTML(page_text) hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()') all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text()') # 上述2個xpath表達式可以合並成以下一個xpath表達式 cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //div[@class="bottom"]/ul/div[2]/li/a/text()') print(cities)

想要解析出一張頁面中指定的局部帶標簽的數據

bs4, 使用bs4定位標簽后，直接返回的就是標簽內容

參考：https://www.bilibili.com/video/BV1tE411F7do（P7~P8）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 數據解析之bs4 bs4和xpath比較 bs4 解析以及用法爬蟲的三種解析方式(正則解析, xpath解析, bs4解析) python bs4 BeautifulSoup Python安裝bs4 python bs4的使用 Xpath re bs4 等爬蟲解析器的性能比較 Python模塊學習之bs4 Python模塊學習之bs4