爬蟲之數據解析爬蟲的核心技術

本文轉載自查看原文 2019-12-08 23:12 734 爬蟲

7·數據解析

1.概念

什么是數據解析,數據解析可以干什么？

概念就是講一組數據中的局部數據進行提取 
數據清洗是去除
作用:就是用來實現聚焦爬蟲

數據解析的通用原理

數據解析的通用原理
html展示的數據可以存儲在哪里？
	標簽之中
	屬性中

2.提取數據的步驟

標簽定位
取文本或者取屬性

3.使用正則

需求：爬取的網站

1.對圖片數據進行爬取

(.*?)取括號里面的值
.*?
.*? 表示匹配任意數量的重復，但是在能使整個匹配成功的前提下使用最少的重復。
#如：
a.*?b匹配最短的，以a開始，以b結束的字符串。如果把它應用於aabab的話，它會匹配aab和ab。

2.re.m取多行數據

re.S 單行模式(可以看成將所有的字符串放在一行內匹配包括換行符\n)
re.M 多行模式(可以識別換行符 進行取值)

使用re.fechall 使用re.s re.m
字符串拼接拼接路徑
#imgPath = dirName+'/'+imgName

示例

正則實現的數據解析

需求：http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/，將該網站中的圖片數據進行爬取
如何對圖片（二進制）數據進行爬取

方法1(requests)

#方法1：使用requests模塊
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
url = 'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg'
pic_data = requests.get(url=url,headers=headers).content #content返回的是二進制類型的響應數據
with open('1.jpg','wb') as fp:
    fp.write(pic_data)

方法2（urllib）

使用urllib

urllib.request.urlretrieve(url=src,filename=imgPath) 
#url 指定圖片路徑地址 filename 圖片路徑地址

爬取單個頁面

#需求的實現：爬取了一頁的數據
url = 'http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/'
page_text = requests.get(url,headers=headers).text #頁面源碼數據

#新建一個文件夾
dirName = 'imgLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)

#數據解析：每一張圖片的地址
ex = '<article.*?<img src="(.*?)" alt=.*?</article>'
img_src_list = re.findall(ex,page_text,re.S) #爬蟲中使用findall函數必須要使用re.S

for src in img_src_list:
    imgName = src.split('/')[-1]
    imgPath = dirName+'/'+imgName
    urllib.request.urlretrieve(url=src,filename=imgPath)
    print(imgName,'下載成功！！！')

爬取所有頁面

import os
import re
import requests
import urllib
#進行全站數據的爬取：爬取所有頁碼的圖片數據
#需求的實現

#制定一個通用的url模板，不可以被改變
url = 'http://duanziwang.com/category/搞笑圖/%d/'

for page in range(1,4):
    new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers).text #頁面源碼數據

    #新建一個文件夾
    dirName = 'imgLibs'
    if not os.path.exists(dirName):
        os.mkdir(dirName)

    #數據解析：每一張圖片的地址
    ex = '<article.*?<img src="(.*?)" alt=.*?</article>'
    img_src_list = re.findall(ex,page_text,re.S) 
    #爬蟲中使用findall函數必須要使用re.S

    for src in img_src_list:
        imgName = src.split('/')[-1]#進行
        imgPath = dirName+'/'+imgName#根據路徑取存放爬取的內容
        urllib.request.urlretrieve(url=src,filename=imgPath)
        print(imgName,'下載成功！！！')

4.bs4

返回值是對應的html頁面(標簽)

1.環境的安裝：

- pip install bs4
- pip install lxml

2.解析原理

- 實例化一個BeautifulSoup的一個對象，把即將被解析的頁面源碼內容加載到該對象中
- 調用BeautifulSoup對象中相關的方法和屬性進行標簽定位和本文數據的提取

3.實例化方式

BeautifulSoup對象的實例化的方式：

本地

BeautifulSoup(fp,'lxml'):將本地的文件內容加載到該對象中進行數據解析

網絡

BeautifulSoup(page_text,'lxml')：將互聯網上請求到的數據加載到該對象中進行數據解析

fp文件描述符 open函數的返回值
soup當前頁面的對象活到當前頁面的所有標簽
find屬性定位 也可以進行tagname的取值
關鍵字_=屬性值 例如class 類聲明 需要加下划線
不是關鍵字=屬性值

取文本

string :只可以將標簽中直系的文本取出
text:可以將標簽中所有內容取出

取屬性

tag['attrName']

一些常用參數

from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
soup.p
soup.find('div',class_='tang')
soup.find('a',id='feng')
soup.find_all('div',class_='tang')
soup.select('#feng')
soup.select('.tang > ul > li')
soup.select('.tang li')
tag = soup.title
tag.text
li_list = soup.select('.tang > ul > li')
li_list[6].text
div_tag = soup.find('div',class_='tang')
div_tag.text
a_tag = soup.select('#feng')[0]
a_tag['href']

示例

需求：http://www.shicimingju.com/book/sanguoyanyi.html 進行全篇小說內容的爬取
分析：
- 首頁：解析出章節的名稱+詳情頁的url
- 詳情頁：解析章節的內容

from bs4 import BeautifulSoup
#爬取到首頁的頁面數據
main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(main_url,headers=headers).text

fp = open('./sanguo.txt','a',encoding='utf-8')

#解析章節名稱+詳情頁的url
soup = BeautifulSoup(page_text,'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
for a in a_list:
    title = a.string#章節標題
    detail_url = 'http://www.shicimingju.com'+a['href']
    
    #爬取詳情頁的頁面源碼內容
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    #解析章節內容
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    
    
    div_tag = detail_soup.find('div',class_="chapter_content")
    
    
    content = div_tag.text #章節內容
    
    
    fp.write(title+':'+content+'\n')
    print(title,'下載成功！！！')
fp.close()

5.xpath(面試重點)

為什么xpath從0開始索引 cpu更好尋址
打印出來的實例化對象
返回的都是列表復數對象
#在xpath表達式中一定不可以出現tbody標簽!!

5.1（擴展）補充參數

我們將在下面的例子中使用這個 XML 文檔。

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

5.1.1選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。

下面列出了最有用的路徑表達式：

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

實例

在下面的表格中，我們已列出了一些路徑表達式以及表達式的結果：

路徑表達式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始於正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬於 bookstore 元素的后代的所有 book 元素，而不管它們位於 bookstore 之下的什么位置。
//@lang	選取名為 lang 的所有屬性。

5.1.2謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點。

謂語被嵌在方括號中。

實例

在下面的表格中，我們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

實例

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

5.1.3選取若干路徑管道符

通過在路徑表達式中使用“|”運算符，您可以選取若干個路徑。

實例

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

5.2xpath入門

1.環境的安裝:

pip install lxml

2.解析原理

實例化一個etree的對象，將解析的數據加載到該對象中
需要調用etree對象中的xpath方法結合着不同的xpath表達式進行標簽定位和文本數據的提取

3.etree對象的實例化

etree.parse('filePath'):將本都數據加載到etree中
etree.HTML(page_text)：將互聯網上的數據加載到該對象中

4.其他

html中所有的標簽都是遵從了樹狀的結構，便於我們實現高效的節點的遍歷或者查找（定位）
etree遵從了樹狀存儲結構
xpath方法的返回值一定是復數（列表）

5.參數

#1.標簽定位
    最左側的/:xpath表達式式一定要從根標簽開始進行定位
    非最左側的/:表示一個層級
    最左側的//:從任意位置進行標簽定位（常用）
    非最左側//：表示多個層級
    //tagName:定位到所有的tagName標簽
    屬性定位：//tagName[@attrName="value"]#定位對應屬性為"value"屬性值
    索引定位：//tagName[index],index索引是從1開始
#2.模糊匹配：
//div[contains(@class, "ng")]
//div[starts-with(@class, "ta")]
#3.取文本
/text():取直系的文本內容。列表只有一個元素
//text()：所有的文本內容。列表會有多個列表元素
#4.取屬性
/@attrName
#5.邏輯運算：
    #找到href屬性值為空且class屬性值為du的a標簽
    //a[@href="" and @class="du"]

from lxml import etree
tree = etree.parse('./test.html')
# 創建etree對象進行制定數據解析# 解析本地文件

"""進行本地文件的創建"""
tree.xpath('/html/head/meta')
tree.xpath('/html//meta')
tree.xpath('//meta')
tree.xpath('//div')
tree.xpath('//div[@class="tang"]')
tree.xpath('//li[1]')
tree.xpath('//a[@id="feng"]/text()')[0]
tree.xpath('//div[2]//text()')
tree.xpath('//a[@id="feng"]/@href')

// 是相對定位進行占位

6.實例化對象

url = '路徑'
page_text = requests.get(url=url,headers=headers).text#取出text信息
#數據解析
tree = etree.HTML(page_text)

6.1.使用本地

etree.parse('filePath'):將本都數據加載到etree中

6.2.使用網絡

etree.HTML(page_text)：將互聯網上的數據加載到該對象中

7.全局解析

7.1知識點

tree.xpath('//a[@id="feng"]/text()')[0]
tree.xpath('//div[2]//text()')
tree.xpath('//a[@id="feng"]/@href')

7.2示例使用

li_list = tree.xpath('//div[@class="box-bd"]/ul/li')

8.局部解析 ./ 當前標簽

8.1知識點

./表示當前標簽
./a[2]表示當前標簽下面索引為2的標簽
./@href 當前標簽的屬性

8.2示例使用

url = 'https://www.huya.com/g/lol'
page_text = requests.get(url=url,headers=headers).text
#數據解析
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="box-bd"]/ul/li')
for li in li_list:
    #實現局部解析：將局部標簽下指定的內容進行解析
    #局部解析xpath表達式中的最左側的./表示的就是xpath方法調用者對應的標簽
    title = li.xpath('./a[2]/text()')[0]
    hot = li.xpath('./span/span[2]/i[2]/text()')[0]
    detail_url = li.xpath('./a[1]/@href')[0]
    print(title,hot,detail_url)

9.管道符

9.1知識點

|管道符 選取若干路徑
1.可以同時允許
2.可以管道符左右都允許 
用於提高xpath兼容性,提升查詢效率

9.2示例

from lxml import etree
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url,headers=headers).text

tree = etree.HTML(page_text)
# hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
all_cities = 
tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()')
all_cities

5.3xpath示例

需求：爬取解析虎牙中直播的房間名稱，熱度，詳情頁的url

from lxml import etree
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
url = 'https://www.huya.com/g/lol'
page_text = requests.get(url=url,headers=headers).text
#數據解析
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="box-bd"]/ul/li')
for li in li_list:
    #實現局部解析：將局部標簽下指定的內容進行解析
    #局部解析xpath表達式中的最左側的./表示的就是xpath方法調用者對應的標簽
    title = li.xpath('./a[2]/text()')[0]
    hot = li.xpath('./span/span[2]/i[2]/text()')[0]
    detail_url = li.xpath('./a[1]/@href')[0]
    print(title,hot,detail_url)

音頻輸入

百度AI的使用

1.注冊百度AI賬號
下載
如果已安裝pip，執行pip install baidu-aip即可。
如果已安裝setuptools，執行python setup.py install即可。

from aip import AipSpeech

""" 你的 APPID AK SK """
APP_ID = '你的 App ID'
API_KEY = '你的 Api Key'
SECRET_KEY = '你的 Secret Key'

client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

result  = client.synthesis('你好百度', 'zh', 1, {
    'vol': 5,
})

# 識別正確返回語音二進制 錯誤則返回dict 參照下面錯誤碼
if not isinstance(result, dict):#result 音頻文件
    with open('auido.mp3', 'wb') as f:
        f.write(result)

html存儲音頻

<audio controls>
  <source src="horse.ogg" type="audio/ogg">
  <source src="horse.mp3" type="audio/mpeg">
</audio>

視頻的爬取

動態加載

1.如果沒有加載此視頻將是一個圖片

2.根據js發送請求此視頻變為一個視頻

3.使用re正則進行匹配選取准確定位到指定的屬性

常用數據解析的分析

bs4的標簽定位:返回值一定是定位到的標簽

bs4參數

標簽定位：返回值一定是定位到的標簽
soup.tagName:定位到第一個出現的tagName標簽.返回的是單數
屬性定位：soup.find('tagName',attrName='value')，返回的是單數
find_all('tagName',attrName='value')返回的是復數(列表)
選擇器定位：select('選擇器')，返回的也是一個列表
層級選擇器：
大於號：表示一個層級
空格：標識多個層級
取文本
string:只可以將標簽中直系的文本取出
text：可以將標簽中所有的內容取出
取屬性
tag['attrName']

from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
soup.p
soup.find('div',class_='tang')
soup.find('a',id='feng')
soup.find_all('div',class_='tang')
soup.select('#feng')
soup.select('.tang > ul > li')
soup.select('.tang li')
tag = soup.title
tag.text
li_list = soup.select('.tang > ul > li')
li_list[6].text
div_tag = soup.find('div',class_='tang')
div_tag.text
a_tag = soup.select('#feng')[0]
a_tag['href']

xpath返回值一定是列表(復數)的實例化對象

#1.標簽定位
    最左側的/:xpath表達式式一定要從根標簽開始進行定位
    非最左側的/:表示一個層級
    最左側的//:從任意位置進行標簽定位（常用）
    非最左側//：表示多個層級
    //tagName:定位到所有的tagName標簽
    屬性定位：//tagName[@attrName="value"]#定位對應屬性為"value"屬性值
    索引定位：//tagName[index],index索引是從1開始
#2.模糊匹配：
//div[contains(@class, "ng")]
//div[starts-with(@class, "ta")]
#3.取文本
/text():取直系的文本內容。列表只有一個元素
//text()：所有的文本內容。列表會有多個列表元素
#4.取屬性
/@attrName
#5.邏輯運算：
    #找到href屬性值為空且class屬性值為du的a標簽
    //a[@href="" and @class="du"]

from lxml import etree
tree = etree.parse('./test.html')本地
tree.xpath('/html/head/meta')
tree.xpath('/html//meta')
tree.xpath('//meta')
tree.xpath('//div')
tree.xpath('//div[@class="tang"]')
tree.xpath('//li[1]')
tree.xpath('//a[@id="feng"]/text()')[0]
tree.xpath('//div[2]//text()')
tree.xpath('//a[@id="feng"]/@href')

	re	xpath	bs4
模塊安裝	內置	第三方	第三方
語法	正則	路徑匹配	面向對象
使用級別	困難	較困難	簡單
性能(處理速度)	最高	適中	最低

9.編碼問題總結

1.修改響應頭

import requests
url = 'https://www.sogou.com/web?query=人民幣'
response = requests.get(url)
#修改響應數據的編碼格式
response.encoding = 'utf-8'#進行編碼轉義
page_text = response.text
with open('./人民幣.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

2.修改單個數據

 img_name.encode('iso-8859-1').decode('gbk')

使用xpath

#url模板
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1,11):
    new_url = format(url%page) #只可以表示非第一頁的頁碼連接
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/'
    page_text = requests.get(new_url,headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        print(img_name,img_src)

牛巴巴查看視頻

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spring 核心技術（1） Spring 核心技術（5） Spring 核心技術（7）大數據安全核心技術 C語言核心技術-變量和數據類型區塊鏈核心技術與應用/核心技術篇 Spring MVC核心技術雲計算核心技術 java核心技術卷一 Struts核心技術簡介