一、主題式網絡爬蟲設計方案
1.主題式網絡爬蟲名稱:BILIBILI番劇熱度及排行
2.主題式網絡爬蟲爬取的內容與數據特征分析:內容為番劇名稱,視頻總播放量以及綜合得分
3.主題式網絡爬蟲設計方案概述:通過對BILIBILI網頁源代碼的分析獲得所需數據,並進行爬取和整理,從而得出所需結論
二、主題頁面的結構特征分析
1.主題頁面的結構與特征分析:紅框內為我們所需要獲取的數據
分別位於以下路徑:
2.技術難點
所需的數據:“播放量”,“彈幕量”,‘收藏量’位於同一個標簽“data-box”下,同時爬取較難區分(或需要進行額外區分)。
解決方法:利用bs4庫的以下函數:【print(soup.find('meta', attrs={'name':'viewport'})) #獲取第一個標簽,根據屬性過濾獲取】爬取后,再根據字段進行分割
三、網絡爬蟲程序設計
1.數據爬取與采集:代碼如下
def gettitle(soup): # 獲取標題
getdata = soup.find_all(attrs={'class': 'title'})
data = []
for getdata in getdata:
data.append(getdata.text)
return data
def getscore(soup): # 獲取評分
getdata = soup.find_all(attrs={'class': 'pts'})
data = []
for getdata in getdata:
data.append(getdata.text)
return data
def getplay(soup): # 獲取播放量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[0:300:3]
return play
def getdanmu(soup): # 獲取彈幕量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[1:300:3]
return play
def getfav(soup): # 獲取收藏量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[2:300:3]
return play
將其分類標記並保存為csv文件
title = gettitle(soup)
danmu = getdanmu(soup)
play = getplay(soup)
score = getscore(soup)
fav = getfav(soup)
df = pd.DataFrame.from_dict({'排名': range(1, 51), '標題': title, '彈幕': danmu, '播放': play, '綜合得分': score, '收藏': fav},
orient='index')
df = df.T
df.to_csv('D:/bilibilidata.csv')
2.對數據進行清洗和處理
已爬取的數據中包含有不同單位,對單位進行統一
filename = 'D:/bilibilidata.xls'
colnames = ["rank", "title", "danmu", "play", "score", "fav", ]
plt.rcParams['font.sans-serif'] = ['SimHei']
data = pd.read_csv(filename, skiprows=1, names=colnames)
play = []
pl = list(data.play)
for i in pl:
if i[-1] == "億":
play.append(eval((i[:-1] * 10000)))
else:
play.append(eval(i))
danmu = []
dm = list(data.danmu)
for i in dm:
if i[-1] == "萬":
danmu.append(eval(i[:-1]) * 10000)
else:
danmu.append(eval(i))
刪除重復行
data.duplicated()
數據的可視化
(1)數據之間的相關性
print(data.corr())
print(data.describe() )
(2)數據可視化
綜合得分-排名直方圖
plt.figure(dpi=240)
ranking = data.ranking
score = data.score
plt.bar(ranking,score,color=[0,0,0.8,0.6])
plt.title("綜合得分直方圖")
plt.xlabel("排名")
plt.ylabel("綜合得分")
plt.show()
彈幕-排名直方圖
plt.figure(dpi=240)
ranking = data.ranking
danmu = data.danmu
plt.bar(ranking,danmu,color=[0,0,0.8,0.6])
plt.title("彈幕-排名直方圖")
plt.xlabel("排名")
plt.ylabel("彈幕")
plt.show()

播放-排名直方圖
plt.figure(dpi=240)
ranking = data.ranking
play = data.play
plt.bar(ranking,play,color=[0,0,0.8,0.6])
plt.title("播放-排名直方圖")
plt.xlabel("排名")
plt.ylabel("彈幕")
plt.show()
下略
3.建立回歸方程
ranking = data.ranking
play = data.play
danmu = data.danmu
score = data.score
def ft(p, x):
a, b, c = p
return a * (x ** 2) + (b * x) + c
def er_ft(p, x, y):
return ft(p, x) - y
play_np = np.array(play)
score_np = np.array(score)
danmu_np = np.array(danmu)
ranking_np = np.array(ranking)
p0 = np.array(2, 3, 4)
plt.figure(dpi=240)
plt.scatter(ranking, play, label=u'樣本數據', color=[0, 0, 0.8, 0.8])
P = leastsq(er_ft, p0, args=(ranking_np, play_np))
a, b, c = P[0]
x = np.linspace(0, 100, 1000)
y = a * (x ** 2) + (b * x) + c
plt.plot(x, y, color="green", label=u"擬合直線", linewidth=2)
plt.title('播放量散點圖&擬合直線')
plt.xlabel("排名")
plt.ylabel("播放量")
plt.legend()
plt.grid()
plt.show()
例如
(1)排名與播放量的回歸曲線和散點圖
(2)排名與彈幕數的回歸曲線和散點圖
(3)綜合得分與排名的回歸曲線與散點圖
以上為數據可視化
四、結論
1.經過對主題數據的分析與可視化,可以得到的結論為:
(1)一部番劇的綜合排名與播放量、彈幕量、收藏量呈正相關
(2)播放量、彈幕量、收藏量決定了番劇的綜合評分
2.小結
本次程序設計任務的主要難點在於對網頁數據的爬取、對已爬取數據進行分類整理並清洗處理。在經過本次的程序設計實踐后,我能夠更加熟練的掌握若干python庫的使用,對與網頁結構的了解更加深刻。
五、源碼
part1 數據爬取
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
from numpy.distutils.fcompiler import none
import xlwt
import numpy as np
import re
response = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/7')
html = response.text
soup = BeautifulSoup(html, 'lxml')
def gettitle(soup): # 獲取標題
getdata = soup.find_all(attrs={'class': 'title'})
data = []
for getdata in getdata:
data.append(getdata.text)
return data
def getscore(soup): # 獲取評分
getdata = soup.find_all(attrs={'class': 'pts'})
data = []
for getdata in getdata:
data.append(getdata.text)
return data
def getplay(soup): # 獲取播放量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[0:300:3]
return play
def getdanmu(soup): # 獲取彈幕量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[1:300:3]
return play
def getfav(soup): # 獲取收藏量
getdata = soup.find_all(attrs={'class': 'data-box'})
data = []
for getdata in getdata:
data.append(getdata.text)
play = data[2:300:3]
return play
title = gettitle(soup)
danmu = getdanmu(soup)
play = getplay(soup)
score = getscore(soup)
fav = getfav(soup)
df = pd.DataFrame.from_dict({'排名': range(1, 51), '標題': title, '彈幕': danmu, '播放': play, '綜合得分': score, '收藏': fav},
orient='index')
df = df.T
df.to_csv('D:/bilibilidata.csv')
part2 數據可視化(由於長度過長只上傳一例)
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import numpy as np
import re
from scipy.optimize import leastsq
filename = 'D:/bilibilidata.xls'
colnames = ["ranking", "title", "danmu", "play", "score", "fav", ]
data = pd.read_csv(filename, skiprows=1, names=colnames)
ranking = data.ranking
play = data.play
danmu = data.danmu
score = data.score
plt.figure(dpi=240)
ranking = data.ranking
score = data.pts
plt.bar(ranking,score,color=[0,0,0.8,0.6])
plt.title("綜合得分直方圖")
plt.xlabel("排名")
plt.ylabel("綜合得分")
plt.show()
def ft(p, x):
a, b, c = p
return a * (x ** 2) + (b * x) + c
def er_ft(p, x, y):
return ft(p, x) - y
play_np = np.array(play)
score_np = np.array(score)
danmu_np = np.array(danmu)
ranking_np = np.array(ranking)
p0 = np.array(2, 3, 4)
plt.figure(dpi=240)
plt.scatter(ranking, play, label=u'樣本數據', color=[0, 0, 0.8, 0.8])
P = leastsq(er_ft, p0, args=(ranking_np, play_np))
a, b, c = P[0]
x = np.linspace(0, 100, 1000)
y = a * (x ** 2) + (b * x) + c
plt.plot(x, y, color="green", label=u"擬合直線", linewidth=2)
plt.title('播放量散點圖&擬合直線')
plt.xlabel("排名")
plt.ylabel("播放量")
plt.legend()
plt.grid()
plt.show()