---恢復內容開始---
注: 本博文系原創,轉載請標明原處。
題外話:春節過后,回到學校無所事事,感覺整個人都生銹一般,沒什么動力,姑且稱為“春節后遺症”。在科賽官網得到關於NBA的詳細數據,而且又想對於自己學習數據挖掘半年以來做一次系統性梳理,就打算做一份關於杜蘭特的技術特點的數據分析報告(本人是杜迷),可以稱得上寓學於樂吧。話不多說,開工。。。。。
1 杜蘭特 VS Who?
既然要說杜蘭特的技術特點,總是要對比吧,不然怎么知道他的特點呢?這里我主要是從幾個方面選擇:一、球員的位置小前鋒和后衛,杜蘭特是小前鋒,當然也會打打后衛。二、基本是同一個時代的球員,前后差幾年也是可以的(如科比)。三、可以稱為巨星的球員。最終選擇了以下幾名球員作為對比:科比、詹姆斯、庫里、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德。對於新星和前輩們就不做對比,時代不一樣數據的意義也有差別,新星的數據比較少,對比沒有必要。當然選的人也不是很完美,個人主觀選擇(哈哈......)
2 數據
數據來源:https://www.kesci.com/apps/home/dataset/599a6e66c8d2787da4d1e21d/document
3 殺向數據的第一刀
巨星表演最佳舞台是季后賽,他們給予我們太多太多的經典時刻,而那些被我們所津津稱道時刻就是他們榮譽加身的時刻。所以我打算從季后賽開始分析。。。(就是這么任性)
3.1 首先,我們先看看季后賽的數據有哪些
>>> import pandas as pd data >>> data_player_playoff = pd.read_csv('E:\Python\Program\NBA_Data\data\player_playoff.csv') >>> data_player_playoff.head()
球員 賽季 球隊 結果 比分 時間 投籃 命中 出手 三分 ... \ 0 Kelenna Azubuike 11-12 DAL L OKC95-79DAL 5 0.333 1 3 1.0 ... 1 Kelenna Azubuike 06-07 GSW L UTA115-101GSW 1 NaN 0 0 NaN ... 2 Kelenna Azubuike 06-07 GSW W UTA105-125GSW 3 0.000 0 1 NaN ... 3 Kelenna Azubuike 06-07 GSW W DAL86-111GSW 2 1.000 1 1 NaN ... 4 Kelenna Azubuike 06-07 GSW L DAL118-112GSW 0 NaN 0 0 NaN ... 罰球出手 籃板 前場 后場 助攻 搶斷 蓋帽 失誤 犯規 得分 0 0 1 1 0 0 1 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 0 0 0 [5 rows x 24 columns]
pd.head(n) 函數是對數據前n 行輸出,默認5行,pd.tail() 對數據后幾行的輸出。
3.2 數據的基本信息
>>> data_player_playoff.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49743 entries, 0 to 49742 Data columns (total 24 columns): 球員 49615 non-null object 賽季 49743 non-null object 球隊 49743 non-null object 結果 49743 non-null object 比分 49743 non-null object 時間 49743 non-null int64 投籃 45767 non-null float64 命中 49743 non-null int64 出手 49743 non-null int64 三分 24748 non-null float64 三分命中 49743 non-null int64 三分出手 49743 non-null int64 罰球 29751 non-null float64 罰球命中 49743 non-null int64 罰球出手 49743 non-null int64 籃板 49743 non-null int64 前場 49743 non-null int64 后場 49743 non-null int64 助攻 49743 non-null int64 搶斷 49743 non-null int64 蓋帽 49743 non-null int64 失誤 49743 non-null int64 犯規 49743 non-null int64 得分 49743 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 9.1+ MB
3.3 由於中文的列名對后面的數據處理帶來麻煩,更改列名
>>> data_player_playoff.columns = ['player','season','team','result','team_score','time','shoot','hit','shot','three_pts','three_pts_hit','three_pts_shot','free_throw','free_throw_hit','free_throw_shot','backboard','front_court','back_court','assists','steals','block_shot','errors','foul','player_score']
3.4 從數據表中選擇杜蘭特、科比、詹姆斯、庫里、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德的數據
>>> kd_data_off = data_player_playoff[data_player_playoff .player == 'Kevin Durant'] >>> jh_data_off = data_player_playoff [data_player_playoff .player == 'James Harden'] >>> kb_data_off = data_player_playoff [data_player_playoff .player == 'Kobe Bryant'] >>> lj_data_off = data_player_playoff [data_player_playoff .player == 'LeBron James'] >>> kl_data_off = data_player_playoff [data_player_playoff .player == 'Kawhi Leonard'] >>> sc_data_off = data_player_playoff [data_player_playoff .player == 'Stephen Curry'] >>> rw_data_off = data_player_playoff [data_player_playoff .player == 'Russell Westbrook'] >>> pg_data_off = data_player_playoff [data_player_playoff .player == 'Paul George'] >>> ca_data_off = data_player_playoff [data_player_playoff .player == 'Carmelo Anthony'] >>> cp_data_off = data_player_playoff [data_player_playoff .player == 'Chris Paul'] >>> super_data_off = pd.DataFrame () >>> super_data_off = pd.concat([kd_data_off ,kb_data_off ,jh_data_off ,lj_data_off ,sc_data_off ,kl_data_off ,cp_data_off ,rw_data_off ,pg_data_off ,ca_data_off ]) >>> super_data_off .info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1087 entries, 9721 to 904 Data columns (total 24 columns): player 1087 non-null object season 1087 non-null object team 1087 non-null object result 1087 non-null object team_score 1087 non-null object time 1087 non-null int64 shoot 1085 non-null float64 hit 1087 non-null int64 shot 1087 non-null int64 three_pts 1059 non-null float64 three_pts_hit 1087 non-null int64 three_pts_shot 1087 non-null int64 free_throw 1015 non-null float64 free_throw_hit 1087 non-null int64 free_throw_shot 1087 non-null int64 backboard 1087 non-null int64 front_court 1087 non-null int64 back_court 1087 non-null int64 assists 1087 non-null int64 steals 1087 non-null int64 block_shot 1087 non-null int64 errors 1087 non-null int64 foul 1087 non-null int64 player_score 1087 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 212.3+ KB
3.5 把這十個人的數據單獨存放到一文件里
>>> super_data_off .to_csv('super_star_playoff.csv',index = False )
4 數據分析
4.1 先看看他們參加了多少場季后賽
>>> super_data_off.player.value_counts()
Kobe Bryant 220 LeBron James 217 Kevin Durant 106 James Harden 88 Kawhi Leonard 87 Russell Westbrook 87 Chris Paul 76 Stephen Curry 75 Carmelo Anthony 66 Paul George 65 Name: player, dtype: int64
這里可以看出詹姆斯的年年總決賽的霸氣,只比科比少三場,今年就會超過科比了,而且老詹還要進幾年總決賽啊。杜蘭特的場數和詹姆斯相差比較大的,估計最后和科比的場數差不多。
4.2 簡單粗暴,直接看看他們的季后賽的得分
>>> super_data_off.groupby('player').player_score.describe()
count mean std min 25% 50% 75% max player Carmelo Anthony 66.0 25.651515 8.471658 2.0 21.0 25.0 31.00 42.0 Chris Paul 76.0 21.434211 7.691269 4.0 16.0 21.5 27.00 35.0 James Harden 88.0 20.681818 10.485398 0.0 13.0 19.0 28.00 45.0 Kawhi Leonard 87.0 16.459770 8.428640 2.0 11.0 16.0 21.00 43.0 Kevin Durant 106.0 28.754717 6.979987 10.0 25.0 29.0 33.75 41.0 Kobe Bryant 220.0 25.636364 9.856715 0.0 20.0 26.0 32.00 50.0 LeBron James 217.0 28.400922 7.826865 7.0 23.0 28.0 33.00 49.0 Paul George 65.0 18.984615 9.299685 2.0 12.0 19.0 26.00 39.0 Russell Westbrook 87.0 25.275862 8.187753 7.0 19.0 26.0 30.00 51.0 Stephen Curry 75.0 26.200000 8.109054 6.0 21.5 26.0 32.50 44.0
從這里可以看出杜蘭特是個得分高手,隱隱約約可以看出穩如狗
得分的直方圖來了,坐穩
#coding:utf-8 import matplotlib.pyplot as plt import pandas as pd # 中文亂碼的處理 from pylab import mpl mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體 mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示為方塊的問題 super_data_off = pd.read_csv('super_star_playoff.csv') kd_off_score = super_data_off[super_data_off .player == 'Kevin Durant'] .player_score.describe() super_off_mean_score = super_data_off .groupby('player').mean()['player_score'] labels = [u'場數',u'均分',u'標准差',u'最小值','25%','50%','75%',u'最大值'] print super_off_mean_score .index super_name = [u'安東尼',u'保羅',u'哈登',u'倫納德',u'杜蘭特',u'科比',u'詹姆斯',u'喬治',u'威少',u'庫里'] # 繪圖 plt.bar(range(len(super_off_mean_score )),super_off_mean_score ,align = 'center') plt.ylabel(u'得分') plt.title(u'巨星季后賽得分數據對比') #plt.xticks(range(len(labels)),labels) plt.xticks(range(len(super_off_mean_score )),super_name) plt.ylim(15,35) for x,y in enumerate (super_off_mean_score ): plt.text (x, y+1, '%s' % round(y, 2) , ha = 'center') plt.show()

從得分的角度看杜蘭特和詹姆斯是一檔,安東尼、科比、威少和庫里是一檔,保羅、哈登、倫納德、喬治一檔。哈登今年應該會有比較明顯的提升,畢竟他是從第六人打的季后賽。杜蘭特的四個得分王不是白拿的,在得分方面確實聯盟的超巨。
再看看巨星的每個賽季的季后賽的平均值的走勢
season_kd_score = super_data_off[super_data_off .player == 'Kevin Durant'] .groupby('season').mean()['player_score'] plt.figure() plt.subplot(321) plt.title(u'杜蘭特賽后季平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_kd_score,'k',season_kd_score,'bo') for x,y in enumerate (season_kd_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_lj_score = super_data_off [super_data_off .player == 'LeBron James'].groupby('season').mean()['player_score'] plt.subplot(322) plt.title(u'詹姆斯賽后季平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_lj_score ,'k',season_lj_score ,'bo') for x,y in enumerate (season_lj_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_kb_score = super_data_off[super_data_off.player == 'Kobe Bryant'].groupby('season').mean()['player_score'] a = season_kb_score [0:-4] b =season_kb_score [-4:] season_kb_score = pd.concat([b,a]) plt.subplot(323) plt.title(u'科比賽季后賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.xticks(range(len(season_kb_score )),season_kb_score.index) plt.plot(list(season_kb_score) ,'k',list(season_kb_score),'bo') for x,y in enumerate (season_kb_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_rw_score = super_data_off[super_data_off.player == 'Russell Westbrook'].groupby('season').mean()['player_score'] plt.subplot(324) plt.title(u'威少賽季后賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_rw_score ,'k',season_rw_score ,'bo') for x,y in enumerate (season_rw_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_sc_score = super_data_off[super_data_off.player == 'Stephen Curry'].groupby('season').mean()['player_score'] plt.subplot(325) plt.title(u'庫里賽季后賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_sc_score ,'k',season_sc_score ,'bo') for x,y in enumerate (season_sc_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_ca_score = super_data_off[super_data_off.player == 'Carmelo Anthony'].groupby('season').mean()['player_score'] plt.subplot(326) plt.title(u'安東尼賽季后賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_ca_score ,'k',season_ca_score ,'bo') for x,y in enumerate (season_ca_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') plt.show()

再使用餅狀圖觀察他們的的得分分布
super_name_E = ['Kevin Durant','LeBron James','Kobe Bryant','Russell Westbrook','Stephen Curry','Carmelo Anthony'] super_name_C = [u'杜蘭特',u'詹姆斯',u'科比',u'威少',u'庫里',u'安東尼'] plt.figure(facecolor= 'bisque') colors = ['red', 'yellow', 'peru', 'springgreen'] for i in range(len(super_name_E)): player_labels = [u'20分以下',u'20~29分',u'30~39分',u'40分以上'] explode = [0,0.1,0,0] # 突出得分在20~29的比例 player_score_range = [] player_off_score_range = super_data_off[super_data_off .player == super_name_E [i]] player_score_range.append(len(player_off_score_range [player_off_score_range['player_score'] < 20])*1.0/len(player_off_score_range )) player_score_range.append(len(pd.merge(player_off_score_range[19 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 30], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(pd.merge(player_off_score_range[29 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 40], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(player_off_score_range[39 < player_off_score_range.player_score]) * 1.0 / len(player_off_score_range)) plt.subplot(231 + i) plt.title(super_name_C [i] + u'得分分布', color='blue') plt.pie(player_score_range, labels=player_labels, colors=colors, labeldistance=1.1, autopct='%.01f%%', shadow=False, startangle=90, pctdistance=0.8, explode=explode) plt.axis('equal') plt.show()

從這些餅狀圖可知,杜蘭特和詹姆斯在得分的穩定性上一騎絕塵,得分主要集中在 20 ~ 40 之間,占到全部的八成左右。他們的不僅得分高,而且穩定性也是極高。其中40+的得分中占比最高的是詹姆斯,其次是庫里和杜蘭特。這也從側面得知杜蘭特是這些球員中得分最穩的人,真是穩如狗!!!!從數據上看穩定性,那么下面我給出他們的得分的標准差的直方圖:
std = super_data_off.groupby('player').std()['player_score'] color = ['red','red','red','red','blue','red','red','red','red','red',] print std plt.barh(range(10), std, align = 'center',color = color ,alpha = 0.8) plt.xlabel(u'標准差',color = 'blue') plt.ylabel(u'球員', color = 'blue') plt.yticks(range(len(super_name )),super_name) plt.xlim(6,11) for x,y in enumerate (std): plt.text(y + 0.1, x, '%s' % round(y,2), va = 'center') plt.show()

標准差的直方圖可以明顯地說明杜蘭特的穩定性極高(標准差越小說明數據的平穩性越好)
4.3 投籃方式和效率
在評價一個球員時,往往其投籃的區域和命中率是一項很重要的指標,可以把分為神射手,三分投手、中投王和沖擊內線(善突),當然也有造犯規的高手,如哈登。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np shoot = super_data_off.groupby('player') .mean()['shoot'] three_pts = super_data_off.groupby('player') .mean()['three_pts'] free_throw = super_data_off.groupby('player') .mean()['free_throw'] plt.bar(np.arange(10),shoot,align = 'center',label = u'投籃命中率',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts ,align = 'center',color = 'blue',label = u'三分命中率',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw ,align = 'center',color = 'green',label = u'罰球命中率',width = bar_width ) for x,y in enumerate (shoot): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.ylim(0.3,1.0) plt.title(u'球員的命中率的對比') plt.xlabel(u'球員') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'命中率') plt.show()

投籃命中率、三分球命中率和罰球命中率最高的依次是倫納德、庫里和庫里,由此可見,庫里三分能力的強悍。杜蘭特這三項的數據都是排在第三位,表明他的得分的全面性。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np three_pts = super_data_off.groupby('player').sum()['three_pts_hit'] free_throw_pts = super_data_off.groupby('player').sum()['free_throw_hit'] sum_pts = super_data_off.groupby('player').sum()['player_score'] three_pts_rate = np.array(list(three_pts ))*3.0 /np.array(list(sum_pts )) free_throw_pts_rate = np.array(list(free_throw_pts ))*1.0/np.array(list(sum_pts )) two_pts_rate = 1.0 - three_pts_rate - free_throw_pts_rate print two_pts_rate plt.bar(np.arange(10),two_pts_rate ,align = 'center',label = u'二分球得分占比',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts_rate ,align = 'center',color = 'blue',label = u'三分球得分占比',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw_pts_rate ,align = 'center',color = 'green',label = u'罰球得分占比',width = bar_width ) for x,y in enumerate (two_pts_rate): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts_rate ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw_pts_rate): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.title(u'球員的得分方式的對比') plt.xlabel(u'球員') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'占比率') plt.show()

可以看出,二分球占比、三分球占比和罰球占比最高依次是:安東尼和科比、庫里 、哈登。這也跟我們的主觀相符的,安東尼絕招中距離跳投,科比的后仰跳投,庫里不講理的三分,哈登在罰球的造詣之高,碰瓷王不是白叫的。當然,詹姆斯的二分球占比也是很高,跟他的身體的天賦分不開的。而杜蘭特這三項的數據都是中規中矩,也保持着中距離的特點,這也說明了他的進攻的手段的豐富性和全面性。
4.4 防守端的數據
球星的能力不光光體現進攻端,而防守端的能力也是一個重要的指標。強如喬丹、科比和詹姆斯都是最佳防守陣容的常客,所以,這里給出他們在攻防兩端的數據值。
import seaborn as sns import numpy as np player_adavance = pd.read_csv('super_star_advance_data.csv') player_labels = [ u'籃板率', u'助攻率', u'搶斷率', u'蓋帽率',u'失誤率', u'使用率', u'勝利貢獻值', u'霍格林效率值'] player_data = player_adavance[['player','total_rebound_rate','assist_rate','steals_rate','cap_rate','error_rate', 'usage_rate','ws','per']] .groupby('player').mean() num = [100,100,100,100,100,100,1,1] np_num = np.array(player_data)*np.array(num) plt.title(u'球員攻防兩端的能力熱力圖') sns.heatmap(np_num , annot=True,xticklabels= player_labels ,yticklabels=super_name ,cmap='YlGnBu') plt.show()

在籃板的數據小前鋒的數據差不多,都是11 左右,而后衛中最能搶板是威少,畢竟是上賽季的場均三雙(歷史第二人)。助攻率最高的當然是保羅,其次是威少和詹姆斯;而杜蘭特的助攻率比較平庸,但在小前鋒里面也是不錯了。搶斷率方面是保羅和倫納德的優勢明顯,顯示了倫納德的死亡纏繞的效果了。蓋帽率最高的是杜蘭特,身體的優勢在這項數據的體現的很明顯;在這個賽季杜蘭特的蓋帽能力又是提升了一個層次,高居聯盟前五(杜中鋒,哈哈)。失誤率方面后衛高於前鋒,最高的是威少。使用率最高的是威少,其次是詹姆斯,可以看出他們的球權都是挺大,倫納德只有22(波波老爺子的整體籃球控制力真強)。貢獻值最高是詹姆斯,畢竟球隊都是圍繞他建立的,現在更是一個人扛着球隊前行;其次是保羅,畢竟球隊的大腦;杜蘭特第三,也是符合殺神的稱號的。效率值的前三和貢獻值一樣,老詹真是強,不服不行啊。。。。
5 小結
在數據面前,可以得出:從進攻的角度講,杜蘭特是最強的,主要體現在:高得分、穩定性強、得分方式全面和得分效率高。從防守的方面,杜蘭特善於封蓋,而串聯球隊方面,杜蘭特還是與詹姆斯有着明顯差距。這兩年杜蘭特的防守是越來越好了,希望這個賽季能進入最佳防守陣容。這些數據顯示與平時對杜蘭特的了解相差不大,可以說數據驗證了主觀的認識。季后賽的數據就分析就到這里了,對模塊padans、numpy 、seaborn 和 matplotlib 系統的梳理一遍吧,也算是新學期的熱身吧。常規賽的數據分析就不分析了,什么時候有興趣了再搞。
---恢復內容結束---
