基於數據形式說明杜蘭特的技術特點的分析(含Python實現講解部分)


---恢復內容開始---

注: 本博文系原創,轉載請標明原處。

題外話:春節過后,回到學校無所事事,感覺整個人都生銹一般,沒什么動力,姑且稱為“春節后遺症”。在科賽官網得到關於NBA的詳細數據,而且又想對於自己學習數據挖掘半年以來做一次系統性梳理,就打算做一份關於杜蘭特的技術特點的數據分析報告(本人是杜迷),可以稱得上寓學於樂吧。話不多說,開工。。。。。

1 杜蘭特 VS Who?

既然要說杜蘭特的技術特點,總是要對比吧,不然怎么知道他的特點呢?這里我主要是從幾個方面選擇:一、球員的位置小前鋒和后衛,杜蘭特是小前鋒,當然也會打打后衛。二、基本是同一個時代的球員,前后差幾年也是可以的(如科比)。三、可以稱為巨星的球員。最終選擇了以下幾名球員作為對比:科比、詹姆斯、庫里、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德。對於新星和前輩們就不做對比,時代不一樣數據的意義也有差別,新星的數據比較少,對比沒有必要。當然選的人也不是很完美,個人主觀選擇(哈哈......)

2 數據

數據來源:https://www.kesci.com/apps/home/dataset/599a6e66c8d2787da4d1e21d/document

3 殺向數據的第一刀

巨星表演最佳舞台是季后賽,他們給予我們太多太多的經典時刻,而那些被我們所津津稱道時刻就是他們榮譽加身的時刻。所以我打算從季后賽開始分析。。。(就是這么任性)

3.1 首先,我們先看看季后賽的數據有哪些

>>> import pandas as pd
data
>>> data_player_playoff = pd.read_csv('E:\Python\Program\NBA_Data\data\player_playoff.csv')
>>> data_player_playoff.head()
                球員     賽季   球隊 結果             比分  時間     投籃  命中  出手   三分 ...  \
0  Kelenna Azubuike  11-12  DAL  L    OKC95-79DAL   5  0.333   1   3  1.0 ...   
1  Kelenna Azubuike  06-07  GSW  L  UTA115-101GSW   1    NaN   0   0  NaN ...   
2  Kelenna Azubuike  06-07  GSW  W  UTA105-125GSW   3  0.000   0   1  NaN ...   
3  Kelenna Azubuike  06-07  GSW  W   DAL86-111GSW   2  1.000   1   1  NaN ...   
4  Kelenna Azubuike  06-07  GSW  L  DAL118-112GSW   0    NaN   0   0  NaN ...   

   罰球出手  籃板  前場  后場  助攻  搶斷  蓋帽  失誤  犯規  得分  
0     0   1   1   0   0   1   0   1   0   3  
1     0   0   0   0   0   0   0   0   0   0  
2     0   0   0   0   0   0   0   0   1   0  
3     0   0   0   0   0   0   0   0   0   2  
4     0   0   0   0   0   0   0   0   0   0  

[5 rows x 24 columns]

pd.head(n) 函數是對數據前n 行輸出,默認5行,pd.tail() 對數據后幾行的輸出。

3.2 數據的基本信息

>>> data_player_playoff.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49743 entries, 0 to 49742
Data columns (total 24 columns):
球員      49615 non-null object
賽季      49743 non-null object
球隊      49743 non-null object
結果      49743 non-null object
比分      49743 non-null object
時間      49743 non-null int64
投籃      45767 non-null float64
命中      49743 non-null int64
出手      49743 non-null int64
三分      24748 non-null float64
三分命中    49743 non-null int64
三分出手    49743 non-null int64
罰球      29751 non-null float64
罰球命中    49743 non-null int64
罰球出手    49743 non-null int64
籃板      49743 non-null int64
前場      49743 non-null int64
后場      49743 non-null int64
助攻      49743 non-null int64
搶斷      49743 non-null int64
蓋帽      49743 non-null int64
失誤      49743 non-null int64
犯規      49743 non-null int64
得分      49743 non-null int64
dtypes: float64(3), int64(16), object(5)
memory usage: 9.1+ MB

3.3 由於中文的列名對后面的數據處理帶來麻煩,更改列名

>>> data_player_playoff.columns = ['player','season','team','result','team_score','time','shoot','hit','shot','three_pts','three_pts_hit','three_pts_shot','free_throw','free_throw_hit','free_throw_shot','backboard','front_court','back_court','assists','steals','block_shot','errors','foul','player_score']

3.4 從數據表中選擇杜蘭特、科比、詹姆斯、庫里、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德的數據

>>> kd_data_off = data_player_playoff[data_player_playoff .player == 'Kevin Durant']
>>> jh_data_off = data_player_playoff [data_player_playoff .player == 'James Harden']
>>> kb_data_off = data_player_playoff [data_player_playoff .player == 'Kobe Bryant']
>>> lj_data_off = data_player_playoff [data_player_playoff .player == 'LeBron James']
>>> kl_data_off = data_player_playoff [data_player_playoff .player == 'Kawhi Leonard']
>>> sc_data_off = data_player_playoff [data_player_playoff .player == 'Stephen Curry']
>>> rw_data_off = data_player_playoff [data_player_playoff .player == 'Russell Westbrook']
>>> pg_data_off = data_player_playoff [data_player_playoff .player == 'Paul George']
>>> ca_data_off = data_player_playoff [data_player_playoff .player == 'Carmelo Anthony']
>>> cp_data_off = data_player_playoff [data_player_playoff .player == 'Chris Paul']
>>> super_data_off = pd.DataFrame ()
>>> super_data_off = pd.concat([kd_data_off ,kb_data_off ,jh_data_off ,lj_data_off ,sc_data_off ,kl_data_off ,cp_data_off ,rw_data_off ,pg_data_off ,ca_data_off ])
>>> super_data_off .info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1087 entries, 9721 to 904
Data columns (total 24 columns):
player             1087 non-null object
season             1087 non-null object
team               1087 non-null object
result             1087 non-null object
team_score         1087 non-null object
time               1087 non-null int64
shoot              1085 non-null float64
hit                1087 non-null int64
shot               1087 non-null int64
three_pts          1059 non-null float64
three_pts_hit      1087 non-null int64
three_pts_shot     1087 non-null int64
free_throw         1015 non-null float64
free_throw_hit     1087 non-null int64
free_throw_shot    1087 non-null int64
backboard          1087 non-null int64
front_court        1087 non-null int64
back_court         1087 non-null int64
assists            1087 non-null int64
steals             1087 non-null int64
block_shot         1087 non-null int64
errors             1087 non-null int64
foul               1087 non-null int64
player_score       1087 non-null int64
dtypes: float64(3), int64(16), object(5)
memory usage: 212.3+ KB

3.5 把這十個人的數據單獨存放到一文件里

>>> super_data_off .to_csv('super_star_playoff.csv',index = False )

4 數據分析

4.1 先看看他們參加了多少場季后賽

>>> super_data_off.player.value_counts()
Kobe Bryant          220
LeBron James         217
Kevin Durant         106
James Harden          88
Kawhi Leonard         87
Russell Westbrook     87
Chris Paul            76
Stephen Curry         75
Carmelo Anthony       66
Paul George           65
Name: player, dtype: int64

這里可以看出詹姆斯的年年總決賽的霸氣,只比科比少三場,今年就會超過科比了,而且老詹還要進幾年總決賽啊。杜蘭特的場數和詹姆斯相差比較大的,估計最后和科比的場數差不多。

4.2 簡單粗暴,直接看看他們的季后賽的得分

>>> super_data_off.groupby('player').player_score.describe()
                   count       mean        std   min   25%   50%    75%   max
player                                                                       
Carmelo Anthony     66.0  25.651515   8.471658   2.0  21.0  25.0  31.00  42.0
Chris Paul          76.0  21.434211   7.691269   4.0  16.0  21.5  27.00  35.0
James Harden        88.0  20.681818  10.485398   0.0  13.0  19.0  28.00  45.0
Kawhi Leonard       87.0  16.459770   8.428640   2.0  11.0  16.0  21.00  43.0
Kevin Durant       106.0  28.754717   6.979987  10.0  25.0  29.0  33.75  41.0
Kobe Bryant        220.0  25.636364   9.856715   0.0  20.0  26.0  32.00  50.0
LeBron James       217.0  28.400922   7.826865   7.0  23.0  28.0  33.00  49.0
Paul George         65.0  18.984615   9.299685   2.0  12.0  19.0  26.00  39.0
Russell Westbrook   87.0  25.275862   8.187753   7.0  19.0  26.0  30.00  51.0
Stephen Curry       75.0  26.200000   8.109054   6.0  21.5  26.0  32.50  44.0

從這里可以看出杜蘭特是個得分高手,隱隱約約可以看出穩如狗

得分的直方圖來了,坐穩

#coding:utf-8

import matplotlib.pyplot as plt
import pandas as pd
# 中文亂碼的處理
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體
mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示為方塊的問題

super_data_off = pd.read_csv('super_star_playoff.csv')
kd_off_score = super_data_off[super_data_off .player == 'Kevin Durant'] .player_score.describe()
super_off_mean_score = super_data_off .groupby('player').mean()['player_score']
labels = [u'場數',u'均分',u'標准差',u'最小值','25%','50%','75%',u'最大值']
print super_off_mean_score .index
super_name = [u'安東尼',u'保羅',u'哈登',u'倫納德',u'杜蘭特',u'科比',u'詹姆斯',u'喬治',u'威少',u'庫里']
# 繪圖
plt.bar(range(len(super_off_mean_score )),super_off_mean_score ,align = 'center')

plt.ylabel(u'得分')
plt.title(u'巨星季后賽得分數據對比')
#plt.xticks(range(len(labels)),labels)
plt.xticks(range(len(super_off_mean_score )),super_name)
plt.ylim(15,35)
for x,y in enumerate (super_off_mean_score ):
    plt.text (x, y+1, '%s' % round(y, 2) , ha = 'center')
plt.show()

從得分的角度看杜蘭特和詹姆斯是一檔,安東尼、科比、威少和庫里是一檔,保羅、哈登、倫納德、喬治一檔。哈登今年應該會有比較明顯的提升,畢竟他是從第六人打的季后賽。杜蘭特的四個得分王不是白拿的,在得分方面確實聯盟的超巨。

再看看巨星的每個賽季的季后賽的平均值的走勢

season_kd_score = super_data_off[super_data_off .player == 'Kevin Durant'] .groupby('season').mean()['player_score']
plt.figure()
plt.subplot(321)
plt.title(u'杜蘭特賽后季平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_kd_score,'k',season_kd_score,'bo')
for x,y in enumerate (season_kd_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')


season_lj_score = super_data_off [super_data_off .player == 'LeBron James'].groupby('season').mean()['player_score']
plt.subplot(322)
plt.title(u'詹姆斯賽后季平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_lj_score ,'k',season_lj_score ,'bo')
for x,y in enumerate (season_lj_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_kb_score = super_data_off[super_data_off.player == 'Kobe Bryant'].groupby('season').mean()['player_score']
a = season_kb_score [0:-4]
b =season_kb_score [-4:]
season_kb_score = pd.concat([b,a])
plt.subplot(323)
plt.title(u'科比賽季后賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.xticks(range(len(season_kb_score )),season_kb_score.index)
plt.plot(list(season_kb_score) ,'k',list(season_kb_score),'bo')
for x,y in enumerate (season_kb_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_rw_score = super_data_off[super_data_off.player == 'Russell Westbrook'].groupby('season').mean()['player_score']
plt.subplot(324)
plt.title(u'威少賽季后賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_rw_score ,'k',season_rw_score ,'bo')
for x,y in enumerate (season_rw_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_sc_score = super_data_off[super_data_off.player == 'Stephen Curry'].groupby('season').mean()['player_score']
plt.subplot(325)
plt.title(u'庫里賽季后賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_sc_score ,'k',season_sc_score ,'bo')
for x,y in enumerate (season_sc_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_ca_score = super_data_off[super_data_off.player == 'Carmelo Anthony'].groupby('season').mean()['player_score']
plt.subplot(326)
plt.title(u'安東尼賽季后賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_ca_score ,'k',season_ca_score ,'bo')
for x,y in enumerate (season_ca_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

plt.show()

再使用餅狀圖觀察他們的的得分分布

super_name_E = ['Kevin Durant','LeBron James','Kobe Bryant','Russell Westbrook','Stephen Curry','Carmelo Anthony']
super_name_C = [u'杜蘭特',u'詹姆斯',u'科比',u'威少',u'庫里',u'安東尼']
plt.figure(facecolor= 'bisque')
colors = ['red', 'yellow', 'peru', 'springgreen']
for i in range(len(super_name_E)):
    player_labels = [u'20分以下',u'20~29分',u'30~39分',u'40分以上']
    explode = [0,0.1,0,0] # 突出得分在20~29的比例
    player_score_range = []
    player_off_score_range = super_data_off[super_data_off .player == super_name_E [i]]
    player_score_range.append(len(player_off_score_range [player_off_score_range['player_score'] < 20])*1.0/len(player_off_score_range ))
    player_score_range.append(len(pd.merge(player_off_score_range[19 < player_off_score_range.player_score],
                                           player_off_score_range[player_off_score_range.player_score < 30],
                                       how='inner')) * 1.0 / len(player_off_score_range))
    player_score_range.append(len(pd.merge(player_off_score_range[29 < player_off_score_range.player_score],
                                           player_off_score_range[player_off_score_range.player_score < 40],
                                       how='inner')) * 1.0 / len(player_off_score_range))
    player_score_range.append(len(player_off_score_range[39 < player_off_score_range.player_score]) * 1.0 / len(player_off_score_range))
    plt.subplot(231 + i)
    plt.title(super_name_C [i] + u'得分分布', color='blue')
    plt.pie(player_score_range, labels=player_labels, colors=colors, labeldistance=1.1,
            autopct='%.01f%%', shadow=False, startangle=90, pctdistance=0.8, explode=explode)
    plt.axis('equal')
plt.show()

 

從這些餅狀圖可知,杜蘭特和詹姆斯在得分的穩定性上一騎絕塵,得分主要集中在 20 ~ 40 之間,占到全部的八成左右。他們的不僅得分高,而且穩定性也是極高。其中40+的得分中占比最高的是詹姆斯,其次是庫里和杜蘭特。這也從側面得知杜蘭特是這些球員中得分最穩的人,真是穩如狗!!!!從數據上看穩定性,那么下面我給出他們的得分的標准差的直方圖:

std = super_data_off.groupby('player').std()['player_score']
color = ['red','red','red','red','blue','red','red','red','red','red',]
print std
plt.barh(range(10), std, align = 'center',color = color ,alpha = 0.8)
plt.xlabel(u'標准差',color = 'blue')
plt.ylabel(u'球員', color = 'blue')
plt.yticks(range(len(super_name )),super_name)
plt.xlim(6,11)
for x,y in enumerate (std):
    plt.text(y + 0.1, x, '%s' % round(y,2), va = 'center')
plt.show()

標准差的直方圖可以明顯地說明杜蘭特的穩定性極高(標准差越小說明數據的平穩性越好)

4.3 投籃方式和效率

在評價一個球員時,往往其投籃的區域和命中率是一項很重要的指標,可以把分為神射手,三分投手、中投王和沖擊內線(善突),當然也有造犯規的高手,如哈登。

super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant',
                u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry']
bar_width = 0.25
import numpy as np
shoot = super_data_off.groupby('player') .mean()['shoot']
three_pts = super_data_off.groupby('player') .mean()['three_pts']
free_throw = super_data_off.groupby('player') .mean()['free_throw']
plt.bar(np.arange(10),shoot,align = 'center',label = u'投籃命中率',color = 'red',width = bar_width )
plt.bar(np.arange(10)+ bar_width, three_pts ,align = 'center',color = 'blue',label = u'三分命中率',width = bar_width )
plt.bar(np.arange(10)+ 2*bar_width, free_throw  ,align = 'center',color = 'green',label = u'罰球命中率',width = bar_width )
for x,y in enumerate (shoot):
    plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (three_pts ):
    plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (free_throw):
    plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
plt.legend ()
plt.ylim(0.3,1.0)
plt.title(u'球員的命中率的對比')
plt.xlabel(u'球員')
plt.xticks(np.arange(10)+bar_width  ,super_name)
plt.ylabel(u'命中率')
plt.show()

投籃命中率、三分球命中率和罰球命中率最高的依次是倫納德、庫里和庫里,由此可見,庫里三分能力的強悍。杜蘭特這三項的數據都是排在第三位,表明他的得分的全面性。

super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant',
                u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry']
bar_width = 0.25
import numpy as np
three_pts = super_data_off.groupby('player').sum()['three_pts_hit']
free_throw_pts = super_data_off.groupby('player').sum()['free_throw_hit']
sum_pts = super_data_off.groupby('player').sum()['player_score']
three_pts_rate = np.array(list(three_pts ))*3.0 /np.array(list(sum_pts ))
free_throw_pts_rate = np.array(list(free_throw_pts ))*1.0/np.array(list(sum_pts ))
two_pts_rate = 1.0 - three_pts_rate - free_throw_pts_rate
print two_pts_rate
plt.bar(np.arange(10),two_pts_rate ,align = 'center',label = u'二分球得分占比',color = 'red',width = bar_width )
plt.bar(np.arange(10)+ bar_width, three_pts_rate ,align = 'center',color = 'blue',label = u'三分球得分占比',width = bar_width )
plt.bar(np.arange(10)+ 2*bar_width, free_throw_pts_rate   ,align = 'center',color = 'green',label = u'罰球得分占比',width = bar_width )
for x,y in enumerate (two_pts_rate):
    plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (three_pts_rate ):
    plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (free_throw_pts_rate):
    plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
plt.legend ()
plt.title(u'球員的得分方式的對比')
plt.xlabel(u'球員')
plt.xticks(np.arange(10)+bar_width  ,super_name)
plt.ylabel(u'占比率')
plt.show()

 可以看出,二分球占比、三分球占比和罰球占比最高依次是:安東尼和科比、庫里 、哈登。這也跟我們的主觀相符的,安東尼絕招中距離跳投,科比的后仰跳投,庫里不講理的三分,哈登在罰球的造詣之高,碰瓷王不是白叫的。當然,詹姆斯的二分球占比也是很高,跟他的身體的天賦分不開的。而杜蘭特這三項的數據都是中規中矩,也保持着中距離的特點,這也說明了他的進攻的手段的豐富性和全面性。

4.4 防守端的數據

球星的能力不光光體現進攻端,而防守端的能力也是一個重要的指標。強如喬丹、科比和詹姆斯都是最佳防守陣容的常客,所以,這里給出他們在攻防兩端的數據值。

import seaborn as sns
import numpy as np
player_adavance = pd.read_csv('super_star_advance_data.csv')
player_labels = [ u'籃板率', u'助攻率', u'搶斷率', u'蓋帽率',u'失誤率', u'使用率', u'勝利貢獻值',  u'霍格林效率值']
player_data = player_adavance[['player','total_rebound_rate','assist_rate','steals_rate','cap_rate','error_rate',
                             'usage_rate','ws','per']] .groupby('player').mean()
num = [100,100,100,100,100,100,1,1]
np_num = np.array(player_data)*np.array(num)
plt.title(u'球員攻防兩端的能力熱力圖')
sns.heatmap(np_num , annot=True,xticklabels= player_labels ,yticklabels=super_name  ,cmap='YlGnBu')
plt.show()

在籃板的數據小前鋒的數據差不多,都是11 左右,而后衛中最能搶板是威少,畢竟是上賽季的場均三雙(歷史第二人)。助攻率最高的當然是保羅,其次是威少和詹姆斯;而杜蘭特的助攻率比較平庸,但在小前鋒里面也是不錯了。搶斷率方面是保羅和倫納德的優勢明顯,顯示了倫納德的死亡纏繞的效果了。蓋帽率最高的是杜蘭特,身體的優勢在這項數據的體現的很明顯;在這個賽季杜蘭特的蓋帽能力又是提升了一個層次,高居聯盟前五(杜中鋒,哈哈)。失誤率方面后衛高於前鋒,最高的是威少。使用率最高的是威少,其次是詹姆斯,可以看出他們的球權都是挺大,倫納德只有22(波波老爺子的整體籃球控制力真強)。貢獻值最高是詹姆斯,畢竟球隊都是圍繞他建立的,現在更是一個人扛着球隊前行;其次是保羅,畢竟球隊的大腦;杜蘭特第三,也是符合殺神的稱號的。效率值的前三和貢獻值一樣,老詹真是強,不服不行啊。。。。

5 小結

在數據面前,可以得出:從進攻的角度講,杜蘭特是最強的,主要體現在:高得分、穩定性強、得分方式全面和得分效率高。從防守的方面,杜蘭特善於封蓋,而串聯球隊方面,杜蘭特還是與詹姆斯有着明顯差距。這兩年杜蘭特的防守是越來越好了,希望這個賽季能進入最佳防守陣容。這些數據顯示與平時對杜蘭特的了解相差不大,可以說數據驗證了主觀的認識。季后賽的數據就分析就到這里了,對模塊padans、numpy 、seaborn 和 matplotlib 系統的梳理一遍吧,也算是新學期的熱身吧。常規賽的數據分析就不分析了,什么時候有興趣了再搞。

---恢復內容結束---


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM