kaggle競賽分享：NFL大數據碗 - 上

競賽簡介

一年一度的NFL大數據碗，今年的預測目標是通過兩隊球員的靜態數據，預測該次進攻推進的碼數，並轉換為該概率分布；

競賽鏈接

https://www.kaggle.com/c/nfl-big-data-bowl-2020

項目鏈接，該項目代碼已經public，大家可以copy下來直接運行

https://www.kaggle.com/holoong9291/nfl-big-data-bowl

github倉庫鏈接，更多做的過程中的一些思考、問題等可以在我的github中看到

https://github.com/NemoHoHaloAi/Competition/tree/master/kaggle/Top61%-0.01404-zzz-NFL-Big-Data-Bowl

一些基本概念

美式足球：進攻方目的是通過跑動、傳球等盡快抵達對方半場，也就是達陣，而防守方的目的則是相反，盡全力去阻止對方的前進以及盡可能斷球；
球場長120碼(109.728米），寬53碼（48.768米），周長是361.992米；
球員：雙方場上共22人，進攻方11人，防守方11人，進攻方持球；
進攻機會：進攻方共有四次機會，需要推進至少十碼；
進攻方：進攻方的職責是通過四次機會，盡可能的向前推進10碼或者達陣，以獲得下一個四次機會，否則就需要交出球權；
防守方：防守方則是相反，盡可能的阻止對方前進，如果能夠斷球那更好，直接球權交換；
handoff：傳球；
snap：發球；
橄欖球基本知識點我了解；
QB：四分衛，通常是發球后接球的那個人，一般口袋陣的中心，但是也不乏有像拉馬爾-傑克遜這樣的跑傳結合的QB，目前古典QB代表是新英格蘭愛國者NE的湯姆-布雷迪；
RB：跑衛，通常發球后進行沖刺、擺脫等，試圖接住本方QB的傳球后盡可能遠的沖刺；

球場碼線圖

一個常見的開球前站位圖

數據字段介紹、繪圖分析

row

字段信息：

GameId - a unique game identifier - 比賽ID
PlayId - a unique play identifier -
Team - home or away - 主場還是客場
X - player position along the long axis of the field. See figure below. - 在球場的位置x
Y - player position along the short axis of the field. See figure below. - 在球場的位置y
S - speed in yards/second - 速度，碼/秒
A - acceleration in yards/second^2
Dis - distance traveled from prior time point, in yards
Orientation - orientation of player (deg) 球員面向
Dir - angle of player motion (deg) 球員移動方向
NflId - a unique identifier of the player - NFL球員ID
DisplayName - player's name - 球員名
JerseyNumber - jersey number - 球衣號碼
Season - year of the season
YardLine - the yard line of the line of scrimmage
Quarter - game quarter (1-5, 5 == overtime) - 當前是第幾節比賽，5為加時
GameClock - time on the game clock - 比賽時間
PossessionTeam - team with possession - 持球方
Down - the down (1-4) - 達陣
Distance - yards needed for a first down - 距離拿首攻所需距離
FieldPosition - which side of the field the play is happening on
HomeScoreBeforePlay - home team score before play started - 賽前主隊分數
VisitorScoreBeforePlay - visitor team score before play started - 賽前客隊分數
NflIdRusher - the NflId of the rushing player
OffenseFormation - offense formation
OffensePersonnel - offensive team positional grouping
DefendersInTheBox - number of defenders lined up near the line of scrimmage, spanning the width of the offensive line
DefensePersonnel - defensive team positional grouping
PlayDirection - direction the play is headed
TimeHandoff - UTC time of the handoff - 傳球時間
TimeSnap - UTC time of the snap - 發球時間
Yards - the yardage gained on the play (you are predicting this) - 目標
PlayerHeight - player height (ft-in) - 球員身高
PlayerWeight - player weight (lbs) - 球員體重
PlayerBirthDate - birth date (mm/dd/yyyy) - 生日、歲數
PlayerCollegeName - where the player attended college - 大學
Position - the player's position (the specific role on the field that they typically play) - 場上位置
HomeTeamAbbr - home team abbreviation - 主隊縮寫
VisitorTeamAbbr - visitor team abbreviation - 客隊縮寫
Week - week into the season
Stadium - stadium where the game is being played - 體育場
Location - city where the game is being player - 城市
StadiumType - description of the stadium environment - 體育場類型
Turf - description of the field surface - 草皮
GameWeather - description of the game weather - 比賽天氣
Temperature - temperature (deg F) - 溫度
Humidity - humidity - 濕度
WindSpeed - wind speed in miles/hour - 風速
WindDirection - wind direction - 風向

定義問題

回歸預測，Target是碼數，但是最終結果需要轉換為條件概率分布；

Evaluation Function

Continuous Ranked Probability Score (CRPS)；

項目流程分享

定義模型輸出結果到概率分布的轉換類

這里競賽需要的並不是具體的碼數，而是碼數對應的概率分布，也就是所有碼數在一次進攻中的概率，所以需要這樣一個轉換類，如下：

缺失值處理

訓練數據上看，缺失情況不嚴重，缺失字段如下：

這里對缺失的處理根據不同類型的字段采取不同的方式：

天氣相關字段，由於天氣具有連續性，因此采用前向填充較為合理：
體育場類型，嚴格來說應該是通過baidu、google等去搜索，但是NFL的相關信息baidu搜到的太少，google上看也沒找到，所以用取值最多的來填充：
FieldPosition，這個字段的缺失不同於以上兩個，通過對數據的分析，它的缺失源於在中線開球時，此時沒法明確指出是在哪個半場，所以缺失，這里用一個特別的值來填充，“Middle”；
OffenseFormation，進攻隊形，實際缺失了5條，統一用取值最多的來填充即可；
DefendersInTheBox，防守方在混戰線附近的人數，通過觀察數據可以通過球隊、對手、以及防守組成員來填充DefendersInTheBox：
Orientation 球員方位-角度，Dir 球員移動-角度，只有一條缺失，且該球員正常上場了的，應該是技術型缺失，用mean填充即可；

異常、重復等處理

StadiumType：存在不同名但是同意思的情況，這里要整理后歸一處理，避免對模型產生干擾；
存在PossessionTeam既不是HomeTeamAbbr也不是VisitorTeamAbbr，共有120場比賽中出現這種情況；
草皮字段處理；
Location字段也存在重復含義但是不同值的情況需要歸一；

EDA：探索性數據分析

下面是通過matplotlib繪制的一場比賽中的多個進攻防守回合的展示圖，黑色三角形是QB，紅色是進攻方，淡藍色是防守方：

可以清楚的看到每次進攻不同的站位，以及整個推進的過程，這里我記錄的一份NFL比賽手記，愛國者vs烏鴉，新老QB的正面交鋒，非常精彩，可以對照着看一下；

特征工程

這里由於我個人對橄欖球的了解也並不是很多（強推電影弱點），所以特征工程部分做的並不是很好，從結果看Top61%也反映除了這個問題，但是我依然覺得具有一定的參考意義，下面我會分析每個新特征構建的目的，以及我的想法；

WindSpeed,WindDirection：直觀看，對比賽影響應該不大，可能存在某些傳球手喜歡順風或者逆風，但是影響應該很小，所以我這里選擇丟棄；
PlayerHeight：轉為球員身高，身高無疑對比賽是有關系的；
PlayerBirthDate：生日轉為歲數，歲數可以表示一個球員的身體狀況是否處於巔峰等；
開球到傳球的時間 - (TimeHandoff-TimeSnap)：我認為這一時間的長短一定程度上決定了戰術的選擇，而戰術肯定是影響了進攻碼數的；
比賽進行時間 - (15-GameClock+Quarter*15)：比賽進行了多久對球員們的體力、戰術選擇等都有很大影響；
Position_XX：用於統計當前進攻中場上各個角色的人數組成，這也跟戰術選擇密切關系；
goal區：碼線對方半場10碼或10碼內，此時距離達陣不到10碼，一般這種情況下戰術選擇會變得與之前不太一樣，不管是防守方還是進攻方；
首攻危險：這是我自己定義的，即當目前進攻方僅有一次進攻機會，而所需繼續進攻的碼數大於5時，我認為是有首攻危險的，此時很可能丟失球權，down為4，且distance大於5；
距離達陣還有多少碼：一般距離的不同，防守方的防守策略會有不同，距離較遠一般會較為保守，距離較近則會比較激進；
其余object特征做label encode處理；