Python數據挖掘入門與實踐---用決策樹預測獲勝球隊


數據集來源:1.  2013-14 NBA Schedule and Results

                        2.2013年 NBA 賽季排名情況

參考書籍:《Python數據挖掘入門與實踐》

 

1.加載數據集:

使用pandas加載數據集,有1319行數據, 8個特征, 查看前5項數據集,並查找是否有重復數據

#coding=gbk #使用決策樹來預測獲勝球隊 import time start = time.clock() #加載數據集 import pandas as pd file_name = r'D:\datasets\NBA_2014_games.csv' data = pd.read_csv(file_name) print(data.head()) #讀取前5項數據集 # Date Unnamed: 1 Visitor/Neutral PTS Home/Neutral \..... # 0 Tue Oct 29 2013 Box Score Orlando Magic 87 Indiana Pacers # 1 Tue Oct 29 2013 Box Score Los Angeles Clippers 103 Los Angeles Lakers # 2 Tue Oct 29 2013 Box Score Chicago Bulls 95 Miami Heat # 3 Wed Oct 30 2013 Box Score Brooklyn Nets 94 Cleveland Cavaliers # 4 Wed Oct 30 2013 Box Score Atlanta Hawks 109 Dallas Mavericks print(data.shape) # (1319, 8) print(data[data.duplicated()]) # Empty DataFrame 沒有重復元素

數據集清洗:1.第一列數據日期是字符串格式,改為日期格式; 2.修改表頭。

#修復表頭數據參數 data = pd.read_csv(file_name, parse_dates= ['Date']) #skiprows 忽略的行數 data.columns = ['Date','Score Type', 'Visitor Team', 'VisitorPts', 'Home Team', 'HomePts', 'OT?', 'Notes'] print(data.head()) #重命名表頭 # Date Score Type Visitor Team VisitorPts \。。。。 # 0 2013-10-29 Box Score Orlando Magic 87 # 1 2013-10-29 Box Score Los Angeles Clippers 103 # 2 2013-10-29 Box Score Chicago Bulls 95 # 3 2013-10-30 Box Score Brooklyn Nets 94 # 4 2013-10-30 Box Score Atlanta Hawks 109 print('-----') # print(data.ix[1] ) #打印出第2行的數據

提取新特征:通過現有的數據抽取特征, 首先確定類別,籃球只有勝負之分, 不像足球還有 平,局,  以1 代表球隊取勝,0為失敗。

#提取新特征 #找出獲勝的球隊 data['HomeWin'] = data['VisitorPts'] < data['HomePts'] y_true = data['HomeWin'].values print(y_true[:5]) #[ True True True True True] 是 numpy 數組 # print(data.head()) #創建2個新特征, 分別是這兩只球隊的上一場比賽的勝負情況 #創建字典,存放上次比賽結果 from collections import defaultdict won_last = defaultdict(int) data['HomeLastWin'] = None data['VisitorLastWin'] = None #此兩行代碼原書上沒有,應該增加這2列,否則下面的循環不能創建這2列 for index, row in data.iterrows(): home_team = row['Home Team'] visitor_team = row['Visitor Team'] #循環獲得球隊名稱 row['HomeLastWin'] = won_last[home_team] row['VisitorLastWin'] = won_last[visitor_team] data.ix[index] = row #更新行數 won_last[home_team] = row['HomeWin'] #判斷上一場是否獲勝 won_last[visitor_team] =not row['HomeWin'] print('----') # print(data.ix[20:25]) # Home Team HomePts OT? Notes HomeWin HomeLastWin VisitorLastWin # 20 Boston Celtics 98 NaN NaN False False False # 21 Brooklyn Nets 101 NaN NaN True False False # 22 Charlotte Bobcats 90 NaN NaN True False True # 23 Denver Nuggets 98 NaN NaN False False False # 24 Houston Rockets 113 NaN NaN True True True # 25 Los Angeles Lakers 85 NaN NaN False False True 

一些練習測試代碼:defaultdict 和 iterrows()的使用方法

won_last['jj'] = 12 dd = won_last['Indiana Pacers'] #defaultdict的作用是在於,當字典里的key不存在但被查找時,返回的不是keyError而是一個默認值 print(dd) # 0 print(won_last) # defaultdict(<class 'int'>, {'Indiana Pacers': 0, 'jj': 12}) 返回的是defaultdict類型 dataset = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]]) print(dataset) for index, row in dataset.iterrows(): print(index) # 0, 1, 2 打印出行號 print(row) #打印出第 1, 2, 3 行的全部元素 

2.使用決策樹

決策樹原理參考:

這里直接使用決策樹, 沒有刻意地去調參數,可能是作者為了對比不同特征的優劣吧。

從數據集中構建有效的特征, (Feature Engineering 特征工程)是數據挖掘的難點所在, 好的特征直接關系到結果的正確率, -------甚至比選擇合適的算法更重要。

#使用決策樹 from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(random_state =14) #設置隨機種子,使結果復現,。。。 但是還是不同。 X_previousWins = data[['HomeLastWin', 'VisitorLastWin']].values #使用新創建的2個特征作為輸入 from sklearn.model_selection import cross_val_score # 使 用交叉驗證模型平均得分 import numpy as np scores = cross_val_score(clf, X_previousWins, y_true, scoring='accuracy') mean_score = np.mean(scores) *100 print('the accuracy is %0.2f'%mean_score+'%') #准確率為 the accuracy is 57.47% 

使用另一數據集:13年NBA 排名情況

#讀取2013年球隊排名情況 file_name2 = r'D:\datasets\NBA_2013_stangdings.csv' standings = pd.read_csv(file_name2) # print(standings.head()) # Rk Team Overall Home Road E W A C \.... # 0 1 Miami Heat 66-16 37-4 29-12 41-11 25-5 14-4 12-6 # 1 2 Oklahoma City Thunder 60-22 34-7 26-15 21-9 39-13 7-3 8-2 # 2 3 San Antonio Spurs 58-24 35-6 23-18 25-5 33-19 8-2 9-1 # 3 4 Denver Nuggets 57-25 38-3 19-22 19-11 38-14 5-5 10-0 # 4 5 Los Angeles Clippers 56-26 32-9 24-17 21-9 35-17 7-3 8-2 # print(standings.shape) # (30, 24) 有30只球隊 

創建一個新特征值, 主場球隊是否比對手排名高。然后使用創建的3個特征去 fit 模型

#創建一個新特征值, 主場球隊是否比對手排名高 data['HomeTeamRanksHigher'] = 0 for index, row in data.iterrows(): home_team = row['Home Team'] visitor_team = row['Visitor Team'] if home_team =='New Orleans Pelicans': #更換了名字的球隊 home_team ='New Orleans Hornets' elif visitor_team == 'New Orleans Pelicans': visitor_team='New Orleans Hornets' #比較排名, 更新特征值 home_rank = standings[standings['Team']== home_team]['Rk'].values[0] visitor_rank = standings[standings['Team']== visitor_team]['Rk'].values[0] row['HomeTeamRanksHigher'] = int(home_rank > visitor_rank) data.ix[index] = row X_homehigher = data[['HomeLastWin', 'VisitorLastWin', 'HomeTeamRanksHigher']].values # clf1 = DecisionTreeClassifier(random_state=14) # scores = cross_val_score(clf1, X_homehigher, y_true, scoring='accuracy') # mean_score1 = np.mean(scores) *100 # print('the new accuracy is %.2f'%mean_score1 + '%') #the new accuracy is 59.67%

再創建新特征, 對比比賽的2隊上一場2隊比賽的結果

#再創建新特征, 對比比賽的2隊上一場2隊比賽的結果 last_match_winner = defaultdict(int) data['HomeTeamWonLast'] = 0 for index, row in data.iterrows(): home_team = row['Home Team'] visitor_team = row['Visitor Team'] teams = tuple(sorted([home_team, visitor_team])) row['HomeTeamWonLast'] = 1 if last_match_winner[teams] == row['Home Team'] else 0 data.ix[index] = row winner = row['Home Team'] if row['HomeWin'] else row['Visitor Team'] last_match_winner[teams] = winner X_lastwinner = data[['HomeTeamWonLast', 'HomeTeamRanksHigher']] # clf2 = DecisionTreeClassifier(random_state=14) # scores = cross_val_score(clf2, X_lastwinner, y_true, scoring='accuracy') # mean_score2 = np.mean(scores) *100 # print('the accuracy is %.2f'%mean_score2 + '%') # the accuracy is 57.85% 

觀察決策樹在訓練數據量很大的情況下, 能否得到有效的模型,使用球隊,並對其編碼

編碼可以參考

#使用LabelEncoder 轉換器把字符串類型的隊名轉換成整型 from sklearn.preprocessing import LabelEncoder encoding = LabelEncoder() encoding.fit(data['Home Team'].values) #將主隊名稱轉換成整型 home_teams = encoding.transform(data['Home Team'].values) visitor_teams = encoding.transform(data['Visitor Team'].values) X_teams = np.vstack([home_teams, visitor_teams]).T from sklearn.preprocessing import OneHotEncoder onehot = OneHotEncoder() X_teams_expanded = onehot.fit_transform(X_teams).todense() clf3 = DecisionTreeClassifier(random_state=14) # scores = cross_val_score(clf3, X_teams_expanded, y_true, scoring='accuracy') # mean_score3 = np.mean(scores) *100 # print('the accuracy is %.2f'%mean_score3+'%') # the accuracy is 59.52%

3.使用隨機森林

隨機森林是一種集成學習的算法

print('----rf-----') #使用隨機森林進行預測 from sklearn.ensemble import RandomForestClassifier # rf = RandomForestClassifier(random_state = 14, n_jobs =-1) #最好調下決策樹的參數 # rf_scores = cross_val_score(rf, X_teams, y_true, scoring='accuracy') # mean_rf_score = np.mean(rf_scores) *100 # print('the randforestclassifier accuracy is %.2f'%mean_rf_score+'%') #the randforestclassifier accuracy is 58.38% #多使用幾個特征 print('使用多個參數') X_all = np.hstack([X_homehigher, X_teams]) # rf_clf2 = RandomForestClassifier(random_state = 14, n_jobs=-1) # rf_scores2 = cross_val_score(rf_clf2, X_all, y_true, scoring='accuracy') # mean_rf_score2 = np.mean(rf_scores2) *100 # print('the accuracy is %.2f'%mean_rf_score2+'%') # the accuracy is 57.62%

使用網格搜索查找最佳的模型, 並查看使用的參數。

#調參數, 使用網格搜索 from sklearn.model_selection import GridSearchCV param_grid = { 'max_features':[2,3,'auto'], 'n_estimators': [100,110,120 ], 'criterion': ['gini', 'entropy'], "min_samples_leaf": [2, 4, 6] } clf = RandomForestClassifier(random_state=14, n_jobs=-1) grid = GridSearchCV(clf, param_grid) grid.fit(X_all, y_true) score = grid.best_score_ *100 print('the accuracy is %.2f'%score +'%') #the accuracy is 62.02% something= str(grid.best_estimator_) print(something) #輸出網格搜索找到的最佳模型 print(grid.best_params_) #輸出返回最好的參數 # the accuracy is 62.02% # RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy', # max_depth=None, max_features=3, max_leaf_nodes=None, # min_impurity_decrease=0.0, min_impurity_split=None, # min_samples_leaf=2, min_samples_split=2, # min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, # oob_score=False, random_state=14, verbose=0, warm_start=False) # {'n_estimators': 100, 'criterion': 'entropy', 'max_features': 3, 'min_samples_leaf': 2} # 所花費的時間 : 117.93s end = time.clock() time = end - start print('所花費的時間 : %.2f'%time + 's') 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM