概念
隨機森林(RandomForest):隨機森林是一個包含多個決策樹的分類器,並且其輸出的類別是由個別數輸出的類別的眾數而定
優點:適合離散型和連續型的屬性數據;對海量數據,盡量避免了過度擬合的問題;對高維數據,不會出現特征選擇困難的問題;實現簡單,訓練速度快,適合 進行分布式計算

1 import pandas; 2 3 data = pandas.read_csv( 4 "D:\\PDM\\5.3\\data.csv" 5 ); 6 7 dummyColumns = ["Gender", "ParentEncouragement"] 8 9 for column in dummyColumns: 10 data[column]=data[column].astype('category') 11 12 dummiesData = pandas.get_dummies( 13 data, 14 columns=dummyColumns, 15 prefix=dummyColumns, 16 prefix_sep="=", 17 drop_first=True 18 ) 19 dummiesData.columns 20 21 fData = dummiesData[[ 22 'ParentIncome', 'IQ', 'Gender=Male', 23 'ParentEncouragement=Not Encouraged' 24 ]] 25 26 tData = dummiesData["CollegePlans"] 27 28 from sklearn.tree import DecisionTreeClassifier 29 from sklearn.ensemble import RandomForestClassifier 30 from sklearn.model_selection import cross_val_score 31 32 dtModel = DecisionTreeClassifier() 33 34 dtScores = cross_val_score( 35 dtModel, 36 fData, tData, cv=10 37 ) 38 39 dtScores.mean() 40 41 rfcModel = RandomForestClassifier() 42 43 rfcScores = cross_val_score( 44 rfcModel, 45 fData, tData, cv=10 46 ) 47 48 rfcScores.mean()
決策樹評分:
隨機森林評分:
發現隨機森林在不調優的情況下,得分高於決策樹模型
調優:設置:max_leaf_nodes=8

1 #對連個模型進行調優 2 dtModel=DecisionTreeClassifier(max_leaf_nodes=8) 3 4 dtScores=cross_val_score( 5 dtModel, 6 fData,tData,cv=10) 7 8 dtScores.mean() 9 10 rfcModel=RandomForestClassifier(max_leaf_nodes=8) 11 12 rfcScores=cross_val_score( 13 rfcModel, 14 fData,tData,cv=10) 15 16 rfcScores.mean()
決策樹評分:
隨機森林評分: