python金融風控評分卡模型和數據分析微專業課(博主親自錄制視頻):http://dwz.date/b9vv
隨機森林算法(RandomForest)的輸出有一個變量是 feature_importances_ ,翻譯過來是 特征重要性,具體含義是什么,這里試着解釋一下。
參考官網和其他資料可以發現,RF可以輸出兩種 feature_importance,分別是Variable importance和Gini importance,兩者都是feature_importance,只是計算方法不同。
Variable importance
選定一個feature M,在所有OOB樣本的feature M上人為添加噪聲,再測試模型在OOB上的判斷精確率,精確率相比沒有噪聲時下降了多少,就表示該特征有多重要。
假如一個feature對數據分類很重要,那么一旦這個特征的數據不再准確,對測試結果會造成較大的影響,而那些不重要的feature,即使受到噪聲干擾,對測試結果也沒什么影響。這就是 Variable importance 方法的朴素思想。
[添加噪聲:這里官網給出的說法是 randomly permute the values of variable m in the oob cases,permute的含義我還不是很確定,有的說法是打亂順序,有的說法是在數據上加入白噪聲。]
Gini importance
選定一個feature M,統計RF的每一棵樹中,由M形成的分支節點的Gini指數下降程度(或不純度下降程度)之和,這就是M的importance。
兩者對比來看,前者比后者計算量更大,后者只需要一邊構建DT,一邊做統計就可以。從sklearn的官方文檔對feature_importances_參數的描述來看,sklearn應當是使用了Gini importance對feature進行排序,同時sklearn把所有的Gini importance以sum的方式做了歸一化,得到了最終的feature_importances_輸出參數。
參考文獻:
RandomForest 官網 https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Variable importance
The variable importances are critical. The run computing importances is done by switching imp =0 to imp =1 in the above parameter list. The output has four columns:
gene number the raw importance score the z-score obtained by dividing the raw score by its standard error the significance level.
The highest 25 gene importances are listed sorted by their z-scores. To get the output on a disk file, put impout =1, and give a name to the corresponding output file. If impout is put equal to 2 the results are written to screen and you will see a display similar to that immediately below:
gene raw z-score significance number score 667 1.414 1.069 0.143 689 1.259 0.961 0.168 666 1.112 0.903 0.183 668 1.031 0.849 0.198 682 0.820 0.803 0.211 878 0.649 0.736 0.231 1080 0.514 0.729 0.233 1104 0.514 0.718 0.237 879 0.591 0.713 0.238 895 0.519 0.685 0.247 3621 0.552 0.684 0.247 3529 0.650 0.683 0.247 3404 0.453 0.661 0.254 623 0.286 0.655 0.256 3617 0.498 0.654 0.257 650 0.505 0.650 0.258 645 0.380 0.644 0.260 3616 0.497 0.636 0.262 938 0.421 0.635 0.263 915 0.426 0.631 0.264 669 0.484 0.626 0.266 663 0.550 0.625 0.266 723 0.334 0.610 0.271 685 0.405 0.605 0.272 3631 0.402 0.603 0.273
Using important variables
Another useful option is to do an automatic rerun using only those variables that were most important in the original run. Say we want to use only the 15 most important variables found in the first run in the second run. Then in the options change mdim2nd=0 to mdim2nd=15 , keep imp=1 and compile. Directing output to screen, you will see the same output as above for the first run plus the following output for the second run. Then the importances are output for the 15 variables used in the 2nd run.
gene raw z-score significance number score 3621 6.235 2.753 0.003 1104 6.059 2.709 0.003 3529 5.671 2.568 0.005 666 7.837 2.389 0.008 3631 4.657 2.363 0.009 667 7.005 2.275 0.011 668 6.828 2.255 0.012 689 6.637 2.182 0.015 878 4.733 2.169 0.015 682 4.305 1.817 0.035 644 2.710 1.563 0.059 879 1.750 1.283 0.100 686 1.937 1.261 0.104 1080 0.927 0.906 0.183 623 0.564 0.847 0.199
Variable interactions
Another option is looking at interactions between variables. If variable m1 is correlated with variable m2 then a split on m1 will decrease the probability of a nearby split on m2 . The distance between splits on any two variables is compared with their theoretical difference if the variables were independent. The latter is subtracted from the former-a large resulting value is an indication of a repulsive interaction. To get this output, change interact =0 to interact=1 leaving imp =1 and mdim2nd =10.
The output consists of a code list: telling us the numbers of the genes corresponding to id. 1-10. The interactions are rounded to the closest integer and given in the matrix following two column list that tells which gene number is number 1 in the table, etc.
1 2 3 4 5 6 7 8 9 10
1 0 13 2 4 8 -7 3 -1 -7 -2
2 13 0 11 14 11 6 3 -1 6 1
3 2 11 0 6 7 -4 3 1 1 -2
4 4 14 6 0 11 -2 1 -2 2 -4
5 8 11 7 11 0 -1 3 1 -8 1
6 -7 6 -4 -2 -1 0 7 6 -6 -1
7 3 3 3 1 3 7 0 24 -1 -1
8 -1 -1 1 -2 1 6 24 0 -2 -3
9 -7 6 1 2 -8 -6 -1 -2 0 -5
10 -2 1 -2 -4 1 -1 -1 -3 -5 0
There are large interactions between gene 2 and genes 1,3,4,5 and between 7 and 8.
python機器學習生物信息學系列課(博主錄制):http://dwz.date/b9vw