0.背景
機器學習通常評判一個算法的好壞,是基於不同場景下采用不同的指標的。通常來說,有:
- [x] 准確度;PR (Precision Recall);
- [x] F測量;
- [ ] MCC;
- [ ] BM;
- [ ] MK;
- [ ] Gini系數;
- [x] ROC;
- [ ] Z score;
- [x] AUC ;
- [ ] Cost Curve;
- [ ] BLEU;
- [ ] Matthews correlation coefficient;
- [ ] METEOR;
- [ ] Brier score;
- [ ] NIST (metric);
- [ ] ROUGE (metric);
- [ ] Sørensen–Dice coefficient;
- [ ] Uncertainty coefficient, aka Proficiency;
- [ ] Word error rate (WER);
從wiki獲取一個很重要的二分類混淆矩陣來說明后續的內容。
圖0.1 wiki上的圖
圖0.1為wiki上針對2分類的一個混淆矩陣,及對應的各種指標表示。其中:
- true condition:列表示真實類別;predicted condition:行表示預測的類別;
- 真實正類=true positive+false negative;真實負類=false positive+true negative;
- 預測的正類=true positive+false positive; 預測的負類=false negative+true negative。
1. 不同指標的含義
1.1 accuracy&Precision Recall
如圖0.1所示:
- accuracy:(圖0.1中ACC)即最常用的准確度,表示\(\frac{所有預測對了的樣本個數}{總的樣本個數}\);
- Precision:(圖0.1中PPV),精確率,表示預測的正類中預測對的樣本個數比例\(\frac{true\, positive}{預測的正類}\);
- Recall:(圖0.1中TPR),召回率,表示真實正類中預測對的樣本個數比例\(\frac{true\, positive}{真實正類}\).
1.2 F measure&&G measure
1.2.1 F measure
傳統的F measure(balanced F score,\(F_1\) score)就是關於precision和recall的Harmonic均值(是數學上一種均值算法),其公式如下:
其中:
- 當F score為0的時候最差:即precision和recall中某個值或者都接近0,則該模型越差;
- 當F score為1的時候最好:即precision和recall同時越接近1則該模型越好。
ps:F1 score同樣也被稱為Sørensen–Dice coefficient或者說叫Dice similarity coefficient (DSC).
將上述式子表示成更通用的形式如下圖:
其中\(F_2\),\(F_{0.5}\)是相對\(F_1\)兩個常用的F measure:
- 當\(\beta=2\),則表示recall的影響要大於precision;
- 當\(\beta=0.5\),則表示precision的影響要大於recall.
如果以圖0.1中的type I error和type II error來表示F measure,則如下面式子:
1.2.2 G measure
相對於F measure 是一種Harmonic均值,G measure是一種geometric mean,同時也被稱為 Fowlkes–Mallows index
即
1.3 PR Curve
針對2分類,以Recall為橫軸,Precision為縱軸的曲線。如圖2.1.1
1.4 Cost Curve
1.5 ROC
針對2分類,以圖0.1中FPR為橫軸,圖0.1中TPR(也就是Recall)為縱軸的曲線。如圖2.1.1
AUC:Aera under curve,即表示曲線下面積的意思
2. 不同指標之間的關系
早在1998年,Provost等人就認為簡單的使用accuracy去評價算法的性能是完全不夠的,因為會出現acc很高,而算法卻相對是差的情況。所以他們推薦使用ROC來對算法進行評價。
2.1. PRC和ROC之間的關系
當不同類別中樣本的個數差別很大的時候,ROC曲線是無法正確的描述算法性能的,假如2分類中負類特別多,那么當圖0.1中FP變化很大時,在ROC上橫坐標表示的FPR上表現的就不那么明顯;而precision是通過FP與TP之間的對比而不是FP和TN之間的對比,從而如果FP變化很大的時候,precision就會變得很敏感了,從而能夠抓取到當負類個數遠大於正類時候算法性能的影響了。Jesse Davis以及前人就通過PRC來代替ROC進行算法性能描述。而這兩種曲線之間一個很重要的區別就在於視覺上的體現,如圖2.1.1所示。
圖2.1.1 PR與ROC的曲線圖
圖2.1.1是基於同一個高度不平衡的數據集和同樣的2個算法基礎上,得到的ROC和PRC。ROC表示的算法當在拐角越接近左上角越好(即<左下-左上-右上>這樣的順序);PRC表示的算法在拐角越接近右上角越好(即<左上-右上-右下>這樣的順序)。從中可以看出PRC更能給人以算法2要好於算法1的感覺。而ROC中雖然算法2的AUC更大,可是整體給人以這兩種算法都很好了,只有細微差別的感覺。所以PRC不但可以放大2個算法之間的差別,還能看出這兩種算法的發展空間還是很大的。
對於任何數據集來說(即有固定數據的正負類樣本個數),基於同一個算法,PRC和ROC都包含了相同的點,也就是這兩個曲線是具有等效性的,然而也保證了ROC中一個算法占優那么有且僅有該算法在PRC中也占優;可是如果一個算法在ROC中占優,卻並不能保證其在PRC中也是占優的(即PRC占優可以推出該算法在ROC中也占優,而ROC中占優不代表其在PRC中占優)
2.2 ROC與CC(cost curves)之間的關系
如果不同類別中樣本個數存在較大的偏差,則ROC曲線可能對算法的性能過分的樂觀。Drummond等人推薦使用CC來替代ROC進行算法評價
2.3 AUC的探討
參考文獻:
- [ROC繪制] .introduction-to-auc-and-roc
- [F1] wiki.F1_score
- [ROC] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982.
- [ROC] Hanley J A, McNeil B J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases[J]. Radiology, 1983, 148(3): 839-843.
- [ROC] McNeil B J, Hanley J A. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves[J]. Medical decision making, 1984, 4(2): 137-150.
- [ROC] DeLong E R, DeLong D M, Clarke-Pearson D L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach[J]. Biometrics, 1988: 837-845.
- [ROC] Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145{1159.
- [ACC&ROC] Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceeding of the 15th International Conference on Machine Learning (pp. 445{453). Morgan Kaufmann, San Francisco, CA.
- [ROC&CC] Chris Drummond and Robert C. Holte, ‘Explicitly representing expected cost: An alternative to roc representation’, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198–207, (2000).
- [ROC] wiki.Receiver_operating_characteristic
- [AUC] Cortes, C., & Mohri, M. (2003). AUC optimization vs. error rate minimization. Neural Information Processing Systems 15 (NIPS). MIT Press
- [ROC&CC] Drummond, C., & Holte, R. C. (2004). What ROC curves can't do (and cost curves can). ROCAI (pp. 19{26).
- [ROC] Zhang, Jun; Mueller, Shane T. (2005). "A note on ROC analysis and non-parametric estimate of sensitivity". Psychometrika. 70: 203–212.
- [ROC] Fan J, Upadhye S, Worster A. Understanding receiver operating characteristic (ROC) curves[J]. Canadian Journal of Emergency Medicine, 2006, 8(1): 19-20.
- [ROC] Fawcett, Tom (2006). An Introduction to ROC Analysis. Pattern Recognition Letters. 27 (8): 861–874.
- [PR&ROC] .The Relationship Between Precision-Recall and ROC Curves, Jesse Davis and Mark Goadrich, ICML 2006
- [ROC] Brown C D, Davis H T. Receiver operating characteristics curves and related decision measures: A tutorial[J]. Chemometrics and Intelligent Laboratory Systems, 2006, 80(1): 24-38.
- [ROC] Weng C G, Poon J. A new evaluation measure for imbalanced datasets[C]//Proceedings of the 7th Australasian Data Mining Conference-Volume 87. Australian Computer Society, Inc., 2008: 27-32.
- [ROC] Powers, David M W (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation . Journal of Machine Learning Technologies. 2 (1): 37–63.
- [ROC] Flach P A. ROC analysis[M]//Encyclopedia of machine learning. Springer US, 2011: 869-875.
- [ROC] Hernandez-Orallo, J. (2013). "ROC curves for regression". Pattern Recognition. 46 (12): 3395–3411 .
- [ROC] .Using the Receiver Operating Characteristic (ROC) curve to analyze a classification model: A final note of historical interest. Department of Mathematics, University of Utah. Department of Mathematics, University of Utah. Retrieved May 25, 2017.
- [CC] Drummond C, Holte R C. Cost curves: An improved method for visualizing classifier performance[J]. Machine learning, 2006, 65(1): 95-130.