python大戰機器學習——模型評估、選擇與驗證


1、損失函數和風險函數

(1)損失函數:常見的有 0-1損失函數  絕對損失函數  平方損失函數  對數損失函數

(2)風險函數:損失函數的期望      經驗風險:模型在數據集T上的平均損失

  根據大數定律,當N趨向於∞時,經驗風險趨向於風險函數

2、模型評估方法

(1)訓練誤差與測試誤差

  訓練誤差:關於訓練集的平均損失

  測試誤差:定義模型關於測試集的平均損失。其反映了學習方法對未知測試數據集的預測能力

(2)泛化誤差:學到的模型對未知數據的預測能力。其越小,該模型越有效。泛化誤差定義為所學習模型的期望風險

(3)過擬合:對已知數據預測得很好,對未知數據預測得很差的現象。原因是將訓練樣本本身的一些特點當做了所有潛在樣本都具有的一般性質,這會造成泛化能力的下降。常用的防止過擬合的辦法為正則化。正則化是基於結構化風險最小化策略的實現。

3、模型評估

(1)留出法:直接將數據划分為三個互斥的部分,然后在訓練集上訓練模型,在驗證集上選擇模型,最后用測試集上的誤差作為泛化誤差的估計。

(2)交叉驗證法(S折交叉驗證法):數據隨機划分為S個互不相交且大小相同的子集,利用S-1個子集數據訓練模型,利用余下的一個子集測試模型。對S種組合依次重復進行,獲取測試誤差的均值。

(3)留一法:留出一個樣例作為測試集。其缺點就是當數據集比較大時計算量太大

(4)自助法:先從T中隨機取出一個樣本放入采樣集TS中,再把該樣本放回T中。經過N次隨機采樣操作,得到包含N個樣本的采樣集TS。將TS用作訓練集,T-TS用過測試集。

4、性能度量

(1)測試准確率和測試錯誤率

(2)混淆矩陣

  查准率:P=TP/(TP+FP)  ,即所有預測為正類的結果中,真正的正類的比例

  查全率:R=TP/(TP+FN),即正真的正類中,被分類器找出來的比例

  不同的問題中,判別標准不同。對於推薦系統,更側重於查准率(即推薦的結果中,用戶真正感興趣的比例);對於醫學診斷系統,更側重於查全率(即疾病被發現的比例)

  2/F1=1/P+1/R

5、ROC曲線

  真正例率:TPR=TP/(TP+FN)

  假正例率:FPR=FP/(TN+FP),刻畫的是分類器錯認為正類的負實例占所有負實例的比例

  以真正例率為縱軸、假正例率為橫軸作圖,就得到ROC曲線。在ROC圖中,對角線對應於隨機猜想模型。點(0,1)對應於理想模型。通常ROC曲線越靠近點(0,1)越好。

6、偏差方差分解

代碼如下:

  1 from sklearn.metrics import zero_one_loss,log_loss
  2 from sklearn.model_selection import train_test_split,KFold,StratifiedKFold,LeaveOneOut,cross_val_score
  3 from sklearn.datasets import load_digits,load_iris
  4 from sklearn.svm import LinearSVC,SVC
  5 from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report
  6 from sklearn.metrics import confusion_matrix,precision_recall_curve,roc_curve
  7 from sklearn.metrics import mean_absolute_error,mean_squared_error,classification_report
  8 from sklearn.multiclass import OneVsRestClassifier
  9 from sklearn.model_selection import validation_curve,learning_curve,GridSearchCV,RandomizedSearchCV
 10 import matplotlib.pyplot as plt
 11 from sklearn.preprocessing import label_binarize
 12 from sklearn.linear_model import LogisticRegression
 13 import numpy as np
 14 #zero_one_loss
 15 # y_true=[1,1,1,1,1,0,0,0,0,0]
 16 # y_pred=[0,0,0,1,1,1,1,1,0,0]
 17 # print("zero_one_loss<fraction>:",zero_one_loss(y_true,y_pred,normalize=True))
 18 # print("zero_one_loss<num>:",zero_one_loss(y_true,y_pred,normalize=False))
 19 
 20 #log_loss
 21 # y_true=[1,1,1,0,0,0]
 22 # y_pred=[[0.1,0.9],
 23 #         [0.2,0.8],
 24 #         [0.3,0.7],
 25 #         [0.7,0.3],
 26 #         [0.8,0.2],
 27 #         [0.9,0.1]
 28 #         ]
 29 # print("log_loss<average>:",log_loss(y_true,y_pred,normalize=True))
 30 # print("log_loss<total>:",log_loss(y_true,y_pred,normalize=False))
 31 
 32 #train_test_split
 33 # X=[
 34 #     [1,2,3,4],
 35 #     [11,12,13,14],
 36 #     [21,22,23,24],
 37 #     [31,32,33,34],
 38 #     [41,42,43,44],
 39 #     [51,52,53,54],
 40 #     [61,62,63,64],
 41 #     [71,72,73,74]
 42 # ]
 43 # Y=[1,1,0,0,1,1,0,0]
 44 # X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.4,random_state=0)
 45 # print("X_train=",X_train)
 46 # print("X_test=",X_test)
 47 # print("Y_train=",Y_train)
 48 # print("Y_test=",Y_test)
 49 # X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.4,random_state=0,stratify=Y)
 50 # print("X_train=",X_train)
 51 # print("X_test=",X_test)
 52 # print("Y_train=",Y_train)
 53 # print("Y_test=",Y_test)
 54 
 55 #KFold
 56 # X=np.array([
 57 #     [1,2,3,4],
 58 #     [11,12,13,14],
 59 #     [21,22,23,24],
 60 #     [31,32,33,34],
 61 #     [41,42,43,44],
 62 #     [51,52,53,54],
 63 #     [61,62,63,64],
 64 #     [71,72,73,74],
 65 #     [81,82,83,84]
 66 # ])
 67 # Y=np.array([1,1,0,0,1,1,0,0,1])
 68 #
 69 # folder=KFold(n_splits=3,random_state=0,shuffle=False)
 70 # for train_index,test_index in folder.split(X,Y):
 71 #     print("Train Index:",train_index)
 72 #     print("Test Index:",test_index)
 73 #     print("X_train:",X[train_index])
 74 #     print("X_test:",X[test_index])
 75 #     print("")
 76 #
 77 # shuffle_folder=KFold(n_splits=3,random_state=0,shuffle=True)
 78 # for train_index,test_index in shuffle_folder.split(X,Y):
 79 #     print("Train Index:",train_index)
 80 #     print("Test Index:",test_index)
 81 #     print("X_train:",X[train_index])
 82 #     print("X_test:",X[test_index])
 83 #     print("")
 84 
 85 #StratifiedKFold
 86 # stratified_folder=StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
 87 #as the operation is similar to the above,pass
 88 
 89 #LeaveOneOut,too easy,pass
 90 # loo=LeaveOneOut(len(Y))
 91 
 92 #cross_val_score
 93 # digits=load_digits()
 94 # X=digits.data
 95 # Y=digits.target
 96 #
 97 # result=cross_val_score(LinearSVC(),X,Y,cv=10)
 98 # print("Cross Val Score is:",result)
 99 
100 #accuracy_score,pass
101 # accuracy_score(y_true,y_pred,normalize=True/False)
102 
103 #precision_score,pass
104 # precision_socre(y_true,y_pred)
105 
106 #recall_score,pass
107 # recall_score(y_true,y_pred)
108 
109 #f1_score,pass
110 # f1_score(y_true,y_pred)
111 
112 #fbeta_score,pass
113 # fbeta_score(y_true,y_pred,beta=num_beta)
114 
115 #classification_report
116 # y_true=[1,1,1,1,1,0,0,0,0,0]
117 # y_pred=[0,0,1,1,0,0,0,0,0,0]
118 # print("Classification Report:\n",classification_report(y_true,y_pred,target_names=["class_0","class_1"]))
119 
120 #confusion_matrix,pass
121 # confusion_matrix(y_true,y_pred,labels=[0,1])
122 
123 #precision_recall_curve
124 # iris=load_iris()
125 # X=iris.data
126 # Y=iris.target
127 # #print(X,'\n',Y)
128 # Y=label_binarize(Y,classes=[0,1,2])
129 # n_classes=Y.shape[1]
130 # # print(n_classes,'\n',Y)
131 # np.random.seed(0)
132 # n_samples,n_features=X.shape
133 # # print(n_samples,'\n',n_features)
134 # X=np.c_[X,np.random.randn(n_samples,200*n_features)]
135 # # n_samples,n_features=X.shape
136 # # print(n_samples,'\n',n_features)
137 # x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.5,random_state=0)
138 # clf=OneVsRestClassifier(SVC(kernel='linear',probability=True,random_state=0))
139 # clf.fit(x_train,y_train)
140 # y_score=clf.fit(x_train,y_train).decision_function(x_test)
141 # # print(y_score)
142 # fig=plt.figure()
143 # ax=fig.add_subplot(1,1,1)
144 # precision=dict()
145 # recall=dict()
146 # for i in range(n_classes):
147 #     precision[i],recall[i],_=precision_recall_curve(y_test[:,i],y_score[:,i])
148 #     ax.plot(recall[i],precision[i],label="target=%s"%i)
149 # ax.set_xlabel("Recall Score")
150 # ax.set_ylabel("Precision Score")
151 # ax.set_title("P-R")
152 # ax.legend(loc="best")
153 # ax.set_xlim(0,1.1)
154 # ax.set_ylim(0,1.1)
155 # ax.grid()
156 # plt.show()
157 
158 #roc_curve,roc_auc_score,pass
159 # roc_curve(y_test,y_score)
160 
161 #mean_absolute_error,pass
162 # mean_absolute_error(y_true,y_pred)
163 
164 #mean_squared_error,pass
165 # mean_squared_error(y_true,y_pred)
166 
167 #validation_curve,pass
168 # validation_curve(LinearSVC(),X,Y,param_name="C",param_range=np.logspace(-2,2),cv=10,scoring="accuracy")
169 
170 #learning_curve,pass
171 # train_size=np.linspace(0.1,1.0,endpoint=True,dtype='float')
172 # learning_curve(LinearSVC(),X,Y,cv=10,scoring="accuracy",train_sizes=train_size)
173 
174 #GridSearcgCV
175 # digits=load_digits()
176 # x_train,x_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.25,random_state=0,stratify=digits.target)
177 # tuned_parameters=[{'penalty':['l1','l2'],'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],'solver':['liblinear'],'multi_class':['ovr']},
178 #                   {'penalty':['l2'],'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],'solver':['lbfgs'],'multi_class':['ovr','multinomial']},
179 #                   ]
180 # clf=GridSearchCV(LogisticRegression(tol=1e-6),tuned_parameters,cv=10)
181 # clf.fit(x_train,y_train)
182 # print("Best parameters set found:",clf.best_params_)
183 # print("Grid scores:")
184 # for params,mean_score,scores in clf.grid_scores_:
185 #     print("\t%0.3f(+/-%0.03f) for %s"%(mean_score,scores.std()*2,params))
186 # 
187 # print("Optimized Score:",clf.score(x_test,y_test))
188 # print("Detailed classification report:")
189 # y_true,y_pred=y_test,clf.predict(x_test)
190 # print(classification_report(y_true,y_pred))
191 
192 #RandomizedSearchCV
193 # RandomizedSearchCV(LogisticRegression(penalty='l2',solver='lbfgs',tol=1e-6,tuned_parameters,cv=10,scoring='accuracy',n_iter=100))
View Code

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM