2 模型評估與選擇
2.1評估方法
2.1.1訓練集和測試集
實例1:鳶尾花數據集(Iris)
鳶尾花數據集(Iris)是一個經典數據集。數據集內包含 3 類共 150 條記錄,每類各 50 個數據,每條記錄都有 4 項特征:花萼長度、花萼寬度、花瓣長度、花瓣寬度,可以通過這4個特征預測鳶尾花卉屬於(iris-setosa, iris-versicolour, iris-virginica,山鳶尾、變色鳶尾和維吉尼亞鳶尾三個類別)中的哪一品種。
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) result = model.score(X_test,Y_test) print('算法評估結果為:%.2f' % (result*100))
2.1.2 K折交叉驗證
import pandas as pd from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] seed = 7 num_folds =10 kfold = KFold(n_splits=num_folds,random_state=seed) model = LogisticRegression() result = cross_val_score(model,X,Y,cv=kfold) print(result) print('算法的評估結果:%.2f' %(result.mean()*100))
2.1.3留一法
import pandas as pd from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] seed = 7 loocv=LeaveOneOut() model = LogisticRegression() result = cross_val_score(model,X,Y,cv=loocv) print(result) print('算法的評估結果:%.2f' %(result.mean()*100))
2.2分類任務的性能指標
准確度、混淆矩陣、精准率、召回率、AUC、F1score
分類報告:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix y_true = [1, 0, 0, 1] y_predict = [1, 0, 1, 0] y_score = [0.1, 0.4, 0.35, 0.8] print(accuracy_score(y_true, y_predict)) print(precision_score(y_true,y_predict)) print(recall_score(y_true, y_predict)) print(f1_score(y_true, y_predict)) print(roc_auc_score(y_true, y_score)) print(confusion_matrix(y_true,y_predict))
實例2:印第安人糖尿病數據集
數據集的目標是基於數據集中(共有768條數據)包含的某些診斷測量來診斷性的預測患者是否患有糖尿病。數據有8個屬性,【1】Pregnancies:懷孕次數;【2】Glucose:葡萄糖【3】BloodPressure:血壓 (mm Hg)【4】SkinThickness:皮層厚度 (mm) 【5】Insulin:胰島素 2小時血清胰島素(mu U / ml【6】BMI:體重指數 (體重/身高)^2 【7】DiabetesPedigreeFunction:糖尿病譜系功能【8】Age:年齡 (歲)
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\ roc_auc_score,confusion_matrix,classification_report filename = 'pima_data.csv' names = ['preg','plas','blood','skin','insulin','bmi','pedi','age','class'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) y_predict = model.predict(X_test) y_true = Y_test print(confusion_matrix(y_true,y_predict)) print(classification_report(y_true,y_predict))
輸出:
[[138 14]
[ 30 49]]
precision recall f1-score support
0.0 0.82 0.91 0.86 152
1.0 0.78 0.62 0.69 79
avg / total 0.81 0.81 0.80 231
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\ roc_auc_score,confusion_matrix,classification_report filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) y_predict = model.predict(X_test) y_true = Y_test print(confusion_matrix(y_true,y_predict)) print(classification_report(y_true,y_predict))
輸出:
[[21 0 0]
[ 0 8 2]
[ 0 1 13]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.89 0.80 0.84 10
Iris-virginica 0.87 0.93 0.90 14
avg / total 0.93 0.93 0.93 45
2.3回歸任務的性能指標
平均絕對誤差
均方誤差
決定系數
import numpy as np def mse_score(y_predict,y_true): rmse = np.mean((y_predict-y_true)**2) return rmse def rmse_score(y_predict,y_true): mse = np.sqrt(np.mean((y_predict-y_true)**2)) return mse def mae_score(y_predict,y_true): mae = np.mean(np.abs(y_predict-y_true)) return mae def r2_score(y_predict,y_true): r2 = 1 - mse_score(y_predict,y_true)/np.var(y_true) return r2 y_true=np.array([0.7,0.2,1.8,0.4,1.4]) y_predict=np.array([0.7,-0.8,3.8,0.9,2.9]) print(rmse_score(y_predict,y_true)) print(mse_score(y_predict,y_true)) print(mae_score(y_predict,y_true)) print(r2_score(y_predict,y_true))
實例3:波士頓房價數據集
波士頓房價數據集共有506條波士頓房價的數據,每條數據包括對指定房屋的13項數值型特征和目標房價組成。
from sklearn.datasets import load_boston from sklearn.model_selection import KFold,cross_val_score from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score boston= load_boston() X = boston['data'] Y = boston['target'] seed = 7 num_folds =10 kfold = KFold(n_splits=num_folds,random_state=seed) model=LinearRegression() test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model=LinearRegression() model.fit(X_train,Y_train) Y_predict=model.predict(X_test) print(Y_predict-Y_test) mae = mean_absolute_error(Y_test,Y_predict) rmse = mean_squared_error(Y_test,Y_predict) r2 = r2_score(Y_test,Y_predict) print("MAE:%.2f" % mae) print("RMSE:%.2f" % rmse) print("R2:%.2f" % r2)
