''' 分類之交叉驗證: 由於數據集的划分有不確定性,若隨機划分的樣本正好處於某類特殊樣本,則得到的訓練模型所預測的結果的可信度將受到質疑。 所以需要進行多次交叉驗證,把樣本空間中的所有樣本均分成n份,使用不同的訓練集訓練模型,對不同的測試集進行測試時輸出指標得分。 sklearn提供了交叉驗證相關API: import sklearn.model_selection as ms ms.cross_val_score(模型, 輸入集, 輸出集, cv=折疊數, scoring=指標名)->指標值數組 交叉驗證指標: 1.精確度(accuracy):分類正確的樣本數/總樣本數 2.查准率(precision_weighted):針對每一個類別,預測正確的樣本數比上預測出來的樣本數 3.召回率(recall_weighted):針對每一個類別,預測正確的樣本數比上實際存在的樣本數 4.f1得分(f1_weighted):2x查准率x召回率/(查准率+召回率) 在交叉驗證過程中,針對每一次交叉驗證,計算所有類別的查准率、召回率或者f1得分,然后取各類別相應指標值的平均數, 作為這一次交叉驗證的評估指標,然后再將所有交叉驗證的評估指標以數組的形式返回調用者。 ''' import numpy as np import matplotlib.pyplot as mp import sklearn.naive_bayes as nb import sklearn.model_selection as ms data = np.loadtxt('./ml_data/multiple1.txt', delimiter=',', unpack=False, dtype='f8') print(data.shape) x = np.array(data[:, :-1]) y = np.array(data[:, -1]) # 訓練集和測試集的划分 使用訓練集訓練 再使用測試集測試,並繪制測試集樣本圖像 train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=7) # 針對訓練集,做5次交叉驗證,若得分還不錯再訓練模型 model = nb.GaussianNB() # 精確度 score = ms.cross_val_score(model, train_x, train_y, cv=5, scoring='accuracy') print('accuracy score=', score) print('accuracy mean=', score.mean()) # 查准率 score = ms.cross_val_score(model, train_x, train_y, cv=5, scoring='precision_weighted') print('precision_weighted score=', score) print('precision_weighted mean=', score.mean()) # 召回率 score = ms.cross_val_score(model, train_x, train_y, cv=5, scoring='recall_weighted') print('recall_weighted score=', score) print('recall_weighted mean=', score.mean()) # f1得分 score = ms.cross_val_score(model, train_x, train_y, cv=5, scoring='f1_weighted') print('f1_weighted score=', score) print('f1_weighted mean=', score.mean()) # 訓練NB模型,完成分類業務 model.fit(train_x, train_y) pred_test_y = model.predict(test_x) # 得到預測輸出,可以與真實輸出作比較,計算預測的精准度(預測正確的樣本數/總測試樣本數) ac = (test_y == pred_test_y).sum() / test_y.size print('預測精准度 ac=', ac) # 繪制分類邊界線 l, r = x[:, 0].min() - 1, x[:, 0].max() + 1 b, t = x[:, 1].min() - 1, x[:, 1].max() + 1 n = 500 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) bg_x = np.column_stack((grid_x.ravel(), grid_y.ravel())) bg_y = model.predict(bg_x) grid_z = bg_y.reshape(grid_x.shape) # 畫圖 mp.figure('NB Classification', facecolor='lightgray') mp.title('NB Classification', fontsize=16) mp.xlabel('X', fontsize=14) mp.ylabel('Y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x, grid_y, grid_z, cmap='gray') mp.scatter(test_x[:, 0], test_x[:, 1], s=80, c=test_y, cmap='jet', label='Samples') mp.legend() mp.show() 輸出結果: (400, 3) accuracy score= [1. 1. 1. 1. 0.98305085] accuracy mean= 0.9966101694915255 precision_weighted score= [1. 1. 1. 1. 0.98411017] precision_weighted mean= 0.996822033898305 recall_weighted score= [1. 1. 1. 1. 0.98305085] recall_weighted mean= 0.9966101694915255 f1_weighted score= [1. 1. 1. 1. 0.98303199] f1_weighted mean= 0.9966063988235516 預測精准度 ac= 0.99