機器學習項目實戰----信用卡欺詐檢測(二)


六、混淆矩陣:

混淆矩陣是由一個坐標系組成的,有x軸以及y軸,在x軸里面有0和1,在y軸里面有0和1。x軸表達的是預測的值,y軸表達的是真實的值。可以對比真實值與預測值之間的差異,可以計算當前模型衡量的指標值。

這里精度的表示:(136+138)/(136+13+9+138)。之前有提到recall=TP/(TP+FN),在這里的表示具體如下:

  

下面定義繪制混淆矩陣的函數:

def plot_confusion_matrix(cm,
                          classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    # This function prints and plots the confusion matrix
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    # cneter 改為 center
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j,
                 i,
                 cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

下面根據上面得出的最好的那個C值,根據下采樣數據集繪制出混淆矩陣。  

import itertools

lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset:",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

# Plot non-normalized confusion.matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title='Confusion matrix')
plt.show()

  

可以看出recall值達到93%,但是因為上面測試數據集采用的下采樣數據集,數據利用率太低。

下面根據原始的划分的測試數據集來進行測試:

lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset:",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title="Confusion matrix")
plt.show()

  

可以看到,這次測試的樣本數據有八萬多。達到的效果還行。這里誤預測的值有一萬多個,有點小多。

那下面如果我們直接拿原始數據集來進行建模,來看看在樣本數據集分布不均衡的情況recall值的情況。

best_c = printing_Kfold_scores(X_train, y_train)
-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.4925373134328358
Iteration  1 : recall score =  0.6027397260273972
Iteration  2 : recall score =  0.6833333333333333
Iteration  3 : recall score =  0.5692307692307692
Iteration  4 : recall score =  0.45

Mean recall score  0.5595682284048672

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.5671641791044776
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.6833333333333333
Iteration  3 : recall score =  0.5846153846153846
Iteration  4 : recall score =  0.525

Mean recall score  0.5953102506435158

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7166666666666667
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.5625

Mean recall score  0.612645688837163

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7333333333333333
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.575

Mean recall score  0.6184790221704963

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7333333333333333
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.575

Mean recall score  0.6184790221704963

*********************************************************************************
Best model to choose from cross validation is with C parameter 10.0
*********************************************************************************

可以看出,recall值基本在60%左右。

繪制出混淆矩陣看看:

lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train, y_train.values.ravel())
# 注意這里不是x_pred_undersample 而是y_pred_undersample
y_pred_undersample = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title='Confusison matrix')
plt.show()

  

可以看出,在樣本數據分布不均衡的情況下,直接進行建立模型,結果並不太好。

在以前學習的邏輯回歸模型中,默認是根據0.5來對結果進行分類。那我們可以作出猜想,可不可以通過改變這個閾值來確定到底哪個閾值對模型的最終結果更好呢?

lr = LogisticRegression(C=0.01, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) # 返回預測的概率值

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # 閾值列表
plt.figure(figsize=(10, 10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i
    plt.subplot(3, 3, j)
    j += 1

    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,
                                  y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset:",
          cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

    # Plot non-normalized confusion matrix
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix,
                          classes=class_names,
                          title='Threshold >= %s' % i)
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.9795918367346939
Recall metric in the testing dataset: 0.9387755102040817
Recall metric in the testing dataset: 0.891156462585034
Recall metric in the testing dataset: 0.8367346938775511
Recall metric in the testing dataset: 0.7687074829931972
Recall metric in the testing dataset: 0.5850340136054422

圖上可以看出,不同的閾值,混淆矩陣是長什么樣子的。根據精度、recall值和誤預測的值來綜合考慮,可以看出閾值在0.5和0.6模型的效果不錯。

七、過采樣操作

過采樣操作(SMOTE算法):

(1)對於少數類中每一個樣本x,以歐氏距離為標准計算它到少數類樣本集中所有樣本的距離,得到其k近鄰。
(2)根據樣本不平衡比例設置一個采樣比例以確定采樣倍率N,對於每一個少數類樣本x,從其k近鄰中隨機選擇若干個樣本,假設選擇的近鄰為xn。
(3)對於每一個隨機選出的近鄰xn,分別與原樣本按照如下的公式構建新的樣本

  

導入相關的Python庫  

import pandas as pd
from imblearn.over_sampling import SMOTE  # pip install imblearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

得到特征和標簽數據  

credit_cards = pd.read_csv('creditcard.csv')

columns = credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns = columns.delete(len(columns) - 1)

features = credit_cards[features_columns]
labels = credit_cards['Class']

划分訓練集測試集  

features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.2, random_state=0)

根據SMOTE算法得到過采樣數據集  

oversampler = SMOTE(random_state=0)
os_features,os_labels = oversampler.fit_sample(features_train,labels_train) # OS  oversampler

可以看看過采樣數據集大小  

len(os_labels[os_labels==1])
227454

下面根據過采樣數據集來進行交叉驗證及邏輯回歸模型建立  

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features, os_labels)
-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9687728228394379
Iteration  3 : recall score =  0.9578813158791396
Iteration  4 : recall score =  0.958167089831943

Mean recall score  0.933976130260189

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9703884032311608
Iteration  3 : recall score =  0.9593981160901727
Iteration  4 : recall score =  0.9605082379837548

Mean recall score  0.9350708360111024

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9704105344694036
Iteration  3 : recall score =  0.9585847594552709
Iteration  4 : recall score =  0.9595410030665743

Mean recall score  0.9347191439483347

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9705433218988603
Iteration  3 : recall score =  0.9601894901133203
Iteration  4 : recall score =  0.9604862553720007

Mean recall score  0.9352556980269211

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9703220095164324
Iteration  3 : recall score =  0.9604093162308613
Iteration  4 : recall score =  0.9607170727954188

Mean recall score  0.9353015642586275

*********************************************************************************
Best model to choose from cross validation is with C parameter 100.0
*********************************************************************************

再來看看混淆矩陣  

lr = LogisticRegression(C = best_c, penalty = 'l1', solver='liblinear')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

  

經過前面的學習,綜合考慮精度,recall值和誤預測的值,發現過采樣的效果比下采樣的效果要好一點。

八、總結:

對於樣本不均衡數據,要利用越多的數據越好。下采樣誤預測值很高,這是模型本身自帶的一個問題,因為0和1一樣少,模型會認為原始數據0和1的數據一樣少,導致誤預測值偏高。在這次的案例中,過采樣的結果偏好一些,雖然recall偏低了一點,但是整體的效果還是不錯的。

流程:

(1)首先要觀察數據,當前數據是否分布均衡,不均衡的情況下就要想一些方法。(這次的數據是比較純凈的,就不需要做其他一些預處理的操作,直接原封不動的拿出來就可以了。很多情況下,不見得可以直接拿到特征數據。)
(2)讓數據進行標准化,讓數據的浮動比較小一些,然后再進行數據的選擇。
(3)混淆矩陣以及模型的評估標准,然后通過交叉驗證的方式來進行參數的選擇。
(4)通過閾值與預測值進行比較,然后得到最終的一個預測結果。不同的閾值會使結果發生很大的變化。
(5)SMOTE算法。

通過對信用卡欺詐檢測這個案例了解了機器學習中樣本數據分布不均衡的解決方案、交叉驗證、正則化懲罰、混淆矩陣和模型的評估方法等等。  

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM