機器學習-信用卡欺詐檢測實戰

本文轉載自查看原文 2020-05-25 15:26 2097

一，課題研究與背景介紹：

1，課題研究：

利用信用卡歷史數據進行機器建模，構建反欺詐模型，預測新的信用卡被盜刷的可能性。

2，背景介紹：

數據集包含由歐洲人於2013年9月使用信用卡進行交易的數據。此數據集顯示兩天內發生的交易，其中284807筆交易中有492筆被盜刷。數據集非常不平衡，正例（被盜刷）占所有交易的0.172％。，這是因為由於保密問題，我們無法提供有關數據的原始功能和更多背景信息。特征V1，V2，... V28是使用PCA獲得的主要組件，沒有用PCA轉換的唯一特征是“Class”和“Amount”。特征'Time'包含數據集中每個刷卡時間和第一次刷卡時間之間經過的秒數。特征'Class'是響應變量，如果發生被盜刷，則取值1，否則為0。

二，數據

1，數據源：https://www.kesci.com/home/dataset/5b56a592fc7e9000103c0442/files

2，用到的庫：

Numpy-科學計算庫主要用來做矩陣運算，什么？你不知道哪里會用到矩陣，那么這樣想吧，咱們的數據就是行（樣本）和列（特征）組成的，那么數據本身不就是一個矩陣嘛。
Pandas-數據分析處理庫很多小伙伴都在說用python處理數據很容易，那么容易在哪呢？其實有了pandas很復雜的操作我們也可以一行代碼去解決掉！
Matplotlib-可視化庫無論是分析還是建模，光靠好記性可不行，很有必要把結果和過程可視化的展示出來。
Scikit-Learn-機器學習庫非常實用的機器學習算法庫，這里面包含了基本你覺得你能用上所有機器學習算法啦。但還遠不止如此，還有很多預處理和評估的模塊等你來挖掘的！
三，提出問題：

四，數據預處理：

1,讀取數據與分析。由於是網站競賽的數據，所以原作者處於保密性對原數據進行了PCA降維操作，同時也對每個列字段名進行了保密，因此我們在分析過程中不需可以強調每一列的含義。同時數據經過了一系列的數據預處理操作，使得數據比較干凈、整潔。class為0是正常的行為，為1是欺詐行為。

data = pd.read_csv(r"X:\Users\orange\Desktop\邏輯回歸-信用卡欺詐檢測\creditcard.csv")
data.head()

2,查看數據信息：從上面可以看出，數據為結構化數據，不需要抽特征轉化，但特征Amount的數據規格和其他特征不一樣，需要對其做特征做特征縮放。

data.describe().T

3,正常數據與異常數據的數量差異

在上圖中Class標簽代表數據分類，0代表正常數據，1代表欺詐數據。

這里是做信用卡數據的欺詐檢測。在整個數據里面，有正常的數據，也有問題的數據。對於一般情況來說，有問題的數據肯定只占了極少部分。

下面繪出柱狀圖可以直觀顯示正常數據與異常數據的數量差異。

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar') # 使用pandas可以繪制一些簡單的圖
# 欺詐類別柱狀圖
plt.title("Fraud class histogram")
plt.xlabel("Class")
# 頻率
plt.ylabel("Frequency")

4,預處理：標准化數據

　　從輸出的結果可以看出正常的樣本0大概有28萬個，異常的樣本1非常少，從圖中不太容易看出來，但是實際上是存在的，大概只有那么幾百個。

因為Amount這列的數據浮動太大，在做機器學習的過程中，需要保證特征值差異不能過大，於是需要對Amount進行預處理，標准化數據。

Time這一列本身沒有多大用處，Amount這一列被標准化后的數據代替。所有刪除這兩列的數據。

# 預處理  標准化數據
from sklearn.preprocessing import StandardScaler
# norm 標准  -1表示自動判斷X維度  對比源碼 這里要加上.values<br># 加上新的特征列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

五，樣本數據分布不均衡解決方案

1，下采樣策略

上面說到數據集里面正常數據和異常數據數量差異極大，對於這種樣本數據不均衡問題，一般有以下兩種策略：

（1）下采樣策略：之前統計的結果可以看出0的樣本有28萬個，而1的樣本只有幾百個。現在將0的數據也變成幾百個就可以了。下采樣，是使樣本的數據同樣少
（2）過采樣策略：之前統計的結果可以看出0的樣本有28萬個，而1的樣本只有幾百個。0比較多1比較少,對1的樣本數據進行生成數列，讓生成的數據與0的樣本數據一樣多。

下面首先采用下采樣策略

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

# 少數類中的數據點數量
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# 選擇正常類的指標
normal_indices = data[data.Class == 0].index

# 從我們選擇的指數中，隨機選擇“x”號(number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# 附加兩個索引
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# 在樣本數據集
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# 顯示比例
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

2，切分數據

可以看出經過下采樣策略過后，正常數據與異常數據各占50%，並且總樣本數也只有少部分。

下面對原始數據集和下采樣后的數據集分別進行切分操作。

from sklearn.cross_validation import train_test_split

# 整個數據集
X_train, X_test, y_train, y_test = train_
test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled數據集
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

六，交叉驗證

　　比如有個集合叫data，通常建立機器模型的時候，先對數據進行切分或者選擇，取前面80%的數據當成訓練集，取20%的數據當成測試集。80%的數據是來建立一個模型，剩下的20%的數據是用來測試模型。因此第一步是將數據進行切分，切分成訓練集以及測試集。這部分操作是必須要做的。第二步還要在訓練集進行平均切分，比如平均切分成3份，分別是數據集1,2,3。

在建立模型的時候，不管建立什么樣的模型，這個模型伴隨着很多參數，有不同的參數進行選擇，這個參數選擇大比較好，還是選擇小比較好一些？從經驗值角度來說，肯定沒辦法很准的，怎么樣去確定這個參數呢？只能通過交叉驗證的方式。

那什么又叫交叉驗證呢？

第一次：將數據集1,2分別建立模型，用數據集3在當前權重下去驗證當前模型的效果。數據集3是個驗證集，驗證集是訓練集的一部分。用驗證集去驗證模型是好還是壞。
第二次：將數據集1,3分別建立模型，用數據集2在當前權重下去驗證當前模型的效果。
第三次：將數據集2,3分別建立模型，用數據集1在當前權重下去驗證當前模型的效果。

如果只是求一次的交叉驗證，這樣的操作會存在風險。比如只做第一次交叉驗證，會使3驗證集偏簡單一些。會使模型效果偏高，此外模型有些數據是錯誤值以及離群值，如果把這些不太好的數據當成驗證集，會使模型的效果偏低的。模型當然是不希望偏高也不希望偏低，那就需要多做幾次交叉驗證模型，求平均值。這里有1，2，3分別作驗證集，每個驗證集都有評估的標准。最終模型的效果將1，2，3的評估效果加在一起，再除以3，就可以得到模型一個大致的效果。

#Recall = TP/(TP+FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # 不同的C參數
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # k-fold將給出2個列表:train_indices = indices[0]， test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # 調用具有特定C參數的logistic回歸模型
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # 使用訓練數據來擬合模型。在本例中，我們使用折疊的部分來訓練模型
            # 與指數[0]。然后，我們使用索引[1]預測指定為“測試交叉驗證”的部分
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # 利用訓練數據中的測試指標預測值
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # 計算收回分數，並將其追加到表示當前c_parameter的收回分數的列表中
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)

        # 這些回憶分數的平均值是我們想要保存和獲得的度量標准。
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
    # 最后，我們可以檢查所選的C參數中哪個是最好的。
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c

使用下采樣數據集調用上面這個函數
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

七，構建矩陣

一般都是用精度來衡量，也就是常說的准確率，但是我們來想一想，我們的目的是什么呢？是不是要檢測出來那些異常的樣本呀！換個例子來說，假如現在醫院給了我們一個任務要檢測出來1000個病人中，有癌症的那些人。那么假設數據集中1000個人中有990個無癌症，只有10個有癌症，我們需要把這10個人檢測出來。假設我們用精度來衡量，那么即便這10個人沒檢測出來，也是有 990/1000 也就是99%的精度，但是這個模型卻沒任何價值！這點是非常重要的，因為不同的評估方法會得出不同的答案，一定要根據問題的本質，去選擇最合適的評估方法。

同樣的道理，這里我們采用recall來計算模型的好壞，也就是說那些異常的樣本我們的檢測到了多少，這也是咱們最初的目的！這里通常用混淆矩陣來展示。

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

繼續調用下采樣數據集調用上面這個函數

best_c = printing_Kfold_scores(X_train,y_train)

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習項目實戰----信用卡欺詐檢測(一) 機器學習項目實戰----信用卡欺詐檢測(二) 機器學習——信用卡反欺詐案例《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第6章--邏輯回歸項目實戰 ——信用卡欺詐檢測信用卡欺詐檢測分析案例 100天搞定機器學習|Day56 隨機森林工作原理及調參實戰（信用卡欺詐預測）數據挖掘實戰（二）—— 類不平衡問題_信用卡欺詐檢測信用卡欺詐檢測Credit Card Fraud Detection(kaggle) python數據分析－信用卡欺詐行為檢測基於邏輯回歸的利用欠采樣處理類別不平衡的信用卡欺詐檢測