機器學習類別不平衡處理之欠采樣（undersampling）

本文轉載自查看原文 2018-05-22 20:35 9646 預處理/ 機器學習

類別不平衡就是指分類任務中不同類別的訓練樣例數目差別很大的情況

常用的做法有三種，分別是1.欠采樣， 2.過采樣， 3.閾值移動

由於這幾天做的project的target為正值的概率不到4%，且數據量足夠大，所以我采用了欠采樣：

欠采樣，即去除一些反例使得正、反例數目接近，然后再進行學習，基本的算法如下：

def undersampling(train, desired_apriori):

    # Get the indices per target value
    idx_0 = train[train.target == 0].index
    idx_1 = train[train.target == 1].index
    # Get original number of records per target value
    nb_0 = len(train.loc[idx_0])
    nb_1 = len(train.loc[idx_1])
    # Calculate the undersampling rate and resulting number of records with target=0
    undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
    undersampled_nb_0 = int(undersampling_rate*nb_0)
    print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
    print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
    # Randomly select records with target=0 to get at the desired a priori
    undersampled_idx = shuffle(idx_0, n_samples=undersampled_nb_0)
    # Construct list with remaining indices
    idx_list = list(undersampled_idx) + list(idx_1)
    # Return undersample data frame
    train = train.loc[idx_list].reset_index(drop=True)

    return train

因為對應具體的project，所以里面欠采樣的為反例，如果要使用的話需要做一些改動。

欠采樣法若隨機丟棄反例，可能會丟失一些重要信息。為此，周志華實驗室提出了欠采樣的算法EasyEnsemble：利用集成學習機制，將反例划分為若干個集合供不同學習器使用，這樣對每個學習器來看都進行了欠采樣，但在全局來看卻不會丟失重要信息。其實這個方法可以再基本欠采樣方法上進行些許改動即可：

def easyensemble(df, desired_apriori, n_subsets=10):
    train_resample = []
    for _ in range(n_subsets):
        sel_train = undersampling(df, desired_apriori)
        train_resample.append(sel_train)
    return train_resample

仔細來看，下圖是原始論文Exploratory Undersampling for Class-Imbalance Learning里的算法介紹：

PS: 對於類別不平衡的時候采用CV進行交叉驗證時，由於分類問題在目標分布上表現出很大的不平衡性。如果用sklearn庫中的函數進行交叉驗證的話，建議采用如StratifiedKFold 和 StratifiedShuffleSplit中實現的分層抽樣方法，確保相對的類別概率在每個訓練和驗證折疊中大致保留。

Reference:

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習之類別不平衡問題 (3) —— 采樣方法機器學習之類別不平衡問題 (1) —— 各種評估指標基於邏輯回歸的利用欠采樣處理類別不平衡的信用卡欺詐檢測機器學習-類別不平衡問題機器學習筆記：imblearn之SMOTE算法處理樣本類別不平衡機器不學習：如何處理數據中的「類別不平衡」？從重采樣到數據合成：如何處理機器學習中的不平衡分類問題？機器學習樣本不平衡處理機器學習之類別不平衡問題 (2) —— ROC和PR曲線機器學習中的數據不平衡問題----通過隨機采樣比例大的類別使得訓練集中大類的個數與小類相當，或者模型中加入懲罰項