不平衡數據集分類實戰：成人收入數據集分類模型訓練和評估

本文轉載自查看原文 2020-03-24 13:27 1322 機器學習

許多二分類任務並不是每個類別都有相同數量的數據，存在着數據分布不平衡的情況。

一個常用的例子是成人收入數據集，它涉及到社交關系、教育水平等個人數據，以此來預測成人的收入水平，判斷其是否擁有5萬美元/年的個人收入。數據集中個人收入低於5萬美元的數據比高於5萬美元的數據要明顯多一些，存在着一定程度的分布不平衡。
針對這一數據集，可以使用很多不平衡分類的相關算法完成分類任務。

在本教程中，您將了解如何為數據分布不平衡的成人收入數據集開發分類模型並對其進行評估。

學習本教程后，您將知道：

如何加載和分析數據集，並對如何進行數據預處理和模型選擇有一定啟發。
如何使用一個穩健的測試工具系統地評估機器學習模型的效能。
如何擬合最終模型並使用它預測特定情況所對應的類標簽。

針對成人收入不平衡分類的具體內容如下：

教程大綱

本教程主要分為了以下五個部分：

成人收入數據集介紹
數據集分析
基礎模型和性能評價
模型評價
對新輸入數據進行預測

成人收入數據集介紹

在這個教程中，我們將使用一個數據分布不平衡的機器學習常用數據集，稱為“成人收入”或簡稱“成人”數據集。
該數據集歸Ronny Kohavi和Barry Becker所有，取自1994年美國人口普查局的數據，包含有教育水平等個人詳細數據，用於預測個人年收入是否超過或低於50000美元。
數據集提供14個輸入變量，這些變量數據的類型有標簽數據、序列數據、連續數據。變量的完整列表如下：

年齡。
階級。
最終重量。
教育程度。
教育年限。
婚姻狀況。
職業。
社交。
種族。
性別。
資本收益。
資本損失。
每周工作小時數。
國籍。

總共有48842行數據，3620行含有缺失數據，45222行具有完整的數據，其中缺失值用?標記。
有'>50K'和'<=50K'兩類標簽數據，也就是說它是一個二分類任務。同時這些標簽數據分布不平衡，'<=50K'類標簽比重更大。
考慮到標簽數據分布不平衡的情況並不嚴重，並且兩個標簽同等重要，本教程采用常見的分類准確度或分類誤差來反映此數據集上的相關模型性能。

分析數據集

成人數據集是一個廣泛使用的標准機器學習數據集，用於探索和演示許多一般性的或專門為不平衡分類設計的機器學習算法。

首先，下載數據集並將其保存在當前工作目錄中，命名為“adult-all.csv”.

接下來讓我們考察一下該數據集。文件的前幾行如下：

39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...

我們可以看到，輸入變量包含有連續數據、標簽數據以及序號數據，對於標簽數據需要進行二進制或者獨熱編碼。同時也需要注意到，目標變量是用字符串表示的，而對於二分類問題，需要用0/1進行標簽編碼，因此對於占比多的多數標簽編碼為0，而占比較少的少數標簽則編碼為1。缺失的數據用?表示，通常可以估算這些值，也可以直接從數據集中刪除這些行。

具體的載入數據集方法可使用read_csv（）這一Pandas包的內置函數，只需要指定文件名、是否讀入標題行以及缺失值的對應符號(本數據為?，缺失值會被處理為NaN數據):

# define the dataset location
filename = 'adult-all.csv'
# load the csv file as a data frame
dataframe = read_csv(filename, header=None, na_values='?')

成功加載數據集后，我們需要移除缺失數據所在的行，並統計數據大小:

# drop rows with missing
dataframe = dataframe.dropna()
# summarize the shape of the dataset
print(dataframe.shape)

通過Counter函數我們可以統計數據集分布情況:

# summarize the class distribution
target = dataframe.values[:,-1]
counter = Counter(target)
for k,v in counter.items():
    per = v / len(target) * 100
    print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

上述函數集合到一起，就實現了數據加載和相關統計工作。完整代碼如下：

# load and summarize the dataset
from pandas import read_csv
from collections import Counter
# define the dataset location
filename = 'adult-all.csv'
# load the csv file as a data frame
dataframe = read_csv(filename, header=None, na_values='?')
# drop rows with missing
dataframe = dataframe.dropna()
# summarize the shape of the dataset
print(dataframe.shape)
# summarize the class distribution
target = dataframe.values[:,-1]
counter = Counter(target)
for k,v in counter.items():
    per = v / len(target) * 100
    print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

運行結果如下:

(45222, 15)
Class= <=50K, Count=34014, Percentage=75.216%
Class= >50K, Count=11208, Percentage=24.784%

在上述代碼中，首先我們加載了數據集，並確認了行和列的數量，即45222行，15列(14個輸入變量和一個目標變量)。然后分析了數據分布情況，發現數據分布是不平衡的，大約75%的數據都是(<=50K)，而只有大約25%的數據是(>50K)。

通過創建直方圖，我們可以更直觀地看到數據分布情況。具體做法如下：
首先，調用select_dtypes函數選取數值型數據。

...
# select columns with numerical data types
num_ix = df.select_dtypes(include=['int64', 'float64']).columns
# select a subset of the dataframe with the chosen columns
subset = df[num_ix]

然后通過matplotlib繪圖包進行顯示。

# create histograms of numeric input variables
from pandas import read_csv
from matplotlib import pyplot
# define the dataset location
filename = 'adult-all.csv'
# load the csv file as a data frame
df = read_csv(filename, header=None, na_values='?')
# drop rows with missing
df = df.dropna()
# select columns with numerical data types
num_ix = df.select_dtypes(include=['int64', 'float64']).columns
# select a subset of the dataframe with the chosen columns
subset = df[num_ix]
# create a histogram plot of each numeric variable
subset.hist()
pyplot.show()

運行上述代碼，將為數據集中的六個輸入變量分別創建一個直方圖。

我們可以看到它們有着不同的分布情況，有些是高斯分布，有些是指數分布或離散分布。同樣可以看出，他們的變化范圍差異較大。而為了得到較好的算法效果，我們通常需要將數據分布縮放到相同的范圍，因此需要進行相應的冪變換。

基礎模型和性能評價

k-fold交叉驗證方法能夠較好估計模型的性能。在這里我們將使用k=10的重復分層k-fold交叉驗證方法來評估相關模型，這意味着每個折疊將包含約45222/10=4522個數據。而分層表示每一個折疊將包含相同的混合比例(即每個折疊中指標數據都具有75%-25%的分布特征)。重復表示評估過程將被多次執行，以避免偶然結果和更好地捕獲所選模型的方差，本教程中，我們將重復三次。這意味着將對單個模型進行10×3=30次擬合和評估，並記錄每次運行結果的平均值和標准差。

上述方法可以通過scikit-learn包里面的RepeatedStratifiedKFold函數實現。
具體代碼如下：

# evaluate a model
def evaluate_model(X, y, model):
    # define evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate model
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores

通過evaluate_model（）函數我們實現了獲取加載的數據集和定義的模型，使用重復分層k-fold交叉驗證對其進行評估，然后返回一個准確度列表。

而如何生成X、Y數據呢？我們可以定義一個函數來加載數據集並對目標列進行編碼，然后返回所需數據。具體代碼如下：

# load the dataset
def load_dataset(full_path):
    # load the dataset as a numpy array
    dataframe = read_csv(full_path, header=None, na_values='?')
    # drop rows with missing
    dataframe = dataframe.dropna()
    # split into inputs and outputs
    last_ix = len(dataframe.columns) - 1
    X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
    # select categorical and numerical features
    cat_ix = X.select_dtypes(include=['object', 'bool']).columns
    num_ix = X.select_dtypes(include=['int64', 'float64']).columns
    # label encode the target variable to have the classes 0 and 1
    y = LabelEncoder().fit_transform(y)
    return X.values, y, cat_ix, num_ix

通過以上步驟，我們就可以使用這個測試工具評估數據集的相關模型了。

為了更好地評估若干模型之間的差距，我們可以通過scikit庫里面的DummyClassifier類建立一個基准模型。相關代碼如下：

# define the reference model
model = DummyClassifier(strategy='most_frequent')
# evaluate the model
scores = evaluate_model(X, y, model)
# summarize performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

上述函數集合到一起，就實現了一個基准算法對於數據集的預測分類和評價。完整代碼如下：

# test harness and baseline model evaluation for the adult dataset
from collections import Counter
from numpy import mean
from numpy import std
from numpy import hstack
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier

# load the dataset
def load_dataset(full_path):
    # load the dataset as a numpy array
    dataframe = read_csv(full_path, header=None, na_values='?')
    # drop rows with missing
    dataframe = dataframe.dropna()
    # split into inputs and outputs
    last_ix = len(dataframe.columns) - 1
    X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
    # select categorical and numerical features
    cat_ix = X.select_dtypes(include=['object', 'bool']).columns
    num_ix = X.select_dtypes(include=['int64', 'float64']).columns
    # label encode the target variable to have the classes 0 and 1
    y = LabelEncoder().fit_transform(y)
    return X.values, y, cat_ix, num_ix

# evaluate a model
def evaluate_model(X, y, model):
    # define evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate model
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores

# define the location of the dataset
full_path = 'adult-all.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# summarize the loaded dataset
print(X.shape, y.shape, Counter(y))
# define the reference model
model = DummyClassifier(strategy='most_frequent')
# evaluate the model
scores = evaluate_model(X, y, model)
# summarize performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

運行結果如下：

(45222, 14) (45222,) Counter({0: 34014, 1: 11208})
Mean Accuracy: 0.752 (0.000)

通過上述代碼，我們首先加載數據並進行預處理。然后通過DummyClassifier()進行分類，並通過RepeatedStratifiedKFold()進行評價。可以看到，基准算法達到了約75.2%的准確度。這一結果指出了相關模型的准確度下限；任何平均准確度高於75.2%的模型都可被視為有效模型，而低於75.2%則通常被認為是無效的。

模型評價

在上一節中，我們看到，基准算法的性能良好，但還有很大的優化空間。
在本節中，我們將使用上一節中所描述的評價方法評估作用於同一數據集的不同算法。
目的是演示如何系統地解決問題，以及某些為不平衡分類問題設計的算法。

不同的機器學習算法

在這里，我們選取一系列非線性算法來進行具體的評價，如：

決策樹（CART，Decision Tree）
支持向量機（SVM，Support Vector Machine）
袋裝決策樹（BAG，Bagged Decision Trees）
隨機森林（RF，Random Forest）
爬坡機（GBM，Gradient Boosting Machine）

首先定義一個列表，依次定義每個模型並將它們添加到列表中，以便於后面將運算的結果進行列表顯示。代碼如下：

# define models to test
def get_models():
    models, names = list(), list()
    # CART
    models.append(DecisionTreeClassifier())
    names.append('CART')
    # SVM
    models.append(SVC(gamma='scale'))
    names.append('SVM')
    # Bagging
    models.append(BaggingClassifier(n_estimators=100))
    names.append('BAG')
    # RF
    models.append(RandomForestClassifier(n_estimators=100))
    names.append('RF')
    # GBM
    models.append(GradientBoostingClassifier(n_estimators=100))
    names.append('GBM')
    return models, names

針對每一個算法，我們將主要使用默認的模型超參數。對標簽變量進行獨熱編碼，對連續型數據變量通過MinMaxScaler進行規范化處理。具體的，建立一個Pipeline，其中第一步使用ColumnTransformer()函數；第二步使用OneHotEncoder()函數；第三步使用MinMaxScaler函數。相關代碼如下：

...
# define steps
steps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), （'n',MinMaxScaler(),num_ix)]
# one hot encode categorical, normalize numerical
ct = ColumnTransformer(steps)
# wrap the model i a pipeline
pipeline = Pipeline(steps=[('t',ct),('m',models[i])])
# evaluate the model and store results
scores = evaluate_model(X, y, pipeline)
# summarize performance
print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

同時，我們可以通過作圖進行直觀的比較：

...
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

上述代碼集合到一起，我們就實現了對於若干算法性能的對比。完整代碼如下：

# spot check machine learning algorithms on the adult imbalanced dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from matplotlib import pyplot
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier

# load the dataset
def load_dataset(full_path):
    # load the dataset as a numpy array
    dataframe = read_csv(full_path, header=None, na_values='?')
    # drop rows with missing
    dataframe = dataframe.dropna()
    # split into inputs and outputs
    last_ix = len(dataframe.columns) - 1
    X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
    # select categorical and numerical features
    cat_ix = X.select_dtypes(include=['object', 'bool']).columns
    num_ix = X.select_dtypes(include=['int64', 'float64']).columns
    # label encode the target variable to have the classes 0 and 1
    y = LabelEncoder().fit_transform(y)
    return X.values, y, cat_ix, num_ix

# evaluate a model
def evaluate_model(X, y, model):
    # define evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate model
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores

# define models to test
def get_models():
    models, names = list(), list()
    # CART
    models.append(DecisionTreeClassifier())
    names.append('CART')
    # SVM
    models.append(SVC(gamma='scale'))
    names.append('SVM')
    # Bagging
    models.append(BaggingClassifier(n_estimators=100))
    names.append('BAG')
    # RF
    models.append(RandomForestClassifier(n_estimators=100))
    names.append('RF')
    # GBM
    models.append(GradientBoostingClassifier(n_estimators=100))
    names.append('GBM')
    return models, names

# define the location of the dataset
full_path = 'adult-all.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
    # define steps
    steps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), ('n',MinMaxScaler(),num_ix)]
    # one hot encode categorical, normalize numerical
    ct = ColumnTransformer(steps)
    # wrap the model i a pipeline
    pipeline = Pipeline(steps=[('t',ct),('m',models[i])])
    # evaluate the model and store results
    scores = evaluate_model(X, y, pipeline)
    results.append(scores)
    # summarize performance
    print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

運行結果如下：

>CART 0.812 (0.005)
>SVM 0.837 (0.005)
>BAG 0.852 (0.004)
>RF 0.849 (0.004)
>GBM 0.863 (0.004)

我們可以看到所選擇的所有算法都達到了75.2%以上的分類准確度。其中GBM算法表現最好，分類准確度約為86.3%。這一結果只是略好於基准算法的結果。而圖中雖然存在一些異常值（圖上的圓圈），但每個算法的結果都高於75%的基線。每個算法的分布看起來也很緊湊，中位數和平均值基本持平，這表明算法在這個數據集上是相當穩定的。這突出表明，重要的不僅僅是模型性能的綜合趨勢，更應該考慮的是對於少數類別的分類結果准確度(這在少數民族的相關例子中尤為重要)。

對新輸入數據進行預測

本節中，我們將使用GradientBoostingClassfier分類模型用於新輸入數據的預測。擬合這個模型需要定義ColumnTransformer來對標簽數據變量進行編碼並縮放連續數據變量，並且在擬合模型之前在訓練集上構造一個Pipeline來執行這些變換。具體代碼如下:

...
# define model to evaluate
model = GradientBoostingClassifier(n_estimators=100)
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then oversample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('m',model)])

函數定義完成后，我們就可以調用該函數進行參數擬合了：

...
# fit the model
pipeline.fit(X, y)

擬合階段過后，通過predict()函數進行預測，返回輸入數據對應的標簽是“<=50K”還是“>50K”：

...
# define a row of data
row = [...]
# make prediction
yhat = pipeline.predict([row])

通過GradientBoostingClassfier分類模型進行預測的完整代碼如下：

# fit a model and make predictions for the on the adult dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.pipeline import Pipeline

# load the dataset
def load_dataset(full_path):
    # load the dataset as a numpy array
    dataframe = read_csv(full_path, header=None, na_values='?')
    # drop rows with missing
    dataframe = dataframe.dropna()
    # split into inputs and outputs
    last_ix = len(dataframe.columns) - 1
    X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
    # select categorical and numerical features
    cat_ix = X.select_dtypes(include=['object', 'bool']).columns
    num_ix = X.select_dtypes(include=['int64', 'float64']).columns
    # label encode the target variable to have the classes 0 and 1
    y = LabelEncoder().fit_transform(y)
    return X.values, y, cat_ix, num_ix

# define the location of the dataset
full_path = 'adult-all.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define model to evaluate
model = GradientBoostingClassifier(n_estimators=100)
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then oversample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('m',model)])
# fit the model
pipeline.fit(X, y)
# evaluate on some <=50K cases (known class 0)
print('<=50K cases:')
data = [[24, 'Private', 161198, 'Bachelors', 13, 'Never-married', 'Prof-specialty', 'Not-in-family', 'White', 'Male', 0, 0, 25, 'United-States'],
    [23, 'Private', 214542, 'Some-college', 10, 'Never-married', 'Farming-fishing', 'Own-child', 'White', 'Male', 0, 0, 40, 'United-States'],
    [38, 'Private', 309122, '10th', 6, 'Divorced', 'Machine-op-inspct', 'Not-in-family', 'White', 'Female', 0, 0, 40, 'United-States']]
for row in data:
    # make prediction
    yhat = pipeline.predict([row])
    # get the label
    label = yhat[0]
    # summarize
    print('>Predicted=%d (expected 0)' % (label))
# evaluate on some >50K cases (known class 1)
print('>50K cases:')
data = [[55, 'Local-gov', 107308, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0, 0, 40, 'United-States'],
    [53, 'Self-emp-not-inc', 145419, '1st-4th', 2, 'Married-civ-spouse', 'Exec-managerial', 'Husband', 'White', 'Male', 7688, 0, 67, 'Italy'],
    [44, 'Local-gov', 193425, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Wife', 'White', 'Female', 4386, 0, 40, 'United-States']]
for row in data:
    # make prediction
    yhat = pipeline.predict([row])
    # get the label
    label = yhat[0]
    # summarize
    print('>Predicted=%d (expected 1)' % (label))

運行結果如下：

<=50K cases:
>Predicted=0 (expected 0)
>Predicted=0 (expected 0)
>Predicted=0 (expected 0)
>50K cases:
>Predicted=1 (expected 1)
>Predicted=1 (expected 1)
>Predicted=1 (expected 1)

運行該代碼，我們首先實現了模型在訓練數據集上的訓練，然后針對新的輸入數據進行預測。可以看到，預測值和真實值是一致的，說明模型具有很好的預測功能。

原文地址：https://imba.deephub.ai/p/cdf00c806d6f11ea90cd05de3860c663

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 不平衡數據集的處理多分類機器學習中數據不平衡的處理（NSL-KDD 數據集+LightGBM) 反欺詐模型（數據不平衡）在分類中如何處理訓練集中不平衡問題深度學習中數據集分布不平衡問題的解決方法深度學習中不平衡數據集處理辦法資源匯總使用catboost解決ML中高維度不平衡數據集挑戰的解決方案分類問題中的數據不平衡問題不平衡數據的處理類間樣本數量不平衡對分類模型性能的影響問題