Python：SMOTE算法——樣本不均衡時候生成新樣本的算法

本文轉載自查看原文 2018-03-09 17:26 17715 機器學習

Python：SMOTE算法

直接用python的庫，

imbalanced-learn

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

---------------------

http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/over-sampling/plot_smote.html#sphx-glr-auto-examples-over-sampling-plot-smote-py

http://contrib.scikit-learn.org/imbalanced-learn/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn 入門

SMOTE

An illustration of the SMOTE method and its variant.

../../_images/sphx_glr_plot_smote_001.png

 
          # Authors: Fernando Nogueira
# Christos Aridas # Guillaume Lemaitre <g.lemaitre58@gmail.com> # License: MIT import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.decomposition import PCA from imblearn.over_sampling import SMOTE print(__doc__) def plot_resampling(ax, X, y, title): c0 = ax.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5) c1 = ax.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5) ax.set_title(title) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.get_xaxis().tick_bottom() ax.get_yaxis().tick_left() ax.spines['left'].set_position(('outward', 10)) ax.spines['bottom'].set_position(('outward', 10)) ax.set_xlim([-6, 8]) ax.set_ylim([-6, 6]) return c0, c1 # Generate the dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=80, random_state=10) # Instanciate a PCA object for the sake of easy visualisation pca = PCA(n_components=2) # Fit and transform x to visualise inside a 2D feature space X_vis = pca.fit_transform(X) # Apply regular SMOTE kind = ['regular', 'borderline1', 'borderline2', 'svm'] sm = [SMOTE(kind=k) for k in kind] X_resampled = [] y_resampled = [] X_res_vis = [] for method in sm: X_res, y_res = method.fit_sample(X, y) X_resampled.append(X_res) y_resampled.append(y_res) X_res_vis.append(pca.transform(X_res)) # Two subplots, unpack the axes array immediately f, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2) # Remove axis for second plot ax2.axis('off') ax_res = [ax3, ax4, ax5, ax6] c0, c1 = plot_resampling(ax1, X_vis, y, 'Original set') for i in range(len(kind)): plot_resampling(ax_res[i], X_res_vis[i], y_resampled[i], 'SMOTE {}'.format(kind[i])) ax2.legend((c0, c1), ('Class #0', 'Class #1'), loc='center', ncol=1, labelspacing=0.) plt.tight_layout() plt.show()

Total running time of the script: ( 0 minutes 0.704 seconds)

Python：SMOTE算法

from：https://www.jianshu.com/p/ecbc924860af

首先，看下Smote算法之前，我們先看下當正負樣本不均衡的時候，我們通常用的方法：

抽樣
常規的包含過抽樣、欠抽樣、組合抽樣
過抽樣：將樣本較少的一類sample補齊
欠抽樣：將樣本較多的一類sample壓縮
組合抽樣：約定一個量級N，同時進行過抽樣和欠抽樣，使得正負樣本量和等於約定量級N

這種方法要么丟失數據信息，要么會導致較少樣本共線性，存在明顯缺陷

權重調整
常規的包括算法中的weight，weight matrix
改變入參的權重比，比如boosting中的全量迭代方式、邏輯回歸中的前置的權重設置

這種方式的弊端在於無法控制合適的權重比，需要多次嘗試

核函數修正
通過核函數的改變，來抵消樣本不平衡帶來的問題

這種使用場景局限，前置的知識學習代價高，核函數調整代價高，黑盒優化

模型修正
通過現有的較少的樣本類別的數據，用算法去探查數據之間的特征，判讀數據是否滿足一定的規律
比如，通過線性擬合，發現少類樣本成線性關系，可以新增線性擬合模型下的新點

實際規律比較難發現，難度較高

SMOTE（Synthetic minoritye over-sampling technique,SMOTE）是Chawla在2002年提出的過抽樣的算法，一定程度上可以避免以上的問題

下面介紹一下這個算法：

正負樣本分布

很明顯的可以看出，藍色樣本數量遠遠大於紅色樣本，在常規調用分類模型去判斷的時候可能會導致之間忽視掉紅色樣本帶了的影響，只強調藍色樣本的分類准確性，這邊需要增加紅色樣本來平衡數據集

Smote算法的思想其實很簡單，先隨機選定n個少類的樣本，如下圖

找出初始擴展的少類樣本

再找出最靠近它的m個少類樣本，如下圖

再任選最臨近的m個少類樣本中的任意一點，

在這兩點上任選一點，這點就是新增的數據樣本

R語言上的開發較為簡單，有現成的包庫，這邊簡單介紹一下：

rm(list=ls()) install.packages(“DMwR”,dependencies=T) library(DMwR)#加載smote包 newdata=SMOTE(formula,data,perc.over=,perc.under=) #formula:申明自變量因變量 #perc.over：過采樣次數 #perc.under：欠采樣次數

效果對比：

簡單的看起來就好像是重復描繪了較少的類
這邊的smote是封裝好的，直接調用就行了，沒有什么特別之處

這邊自己想拿剛學的python練練手，所有就拿python寫了一下過程：

# -*- coding: utf-8 -*- import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from numpy import * import matplotlib.pyplot as plt #讀數據 data = pd.read_table('C:/Users/17031877/Desktop/supermarket_second_man_clothes_train.txt', low_memory=False) #簡單的預處理 test_date = pd.concat([data['label'], data.iloc[:, 7:10]], axis=1) test_date = test_date.dropna(how='any')

數據大致如下：

test_date.head()
Out[25]: 
   label  max_date_diff  max_pay  cnt_time
0      0           23.0  43068.0        15
1      0           10.0   1899.0         2
2      0          146.0   3299.0        21
3      0           30.0  31959.0        35
4      0            3.0  24165.0        98
test_date['label'][test_date['label']==0].count()/test_date['label'][test_date['label']==1].count() Out[37]: 67

label是樣本類別判別標簽，1:0=67:1，需要對label=1的數據進行擴充

# 篩選目標變量 aimed_date = test_date[test_date['label'] == 1] # 隨機篩選少類擴充中心 index = pd.DataFrame(aimed_date.index).sample(frac=0.1, random_state=1) index.columns = ['id'] number = len(index) # 生成array格式 aimed_date_new = aimed_date.ix[index.values.ravel(), :]

隨機選取了全量少數樣本的10%作為數據擴充的中心點


# 自變量標准化 sc = StandardScaler().fit(aimed_date_new) aimed_date_new = pd.DataFrame(sc.transform(aimed_date_new)) sc1 = StandardScaler().fit(aimed_date) aimed_date = pd.DataFrame(sc1.transform(aimed_date)) # 定義歐式距離計算 def dist(a, b): a = array(a) b = array(b) d = ((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2 + (a[2] - b[2]) ** 2 + (a[3] - b[3]) ** 2) ** 0.5 return d

下面定義距離計算的方式，所有算法中，涉及到距離的地方都需要標准化去除岡量，也同時加快了計算的速度
這邊采取了歐式距離的方式，更多計算距離的方式參考：
多種距離及相似度的計算理論介紹

# 統計所有檢驗距離樣本個數 row_l1 = aimed_date_new.iloc[:, 0].count() row_l2 = aimed_date.iloc[:, 0].count() a = zeros((row_l1, row_l2)) a = pd.DataFrame(a) # 計算距離矩陣 for i in range(row_l1): for j in range(row_l2): d = dist(aimed_date_new.iloc[i, :], aimed_date.iloc[j, :]) a.ix[i, j] = d b = a.T.apply(lambda x: x.min())

調用上面的計算距離的函數，形成一個距離矩陣

# 找到同類點位置 h = [] z = [] for i in range(number): for j in range(len(a.iloc[i, :])): ai = a.iloc[i, j] bi = b[i] if ai == bi: h.append(i) z.append(j) else: continue new_point = [0, 0, 0, 0] new_point = pd.DataFrame(new_point) for i in range(len(h)): index_a = z[i] new = aimed_date.iloc[index_a, :] new_point = pd.concat([new, new_point], axis=1) new_point = new_point.iloc[:, range(len(new_point.columns) - 1)]

再找到位置的情況下，再去原始的數據集中根據位置查找具體的數據

import random r1 = [] for i in range(len(new_point.columns)): r1.append(random.uniform(0, 1)) new_point_last = [] new_point_last = pd.DataFrame(new_point_last) # 求新點 new_x=old_x+rand()*(append_x-old_x) for i in range(len(new_point.columns)): new_x = (new_point.iloc[1:4, i] - aimed_date_new.iloc[number - 1 - i, 1:4]) * r1[i] + aimed_date_new.iloc[ number - 1 - i, 1:4] new_point_last = pd.concat([new_point_last, new_x], axis=1) print new_point_last

最后，再根據smote的計算公式new_x=old_x+rand()*(append_x-old_x)，計算出新的點即可，python練手到此就結束了

其實，在這個結果上，我們可以綜合Tomek link做一個集成的數據擴充的算法，思路如下：
假設，我們利用上述的算法產生了兩個青色方框的新數據點：

我們認為，對於新產生的青色數據點與其他非青色樣本點距離最近的點，構成一對Tomek link，如下圖框中的青藍兩點

我們可以定義規則：
當以新產生點為中心，Tomek link的距離為范圍半徑，去框定一個空間，空間內的 少數類的個數/多數類的個數<最低閥值的時候，認為新產生點為“垃圾點”，應該剔除或者再次進行smote訓練；空間內的 少數類的個數/多數類的個數>=最低閥值的時候,在進行保留並納入smote訓練的初始少類樣本集合中去抽樣
所以，剔除左側的青色新增點，只保留右邊的新增數據如下：

參考文獻：

https://www.jair.org/media/953/live-953-2037-jair.pdf
https://github.com/fmfn/UnbalancedDataset
Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (pp. 10-18).
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 關於樣本不均衡問題樣本不均衡問題 SMOTE算法解決樣本不平衡常見算法面試之樣本不均衡的解決辦法、交叉熵以及HMM、MEMM vs CRF 處理樣本不均衡數據數據抽樣及樣本不均衡處理樣本不均衡對模型的影響如何解決樣本不均衡問題機器學習-樣本不均衡問題處理緩解多分類的樣本不均衡問題