搜索廣告 - 不平衡數據 Imbalanced Data

本文轉載自查看原文 2018-04-09 09:23 1222

【IJCAI-2018】搜索廣告 - 不平衡數據 Imbalanced Data

我並不擅長做比賽，也不擅長構造特征，也不擅長調參數，也沒有服務器可以並行。大家的baseline都比我的模型要好。在這里寫這篇文章，主要是想跟大家分享下我對數據的理解，以及我思考的一個大概框架，希望對大家能有那么一點點啟發或者幫助。

像我這種無經驗無戰績無隊友，特征只會弄個dummy variable，降維只會PCA，模型只會LR，SVM，調參只會CV，ensemble只會求平均的人，每次在比賽里的存在感就是增大分母，當我看到大家在論壇分享自己的baseline的時候，真的是好高興好興奮，然后又看到大家在構造各種神奇的特征，模型的logloss居然有提高，真的是很佩服。由於我既沒有聰明的頭腦也沒有足夠的細致，於是就“拿來主義”至上，將論壇里看到的baseline copy下來，在電腦上跑了一下。哇，好牛，第一把就0.084，與排名最前的0.080，只差0.004，我這是要沖擊leaderboard 的節奏啊~興奮之余干勁更大了，各種吭哧吭哧搜模型，吭哧吭哧敲代碼，感覺自帶BGM，走路生風~反正就是那種全世界都屬於我的感覺~但是~~當我想看看哪些預測為1的時候，我驚呆了，no one! 合着我1萬8的test data，用模型預測出來之后，竟然沒有一個1，一個都沒有！

Confusion matrix1

然后這就是問題了。可能聰明的大家早就知道了CTR的數據不平衡問題，但是愚鈍如我啊，我竟然沒有發現！

所以吐槽完了~

對於不平衡數據 Imbalanced Data，像這里的CTR里面的二分類預測，應該怎么處理呢？

正負樣本比例嚴重不平衡的情況，比例達到了50:1，如果直接在此基礎上做預測，對於樣本量較小的類的召回率會極低。

因為傳統的學習方法以降低總體分類精度為目標，將所有樣本一視同仁，同等對待，造成了分類器在多數類的分類精度較高而在少數類的分類精度很低。例如ctr正負樣本50:1的例子，算法就算全部預測為另一樣本，准確率也會達到98%(50/51)，因此傳統的學習算法在不平衡數據集中具有較大的局限性。傳統的學習算法的預測結果就是favor the majority, 因為the minority 本身數量少，又本同等對待，因此miss the minority 的代價極小，所以結果就是favor the majority。

解決方法主要分為兩個方面。

第一種方案主要從數據的角度出發，主要方法為抽樣，既然我們的樣本是不平衡的，那么可以通過某種策略進行抽樣，從而讓我們的數據相對均衡一些；resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.

第二種方案從算法的角度出發，考慮不同誤分類情況代價的差異性對算法進行優化，使得我們的算法在不平衡數據下也能有較好的效果。改寫cost function by giving large cost of misclassifying the minority labels.

PS：附件中有基於logloss , AUC 的對比的python代碼，可以運行，不會memory error.

base_lgb_weighted

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Wed Apr 4 10:53:58 2018
  4 @author : HaiyanJiang
  5 @email : jianghaiyan.cn@gmail.com
  6 
  7  
  8 
  9 what does the doc do?
 10 some ideas of improving the accuracy of imbalanced data classification.
 11 data characteristics:
 12 imbalanced data.
 13 the models:
 14 model_baseline : lgb
 15 model_baseline2 : another lgb
 16 model_baseline3 : bagging
 17 
 18  
 19 
 20 Other Notes:
 21 除了基本特征外，還包括了'用戶'在當前小時內和當天的點擊量統計特征，以及當前所在的小時。
 22 'context_day', 'context_hour',
 23 'user_query_day', 'user_query_hour', 'user_query_day_hour',
 24 non_feat = [
 25 'instance_id', 'user_id', 'context_id', 'item_category_list',
 26 'item_property_list', 'predict_category_property',
 27 'context_timestamp', 'TagTime', 'context_day'
 28 ]
 29 
 30  
 31 
 32 """
 33 
 34  
 35 
 36 import time
 37 import pandas as pd
 38 import lightgbm as lgb
 39 from sklearn.metrics import log_loss
 40 
 41  
 42 
 43 import numpy as np
 44 import itertools
 45 import matplotlib.pyplot as plt
 46 from sklearn.metrics import confusion_matrix
 47 from sklearn.metrics import auc, roc_curve
 48 from scipy import interp
 49 
 50  
 51 
 52 from sklearn.ensemble import BaggingClassifier
 53 from imblearn.ensemble import BalancedBaggingClassifier
 54 
 55  
 56 
 57 
 58 def read_bigcsv(filename, **kw):
 59 with open(filename) as rf:
 60 reader = pd.read_csv(rf, **kw, iterator=True)
 61 chunkSize = 100000
 62 chunks = []
 63 while True:
 64 try:
 65 chunk = reader.get_chunk(chunkSize)
 66 chunks.append(chunk)
 67 except StopIteration:
 68 print("Iteration is stopped.")
 69 break
 70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True)
 71 return df
 72 
 73  
 74 
 75 
 76 def timestamp2datetime(value):
 77 value = time.localtime(value)
 78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value)
 79 return dt
 80 
 81  
 82 
 83 
 84 '''
 85 from matplotlib import pyplot as plt
 86 tt = data['context_timestamp']
 87 plt.plot(tt)
 88 # 可以看出時間是沒有排好的,有一定的錯位。如果做成online的模型，一定要將時間排好。
 89 # aa = data[data['user_id']==24779788309075]
 90 aa = data_train[data_train.duplicated(subset=None, keep='first')]
 91 bb = data_train[data_train.duplicated(subset=None, keep='last')]
 92 cc = data_train[data_train.duplicated(subset=None, keep=False)]
 93 
 94  
 95 
 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)]
 97 b2 = train_id[train_id.duplicated(keep='last')]
 98 c2 = train_id[train_id.duplicated(keep=False)]
 99 
100  
101 
102 c2 = data_train[data_train.duplicated(subset=None, keep=False)]
103 
104  
105 
106 經驗證, 'instance_id'有重復
107 a3 = Xdata[Xdata['instance_id']==1037061371711078396]
108 '''
109 
110  
111 
112 
113 def convert_timestamp(data):
114 '''
115 1. convert timestamp to datetime.
116 2. no sort, no reindex.
117 data.duplicated(subset=None, keep='first')
118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47')
119 'user_query_day', 'user_query_day_hour', 'hour',
120 np.corrcoef(data['user_query_day'], data['user_query_hour'])
121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour'])
122 np.corrcoef(data['user_query_day'], data['user_query_day_hour'])
123 '''
124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime)
125 # data['TagTime'][0], data['TagTime'][len(data) - 1]
126 # x = data['TagTime'][len(data) - 1]
127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10]))
128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13]))
129 query_day = data.groupby(['user_id', 'context_day']).size(
130 ).reset_index().rename(columns={0: 'user_query_day'})
131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day'])
132 query_hour = data.groupby(['user_id', 'context_hour']).size(
133 ).reset_index().rename(columns={0: 'user_query_hour'})
134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour'])
135 query_day_hour = data.groupby(
136 by=['user_id', 'context_day', 'context_hour']).size(
137 ).reset_index().rename(columns={0: 'user_query_day_hour'})
138 data = pd.merge(data, query_day_hour, 'left',
139 on=['user_id', 'context_day', 'context_hour'])
140 return data
141 
142  
143 
144 
145 def plot_confusion_matrix(cm, classes, normalize=False,
146 title='Confusion matrix',
147 cmap=plt.cm.Blues):
148 """
149 This function prints and plots the confusion matrix.
150 Normalization can be applied by setting 'normalize=True'.
151 """
152 if normalize:
153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
154 print("Normalized confusion matrix")
155 else:
156 print('Confusion matrix, without normalization')
157 print(cm)
158 plt.imshow(cm, interpolation='nearest', cmap=cmap)
159 plt.title(title)
160 plt.colorbar()
161 tick_marks = np.arange(len(classes))
162 plt.xticks(tick_marks, classes, rotation=45)
163 plt.yticks(tick_marks, classes)
164 fmt = '.2f' if normalize else 'd'
165 thresh = cm.max() / 2.
166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
167 plt.text(j, i, format(cm[i, j], fmt),
168 horizontalalignment="center",
169 color="white" if cm[i, j] > thresh else "black")
170 plt.tight_layout()
171 plt.ylabel('True label')
172 plt.xlabel('Predicted label')
173 
174  
175 
176 
177 def data_baseline():
178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt'
179 data = read_bigcsv(filename, sep=' ')
180 # data = pd.read_csv(filename, sep=' ')
181 data.drop_duplicates(inplace=True)
182 data.reset_index(drop=True, inplace=True) # very important
183 data = convert_timestamp(data)
184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24
185 test = data.loc[data['context_day'] == 24] # 暫時先使用第24天作為驗證集
186 features = [
187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level',
188 'item_sales_level', 'item_collected_level', 'item_pv_level',
189 'user_gender_id', 'user_age_level', 'user_occupation_id',
190 'user_star_level', 'context_page_id', 'shop_id',
191 'shop_review_num_level', 'shop_review_positive_rate',
192 'shop_star_level', 'shop_score_service',
193 'shop_score_delivery', 'shop_score_description',
194 'user_query_day', 'user_query_day_hour', 'context_hour',
195 ]
196 x_train = train[features]
197 x_test = test[features]
198 y_train = train['is_trade']
199 y_test = test['is_trade']
200 return x_train, x_test, y_train, y_test
201 # x_train, x_test, y_train, y_test = data_baseline()
202 
203  
204 
205 
206 def model_baseline(x_train, y_train, x_test, y_test):
207 cat_names = [
208 'item_price_level',
209 'item_sales_level',
210 'item_collected_level',
211 'item_pv_level',
212 'user_gender_id',
213 'user_age_level',
214 'user_occupation_id',
215 'user_star_level',
216 'context_page_id',
217 'shop_review_num_level',
218 'shop_star_level',
219 ]
220 print("begin train...")
221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,)
222 clf = lgb.LGBMClassifier(**kw_lgb)
223 clf.fit(x_train, y_train, categorical_feature=cat_names,)
224 prob = clf.predict_proba(x_test,)[:, 1]
225 predict_score = [float('%.2f' % x) for x in prob]
226 loss_val = log_loss(y_test, predict_score)
227 # print(loss_val) # 0.0848226750637
228 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
229 mean_fpr = np.linspace(0, 1, 100)
230 mean_tpr = interp(mean_fpr, fpr, tpr)
231 x_auc = auc(fpr, tpr)
232 fig = plt.figure('fig1')
233 ax = fig.add_subplot(1, 1, 1)
234 name = 'base_lgb'
235 plt.plot(mean_fpr, mean_tpr, linestyle='--',
236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
237 (x_auc, loss_val), lw=2)
238 y_pred = clf.predict(x_test)
239 cm1 = plt.figure()
240 cm = confusion_matrix(y_test, y_pred)
241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1')
242 # add weighted according to the labels
243 clf = lgb.LGBMClassifier(**kw_lgb)
244 clf.fit(x_train, y_train,
245 sample_weight=[1 if y == 1 else 0.02 for y in y_train],
246 categorical_feature=cat_names)
247 prob = clf.predict_proba(x_test,)[:, 1]
248 predict_score = [float('%.2f' % x) for x in prob]
249 loss_val = log_loss(y_test, predict_score)
250 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
251 mean_fpr = np.linspace(0, 1, 100)
252 mean_tpr = interp(mean_fpr, fpr, tpr)
253 x_auc = auc(fpr, tpr)
254 name = 'base_lgb_weighted'
255 plt.figure('fig1') # 選擇圖
256 plt.plot(
257 mean_fpr, mean_tpr, linestyle='--',
258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
259 (x_auc, loss_val), lw=2)
260 y_pred = clf.predict(x_test)
261 cm2 = plt.figure()
262 cm = confusion_matrix(y_test, y_pred)
263 plot_confusion_matrix(cm, classes=[0, 1],
264 title='Confusion matrix basemodle')
265 plt.figure('fig1') # 選擇圖
266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
267 # make nice plotting
268 ax.spines['top'].set_visible(False)
269 ax.spines['right'].set_visible(False)
270 ax.get_xaxis().tick_bottom()
271 ax.get_yaxis().tick_left()
272 ax.spines['left'].set_position(('outward', 10))
273 ax.spines['bottom'].set_position(('outward', 10))
274 plt.xlim([0, 1])
275 plt.ylim([0, 1])
276 plt.xlabel('False Positive Rate')
277 plt.ylabel('True Positive Rate')
278 plt.title('Receiver Operating Characteristic')
279 plt.legend(loc="lower right")
280 plt.show()
281 return cm1, cm2, fig
282 
283  
284 
285 
286 def model_baseline3(x_train, y_train, x_test, y_test):
287 bagging = BaggingClassifier(random_state=0)
288 balanced_bagging = BalancedBaggingClassifier(random_state=0)
289 bagging.fit(x_train, y_train)
290 balanced_bagging.fit(x_train, y_train)
291 prob = bagging.predict_proba(x_test)[:, 1]
292 predict_score = [float('%.2f' % x) for x in prob]
293 loss_val = log_loss(y_test, predict_score)
294 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
295 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
296 mean_fpr = np.linspace(0, 1, 100)
297 mean_tpr = interp(mean_fpr, fpr, tpr)
298 x_auc = auc(fpr, tpr)
299 fig = plt.figure('Bagging')
300 ax = fig.add_subplot(1, 1, 1)
301 name = 'base_Bagging'
302 plt.plot(mean_fpr, mean_tpr, linestyle='--',
303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
304 (x_auc, loss_val), lw=2)
305 y_pred_bagging = bagging.predict(x_test)
306 cm_bagging = confusion_matrix(y_test, y_pred_bagging)
307 cm1 = plt.figure()
308 plot_confusion_matrix(cm_bagging,
309 classes=[0, 1],
310 title='Confusion matrix of BaggingClassifier')
311 # balanced_bagging
312 prob = balanced_bagging.predict_proba(x_test)[:, 1]
313 predict_score = [float('%.2f' % x) for x in prob]
314 loss_val = log_loss(y_test, predict_score)
315 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
316 mean_fpr = np.linspace(0, 1, 100)
317 mean_tpr = interp(mean_fpr, fpr, tpr)
318 x_auc = auc(fpr, tpr)
319 plt.figure('Bagging') # 選擇圖
320 name = 'base_Balanced_Bagging'
321 plt.plot(
322 mean_fpr, mean_tpr, linestyle='--',
323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
324 (x_auc, loss_val), lw=2)
325 y_pred_balanced_bagging = balanced_bagging.predict(x_test)
326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging)
327 cm2 = plt.figure()
328 plot_confusion_matrix(cm_balanced_bagging,
329 classes=[0, 1],
330 title='Confusion matrix of BalancedBagging')
331 plt.figure('Bagging') # 選擇圖
332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
333 # make nice plotting
334 ax.spines['top'].set_visible(False)
335 ax.spines['right'].set_visible(False)
336 ax.get_xaxis().tick_bottom()
337 ax.get_yaxis().tick_left()
338 ax.spines['left'].set_position(('outward', 10))
339 ax.spines['bottom'].set_position(('outward', 10))
340 plt.xlim([0, 1])
341 plt.ylim([0, 1])
342 plt.xlabel('False Positive Rate')
343 plt.ylabel('True Positive Rate')
344 plt.title('Receiver Operating Characteristic')
345 plt.legend(loc="lower right")
346 plt.show()
347 return cm1, cm2, fig
348 
349  
350 
351 
352 def model_baseline2(x_train, y_train, x_test, y_test):
353 params = {
354 'task': 'train',
355 'boosting_type': 'gbdt',
356 'objective': 'multiclass',
357 'num_class': 2,
358 'verbose': 0,
359 'metric': 'logloss',
360 'max_bin': 255,
361 'max_depth': 7,
362 'learning_rate': 0.3,
363 'nthread': 4,
364 'n_estimators': 85,
365 'num_leaves': 63,
366 'feature_fraction': 0.8,
367 'num_boost_round': 160,
368 }
369 lgb_train = lgb.Dataset(x_train, label=y_train)
370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train)
371 print("begin train...")
372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
373 prob = bst.predict(x_test)[:, 1]
374 predict_score = [float('%.2f' % x) for x in prob]
375 loss_val = log_loss(y_test, predict_score)
376 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
377 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
378 x_auc = auc(fpr, tpr)
379 mean_fpr = np.linspace(0, 1, 100)
380 mean_tpr = interp(mean_fpr, fpr, tpr)
381 fig = plt.figure('weighted')
382 ax = fig.add_subplot(1, 1, 1)
383 name = 'base_lgb'
384 plt.plot(mean_fpr, mean_tpr, linestyle='--',
385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
386 (x_auc, loss_val), lw=2)
387 cm1 = plt.figure()
388 cm = confusion_matrix(y_test, y_pred)
389 plot_confusion_matrix(cm, classes=[0, 1],
390 title='Confusion matrix basemodle')
391 # add weighted according to the labels
392 lgb_train = lgb.Dataset(
393 x_train, label=y_train,
394 weight=[1 if y == 1 else 0.02 for y in y_train])
395 lgb_eval = lgb.Dataset(
396 x_test, label=y_test, reference=lgb_train,
397 weight=[1 if y == 1 else 0.02 for y in y_test])
398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
399 prob = bst.predict(x_test)[:, 1]
400 predict_score = [float('%.2f' % x) for x in prob]
401 loss_val = log_loss(y_test, predict_score)
402 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
403 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
404 mean_fpr = np.linspace(0, 1, 100)
405 mean_tpr = interp(mean_fpr, fpr, tpr)
406 x_auc = auc(fpr, tpr)
407 plt.figure('weighted') # 選擇圖
408 name = 'base_lgb_weighted'
409 plt.plot(
410 mean_fpr, mean_tpr, linestyle='--',
411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
412 (x_auc, loss_val), lw=2)
413 cm2 = plt.figure()
414 cm = confusion_matrix(y_test, y_pred)
415 plot_confusion_matrix(cm, classes=[0, 1],
416 title='Confusion matrix basemodle')
417 plt.figure('weighted') # 選擇圖
418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
419 # make nice plotting
420 ax.spines['top'].set_visible(False)
421 ax.spines['right'].set_visible(False)
422 ax.get_xaxis().tick_bottom()
423 ax.get_yaxis().tick_left()
424 ax.spines['left'].set_position(('outward', 10))
425 ax.spines['bottom'].set_position(('outward', 10))
426 plt.xlim([0, 1])
427 plt.ylim([0, 1])
428 plt.xlabel('False Positive Rate')
429 plt.ylabel('True Positive Rate')
430 plt.title('Receiver Operating Characteristic')
431 plt.legend(loc="lower right")
432 plt.show()
433 return cm1, cm2, fig
434 
435  
436 
437 
438 '''
439 1. logloss VS AUC
440 雖然 baseline 的 logloss= 0.0819, 確實很小，但是從 Confusion matrix 看出，
441 模型傾向於將所有的數據都分成多的那個，加了weight 之后稍好一點？
442 Though the logloss is 0.0819, which is a very small value.
443 Confusion matrix shows y_pred all 0, which feavors the majority classes.
444 
445  
446 
447 AUC 只有 0.64~0.67.
448 AUC如此小，按理來說不應該啊，但是為什么呢？
449 因為數據的label 極度不平衡，1 的比例大概只有 2%. 50:1.
450 AUC 對不平衡數據的分類性能測試更友好，用AUC去選特征，可能結果更好哦。
451 這里只提供一個大概的思考改進點。
452 2. handling with imbalanced data:
453 1. resampling, over- or under-,
454 over- is increasing # of minority, under- is decreasing # of majority.
455 2. revalue the loss function by giving large loss of misclassifying the
456 minority labels.
457 '''
458 
459  
460 
461 
462 if __name__ == "__main__":
463 x_train, x_test, y_train, y_test = data_baseline()
464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test)
465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test)
466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test)
467 
468  
469 
470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg')
471 cm11.savefig('./Confusion matrix1.jpg', format='jpg')
472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 不平衡數據的處理 Python處理不平衡數據數據不平衡與SMOTE算法反欺詐模型（數據不平衡）不平衡數據集的處理分類問題中的數據不平衡問題類別不平衡問題和Softmax回歸類不平衡問題的處理辦法詳解類別不平衡問題深度學習中數據集分布不平衡問題的解決方法