搜索廣告 - 不平衡數據 Imbalanced Data


【IJCAI-2018】搜索廣告 - 不平衡數據 Imbalanced Data

我並不擅長做比賽,也不擅長構造特征,也不擅長調參數,也沒有服務器可以並行。大家的baseline都比我的模型要好。在這里寫這篇文章,主要是想跟大家分享下我對數據的理解,以及我思考的一個大概框架,希望對大家能有那么一點點啟發或者幫助。

 

像我這種無經驗無戰績無隊友,特征只會弄個dummy variable,降維只會PCA,模型只會LR,SVM,調參只會CV,ensemble只會求平均的人,每次在比賽里的存在感就是增大分母,當我看到大家在論壇分享自己的baseline的時候,真的是好高興好興奮,然后又看到大家在構造各種神奇的特征,模型的logloss居然有提高,真的是很佩服。由於我既沒有聰明的頭腦也沒有足夠的細致,於是就“拿來主義”至上,將論壇里看到的baseline copy下來,在電腦上跑了一下。哇,好牛,第一把就0.084,與排名最前的0.080,只差0.004,我這是要沖擊leaderboard 的節奏啊~興奮之余干勁更大了,各種吭哧吭哧搜模型,吭哧吭哧敲代碼,感覺自帶BGM,走路生風~反正就是那種全世界都屬於我的感覺~但是~~當我想看看哪些預測為1的時候,我驚呆了,no one! 合着我1萬8的test data,用模型預測出來之后,竟然沒有一個1,一個都沒有!

Confusion matrix1

然后這就是問題了。可能聰明的大家早就知道了CTR的數據不平衡問題,但是愚鈍如我啊,我竟然沒有發現!

所以吐槽完了~

 

對於不平衡數據 Imbalanced Data,像這里的CTR里面的二分類預測,應該怎么處理呢?

正負樣本比例嚴重不平衡的情況,比例達到了50:1,如果直接在此基礎上做預測,對於樣本量較小的類的召回率會極低。

 

因為傳統的學習方法以降低總體分類精度為目標,將所有樣本一視同仁,同等對待,造成了分類器在多數類的分類精度較高而在少數類的分類精度很低。例如ctr正負樣本50:1的例子,算法就算全部預測為另一樣本,准確率也會達到98%(50/51),因此傳統的學習算法在不平衡數據集中具有較大的局限性。傳統的學習算法的預測結果就是favor the majority, 因為the minority 本身數量少,又本同等對待,因此miss the minority 的代價極小,所以結果就是favor the majority。

 

解決方法主要分為兩個方面。

第一種方案主要從數據的角度出發,主要方法為抽樣,既然我們的樣本是不平衡的,那么可以通過某種策略進行抽樣,從而讓我們的數據相對均衡一些;resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.

 

第二種方案從算法的角度出發,考慮不同誤分類情況代價的差異性對算法進行優化,使得我們的算法在不平衡數據下也能有較好的效果。改寫cost function by giving large cost of misclassifying the minority labels. 

PS: 附件中有基於logloss , AUC 的對比的python代碼,可以運行,不會memory error.

base_lgb_weighted

 

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Wed Apr 4 10:53:58 2018
  4 @author : HaiyanJiang
  5 @email : jianghaiyan.cn@gmail.com
  6 
  7  
  8 
  9 what does the doc do?
 10 some ideas of improving the accuracy of imbalanced data classification.
 11 data characteristics:
 12 imbalanced data.
 13 the models:
 14 model_baseline : lgb
 15 model_baseline2 : another lgb
 16 model_baseline3 : bagging
 17 
 18  
 19 
 20 Other Notes:
 21 除了基本特征外,還包括了'用戶'在當前小時內和當天的點擊量統計特征,以及當前所在的小時。
 22 'context_day', 'context_hour',
 23 'user_query_day', 'user_query_hour', 'user_query_day_hour',
 24 non_feat = [
 25 'instance_id', 'user_id', 'context_id', 'item_category_list',
 26 'item_property_list', 'predict_category_property',
 27 'context_timestamp', 'TagTime', 'context_day'
 28 ]
 29 
 30  
 31 
 32 """
 33 
 34  
 35 
 36 import time
 37 import pandas as pd
 38 import lightgbm as lgb
 39 from sklearn.metrics import log_loss
 40 
 41  
 42 
 43 import numpy as np
 44 import itertools
 45 import matplotlib.pyplot as plt
 46 from sklearn.metrics import confusion_matrix
 47 from sklearn.metrics import auc, roc_curve
 48 from scipy import interp
 49 
 50  
 51 
 52 from sklearn.ensemble import BaggingClassifier
 53 from imblearn.ensemble import BalancedBaggingClassifier
 54 
 55  
 56 
 57 
 58 def read_bigcsv(filename, **kw):
 59 with open(filename) as rf:
 60 reader = pd.read_csv(rf, **kw, iterator=True)
 61 chunkSize = 100000
 62 chunks = []
 63 while True:
 64 try:
 65 chunk = reader.get_chunk(chunkSize)
 66 chunks.append(chunk)
 67 except StopIteration:
 68 print("Iteration is stopped.")
 69 break
 70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True)
 71 return df
 72 
 73  
 74 
 75 
 76 def timestamp2datetime(value):
 77 value = time.localtime(value)
 78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value)
 79 return dt
 80 
 81  
 82 
 83 
 84 '''
 85 from matplotlib import pyplot as plt
 86 tt = data['context_timestamp']
 87 plt.plot(tt)
 88 # 可以看出時間是沒有排好的,有一定的錯位。如果做成online的模型,一定要將時間排好。
 89 # aa = data[data['user_id']==24779788309075]
 90 aa = data_train[data_train.duplicated(subset=None, keep='first')]
 91 bb = data_train[data_train.duplicated(subset=None, keep='last')]
 92 cc = data_train[data_train.duplicated(subset=None, keep=False)]
 93 
 94  
 95 
 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)]
 97 b2 = train_id[train_id.duplicated(keep='last')]
 98 c2 = train_id[train_id.duplicated(keep=False)]
 99 
100  
101 
102 c2 = data_train[data_train.duplicated(subset=None, keep=False)]
103 
104  
105 
106 經驗證, 'instance_id'有重復
107 a3 = Xdata[Xdata['instance_id']==1037061371711078396]
108 '''
109 
110  
111 
112 
113 def convert_timestamp(data):
114 '''
115 1. convert timestamp to datetime.
116 2. no sort, no reindex.
117 data.duplicated(subset=None, keep='first')
118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47')
119 'user_query_day', 'user_query_day_hour', 'hour',
120 np.corrcoef(data['user_query_day'], data['user_query_hour'])
121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour'])
122 np.corrcoef(data['user_query_day'], data['user_query_day_hour'])
123 '''
124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime)
125 # data['TagTime'][0], data['TagTime'][len(data) - 1]
126 # x = data['TagTime'][len(data) - 1]
127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10]))
128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13]))
129 query_day = data.groupby(['user_id', 'context_day']).size(
130 ).reset_index().rename(columns={0: 'user_query_day'})
131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day'])
132 query_hour = data.groupby(['user_id', 'context_hour']).size(
133 ).reset_index().rename(columns={0: 'user_query_hour'})
134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour'])
135 query_day_hour = data.groupby(
136 by=['user_id', 'context_day', 'context_hour']).size(
137 ).reset_index().rename(columns={0: 'user_query_day_hour'})
138 data = pd.merge(data, query_day_hour, 'left',
139 on=['user_id', 'context_day', 'context_hour'])
140 return data
141 
142  
143 
144 
145 def plot_confusion_matrix(cm, classes, normalize=False,
146 title='Confusion matrix',
147 cmap=plt.cm.Blues):
148 """
149 This function prints and plots the confusion matrix.
150 Normalization can be applied by setting 'normalize=True'.
151 """
152 if normalize:
153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
154 print("Normalized confusion matrix")
155 else:
156 print('Confusion matrix, without normalization')
157 print(cm)
158 plt.imshow(cm, interpolation='nearest', cmap=cmap)
159 plt.title(title)
160 plt.colorbar()
161 tick_marks = np.arange(len(classes))
162 plt.xticks(tick_marks, classes, rotation=45)
163 plt.yticks(tick_marks, classes)
164 fmt = '.2f' if normalize else 'd'
165 thresh = cm.max() / 2.
166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
167 plt.text(j, i, format(cm[i, j], fmt),
168 horizontalalignment="center",
169 color="white" if cm[i, j] > thresh else "black")
170 plt.tight_layout()
171 plt.ylabel('True label')
172 plt.xlabel('Predicted label')
173 
174  
175 
176 
177 def data_baseline():
178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt'
179 data = read_bigcsv(filename, sep=' ')
180 # data = pd.read_csv(filename, sep=' ')
181 data.drop_duplicates(inplace=True)
182 data.reset_index(drop=True, inplace=True) # very important
183 data = convert_timestamp(data)
184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24
185 test = data.loc[data['context_day'] == 24] # 暫時先使用第24天作為驗證集
186 features = [
187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level',
188 'item_sales_level', 'item_collected_level', 'item_pv_level',
189 'user_gender_id', 'user_age_level', 'user_occupation_id',
190 'user_star_level', 'context_page_id', 'shop_id',
191 'shop_review_num_level', 'shop_review_positive_rate',
192 'shop_star_level', 'shop_score_service',
193 'shop_score_delivery', 'shop_score_description',
194 'user_query_day', 'user_query_day_hour', 'context_hour',
195 ]
196 x_train = train[features]
197 x_test = test[features]
198 y_train = train['is_trade']
199 y_test = test['is_trade']
200 return x_train, x_test, y_train, y_test
201 # x_train, x_test, y_train, y_test = data_baseline()
202 
203  
204 
205 
206 def model_baseline(x_train, y_train, x_test, y_test):
207 cat_names = [
208 'item_price_level',
209 'item_sales_level',
210 'item_collected_level',
211 'item_pv_level',
212 'user_gender_id',
213 'user_age_level',
214 'user_occupation_id',
215 'user_star_level',
216 'context_page_id',
217 'shop_review_num_level',
218 'shop_star_level',
219 ]
220 print("begin train...")
221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,)
222 clf = lgb.LGBMClassifier(**kw_lgb)
223 clf.fit(x_train, y_train, categorical_feature=cat_names,)
224 prob = clf.predict_proba(x_test,)[:, 1]
225 predict_score = [float('%.2f' % x) for x in prob]
226 loss_val = log_loss(y_test, predict_score)
227 # print(loss_val) # 0.0848226750637
228 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
229 mean_fpr = np.linspace(0, 1, 100)
230 mean_tpr = interp(mean_fpr, fpr, tpr)
231 x_auc = auc(fpr, tpr)
232 fig = plt.figure('fig1')
233 ax = fig.add_subplot(1, 1, 1)
234 name = 'base_lgb'
235 plt.plot(mean_fpr, mean_tpr, linestyle='--',
236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
237 (x_auc, loss_val), lw=2)
238 y_pred = clf.predict(x_test)
239 cm1 = plt.figure()
240 cm = confusion_matrix(y_test, y_pred)
241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1')
242 # add weighted according to the labels
243 clf = lgb.LGBMClassifier(**kw_lgb)
244 clf.fit(x_train, y_train,
245 sample_weight=[1 if y == 1 else 0.02 for y in y_train],
246 categorical_feature=cat_names)
247 prob = clf.predict_proba(x_test,)[:, 1]
248 predict_score = [float('%.2f' % x) for x in prob]
249 loss_val = log_loss(y_test, predict_score)
250 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
251 mean_fpr = np.linspace(0, 1, 100)
252 mean_tpr = interp(mean_fpr, fpr, tpr)
253 x_auc = auc(fpr, tpr)
254 name = 'base_lgb_weighted'
255 plt.figure('fig1') # 選擇圖
256 plt.plot(
257 mean_fpr, mean_tpr, linestyle='--',
258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
259 (x_auc, loss_val), lw=2)
260 y_pred = clf.predict(x_test)
261 cm2 = plt.figure()
262 cm = confusion_matrix(y_test, y_pred)
263 plot_confusion_matrix(cm, classes=[0, 1],
264 title='Confusion matrix basemodle')
265 plt.figure('fig1') # 選擇圖
266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
267 # make nice plotting
268 ax.spines['top'].set_visible(False)
269 ax.spines['right'].set_visible(False)
270 ax.get_xaxis().tick_bottom()
271 ax.get_yaxis().tick_left()
272 ax.spines['left'].set_position(('outward', 10))
273 ax.spines['bottom'].set_position(('outward', 10))
274 plt.xlim([0, 1])
275 plt.ylim([0, 1])
276 plt.xlabel('False Positive Rate')
277 plt.ylabel('True Positive Rate')
278 plt.title('Receiver Operating Characteristic')
279 plt.legend(loc="lower right")
280 plt.show()
281 return cm1, cm2, fig
282 
283  
284 
285 
286 def model_baseline3(x_train, y_train, x_test, y_test):
287 bagging = BaggingClassifier(random_state=0)
288 balanced_bagging = BalancedBaggingClassifier(random_state=0)
289 bagging.fit(x_train, y_train)
290 balanced_bagging.fit(x_train, y_train)
291 prob = bagging.predict_proba(x_test)[:, 1]
292 predict_score = [float('%.2f' % x) for x in prob]
293 loss_val = log_loss(y_test, predict_score)
294 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
295 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
296 mean_fpr = np.linspace(0, 1, 100)
297 mean_tpr = interp(mean_fpr, fpr, tpr)
298 x_auc = auc(fpr, tpr)
299 fig = plt.figure('Bagging')
300 ax = fig.add_subplot(1, 1, 1)
301 name = 'base_Bagging'
302 plt.plot(mean_fpr, mean_tpr, linestyle='--',
303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
304 (x_auc, loss_val), lw=2)
305 y_pred_bagging = bagging.predict(x_test)
306 cm_bagging = confusion_matrix(y_test, y_pred_bagging)
307 cm1 = plt.figure()
308 plot_confusion_matrix(cm_bagging,
309 classes=[0, 1],
310 title='Confusion matrix of BaggingClassifier')
311 # balanced_bagging
312 prob = balanced_bagging.predict_proba(x_test)[:, 1]
313 predict_score = [float('%.2f' % x) for x in prob]
314 loss_val = log_loss(y_test, predict_score)
315 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
316 mean_fpr = np.linspace(0, 1, 100)
317 mean_tpr = interp(mean_fpr, fpr, tpr)
318 x_auc = auc(fpr, tpr)
319 plt.figure('Bagging') # 選擇圖
320 name = 'base_Balanced_Bagging'
321 plt.plot(
322 mean_fpr, mean_tpr, linestyle='--',
323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
324 (x_auc, loss_val), lw=2)
325 y_pred_balanced_bagging = balanced_bagging.predict(x_test)
326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging)
327 cm2 = plt.figure()
328 plot_confusion_matrix(cm_balanced_bagging,
329 classes=[0, 1],
330 title='Confusion matrix of BalancedBagging')
331 plt.figure('Bagging') # 選擇圖
332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
333 # make nice plotting
334 ax.spines['top'].set_visible(False)
335 ax.spines['right'].set_visible(False)
336 ax.get_xaxis().tick_bottom()
337 ax.get_yaxis().tick_left()
338 ax.spines['left'].set_position(('outward', 10))
339 ax.spines['bottom'].set_position(('outward', 10))
340 plt.xlim([0, 1])
341 plt.ylim([0, 1])
342 plt.xlabel('False Positive Rate')
343 plt.ylabel('True Positive Rate')
344 plt.title('Receiver Operating Characteristic')
345 plt.legend(loc="lower right")
346 plt.show()
347 return cm1, cm2, fig
348 
349  
350 
351 
352 def model_baseline2(x_train, y_train, x_test, y_test):
353 params = {
354 'task': 'train',
355 'boosting_type': 'gbdt',
356 'objective': 'multiclass',
357 'num_class': 2,
358 'verbose': 0,
359 'metric': 'logloss',
360 'max_bin': 255,
361 'max_depth': 7,
362 'learning_rate': 0.3,
363 'nthread': 4,
364 'n_estimators': 85,
365 'num_leaves': 63,
366 'feature_fraction': 0.8,
367 'num_boost_round': 160,
368 }
369 lgb_train = lgb.Dataset(x_train, label=y_train)
370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train)
371 print("begin train...")
372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
373 prob = bst.predict(x_test)[:, 1]
374 predict_score = [float('%.2f' % x) for x in prob]
375 loss_val = log_loss(y_test, predict_score)
376 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
377 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
378 x_auc = auc(fpr, tpr)
379 mean_fpr = np.linspace(0, 1, 100)
380 mean_tpr = interp(mean_fpr, fpr, tpr)
381 fig = plt.figure('weighted')
382 ax = fig.add_subplot(1, 1, 1)
383 name = 'base_lgb'
384 plt.plot(mean_fpr, mean_tpr, linestyle='--',
385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
386 (x_auc, loss_val), lw=2)
387 cm1 = plt.figure()
388 cm = confusion_matrix(y_test, y_pred)
389 plot_confusion_matrix(cm, classes=[0, 1],
390 title='Confusion matrix basemodle')
391 # add weighted according to the labels
392 lgb_train = lgb.Dataset(
393 x_train, label=y_train,
394 weight=[1 if y == 1 else 0.02 for y in y_train])
395 lgb_eval = lgb.Dataset(
396 x_test, label=y_test, reference=lgb_train,
397 weight=[1 if y == 1 else 0.02 for y in y_test])
398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
399 prob = bst.predict(x_test)[:, 1]
400 predict_score = [float('%.2f' % x) for x in prob]
401 loss_val = log_loss(y_test, predict_score)
402 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
403 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
404 mean_fpr = np.linspace(0, 1, 100)
405 mean_tpr = interp(mean_fpr, fpr, tpr)
406 x_auc = auc(fpr, tpr)
407 plt.figure('weighted') # 選擇圖
408 name = 'base_lgb_weighted'
409 plt.plot(
410 mean_fpr, mean_tpr, linestyle='--',
411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
412 (x_auc, loss_val), lw=2)
413 cm2 = plt.figure()
414 cm = confusion_matrix(y_test, y_pred)
415 plot_confusion_matrix(cm, classes=[0, 1],
416 title='Confusion matrix basemodle')
417 plt.figure('weighted') # 選擇圖
418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
419 # make nice plotting
420 ax.spines['top'].set_visible(False)
421 ax.spines['right'].set_visible(False)
422 ax.get_xaxis().tick_bottom()
423 ax.get_yaxis().tick_left()
424 ax.spines['left'].set_position(('outward', 10))
425 ax.spines['bottom'].set_position(('outward', 10))
426 plt.xlim([0, 1])
427 plt.ylim([0, 1])
428 plt.xlabel('False Positive Rate')
429 plt.ylabel('True Positive Rate')
430 plt.title('Receiver Operating Characteristic')
431 plt.legend(loc="lower right")
432 plt.show()
433 return cm1, cm2, fig
434 
435  
436 
437 
438 '''
439 1. logloss VS AUC
440 雖然 baseline 的 logloss= 0.0819, 確實很小,但是從 Confusion matrix 看出,
441 模型傾向於將所有的數據都分成多的那個,加了weight 之后稍好一點?
442 Though the logloss is 0.0819, which is a very small value.
443 Confusion matrix shows y_pred all 0, which feavors the majority classes.
444 
445  
446 
447 AUC 只有 0.64~0.67.
448 AUC如此小,按理來說不應該啊,但是為什么呢?
449 因為數據的label 極度不平衡,1 的比例大概只有 2%. 50:1.
450 AUC 對不平衡數據的分類性能測試更友好,用AUC去選特征,可能結果更好哦。
451 這里只提供一個大概的思考改進點。
452 2. handling with imbalanced data:
453 1. resampling, over- or under-,
454 over- is increasing # of minority, under- is decreasing # of majority.
455 2. revalue the loss function by giving large loss of misclassifying the
456 minority labels.
457 '''
458 
459  
460 
461 
462 if __name__ == "__main__":
463 x_train, x_test, y_train, y_test = data_baseline()
464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test)
465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test)
466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test)
467 
468  
469 
470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg')
471 cm11.savefig('./Confusion matrix1.jpg', format='jpg')
472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM