【IJCAI-2018】搜索廣告 - 不平衡數據 Imbalanced Data
我並不擅長做比賽,也不擅長構造特征,也不擅長調參數,也沒有服務器可以並行。大家的baseline都比我的模型要好。在這里寫這篇文章,主要是想跟大家分享下我對數據的理解,以及我思考的一個大概框架,希望對大家能有那么一點點啟發或者幫助。
像我這種無經驗無戰績無隊友,特征只會弄個dummy variable,降維只會PCA,模型只會LR,SVM,調參只會CV,ensemble只會求平均的人,每次在比賽里的存在感就是增大分母,當我看到大家在論壇分享自己的baseline的時候,真的是好高興好興奮,然后又看到大家在構造各種神奇的特征,模型的logloss居然有提高,真的是很佩服。由於我既沒有聰明的頭腦也沒有足夠的細致,於是就“拿來主義”至上,將論壇里看到的baseline copy下來,在電腦上跑了一下。哇,好牛,第一把就0.084,與排名最前的0.080,只差0.004,我這是要沖擊leaderboard 的節奏啊~興奮之余干勁更大了,各種吭哧吭哧搜模型,吭哧吭哧敲代碼,感覺自帶BGM,走路生風~反正就是那種全世界都屬於我的感覺~但是~~當我想看看哪些預測為1的時候,我驚呆了,no one! 合着我1萬8的test data,用模型預測出來之后,竟然沒有一個1,一個都沒有!

然后這就是問題了。可能聰明的大家早就知道了CTR的數據不平衡問題,但是愚鈍如我啊,我竟然沒有發現!
所以吐槽完了~
對於不平衡數據 Imbalanced Data,像這里的CTR里面的二分類預測,應該怎么處理呢?
正負樣本比例嚴重不平衡的情況,比例達到了50:1,如果直接在此基礎上做預測,對於樣本量較小的類的召回率會極低。
因為傳統的學習方法以降低總體分類精度為目標,將所有樣本一視同仁,同等對待,造成了分類器在多數類的分類精度較高而在少數類的分類精度很低。例如ctr正負樣本50:1的例子,算法就算全部預測為另一樣本,准確率也會達到98%(50/51),因此傳統的學習算法在不平衡數據集中具有較大的局限性。傳統的學習算法的預測結果就是favor the majority, 因為the minority 本身數量少,又本同等對待,因此miss the minority 的代價極小,所以結果就是favor the majority。
解決方法主要分為兩個方面。
第一種方案主要從數據的角度出發,主要方法為抽樣,既然我們的樣本是不平衡的,那么可以通過某種策略進行抽樣,從而讓我們的數據相對均衡一些;resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.
第二種方案從算法的角度出發,考慮不同誤分類情況代價的差異性對算法進行優化,使得我們的算法在不平衡數據下也能有較好的效果。改寫cost function by giving large cost of misclassifying the minority labels.
PS: 附件中有基於logloss , AUC 的對比的python代碼,可以運行,不會memory error.

1 # -*- coding: utf-8 -*- 2 """ 3 Created on Wed Apr 4 10:53:58 2018 4 @author : HaiyanJiang 5 @email : jianghaiyan.cn@gmail.com 6 7 8 9 what does the doc do? 10 some ideas of improving the accuracy of imbalanced data classification. 11 data characteristics: 12 imbalanced data. 13 the models: 14 model_baseline : lgb 15 model_baseline2 : another lgb 16 model_baseline3 : bagging 17 18 19 20 Other Notes: 21 除了基本特征外,還包括了'用戶'在當前小時內和當天的點擊量統計特征,以及當前所在的小時。 22 'context_day', 'context_hour', 23 'user_query_day', 'user_query_hour', 'user_query_day_hour', 24 non_feat = [ 25 'instance_id', 'user_id', 'context_id', 'item_category_list', 26 'item_property_list', 'predict_category_property', 27 'context_timestamp', 'TagTime', 'context_day' 28 ] 29 30 31 32 """ 33 34 35 36 import time 37 import pandas as pd 38 import lightgbm as lgb 39 from sklearn.metrics import log_loss 40 41 42 43 import numpy as np 44 import itertools 45 import matplotlib.pyplot as plt 46 from sklearn.metrics import confusion_matrix 47 from sklearn.metrics import auc, roc_curve 48 from scipy import interp 49 50 51 52 from sklearn.ensemble import BaggingClassifier 53 from imblearn.ensemble import BalancedBaggingClassifier 54 55 56 57 58 def read_bigcsv(filename, **kw): 59 with open(filename) as rf: 60 reader = pd.read_csv(rf, **kw, iterator=True) 61 chunkSize = 100000 62 chunks = [] 63 while True: 64 try: 65 chunk = reader.get_chunk(chunkSize) 66 chunks.append(chunk) 67 except StopIteration: 68 print("Iteration is stopped.") 69 break 70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True) 71 return df 72 73 74 75 76 def timestamp2datetime(value): 77 value = time.localtime(value) 78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value) 79 return dt 80 81 82 83 84 ''' 85 from matplotlib import pyplot as plt 86 tt = data['context_timestamp'] 87 plt.plot(tt) 88 # 可以看出時間是沒有排好的,有一定的錯位。如果做成online的模型,一定要將時間排好。 89 # aa = data[data['user_id']==24779788309075] 90 aa = data_train[data_train.duplicated(subset=None, keep='first')] 91 bb = data_train[data_train.duplicated(subset=None, keep='last')] 92 cc = data_train[data_train.duplicated(subset=None, keep=False)] 93 94 95 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)] 97 b2 = train_id[train_id.duplicated(keep='last')] 98 c2 = train_id[train_id.duplicated(keep=False)] 99 100 101 102 c2 = data_train[data_train.duplicated(subset=None, keep=False)] 103 104 105 106 經驗證, 'instance_id'有重復 107 a3 = Xdata[Xdata['instance_id']==1037061371711078396] 108 ''' 109 110 111 112 113 def convert_timestamp(data): 114 ''' 115 1. convert timestamp to datetime. 116 2. no sort, no reindex. 117 data.duplicated(subset=None, keep='first') 118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47') 119 'user_query_day', 'user_query_day_hour', 'hour', 120 np.corrcoef(data['user_query_day'], data['user_query_hour']) 121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour']) 122 np.corrcoef(data['user_query_day'], data['user_query_day_hour']) 123 ''' 124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime) 125 # data['TagTime'][0], data['TagTime'][len(data) - 1] 126 # x = data['TagTime'][len(data) - 1] 127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10])) 128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13])) 129 query_day = data.groupby(['user_id', 'context_day']).size( 130 ).reset_index().rename(columns={0: 'user_query_day'}) 131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day']) 132 query_hour = data.groupby(['user_id', 'context_hour']).size( 133 ).reset_index().rename(columns={0: 'user_query_hour'}) 134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour']) 135 query_day_hour = data.groupby( 136 by=['user_id', 'context_day', 'context_hour']).size( 137 ).reset_index().rename(columns={0: 'user_query_day_hour'}) 138 data = pd.merge(data, query_day_hour, 'left', 139 on=['user_id', 'context_day', 'context_hour']) 140 return data 141 142 143 144 145 def plot_confusion_matrix(cm, classes, normalize=False, 146 title='Confusion matrix', 147 cmap=plt.cm.Blues): 148 """ 149 This function prints and plots the confusion matrix. 150 Normalization can be applied by setting 'normalize=True'. 151 """ 152 if normalize: 153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 154 print("Normalized confusion matrix") 155 else: 156 print('Confusion matrix, without normalization') 157 print(cm) 158 plt.imshow(cm, interpolation='nearest', cmap=cmap) 159 plt.title(title) 160 plt.colorbar() 161 tick_marks = np.arange(len(classes)) 162 plt.xticks(tick_marks, classes, rotation=45) 163 plt.yticks(tick_marks, classes) 164 fmt = '.2f' if normalize else 'd' 165 thresh = cm.max() / 2. 166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 167 plt.text(j, i, format(cm[i, j], fmt), 168 horizontalalignment="center", 169 color="white" if cm[i, j] > thresh else "black") 170 plt.tight_layout() 171 plt.ylabel('True label') 172 plt.xlabel('Predicted label') 173 174 175 176 177 def data_baseline(): 178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt' 179 data = read_bigcsv(filename, sep=' ') 180 # data = pd.read_csv(filename, sep=' ') 181 data.drop_duplicates(inplace=True) 182 data.reset_index(drop=True, inplace=True) # very important 183 data = convert_timestamp(data) 184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24 185 test = data.loc[data['context_day'] == 24] # 暫時先使用第24天作為驗證集 186 features = [ 187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level', 188 'item_sales_level', 'item_collected_level', 'item_pv_level', 189 'user_gender_id', 'user_age_level', 'user_occupation_id', 190 'user_star_level', 'context_page_id', 'shop_id', 191 'shop_review_num_level', 'shop_review_positive_rate', 192 'shop_star_level', 'shop_score_service', 193 'shop_score_delivery', 'shop_score_description', 194 'user_query_day', 'user_query_day_hour', 'context_hour', 195 ] 196 x_train = train[features] 197 x_test = test[features] 198 y_train = train['is_trade'] 199 y_test = test['is_trade'] 200 return x_train, x_test, y_train, y_test 201 # x_train, x_test, y_train, y_test = data_baseline() 202 203 204 205 206 def model_baseline(x_train, y_train, x_test, y_test): 207 cat_names = [ 208 'item_price_level', 209 'item_sales_level', 210 'item_collected_level', 211 'item_pv_level', 212 'user_gender_id', 213 'user_age_level', 214 'user_occupation_id', 215 'user_star_level', 216 'context_page_id', 217 'shop_review_num_level', 218 'shop_star_level', 219 ] 220 print("begin train...") 221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,) 222 clf = lgb.LGBMClassifier(**kw_lgb) 223 clf.fit(x_train, y_train, categorical_feature=cat_names,) 224 prob = clf.predict_proba(x_test,)[:, 1] 225 predict_score = [float('%.2f' % x) for x in prob] 226 loss_val = log_loss(y_test, predict_score) 227 # print(loss_val) # 0.0848226750637 228 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 229 mean_fpr = np.linspace(0, 1, 100) 230 mean_tpr = interp(mean_fpr, fpr, tpr) 231 x_auc = auc(fpr, tpr) 232 fig = plt.figure('fig1') 233 ax = fig.add_subplot(1, 1, 1) 234 name = 'base_lgb' 235 plt.plot(mean_fpr, mean_tpr, linestyle='--', 236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 237 (x_auc, loss_val), lw=2) 238 y_pred = clf.predict(x_test) 239 cm1 = plt.figure() 240 cm = confusion_matrix(y_test, y_pred) 241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1') 242 # add weighted according to the labels 243 clf = lgb.LGBMClassifier(**kw_lgb) 244 clf.fit(x_train, y_train, 245 sample_weight=[1 if y == 1 else 0.02 for y in y_train], 246 categorical_feature=cat_names) 247 prob = clf.predict_proba(x_test,)[:, 1] 248 predict_score = [float('%.2f' % x) for x in prob] 249 loss_val = log_loss(y_test, predict_score) 250 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 251 mean_fpr = np.linspace(0, 1, 100) 252 mean_tpr = interp(mean_fpr, fpr, tpr) 253 x_auc = auc(fpr, tpr) 254 name = 'base_lgb_weighted' 255 plt.figure('fig1') # 選擇圖 256 plt.plot( 257 mean_fpr, mean_tpr, linestyle='--', 258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 259 (x_auc, loss_val), lw=2) 260 y_pred = clf.predict(x_test) 261 cm2 = plt.figure() 262 cm = confusion_matrix(y_test, y_pred) 263 plot_confusion_matrix(cm, classes=[0, 1], 264 title='Confusion matrix basemodle') 265 plt.figure('fig1') # 選擇圖 266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 267 # make nice plotting 268 ax.spines['top'].set_visible(False) 269 ax.spines['right'].set_visible(False) 270 ax.get_xaxis().tick_bottom() 271 ax.get_yaxis().tick_left() 272 ax.spines['left'].set_position(('outward', 10)) 273 ax.spines['bottom'].set_position(('outward', 10)) 274 plt.xlim([0, 1]) 275 plt.ylim([0, 1]) 276 plt.xlabel('False Positive Rate') 277 plt.ylabel('True Positive Rate') 278 plt.title('Receiver Operating Characteristic') 279 plt.legend(loc="lower right") 280 plt.show() 281 return cm1, cm2, fig 282 283 284 285 286 def model_baseline3(x_train, y_train, x_test, y_test): 287 bagging = BaggingClassifier(random_state=0) 288 balanced_bagging = BalancedBaggingClassifier(random_state=0) 289 bagging.fit(x_train, y_train) 290 balanced_bagging.fit(x_train, y_train) 291 prob = bagging.predict_proba(x_test)[:, 1] 292 predict_score = [float('%.2f' % x) for x in prob] 293 loss_val = log_loss(y_test, predict_score) 294 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 295 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 296 mean_fpr = np.linspace(0, 1, 100) 297 mean_tpr = interp(mean_fpr, fpr, tpr) 298 x_auc = auc(fpr, tpr) 299 fig = plt.figure('Bagging') 300 ax = fig.add_subplot(1, 1, 1) 301 name = 'base_Bagging' 302 plt.plot(mean_fpr, mean_tpr, linestyle='--', 303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 304 (x_auc, loss_val), lw=2) 305 y_pred_bagging = bagging.predict(x_test) 306 cm_bagging = confusion_matrix(y_test, y_pred_bagging) 307 cm1 = plt.figure() 308 plot_confusion_matrix(cm_bagging, 309 classes=[0, 1], 310 title='Confusion matrix of BaggingClassifier') 311 # balanced_bagging 312 prob = balanced_bagging.predict_proba(x_test)[:, 1] 313 predict_score = [float('%.2f' % x) for x in prob] 314 loss_val = log_loss(y_test, predict_score) 315 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 316 mean_fpr = np.linspace(0, 1, 100) 317 mean_tpr = interp(mean_fpr, fpr, tpr) 318 x_auc = auc(fpr, tpr) 319 plt.figure('Bagging') # 選擇圖 320 name = 'base_Balanced_Bagging' 321 plt.plot( 322 mean_fpr, mean_tpr, linestyle='--', 323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 324 (x_auc, loss_val), lw=2) 325 y_pred_balanced_bagging = balanced_bagging.predict(x_test) 326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging) 327 cm2 = plt.figure() 328 plot_confusion_matrix(cm_balanced_bagging, 329 classes=[0, 1], 330 title='Confusion matrix of BalancedBagging') 331 plt.figure('Bagging') # 選擇圖 332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 333 # make nice plotting 334 ax.spines['top'].set_visible(False) 335 ax.spines['right'].set_visible(False) 336 ax.get_xaxis().tick_bottom() 337 ax.get_yaxis().tick_left() 338 ax.spines['left'].set_position(('outward', 10)) 339 ax.spines['bottom'].set_position(('outward', 10)) 340 plt.xlim([0, 1]) 341 plt.ylim([0, 1]) 342 plt.xlabel('False Positive Rate') 343 plt.ylabel('True Positive Rate') 344 plt.title('Receiver Operating Characteristic') 345 plt.legend(loc="lower right") 346 plt.show() 347 return cm1, cm2, fig 348 349 350 351 352 def model_baseline2(x_train, y_train, x_test, y_test): 353 params = { 354 'task': 'train', 355 'boosting_type': 'gbdt', 356 'objective': 'multiclass', 357 'num_class': 2, 358 'verbose': 0, 359 'metric': 'logloss', 360 'max_bin': 255, 361 'max_depth': 7, 362 'learning_rate': 0.3, 363 'nthread': 4, 364 'n_estimators': 85, 365 'num_leaves': 63, 366 'feature_fraction': 0.8, 367 'num_boost_round': 160, 368 } 369 lgb_train = lgb.Dataset(x_train, label=y_train) 370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train) 371 print("begin train...") 372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval) 373 prob = bst.predict(x_test)[:, 1] 374 predict_score = [float('%.2f' % x) for x in prob] 375 loss_val = log_loss(y_test, predict_score) 376 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 377 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 378 x_auc = auc(fpr, tpr) 379 mean_fpr = np.linspace(0, 1, 100) 380 mean_tpr = interp(mean_fpr, fpr, tpr) 381 fig = plt.figure('weighted') 382 ax = fig.add_subplot(1, 1, 1) 383 name = 'base_lgb' 384 plt.plot(mean_fpr, mean_tpr, linestyle='--', 385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 386 (x_auc, loss_val), lw=2) 387 cm1 = plt.figure() 388 cm = confusion_matrix(y_test, y_pred) 389 plot_confusion_matrix(cm, classes=[0, 1], 390 title='Confusion matrix basemodle') 391 # add weighted according to the labels 392 lgb_train = lgb.Dataset( 393 x_train, label=y_train, 394 weight=[1 if y == 1 else 0.02 for y in y_train]) 395 lgb_eval = lgb.Dataset( 396 x_test, label=y_test, reference=lgb_train, 397 weight=[1 if y == 1 else 0.02 for y in y_test]) 398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval) 399 prob = bst.predict(x_test)[:, 1] 400 predict_score = [float('%.2f' % x) for x in prob] 401 loss_val = log_loss(y_test, predict_score) 402 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 403 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 404 mean_fpr = np.linspace(0, 1, 100) 405 mean_tpr = interp(mean_fpr, fpr, tpr) 406 x_auc = auc(fpr, tpr) 407 plt.figure('weighted') # 選擇圖 408 name = 'base_lgb_weighted' 409 plt.plot( 410 mean_fpr, mean_tpr, linestyle='--', 411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 412 (x_auc, loss_val), lw=2) 413 cm2 = plt.figure() 414 cm = confusion_matrix(y_test, y_pred) 415 plot_confusion_matrix(cm, classes=[0, 1], 416 title='Confusion matrix basemodle') 417 plt.figure('weighted') # 選擇圖 418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 419 # make nice plotting 420 ax.spines['top'].set_visible(False) 421 ax.spines['right'].set_visible(False) 422 ax.get_xaxis().tick_bottom() 423 ax.get_yaxis().tick_left() 424 ax.spines['left'].set_position(('outward', 10)) 425 ax.spines['bottom'].set_position(('outward', 10)) 426 plt.xlim([0, 1]) 427 plt.ylim([0, 1]) 428 plt.xlabel('False Positive Rate') 429 plt.ylabel('True Positive Rate') 430 plt.title('Receiver Operating Characteristic') 431 plt.legend(loc="lower right") 432 plt.show() 433 return cm1, cm2, fig 434 435 436 437 438 ''' 439 1. logloss VS AUC 440 雖然 baseline 的 logloss= 0.0819, 確實很小,但是從 Confusion matrix 看出, 441 模型傾向於將所有的數據都分成多的那個,加了weight 之后稍好一點? 442 Though the logloss is 0.0819, which is a very small value. 443 Confusion matrix shows y_pred all 0, which feavors the majority classes. 444 445 446 447 AUC 只有 0.64~0.67. 448 AUC如此小,按理來說不應該啊,但是為什么呢? 449 因為數據的label 極度不平衡,1 的比例大概只有 2%. 50:1. 450 AUC 對不平衡數據的分類性能測試更友好,用AUC去選特征,可能結果更好哦。 451 這里只提供一個大概的思考改進點。 452 2. handling with imbalanced data: 453 1. resampling, over- or under-, 454 over- is increasing # of minority, under- is decreasing # of majority. 455 2. revalue the loss function by giving large loss of misclassifying the 456 minority labels. 457 ''' 458 459 460 461 462 if __name__ == "__main__": 463 x_train, x_test, y_train, y_test = data_baseline() 464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test) 465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test) 466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test) 467 468 469 470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg') 471 cm11.savefig('./Confusion matrix1.jpg', format='jpg') 472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')
