數據分析中的變量分箱——德國信貸數據集（variable bin in data analysis -German credit datasets）

本文轉載自查看原文 2021-10-15 17:53 990 data analysis
最近看了一本《Python金融大數據風控建模實戰：基於機器學習》（機械工業出版社）這本書，看了其中第6章：變量分箱方法內容，總結了主要內容以及做了代碼詳解，分享給大家。
一、主要知識點：
1. 變量分箱是一種特征工程方法，意在增強變量的可解釋性與預測能力。變量分箱方法主要用於連續變量，對於變量取值較稀疏的離散變量也應該進行分箱處理。
2. 變量分箱的好處：
（1）降低異常值的影響，增加模型的穩定性。
（2）缺失值作為特殊變量參與分箱，減少缺失值填補的不確定性。
（3）增加變量的可解釋性。
（4）增加變量的非線性。
（5）增加模型的預測效果。
3. 變量分箱的局限性：
（1）同一箱內的樣本具有同質性。（2）需要專家經驗支持。
4. 變量分箱的注意事項：
（1）分箱結果不宜過多。（2）分箱結果不宜過少。（3）分箱后單調性的要求。
5. 變量分箱流程
二、代碼實現
數據的使用還是德國信貸數據集，具體數據集介紹和獲取方法請看數據清洗與預處理代碼詳解——德國信貸數據集（data cleaning and preprocessing - German credit datasets）
 1 # -*- coding: utf-8 -*-
 2 """
 3 第6章：變量分箱方法  4  1: Chi-merge(卡方分箱)  5  2: IV(最優IV值分箱)  6  3: 信息熵(基於樹的分箱)  7 """
 8 import os  9 import pandas as pd  10 import numpy as np  11 from sklearn.model_selection import train_test_split  12 import warnings  13 warnings.filterwarnings("ignore")  # 忽略警告
 14 
 15 
 16 def data_read(data_path, file_name):  17     df = pd.read_csv(os.path.join(data_path, file_name),  18                      delim_whitespace=True,  19                      header=None)  20     # 變量重命名
 21     columns = [  22         'status_account', 'duration', 'credit_history', 'purpose', 'amount',  23         'svaing_account', 'present_emp', 'income_rate', 'personal_status',  24         'other_debtors', 'residence_info', 'property', 'age', 'inst_plans',  25         'housing', 'num_credits', 'job', 'dependents', 'telephone',  26         'foreign_worker', 'target'
 27  ]  28     df.columns = columns  29     # 將標簽變量由狀態1,2轉為0,1;0表示好用戶，1表示壞用戶
 30     df.target = df.target - 1
 31     # 數據分為data_train和 data_test兩部分，訓練集用於得到編碼函數，驗證集用已知的編碼規則對驗證集編碼
 32     data_train, data_test = train_test_split(df,  33                                              test_size=0.2,  34                                              random_state=0,  35                                              stratify=df.target)  36     return data_train, data_test  37 
 38 
 39 def cal_advantage(temp, piont, method, flag='sel'):  40     """
 41  計算當前切分點下的指標值  42  # 參數  43  temp: 上一步的分箱結果，pandas dataframe  44  piont: 切分點，以此來划分分箱  45  method: 分箱方法選擇，1:chi-merge , 2:IV值, 3:信息熵  46     """
 47     # temp = binDS
 48     if flag == 'sel':  49         # 用於最優切分點選擇，這里只是二叉樹，即二分
 50         bin_num = 2
 51         # np.empty 依給定的shape, 和數據類型 dtype, 返回一個一維或者多維數組，數組的元素不為空，為隨機產生的數據。
 52         good_bad_matrix = np.empty((bin_num, 3))  53         for ii in range(bin_num):  54             if ii == 0:  55                 df_temp_1 = temp[temp['bin_raw'] <= piont]  56             else:  57                 df_temp_1 = temp[temp['bin_raw'] > piont]  58             # 計算每個箱內的好壞樣本書
 59             good_bad_matrix[ii][0] = df_temp_1['good'].sum()  60             good_bad_matrix[ii][1] = df_temp_1['bad'].sum()  61             good_bad_matrix[ii][2] = df_temp_1['total'].sum()  62 
 63     elif flag == 'gain':  64         # 用於計算本次分箱后的指標結果，即分箱數，每增加一個，就要算一下當前分箱下的指標結果
 65         bin_num = temp['bin'].max()  66         good_bad_matrix = np.empty((bin_num, 3))  67         for ii in range(bin_num):  68             df_temp_1 = temp[temp['bin'] == (ii + 1)]  69             good_bad_matrix[ii][0] = df_temp_1['good'].sum()  70             good_bad_matrix[ii][1] = df_temp_1['bad'].sum()  71             good_bad_matrix[ii][2] = df_temp_1['total'].sum()  72 
 73     # 計算總樣本中的好壞樣本
 74     total_matrix = np.empty(3)  75     # sum（）函數用於獲取所請求軸的值之和。
 76     total_matrix[0] = temp.good.sum()  77     total_matrix[1] = temp.bad.sum()  78     total_matrix[2] = temp.total.sum()  79 
 80     # Chi-merger分箱
 81     if method == 1:  82         X2 = 0  83         # i 是區間的信息
 84         for i in range(bin_num):  85             # j=0 表示好樣本, j=1 表示壞樣本
 86             for j in range(2):  87                 # 期望值 好(壞)樣本/總樣本 * 該區間的樣本總數
 88                 expect = (total_matrix[j] / total_matrix[2]) * good_bad_matrix[i][2]  89                 # 計算實際值和期望值的差異距離的平方/該樣本的期望值
 90                 X2 = X2 + (good_bad_matrix[i][j] - expect)**2 / expect  91         M_value = X2  92     # IV分箱
 93     elif method == 2:  94         if pd.isnull(total_matrix[0]) or pd.isnull(total_matrix[1]) or total_matrix[0] == 0 or total_matrix[1] == 0:  95             M_value = np.NaN  96         else:  97             IV = 0  98             for i in range(bin_num):  99                 # 壞好比
100                 weight = good_bad_matrix[i][1] / total_matrix[1] - good_bad_matrix[i][0] / total_matrix[0] 101                 # 本來對照公式覺得這里出現問題，后來化簡下方程，發現是對的
102                 IV = IV + weight * np.log((good_bad_matrix[i][1] * total_matrix[0]) / (good_bad_matrix[i][0] * total_matrix[1])) 103             M_value = IV 104     # 信息熵分箱
105     elif method == 3: 106         # 總的信息熵
107         entropy_total = 0 108         for j in range(2): 109             weight = (total_matrix[j] / total_matrix[2]) 110             entropy_total = entropy_total - weight * (np.log(weight)) 111 
112         # 計算條件熵
113         entropy_cond = 0 114         for i in range(bin_num): 115             entropy_temp = 0 116             for j in range(2): 117                 entropy_temp = entropy_temp - \ 118                     ((good_bad_matrix[i][j] / good_bad_matrix[i][2]) * np.log(good_bad_matrix[i][j] / good_bad_matrix[i][2])) 119             entropy_cond = entropy_cond + good_bad_matrix[i][2] / total_matrix[2] * entropy_temp 120 
121         # 計算歸一化信息增益
122         M_value = 1 - (entropy_cond / entropy_total) 123     # Best-Ks分箱
124     else: 125         pass
126     return M_value 127 
128 
129 def best_split(df_temp0, method, bin_num): 130     """
131  在每個候選集中尋找切分點，完成一次分裂。 132  select_split_point函數的中間過程函數 133  # 參數 134  df_temp0: 上一次分箱后的結果，pandas dataframe 135  method: 分箱方法選擇，1:chi-merge , 2:IV值, 3:信息熵 136  bin_num: 分箱編號，在不同編號的分箱結果中繼續二分 137  # 返回值 138  返回在本次分箱標號內的最有切分結果， pandas dataframe 139     """
140     # df_temp0 = df_temp
141     # bin_num = 1
142     df_temp0 = df_temp0.sort_values(by=['bin', 'bad_rate']) 143     piont_len = len(df_temp0[df_temp0['bin'] == bin_num])  # 候選集的長度
144     bestValue = 0 145     bestI = 1
146     li = [] 147     # 以候選集的每個切分點做分隔，計算指標值
148     for i in range(1, piont_len): 149         # 計算指標值
150         value = cal_advantage(df_temp0, i, method, flag='sel') 151  li.append(value) 152         # 要的是大的值
153         if bestValue < value: 154             bestValue = value 155             bestI = i 156     # print("beasValue = ", bestValue)
157     # create new var split according to bestI，運行后多了一個維度
158     df_temp0['split'] = np.where(df_temp0['bin_raw'] <= bestI, 1, 0) 159     # dataFrame.drop用於刪除指定的行列
160     df_temp0 = df_temp0.drop('bin_raw', axis=1) 161     # 重新排序，默認是升序排序
162     newbinDS = df_temp0.sort_values(by=['split', 'bad_rate']) 163     # rebuild var i
164     newbinDS_0 = newbinDS[newbinDS['split'] == 0] 165     newbinDS_1 = newbinDS[newbinDS['split'] == 1] 166     newbinDS_0 = newbinDS_0.copy() 167     newbinDS_1 = newbinDS_1.copy() 168     newbinDS_0['bin_raw'] = range(1, len(newbinDS_0) + 1) 169     newbinDS_1['bin_raw'] = range(1, len(newbinDS_1) + 1) 170     newbinDS = pd.concat([newbinDS_0, newbinDS_1], axis=0) 171     return newbinDS 172 
173 
174 def select_split_point(temp_bin, method): 175     """
176  二叉樹分割方式，從候選者中挑選每次的最優切分點，與切分后的指標計算cont_var_bin函數的中間過程函數， 177  # 參數 178  temp_bin: 分箱后的結果 pandas dataframe 179  method:分箱方法選擇，1:chi-merge , 2:IV值, 3:信息熵 180  # 返回值 181  新的分箱結果 pandas dataframe 182     """
183     # temp_bin = df_temp_all
184     # sort_values()函數原理類似於SQL中的order by，可以將數據集依照某個字段中的數據進行排序
185     # 參數by指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
186     temp_bin = temp_bin.sort_values(by=['bin', 'bad_rate']) 187     # 得到當前的最大的分箱值
188     max_num = max(temp_bin['bin']) 189     # temp_binC = dict()
190     # m = dict()
191     # # 不同箱內的數據取出來
192     # for i in range(1, max_num + 1):
193     # temp_binC[i] = temp_bin[temp_bin['bin'] == i]
194     # m[i] = len(temp_binC[i])
195     temp_main = dict() 196     bin_i_value = [] 197     for i in range(1, max_num + 1): 198         # 得到這一類別的數據
199         df_temp = temp_bin[temp_bin['bin'] == i] 200         # 如果這一類別的數據大於1
201         if df_temp.shape[0] > 1: 202             # bin=i的做分裂
203             temp_split = best_split(df_temp, method, i) 204             # 完成一次分箱，更新bin的之 np.where(condition, x, y) 滿足條件condition，輸出x,否則輸出y
205             # 這里把 ['bin'] 這一列本來都是相同的值區分開來
206             temp_split['bin'] = np.where(temp_split['split'] == 1, max_num + 1, temp_split['bin']) 207             # 取出bin!=i合並為新租
208             temp_main[i] = temp_bin[temp_bin['bin'] != i] 209             # 這里 temp_split 比 temp_main[i] 多了一列變量，合並的時候，不存在的值為 NaN
210             temp_main[i] = pd.concat([temp_main[i], temp_split], axis=0, sort=False) 211             # 計算新分組的指標值
212             value = cal_advantage(temp_main[i], 0, method, flag='gain') 213             newdata = [i, value] 214  bin_i_value.append(newdata) 215     # 最終只選擇一個 df_temp.shape[0]>1 的分類分組結果
216     # find maxinum of value bintoSplit
217     bin_i_value.sort(key=lambda x: x[1], reverse=True) 218     # binNum = temp_all_Vals['BinToSplit']
219     binNum = bin_i_value[0][0] 220     newBins = temp_main[binNum].drop('split', axis=1) 221     return newBins.sort_values(by=['bin', 'bad_rate']), round(bin_i_value[0][1], 4) 222 
223 
224 def init_equal_bin(x, bin_rate): 225     """
226  初始化等距分組，cont_var_bin函數的中間過程函數 227  # 參數 228  x:要分組的變量值，pandas series 229  bin_rate：比例值1/bin_rate 230  # 返回值 231  返回初始化分箱結果，pandas dataframe 232     """
233     # 異常值剔除，只考慮90%沒的最大值與最小值，邊界與-inf或inf分為一組
234     # np.percentile 是 計算一組數的分位數值
235     # print("np.percentile(x, 95) = ", np.percentile(x, 95))
236     if len(x[x > np.percentile(x, 95)]) > 0 and len(np.unique(x)) >= 30: 237         var_up = min(x[x > np.percentile(x, 95)]) 238     else: 239         var_up = max(x) 240     # print("var_up = ", var_up)
241     # print("np.percentile(x, 5) = ", np.percentile(x, 5))
242     if len(x[x < np.percentile(x, 5)]) > 0: 243         var_low = max(x[x < np.percentile(x, 5)]) 244     else: 245         var_low = min(x) 246     # print("var_low = ", var_low)
247 
248     # 初始化分組個數
249     bin_num = int(1 / bin_rate) 250     # 分箱間隔
251     dist_bin = (var_up - var_low) / bin_num 252     bin_up = [] 253     bin_low = [] 254     for i in range(1, bin_num + 1): 255         if i == 1: 256             bin_up.append(var_low + i * dist_bin) 257             bin_low.append(-np.inf) 258         elif i == bin_num: 259  bin_up.append(np.inf) 260             bin_low.append(var_low + (i - 1) * dist_bin) 261         else: 262             bin_up.append(var_low + i * dist_bin) 263             bin_low.append(var_low + (i - 1) * dist_bin) 264     result = pd.DataFrame({'bin_up': bin_up, 'bin_low': bin_low}) 265     # 設置result數據的索引名
266     result.index.name = 'bin_num'
267     return result 268 
269 
270 def limit_min_sample(temp_cont, bin_min_num_0): 271     """
272  分箱約束條件：每個箱內的樣本數不能小於bin_min_num_0，cont_var_bin函數的中間過程函數 273  # 參數 274  temp_cont: 初始化分箱后的結果 pandas dataframe 275  bin_min_num_0:每組內的最小樣本限制 276  # 返回值 277  合並后的分箱結果，pandas dataframe 278     """
279     # print("合並前 temp_cont.shape = ", temp_cont.shape)
280     # print("temp_cont.index.max() = ", temp_cont.index.max())
281     for i in temp_cont.index: 282         # 獲取某一行的數據
283         rowdata = temp_cont.loc[i, :] 284         # print("rowdata = ", rowdata)
285         if i == temp_cont.index.max(): 286             # 如果是最后一個箱就，取倒數第二個值
287             ix = temp_cont[temp_cont.index < i].index.max() 288         else: 289             # 否則就取大於i的最小的分箱值
290             ix = temp_cont[temp_cont.index > i].index.min() 291         # print("------------------------------")
292         # print("i = ", i)
293         # print("ix = ", ix)
294         # print("rowdata = ", rowdata)
295         # 如果0, 1, total項中樣本的數量小於20則進行合並
296         if rowdata['total'] <= bin_min_num_0: 297             # 與相鄰的bin合並，即把temp_cont.loc[i]的值和temp_cont.loc[ix]的值合並
298             temp_cont.loc[ix, 'bad'] = temp_cont.loc[ix, 'bad'] + rowdata['bad'] 299             temp_cont.loc[ix, 'good'] = temp_cont.loc[ix, 'good'] + rowdata['good'] 300             temp_cont.loc[ix, 'total'] = temp_cont.loc[ix, 'total'] + rowdata['total'] 301             # 把低限制值保留下來
302             if i < temp_cont.index.max(): 303                 temp_cont.loc[ix, 'bin_low'] = rowdata['bin_low'] 304             else: 305                 temp_cont.loc[ix, 'bin_up'] = rowdata['bin_up'] 306             temp_cont = temp_cont.drop(i, axis=0) 307     # print("合並后 temp_cont.shape = ", temp_cont.shape)
308     return temp_cont.sort_values(by='bad_rate') 309 
310 
311 def cont_var_bin_map(x, bin_init): 312     """
313  按照初始化分箱結果，對原始值進行分箱映射 314  用於訓練集與測試集的分箱映射 315     """
316     temp = x.copy() 317     # print("bin_init.index = ", bin_init.index)
318     for i in bin_init.index: 319         bin_up = bin_init['bin_up'][i] 320         bin_low = bin_init['bin_low'][i] 321         # 尋找出 >lower and <= upper的位置
322         if pd.isnull(bin_up) or pd.isnull(bin_low): 323             temp[pd.isnull(temp)] = i 324         else: 325             # index是series類型，返回的是true和false
326             index = (x > bin_low) & (x <= bin_up) 327             temp[index] = i 328     # series.name是設置series的名稱
329     temp.name = temp.name + "_BIN"
330     return temp 331 
332 
333 def merge_bin(sub, i): 334     """
335  將相同箱內的樣本書合並，區間合並 336  # 參數 337  sub:分箱結果子集，pandas dataframe ，如bin=1的結果 338  i: 分箱標號 339  # 返回值 340  返回合並結果 341     """
342     length = len(sub) 343     total = sub['total'].sum() 344     # 獲取第1行值
345     first = sub.iloc[0, :] 346     # 獲取最后一行值
347     last = sub.iloc[length - 1, :] 348 
349     lower = first['bin_low'] 350     upper = last['bin_up'] 351     df = pd.DataFrame() 352     df = df.append([i, lower, upper, total], ignore_index=True).T 353     df.columns = ['bin', 'bin_low', 'bin_up', 'total'] 354     return df 355 
356 
357 # --------------------- 連續變量分箱函數 -------------------- #
358 def cont_var_bin(x, 359  y, 360  method, 361                  mmin=5, 362                  mmax=10, 363                  bin_rate=0.01, 364                  stop_limit=0.1, 365                  bin_min_num=20): 366     """
367  # 參數 368  x:輸入分箱數據，pandas series 369  y:標簽變量 370  method:分箱方法選擇，1:chi-merge , 2:IV值, 3:基尼系數分箱 371  mmin:最小分箱數，當分箱初始化后如果初始化箱數小於等mmin，則mmin=2，即最少分2箱， 372  如果分兩箱也無法滿足箱內最小樣本數限制而分1箱，則變量刪除 373  mmax:最大分箱數，當分箱初始化后如果初始化箱數小於等於mmax，則mmax等於初始化箱數-1 374  bin_rate：等距初始化分箱參數，分箱數為1/bin_rate,分箱間隔在數據中的最小值與最大值將等間隔取值 375  stop_limit:分箱earlystopping機制，如果已經沒有明顯增益即停止分箱 376  bin_min_num:每組最小樣本數 377  # 返回值 378  分箱結果：pandas dataframe 379     """
380     # 簡單的來說pandas只有兩種數據類型，Series和DataFrame，Series你可以簡單的理解為Excel中的行或者列，DataFrame可以理解為整個Excel表格
381 
382     # 缺失值單獨取出來
383     df_na = pd.DataFrame({'x': x[pd.isnull(x)], 'y': y[pd.isnull(x)]}) 384     y = y[~pd.isnull(x)] 385     x = x[~pd.isnull(x)] 386 
387     # 初始化分箱，等距的方式，后面加上約束條件,沒有箱內樣本數沒有限制
388     # 返回的是 bin_num, bin_up, bin_low （shape=100*2 的 dataFrame）
389     bin_init = init_equal_bin(x, bin_rate) 390 
391     # 分箱映射，即按照初始化分箱結果，對原始值進行分箱映射
392     # 數據類型是series ( shape=(771,) )
393     bin_map = cont_var_bin_map(x, bin_init) 394 
395     # 把series轉換成dataFrame數據類型，其中，axis=1表示列拼接，axis=0表示行拼接，列拼接的話對應的是橫向拼接，行拼接的話就是對應縱向拼接
396     df_temp = pd.concat([x, y, bin_map], axis=1) 397     # 計算每個bin中好壞樣本的頻數
398     df_temp_1 = pd.crosstab(index=df_temp[bin_map.name], columns=y) 399     # dataframe中有行和列兩個方向，在改名時，需要指明改名的是行還是列（默認是行）
400     # inplace表示將結果返回給原變量
401     df_temp_1.rename(columns=dict(zip([0, 1], ['good', 'bad'])), inplace=True) 402     # 返回的 amount_BIN, good, bad (shape = 97*2 的 DataFrame)
403 
404     # 計算每個bin中一共有多少樣本
405     # df.groupby(..).count() 每組內，按列統計每組的成員數。每列的統計結果是一樣的，所以只取一列數據
406     # loc函數：通過行索引 "Index" 中的具體值來獲取數據，iloc函數：通過數值來取數據（如取第二行的數據）
407     # print("bin_map.name = ", bin_map.name)
408     df_temp_2 = pd.DataFrame(df_temp.groupby(bin_map.name).count().iloc[:, 0]) 409     df_temp_2.columns = ['total'] 410     # pd.merge 表示主鍵合並類似於關系型數據庫的連接方式，
411     # left_index：左側的行索引的用作連接鍵。right_index：右側的行索引的用作連接鍵
412     # how：表示連接方式。left表示使用左側的DataFrame的鍵，類似於SQL的左外連接。左表全部顯示，右表顯示與重疊數據行索引值相同的數據
413     df_temp_all = pd.merge(pd.concat([df_temp_1, df_temp_2], axis=1), bin_init, left_index=True, right_index=True, how='left') 414 
415     pd.set_option('display.max_rows', None)     # 顯示所有行
416     pd.set_option('display.max_columns', None)  # 顯示所有列
417 
418     # 做分箱上下限的整理，讓候選點連續（因為會出現這個Bin中不存在一個樣本的情況，所以做這個連續的處理）
419     for j in range(df_temp_all.shape[0] - 1): 420         # df_temp_all.index[i]是獲取第幾個index值得，前一個值得高限，后一個數得地線
421         # print(df_temp_all.bin_up.loc[df_temp_all.index[j]])
422         # print(df_temp_all.bin_low.loc[df_temp_all.index[j + 1]])
423         if df_temp_all.bin_low.loc[df_temp_all.index[j + 1]] != df_temp_all.bin_up.loc[df_temp_all.index[j]]: 424             # print("j = ", j)
425             df_temp_all.bin_low.loc[df_temp_all.index[j + 1]] = df_temp_all.bin_up.loc[df_temp_all.index[j]] 426         
427     # 離散變量中這個值為badrate,連續變量時為索引，索引值是分箱初始化時，箱內有變量的箱的索引
428     df_temp_all['bad_rate'] = df_temp_all.index 429     # df_temp_all 列名是 ('amount_BIN', 'good', 'bad', 'total', 'bin_up', 'bin_low', 'bad_rate') 的類型是dataFrame，shape = (97*6)
430     # 最小樣本數限制，進行分箱合並
431     df_temp_all = limit_min_sample(df_temp_all, bin_min_num) 432     # df_temp_all 列名是 ('amount_BIN', 'good', 'bad', 'total', 'bin_up', 'bin_low', 'bad_rate') 的類型是dataFrame，shape = (27*6)
433 
434     # 將合並后的最大箱數與設定的箱數進行比較，這個應該是分箱數的最大值
435     if mmax >= df_temp_all.shape[0]: 436         mmax = df_temp_all.shape[0] - 1
437     if mmin >= df_temp_all.shape[0]: 438         gain_value_save0 = 0 439         gain_rate_save0 = 0 440         df_temp_all['bin'] = np.linspace(1, df_temp_all.shape[0], df_temp_all.shape[0], dtype=int) 441         data = df_temp_all[['bin_low', 'bin_up', 'total', 'bin']] 442         data.index = data['bin'] 443     else: 444         # 增加新的一列，並且新增的列值都是1
445         df_temp_all['bin'] = 1
446         df_temp_all['bin_raw'] = range(1, len(df_temp_all) + 1) 447         df_temp_all['var'] = df_temp_all.index  # 初始化箱的編號
448         # df_temp_all 是 ['good', 'bad', 'total', 'bin_up', 'bin_low', 'bad_rate', 'bin', 'bin_raw', 'var'], shape=27*9
449         gain_1 = 1e-10
450         gain_rate_save0 = [] 451         gain_value_save0 = [] 452         # 分箱約束：最大分箱數限制
453         for i in range(1, mmax): 454             df_temp_all, gain_2 = select_split_point(df_temp_all, method=method) 455             gain_rate = gain_2 / gain_1 - 1   # ratio gain
456             gain_value_save0.append(np.round(gain_2, 4)) 457             if i == 1: 458                 gain_rate_save0.append(0.5) 459             else: 460                 gain_rate_save0.append(np.round(gain_rate, 4)) 461             gain_1 = gain_2 462             # 判斷分箱數是否在最小分箱數和最大分箱數之間
463             if df_temp_all.bin.max() >= mmin and df_temp_all.bin.max() <= mmax: 464                 if gain_rate <= stop_limit or pd.isnull(gain_rate): 465                     break
466 
467         df_temp_all = df_temp_all.rename(columns={'var': 'oldbin'}) 468         # drop之前的shape=(27*9)， drop之后的shape=(27*5)
469         temp_Map1 = df_temp_all.drop(['good', 'bad', 'bad_rate', 'bin_raw'], axis=1) 470         temp_Map1 = temp_Map1.sort_values(by=['bin', 'oldbin']) 471 
472         # get new lower, upper, bin, total for sub
473         data = pd.DataFrame() 474         for i in temp_Map1['bin'].unique(): 475             # 得到這個新的分箱內的上下界
476             sub_Map = temp_Map1[temp_Map1['bin'] == i] 477             rowdata = merge_bin(sub_Map, i) 478             data = data.append(rowdata, ignore_index=True) 479 
480         # resort data
481         data = data.sort_values(by='bin_low') 482         data = data.drop('bin', axis=1) 483         mmax = df_temp_all.bin.max() 484         data['bin'] = range(1, mmax + 1) 485         data.index = data['bin'] 486 
487     # 將缺失值的箱加過來，把缺失值單獨作為一個箱
488     if len(df_na) > 0: 489         row_num = data.shape[0] + 1
490         data.loc[row_num, 'bin_low'] = np.nan 491         data.loc[row_num, 'bin_up'] = np.nan 492         data.loc[row_num, 'total'] = df_na.shape[0] 493         data.loc[row_num, 'bin'] = data.bin.max() + 1
494     return data, gain_value_save0, gain_rate_save0 495 
496 
497 def cal_bin_value(x, y, bin_min_num_0=10): 498     """
499  按變量類別進行分箱初始化，不滿足最小樣本數的箱進行合並 500  # 參數 501  x: 待分箱的離散變量 pandas Series 502  y: 標簽變量 503  target: 正樣本標識 504  bin_min_num_0：箱內的最小樣本數限制 505  # 返回值 506  計算結果 507     """
508     # 按類別x計算y中0,1兩種狀態的樣本數
509     df_temp = pd.crosstab(index=x, columns=y, margins=False) 510     df_temp.rename(columns=dict(zip([0, 1], ['good', 'bad'])), inplace=True) 511     # DataFrame.assign(**kwargs) 為DataFrame分配新列。返回一個新對象，該對象包含除新列之外的所有原始列。重新分配的現有列將被覆蓋。
512     df_temp = df_temp.assign(total=lambda x: x['good'] + x['bad'], 513                              bin=1, 514                              var_name=df_temp.index).assign(bad_rate=lambda x: x['bad'] / x['total']) 515 
516     # 按照baterate排序
517     df_temp = df_temp.sort_values(by='bad_rate') 518     df_temp = df_temp.reset_index(drop=True) 519     # print(df_temp)
520     # 樣本數不滿足最小值進行合並
521     for i in df_temp.index: 522         # 獲取這一行的數據
523         rowdata = df_temp.loc[i, :] 524         if i == df_temp.index.max(): 525             # 如果是最后一個箱就，取倒數第二個值.ix是要與之合並的箱數
526             ix = df_temp[df_temp.index < i].index.max() 527         else: 528             # 否則就取大於i的最小的分箱值
529             ix = df_temp[df_temp.index > i].index.min() 530         # 如果0, 1, total項中樣本的數量小於20則進行合並
531         # bin_min_num_0是箱內最小樣本數限制
532         if any(rowdata[:3] <= bin_min_num_0): 533             # 與相鄰的bin合並
534             df_temp.loc[ix, 'bad'] = df_temp.loc[ix, 'bad'] + rowdata['bad'] 535             df_temp.loc[ix, 'good'] = df_temp.loc[ix, 'good'] + rowdata['good'] 536             df_temp.loc[ix, 'total'] = df_temp.loc[ix, 'total'] + rowdata['total'] 537             df_temp.loc[ix, 'bad_rate'] = df_temp.loc[ix, 'bad'] / df_temp.loc[ix, 'total'] 538             # 將區間也進行合並
539             # print(str(rowdata['var_name']))
540             # print(str(df_temp.loc[ix, 'var_name']))
541             df_temp.loc[ix, 'var_name'] = str(rowdata['var_name']) + '%' + str(df_temp.loc[ix, 'var_name']) 542             # print(df_temp.loc[ix, 'var_name'])
543 
544             df_temp = df_temp.drop(i, axis=0)  # 刪除原來的bin（行）
545 
546     # print(df_temp)
547     # 如果離散變量小於等於5，每個變量為一個箱
548     df_temp['bin_raw'] = range(1, df_temp.shape[0] + 1) 549     df_temp = df_temp.reset_index(drop=True) 550     return df_temp 551 
552 
553 def disc_var_bin(x, 554  y, 555                  method=1, 556                  mmin=3, 557                  mmax=8, 558                  stop_limit=0.1, 559                  bin_min_num=20): 560     """
561  離散變量分箱方法，如果變量過於稀疏最好先編碼在按連續變量分箱 562  # 參數： 563  x:輸入分箱數據，pandas series 564  y:標簽變量 565  method:分箱方法選擇，1:chi-merge , 2:IV值, 3:信息熵 566  mmin:最小分箱數，當分箱初始化后如果初始化箱數小於等mmin，則mmin=2，即最少分2箱， 567  如果分兩廂也無法滿足箱內最小樣本數限制而分1箱，則變量刪除 568  mmax:最大分箱數，當分箱初始化后如果初始化箱數小於等於mmax，則mmax等於初始化箱數-1 569  stop_limit:分箱earlystopping機制，如果已經沒有明顯增益即停止分箱 570  bin_min_num:每組最小樣本數 571  # 返回值 572  分箱結果：pandas dataframe 573     """
574     # x = data_train.purpose
575     # y = data_train.target
576     del_key = [] 577     # 缺失值單獨取出來
578     df_na = pd.DataFrame({'x': x[pd.isnull(x)], 'y': y[pd.isnull(x)]}) 579     y = y[~pd.isnull(x)] 580     x = x[~pd.isnull(x)] 581     # 數據類型轉化
582     # np.issubdtype()可以判斷類型繼承關系,'o'類型是object或Pandas對象，這是Python類型字符串
583     if np.issubdtype(x.dtype, np.int_): 584         x = x.astype('float').astype('str') 585     if np.issubdtype(x.dtype, np.float_): 586         x = x.astype('str') 587 
588     # 按照類別分箱，得到每個箱下的統計值
589     temp_cont = cal_bin_value(x, y, bin_min_num) 590     # print(temp_cont)
591     
592     # 如果去掉缺失值后離散變量的可能取值小於等於5不分箱
593     if len(x.unique()) > 5: 594         # 將合並后的最大箱數與設定的箱數進行比較，這個應該是分箱數的最大值
595         if mmax >= temp_cont.shape[0]: 596             mmax = temp_cont.shape[0] - 1
597         if mmin >= temp_cont.shape[0]: 598             mmin = 2
599             mmax = temp_cont.shape[0] - 1
600         if mmax == 1: 601             print('變量 {0}合並后分箱數為1，該變量刪除'.format(x.name)) 602  del_key.append(x.name) 603 
604         gain_1 = 1e-10
605         gain_value_save0 = [] 606         gain_rate_save0 = [] 607         for i in range(1, mmax): 608             temp_cont, gain_2 = select_split_point(temp_cont, method=method) 609             gain_rate = gain_2 / gain_1 - 1   # ratio gain
610             gain_value_save0.append(np.round(gain_2, 4)) 611             if i == 1: 612                 gain_rate_save0.append(0.5) 613             else: 614                 gain_rate_save0.append(np.round(gain_rate, 4)) 615             gain_1 = gain_2 616             # print("temp_cont.bin.max() = ", temp_cont.bin.max())
617             if temp_cont.bin.max() >= mmin and temp_cont.bin.max() <= mmax: 618                 if gain_rate <= stop_limit: 619                     break
620         
621         # 這時候temp_cont的shape是 (6, 7)
622         temp_cont = temp_cont.rename(columns={'var': x.name}) 623         # 這時候temp_cont的shape是 (6, 3)
624         temp_cont = temp_cont.drop(['good', 'bad', 'bin_raw', 'bad_rate'], axis=1) 625     else: 626         # print("temp_cont = ", temp_cont)
627         temp_cont.bin = temp_cont.bin_raw 628         temp_cont = temp_cont[['total', 'bin', 'var_name']] 629         gain_value_save0 = [] 630         gain_rate_save0 = [] 631         del_key = [] 632 
633     # 將缺失值的箱加過來
634     if len(df_na) > 0: 635         index_1 = temp_cont.shape[0] + 1
636         temp_cont.loc[index_1, 'total'] = df_na.shape[0] 637         temp_cont.loc[index_1, 'bin'] = temp_cont.bin.max() + 1
638         temp_cont.loc[index_1, 'var_name'] = 'NA'
639     temp_cont = temp_cont.reset_index(drop=True) 640     if temp_cont.shape[0] == 1: 641  del_key.append(x.name) 642     return temp_cont.sort_values(by='bin'), gain_value_save0, gain_rate_save0, del_key 643 
644 
645 def disc_var_bin_map(x, bin_map): 646     """
647  用離散變量分箱后的結果，對原始值進行分箱映射 648  # 參數 649  x: 待分箱映射的離散變量，pandas Series 650  bin_map:分箱映射字典， pandas dataframe 651  # 返回值 652  返回映射結果 653     """
654     # 數據類型轉化
655     xx = x[~pd.isnull(x)] 656     if np.issubdtype(xx.dtype, np.int_): 657         x[~pd.isnull(x)] = xx.astype('float').astype('str') 658     if np.issubdtype(xx.dtype, np.float_): 659         x[~pd.isnull(x)] = xx.astype('str') 660     d = dict() 661     for i in bin_map.index: 662         for j in bin_map.loc[i, 'var_name'].split('%'): 663             if j != 'NA': 664                 d[j] = bin_map.loc[i, 'bin'] 665 
666     # 不論是利用字典還是函數進行映射，pandas.series.map方法都是把對應的數據逐個當作參數傳入到字典或函數中，得到映射后的值。
667     new_x = x.map(d) 668 
669     # 有缺失值要做映射
670     if sum(pd.isnull(new_x)) > 0: 671         index_1 = bin_map.index[bin_map.var_name == 'NA'] 672         if len(index_1) > 0: 673             new_x[pd.isnull(new_x)] = bin_map.loc[index_1, 'bin'].tolist() 674     new_x.name = x.name + '_BIN'
675 
676     return new_x 677 
678 
679 if __name__ == '__main__': 680 
681     path = os.getcwd() 682     data_path = os.path.join(path, 'data') 683     file_name = 'german.csv'
684     # 讀取數據
685     data_train, data_test = data_read(data_path, file_name) 686     print("data_train.shape = ", data_train.shape) 687     print("data_test.shape = ", data_test.shape) 688     
689     dict_cont_bin = {} 690     cont_name = ['duration', 'amount', 'income_rate', 'residence_info', 'age', 'num_credits', 'dependents'] 691 
692     # ------------------------ 連續變量分箱 -------------------------- #
693     data_train.amount[1:30] = np.nan 694     # 注意，這里輸入的變量就只有一個變量
695     data_test1, gain_value_save1, gain_rate_save1 = cont_var_bin( 696  data_train.amount, 697  data_train.target, 698         method=1, 699         mmin=4, 700         mmax=10, 701         bin_rate=0.01, 702         stop_limit=0.1, 703         bin_min_num=20) 704     
705     data_test2, gain_value_save2, gain_rate_save2 = cont_var_bin( 706  data_train.amount, 707  data_train.target, 708         method=2, 709         mmin=4, 710         mmax=10, 711         bin_rate=0.01, 712         stop_limit=0.1, 713         bin_min_num=20) 714 
715     data_test3, gain_value_save3, gain_rate_save3 = cont_var_bin( 716  data_train.amount, 717  data_train.target, 718         method=3, 719         mmin=4, 720         mmax=10, 721         bin_rate=0.01, 722         stop_limit=0.1, 723         bin_min_num=20) 724 
725     # 區分離散變量和連續變量批量進行分箱，把每個變量分箱的結果保存在字典中
726     for i in cont_name: 727         dict_cont_bin[i], gain_value_save, gain_rate_save = cont_var_bin( 728  data_train[i], 729  data_train.target, 730             method=1, 731             mmin=4, 732             mmax=10, 733             bin_rate=0.01, 734             stop_limit=0.1, 735             bin_min_num=20) 736     
737     # 訓練數據分箱
738     # 連續變量分箱映射
739 # ss = data_train[list( dict_cont_bin.keys())]
740     df_cont_bin_train = pd.DataFrame() 741     for i in dict_cont_bin.keys(): 742         print("dict_cont_bin.keys = ", i) 743         df_cont_bin_train = pd.concat([ 744  df_cont_bin_train, 745             cont_var_bin_map(data_train[i], dict_cont_bin[i])], axis=1) 746 
747     # ---------------------- 離散變量分箱 ---------------------- #
748     data_train.purpose[1:30] = np.nan 749     data_disc_test1, gain_value_save1, gain_rate_save1, del_key = disc_var_bin( 750  data_train.purpose, 751  data_train.target, 752         method=1, 753         mmin=4, 754         mmax=10, 755         stop_limit=0.1, 756         bin_min_num=10) 757 
758     data_disc_test2, gain_value_save2, gain_rate_save2, del_key = disc_var_bin( 759  data_train.purpose, 760  data_train.target, 761         method=2, 762         mmin=4, 763         mmax=10, 764         stop_limit=0.1, 765         bin_min_num=10) 766 
767     data_disc_test3, gain_value_save3, gain_rate_save3, del_key = disc_var_bin( 768  data_train.purpose, 769  data_train.target, 770         method=3, 771         mmin=4, 772         mmax=10, 773         stop_limit=0.1, 774         bin_min_num=10) 775 
776     pd.set_option('display.max_rows', 60) 777     pd.set_option('display.max_columns', 0) 778     dict_disc_bin = {} 779     del_key = [] 780     # 找到離散變量
781     disc_name = [x for x in data_train.columns if x not in cont_name] 782     disc_name.remove('target') 783     for i in disc_name: 784         print("disc_name = ", i) 785         dict_disc_bin[i], gain_value_save, gain_rate_save, del_key_1 = disc_var_bin( 786  data_train[i], 787  data_train.target, 788                 method=1, 789                 mmin=3, 790                 mmax=8, 791                 stop_limit=0.1, 792                 bin_min_num=5) 793         if len(del_key_1) > 0: 794  del_key.extend(del_key_1) 795     # 刪除分箱數只有1個的變量
796     if len(del_key) > 0: 797         for j in del_key: 798             del dict_disc_bin[j] 799 
800     # 訓練數據分箱
801     # 離散變量分箱映射
802 # ss = data_train[list( dict_disc_bin.keys())]
803     df_disc_bin_train = pd.DataFrame() 804     for i in dict_disc_bin.keys(): 805         print("離散變量分箱映射: ", i) 806         df_disc_bin_train = pd.concat([ 807  df_disc_bin_train, 808             disc_var_bin_map(data_train[i], dict_disc_bin[i])], axis=1) 809 
810     # 測試數據分箱
811     # 連續變量分箱映射
812     ss = data_test[list(dict_cont_bin.keys())] 813     df_cont_bin_test = pd.DataFrame() 814     for i in dict_cont_bin.keys(): 815         df_cont_bin_test = pd.concat([ 816  df_cont_bin_test, 817             cont_var_bin_map(data_test[i], dict_cont_bin[i])], axis=1) 818     # 離散變量分箱映射
819 # ss = data_test[list( dict_disc_bin.keys())]
820     df_disc_bin_test = pd.DataFrame() 821     for i in dict_disc_bin.keys(): 822         df_disc_bin_test = pd.concat([ 823  df_disc_bin_test, 824             disc_var_bin_map(data_test[i], dict_disc_bin[i])], axis=1)
免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。
猜您在找 數據分析中的變量選擇——德國信貸數據集（variable selection in data analysis-German credit datasets）數據分析中的變量編碼——德國信貸數據集（data coding in data analysis-German credit datasets）數據清洗與預處理代碼詳解——德國信貸數據集（data cleaning and preprocessing - German credit datasets）數據分析常用數據集下載基於 Python 和 Pandas 的數據分析(4) --- 建立數據集基於數據集Airbnb的數據分析 sklearn中的datasets數據集 Python 探索性數據分析(Exploratory Data Analysis,EDA) A survey of best practices for RNA-seq data analysis RNA-seq數據分析指南探索性數據分析（Exploratory Data Analysis，EDA）