目前,國內外對個人信用風險評估模型的研究方法,是通過用戶的歷史行為(如歷史數據的多維特征和貸款狀態是否違約)來訓練模型,通過這個模型對新增的貸款人“是否具有償還能力,是否具有償債意願”進行分析,預測貸款申請人是否會發生違約貸款。主要有兩種方法:以 Logistic 回歸模型為代表的傳統信用風險評估方法;以支持向量機、神經網絡、決策樹等機器學習理論為代表的新型信用風險評估模型。
本文『以Logistic 回歸模型建立信用評分卡模型』方法對 Lending Club 公司貸款業務進行信用卡評分建模,通過模型預測貸款人是否會違約,達到最小化風險的目的。其原理是將模型變量 WOE 編碼方式離散化之后運用 logistic 回歸模型進行的一種二分類變量的廣義線性模型。一般地,我們將模型目標標量 1 記為違約用戶,目標變量 0 記為正常用戶。

- 原始數據來源:https://www.kaggle.com/wendykan/lending-club-loan-data/kernels
- 原始時間跨度:2007-2015
- 原始數據維度:226萬 * 145
- 本項違約定義:違約16天及其以上 (d_loan = [ "Late (16-30 days)" , "Late (31-120 days)","Charged Off" , "Default", "Does not meet the credit policy. Status:Charged Off"])
- 模型時間窗口:由於數據量較大,時間跨度過長,故選擇2016、2017 兩年的數據進行后續建模(數據877986*145)。
對 Lending Club 公司業務前期分析:
- Lending Club 公司2007-2018貸款業務初步分析:https://www.cnblogs.com/Ray-0808/p/12618551.html
- Lending Club 公司2007-2018貸款業務好壞帳分析:https://www.cnblogs.com/Ray-0808/p/12668594.html
1. 數據清洗
1.1 刪除變量
- 刪去缺失率大於 25% 變量 (44個變量)
- 刪去取值只有一個的變量,同一性很大的變量 (17個變量)
- 刪去一些無用變量,例如一些貸后數據,如下圖

1.2 刪除記錄缺失
- 25% 特征信息的記錄 (1條記錄)
- 剔除包含異常值的記錄
1.3 填充空值
- 缺失值標示為一類,將字符型缺失數據填充 'UnKnown'
- 眾數填充,將數值型缺失數據用眾數填充
清洗后數據量:數據877985*71
2. 數據分箱
2.1 數據分箱方法
數據分箱的方法可以分為:有監督分箱和無監督分箱。有監督分箱可以分為 Split 分箱和 Merge 分箱,無監督分箱為等頻分箱、等距分箱、聚類分箱等。
本項目使用卡方分箱法(ChiMerge),基本思想是如果兩個相鄰的區間具有類似的類分布,則這兩個區間合並;否則,它們應保持分開。
操作過程:
- 輸入:分箱的最大區間數n
- 初始化
- 連續值按升序排列,離散值先轉化為壞客戶的比率,然后再按升序排列;
- 為了減少計算量,對於狀態數大於某一閾值 (建議為100) 的變量,利用等頻分箱進行粗分箱。
- 若有缺失值,則缺失值單獨作為一個分箱。
- 合並區間
- 計算每一對相鄰區間的卡方值;
- 將卡方值最小的一對區間合並
- 重復以上兩個步驟,直到分箱數量不大於n
- 分箱后處理
- 對於壞客戶比例為 0 或 1 的分箱進行合並 (一個分箱內不能全為好客戶或者全為壞客戶)。
- 對於分箱后某一箱樣本占比超過 95% 的箱子進行刪除。
- 輸出:分箱后的數據和分箱區間。
2.2 本項目數據划分
由於使用的卡方分箱法,分箱過程中會使用到實際違約變量("y_label")的數據,為防止數據泄露,需要在對數據進行分箱前划分數據集。這里由於數據傾斜(違約樣本占比遠少於正常樣本),所以使用分層抽樣的方法,避免數據分布不均。
1 # 分層抽樣, 避免不均 2 from sklearn.model_selection import StratifiedShuffleSplit 3 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 4 for train_index, test_index in split.split(data, data.loc[:, 'y_label']): 5 train_set = data.iloc[train_index, :] 6 test_set = data.iloc[test_index, :]
2.3 本項目分箱流程
- 類別型變量
- 當取值較少時:
- 如果每種類別同時包含好壞樣本,無需分箱
- 如果有類別只包含好壞樣本的一種,需要合並
- 當取值較多時,先用bad rate編碼(即將每個類別變量的不同值用其違約率代替),再用連續型分箱的方式進行分箱
- 當取值較少時:
- 連續型變量
- 檢查是否有特殊值,需要單獨分箱
- 檢查原始分箱個數,如果過大,需要粗分箱(比如100箱)
- 使用卡方分箱法
- 檢查所分箱是否有箱是相同值,有則合並
- 檢查分箱后是否符合單調性,不單調則合並
本部代碼較多,集中放置於附錄代碼中。
3. WOE 編碼
在分箱完畢后,需要對每個變量的各箱進行編碼,才能送入邏輯回歸模型中進行使用。
WOE的全稱是“Weight of Evidence”,計算每個變量的每箱 WOE 值,然后使用對應 WOE 值對該變量進行編碼。其不同於ONE-HOT編碼后會創建許多新變量,造成矩陣稀疏、維度災難等問題
WOE 編碼相當於把分箱后的特征從非線性可分映射到近似線性可分的空間內:
- 可提升模型的預測效果
- 將自變量規范到同一尺度上
- WOE能反映自變量取值的貢獻情況
- 有利於對變量的每個分箱進行評分
- 轉化為連續變量之后,便於分析變量與變量之間的相關性
- 與獨熱向量編碼相比,可以保證變量的完整性,同時避免稀疏矩陣和維度災難

如 ”int_rate“ 變量:

4. 變量篩選
4.1 單變量篩選
- 基於 IV 值的變量篩選信息價值,類似於信息增益、基尼系數
- 基於 stepwise 的變量篩選,一步一步進行變量篩選
- 基於特征重要度的變量篩選:RF, GBDT…
- 基於 LASSO 正則化的變量篩選
本文對比使用了兩種,最終選擇第一種方法。

1 # 4.1 單變量選擇 --> 按 IV 值挑選變量,IV > 0.02 2 high_IV = {k: v for k, v in IV_dict.items() if v >= 0.02} 3 high_IV_dict_sorted = sorted(high_IV.items(), key=lambda x: x[1], reverse=False) 4 high_IV_values = [i[1] for i in high_IV_dict_sorted] 5 high_IV_names = [i[0] for i in high_IV_dict_sorted] 6 feat_select_1 = [] 7 for i in high_IV_names: 8 feat_select_1.append(i + '_WOE') 9 print("經過 IV>0.02 挑選后,有 {} 個變量".format(len(feat_select_1))) 10 print("這些變量是:", feat_select_1) 11 fig_1, ax = plt.subplots(1, 1) 12 plt.barh(range(len(high_IV_values)), high_IV_values) 13 plt.title('feature IV') 14 plt.yticks(range(len(high_IV_names)), high_IV_names) 15 plt.savefig(folder_data + "feature_IV.png", dpi=1000, bbox_inches='tight') # 解決圖片不清晰,不完整的問題 16 # plt.show()
1 # 4.2 單變量選擇 --> 隨機森林法選擇變量 2 X = train_data[WOE_feat] 3 X = np.matrix(X) 4 y = train_data['y_label'] 5 y = np.array(y) 6 RFC = RandomForestClassifier() 7 RFC_Model = RFC.fit(X, y) 8 features_rfc = train_data[WOE_feat].columns 9 RF_feat_importance = {features_rfc[i]: RFC_Model.feature_importances_[i] for i in range(len(features_rfc))} 10 RF_feat_importance_sorted = sorted(RF_feat_importance.items(), key=lambda x: x[1], reverse=False) 11 RF_feat_values = [i[1] for i in RF_feat_importance_sorted] 12 RF_feat_names = [i[0] for i in RF_feat_importance_sorted] 13 print("經過隨機森林法單變量挑選后,有 {} 個變量".format(len(RF_feat_names))) 14 print("這些變量是:", RF_feat_names) 15 fig_2, ax = plt.subplots(1, 1) 16 plt.barh(range(len(RF_feat_values)), RF_feat_values) 17 plt.title('隨機森林特征重要性排行') 18 plt.yticks(range(len(RF_feat_names)), RF_feat_names) 19 plt.show()
4.2 變量相關性
- 雙變量相關性 線性相關性:皮爾遜相關系數
- 多變量相關性 多重共線性:某一個變量可以由其他變量線性表示,使用VIF(方差膨脹因子)檢驗
下圖變量線性相關性熱力圖:

5. 建模評價
5.1 建模
可以使用 sklearn 或者 statsmodels 等等庫調用邏輯回歸模型。

5.2 評價
| 預測 | |||
| 1 | 0 | ||
| 實際 | 1 | TP(true positivie) | FN(false negative) |
| 0 | FP(false positive) | TN(true negative) | |
- 真正率(TPR):實際上為正例被預測為正例的比率,TPR = TP / (TP + FN)
- 假正率(FPR):實際上為反例被預測為正例的比率,FPR = FP / (FP + TN)
- ROC 曲線:橫坐標由假正率,縱坐標由真正率構成的曲線
- AUC 值:ROC 曲線到 坐標軸之間的面積,指隨機給定一個正樣本和一個負樣本,分類器輸出該正樣本為正的那個概率值比分類器輸出該負樣本為正的那個概率值要大的可能性
- KS 值:KS = max(TPR-FPR),大於0.2 則說明模型區分性較好

6. 參考文獻
- 邏輯回歸算法:https://www.cnblogs.com/biyeymyhjob/archive/2012/07/18/2595410.html
- 特征重要度 WoE、IV、BadRate:https://www.cnblogs.com/Allen-rg/p/11507970.html
- 隨機森林: https://www.cnblogs.com/maybe2030/p/4585705.html#top
- L1、L2:https://blog.csdn.net/jinping_shi/article/details/52433975
- 玩轉邏輯回歸之金融評分卡模型:https://zhuanlan.zhihu.com/p/36539125
- 利用LendingClub數據建模:https://zhuanlan.zhihu.com/p/21550547
附錄:
1 # coding:utf-8 2 3 import pandas as pd 4 import numpy as np 5 import time 6 import matplotlib.pyplot as plt 7 import seaborn as sns 8 from statsmodels.stats.outliers_influence import variance_inflation_factor 9 import pickle 10 from sklearn.model_selection import train_test_split 11 import statsmodels.api as sm 12 from sklearn.ensemble import RandomForestClassifier 13 import CustomFunction as cf 14 from sklearn.model_selection import StratifiedShuffleSplit 15 16 # 顯示設置 17 plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標簽 18 plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號,#有中文出現的情況,需要u'內容' 19 pd.set_option('display.max_columns', 1000) 20 pd.set_option('display.width', 1000) 21 pd.set_option('display.max_colwidth', 1000) 22 pd.set_option('display.max_row', 1000) 23 24 # 程序開始時間計算 25 start_time = time.process_time() 26 27 # 存放數據文件路徑 28 folder_data = r'E:/3.Python_Project/LENGING_CLUB/data/' 29 30 # ---------------------------------------------------------------------------------------------------------------------- 31 # 0. 讀取數據 , 轉換數據格式 32 # ---------------------------------------------------------------------------------------------------------------------- 33 # loan = pd.read_csv('loan.csv', low_memory=False) 34 # df = loan.copy() 35 # 36 # # 將 ‘issue_d’ str形式的時間日期轉換 37 # dt_series = pd.to_datetime(df['issue_d']) 38 # df['year'] = dt_series.dt.year 39 # df['month'] = dt_series.dt.month 40 # 41 # # 取 2016-2017 年度的數據進行后續計算 42 # print(df['year'].unique()) 43 # df = df[df['year'].isin([2016, 2017])] 44 # print(df['year'].unique()) 45 # 46 # # 根據貸款狀態 ‘loan_status’ 定義違約標簽 47 # d_loan = ["Late (16-30 days)", "Late (31-120 days)", "Charged Off", "Default", 48 # "Does not meet the credit policy. Status:Charged Off"] 49 # df['loan_condition'] = np.nan 50 # lst = [df] 51 # 52 # 53 # def loan_condition(status): 54 # if status in d_loan: 55 # return 'd Loan' 56 # else: 57 # return 'c Loan' 58 # 59 # 60 # df['loan_condition'] = df['loan_status'].apply(loan_condition) 61 # df['y_label'] = np.nan 62 # for col in lst: 63 # col.loc[df['loan_condition'] == 'c Loan', 'y_label'] = 0 # 正常貸款 64 # col.loc[df['loan_condition'] == 'd Loan', 'y_label'] = 1 # 違約貸款 65 # df['y_label'] = df['y_label'].astype(int) 66 # df.to_pickle(folder_data + "loan_1617_preprocessing.pkl") 67 68 # ---------------------------------------------------------------------------------------------------------------------- 69 # 1. 數據清洗 70 # ---------------------------------------------------------------------------------------------------------------------- 71 # df = pd.read_pickle(folder_data+"loan_1617_preprocessing.pkl") 72 # 73 # # 1.1 刪除變量 74 # df_na = df.isna().sum()/df.shape[0] # 缺失率 75 # # 刪去缺失率大於 25% 變量 76 # del_var = [] 77 # var_big_na = list(df_na[df_na>0.25].index ) # 缺失值超過 25% 的變量 78 # del_var.append(var_big_na) 79 # df.drop(var_big_na,axis=1,inplace=True) 80 # df_na.drop(var_big_na,axis=0,inplace=True) 81 # # 刪去特征只有一個的變量,同一性很大的變量 82 # var_same=[] 83 # for i in df.columns: 84 # mode_value = df[i].mode()[0] 85 # mode_rate = len(df[i][df[i]==mode_value]) / df.shape[0] 86 # if mode_rate > 0.9: 87 # var_same.append(i) 88 # df.drop(i,axis=1,inplace=True) 89 # del_var.append(var_same) 90 # # 刪去自選的一些無用特特征 91 # var_no_use = ['sub_grade', 'emp_title', 'addr_state', 'zip_code', 'loan_condition', 'loan_status', 92 # 'last_pymnt_d', 'last_credit_pull_d', 'last_pymnt_amnt', 'earliest_cr_line', 93 # 'initial_list_status', 'title', 'issue_d', 'mths_since_recent_inq'] 94 # del_var.append(var_no_use) 95 # df.drop(var_no_use,axis=1,inplace=True) 96 # 97 # # 1.2 刪除記錄 98 # df_na_row = df.isna().sum(axis=1)/df.shape[1] 99 # del_row = list(df_na_row[df_na_row>0.25].index ) # 信息缺失超過 25% 的記錄 100 # df.drop(del_row,axis=0,inplace=True) 101 # 102 # # 1.3 缺失值填充 103 # # 查看變量類型是否正確,並填充缺失值 104 # object_columns = df.select_dtypes(include=["object"]).columns 105 # numeric_columns = df.select_dtypes(exclude=["object"]).columns 106 # # print(df[object_columns].head()) 107 # # print(df[numeric_columns].head()) 108 # 109 # # 將字符型缺失數據填充 'UnKnown' 110 # df[object_columns]=df[object_columns].fillna('UnKnown') 111 # 112 # # 將數值型缺失數據用眾數填充 113 # for i in numeric_columns: 114 # mode_value = df[i].mode()[0] 115 # df[i] = df[i].fillna(mode_value) 116 # 117 # print("數據清洗后有{}個變量".format(df.shape[1]-3)) 118 # print("數據清洗后有{}條記錄".format(df.shape[0])) 119 # df.to_pickle(folder_data+"loan_1617_clean.pkl") 120 121 # ---------------------------------------------------------------------------------------------------------------------- 122 # 2. 數據分箱(ChiMerge) 123 # ---------------------------------------------------------------------------------------------------------------------- 124 # 分箱之后: 125 # (1)不超過5箱 126 # (2)Bad Rate單調 127 # (3)每箱同時包含好壞樣本 128 # (4)特殊值如'UnKnown'單獨成一箱 129 # ---------------------------------------------------------------------------------------------------------------------- 130 loan_clean = pd.read_pickle(folder_data + "loan_1617_clean.pkl") 131 # print(loan_clean.head()) 132 # print(loan_clean.isnull().sum()) 133 loan_clean = loan_clean[loan_clean['dti'] != -1] 134 data = loan_clean.loc[loan_clean['term'] == ' 36 months'].copy() 135 data.drop(['year', 'month', 'term'], axis=1, inplace=True) 136 137 num_features = list(data.select_dtypes(exclude=["object"]).columns) # 數值變量 138 cat_features = list(data.select_dtypes(include=["object"]).columns) # 類型變量 139 # print(num_features) 140 # print(cat_features) 141 # 142 # 划分數據集 143 # train_data, test_data = train_test_split(data, test_size=0.4) 144 # 分層抽樣, 避免不均 145 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 146 for train_index, test_index in split.split(data, data.loc[:, 'y_label']): 147 train_set = data.iloc[train_index, :] 148 test_set = data.iloc[test_index, :] 149 150 # 有幾個字段的值比較少,剔除掉 151 train_set = train_set[~(train_set['home_ownership'] == 'NONE')] 152 train_set = train_set[~(train_set['purpose'].isin(['wedding', 'educational']))] 153 train_data = train_set.copy() 154 test_set = test_set[~(test_set['home_ownership'].isin(['NONE']))] 155 test_set = test_set[~(test_set['purpose'].isin(['wedding', 'educational']))] 156 test_data = test_set.copy() 157 158 # 2.1 處理類型變量 159 cat_more_5_feat = [] # 類型變量中超過 5 種取值的變量 160 cat_less_5_feat = [] # 類型變量中超過 5 種取值的變量 161 # a) 檢查類型變量中哪些變量取值超過5 162 for var in cat_features: 163 value_counts = data[var].nunique() 164 if value_counts > 5: 165 cat_more_5_feat.append(var) 166 else: 167 cat_less_5_feat.append(var) 168 169 # b) 對於類型變量值分類小於5的變量,其每種類別應該包含好壞兩種樣本,如果只有其中一種樣本則需要合並 170 merge_bin_dict = {} # 存放需要合並的變量,以及合並方法 171 for var in cat_less_5_feat: 172 bin_bad_rate = cf.BinBadRate(train_data, var, 'y_label')[0] 173 if min(bin_bad_rate.values()) == 0: 174 print('{} 類型變量需要合並箱'.format(var)) 175 combine_bin = cf.MergeBin(train_data, var, 'y_label') 176 merge_bin_dict[var] = combine_bin 177 pass 178 if max(bin_bad_rate.values()) == 1: 179 print('{} 類型變量需要合並箱'.format(var)) 180 combine_bin = cf.MergeBin(train_data, var, 'y_label', direction='good') 181 merge_bin_dict[var] = combine_bin 182 # 183 # print(merge_bin_dict) 184 # 185 # # c) 對於類型變量值分類大於5的變量,使用 badrate 編碼,放入連續型變量 186 # for var in cat_more_5_feat: 187 # if 'UnKnown' in list(data[var].unique()): 188 # encoding_df = train_data.loc[train_data[var] != 'UnKnown'].copy() 189 # else: 190 # encoding_df = train_data.copy() 191 # bin_bad_rate = cf.BinBadRate(encoding_df, var, 'y_label')[0] 192 # if 'UnKnown' in list(data[var].unique()): 193 # bin_bad_rate['UnKnown'] = -1 194 # new_var = var + '_BR_encoding' 195 # num_features.append(new_var) 196 # train_data[new_var] = train_data[var].map(lambda x: bin_bad_rate[x]) 197 # test_data[new_var] = test_data[var].map(lambda x: bin_bad_rate[x]) 198 # 199 # # 2.2 對連續變量進行卡方分箱(chimerge) 200 # bin_feat = [] 201 # for var in num_features: 202 # print("請稍等,正在處理 {} 字段".format(var)) 203 # if -1 not in train_data[var].unique(): 204 # max_bin_num = 5 # 設置最大分箱數 205 # split_points = cf.ChiMerge(train_data, var, 'y_label', max_bin_num=max_bin_num) 206 # train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points)) 207 # monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label') 208 # while not monotone: 209 # max_bin_num -= 1 210 # split_points = cf.ChiMerge(train_data, var, 'y_label', max_bin_num=max_bin_num) 211 # train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points)) 212 # monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label') 213 # # if max_bin_num == 2: 214 # # # 當分箱數為2時,必然單調,退出 215 # # break 216 # new_var = var + '_bin' 217 # bin_feat.append(new_var) 218 # test_data[var + '_bin'] = test_data[var].map(lambda x: cf.AssignBin(x, split_points)) 219 # else: 220 # max_bin_num = 5 # 設置最大分箱數 221 # split_points = cf.ChiMerge(train_data, var, 'y_label', special_attribute=[-1], max_bin_num=max_bin_num) 222 # train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points, special_attribute=[-1])) 223 # monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label', special_attribute=[-1]) 224 # while not monotone: 225 # max_bin_num -= 1 226 # split_points = cf.ChiMerge(train_data, var, 'y_label', special_attribute=[-1], max_bin_num=max_bin_num) 227 # train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points, 228 # special_attribute=[-1])) 229 # monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label', special_attribute=[-1]) 230 # new_var = var + '_bin' 231 # bin_feat.append(new_var) 232 # test_data[var + '_bin'] = test_data[var].map(lambda x: cf.AssignBin(x, split_points, special_attribute=[-1])) 233 # 234 # train_data.to_pickle(folder_data + 'train_data_1617.pkl') 235 # test_data.to_pickle(folder_data + 'test_data_1617.pkl') 236 # 237 # bin_feat.sort() 238 # print(bin_feat) 239 # print(train_data.head()) 240 241 # ---------------------------------------------------------------------------------------------------------------------- 242 # 3. WOE 編碼,計算 IV 243 # ---------------------------------------------------------------------------------------------------------------------- 244 train_data = pd.read_pickle(folder_data + 'train_data_1617.pkl') 245 test_data = pd.read_pickle(folder_data + 'test_data_1617.pkl') 246 247 # print(train_data.head()) 248 # print(test_data.head()) 249 250 num_features.remove('y_label') 251 for var in cat_more_5_feat: 252 new_var = var + '_BR_encoding' 253 num_features.append(new_var) 254 bin_feat = [] # bin_feat 分箱后變量 255 for var in num_features: 256 var_bin = var + '_bin' 257 bin_feat.append(var_bin) 258 bin_feat = bin_feat + cat_less_5_feat # 少於5個不同值的類型變量在本例中不用合並箱,直接使用原分箱即可 259 bin_feat.sort() 260 261 print("經過分箱后,有 {} 個變量".format(len(bin_feat))) 262 print("這些變量是:", bin_feat) 263 264 # 計算每個變量分箱后的 WOE 和 IV 值 265 WOE_dict = {} 266 IV_dict = {} 267 for var in bin_feat: 268 WOE_IV = cf.CalcWOE(train_data, var, 'y_label') 269 WOE_dict[var] = WOE_IV['WOE'] 270 IV_dict[var] = WOE_IV['IV'] 271 272 # WOE 編碼 273 WOE_feat = [] 274 for var in IV_dict.keys(): 275 var_WOE = var + '_WOE' 276 train_data[var_WOE] = train_data[var].map(WOE_dict[var]) 277 test_data[var_WOE] = test_data[var].map(WOE_dict[var]) 278 WOE_feat.append(var_WOE) 279 280 # print(train_data[WOE_feat].head()) 281 282 # ---------------------------------------------------------------------------------------------------------------------- 283 # 4. 變量選擇 284 # ---------------------------------------------------------------------------------------------------------------------- 285 # 4.1 單變量選擇 --> 按 IV 值挑選變量,IV > 0.02 286 high_IV = {k: v for k, v in IV_dict.items() if v >= 0.02} 287 high_IV_dict_sorted = sorted(high_IV.items(), key=lambda x: x[1], reverse=False) 288 high_IV_values = [i[1] for i in high_IV_dict_sorted] 289 high_IV_names = [i[0] for i in high_IV_dict_sorted] 290 291 feat_select_1 = [] 292 for i in high_IV_names: 293 feat_select_1.append(i + '_WOE') 294 print("經過 IV>0.02 挑選后,有 {} 個變量".format(len(feat_select_1))) 295 print("這些變量是:", feat_select_1) 296 297 fig_1, ax = plt.subplots(1, 1, figsize=(10, 10)) 298 plt.barh(range(len(high_IV_values[0:19])), high_IV_values[0:19], facecolor='royalblue') 299 plt.title('feature IV') 300 plt.yticks(range(len(high_IV_names[0:19])), high_IV_names[0:19]) 301 plt.savefig(folder_data + "feature_IV.png", dpi=1000, bbox_inches='tight') # 解決圖片不清晰,不完整的問題 302 # plt.show() 303 304 # 4.2 單變量選擇 --> 隨機森林法選擇變量 305 # X = train_data[WOE_feat] 306 # X = np.matrix(X) 307 # y = train_data['y_label'] 308 # y = np.array(y) 309 # 310 # RFC = RandomForestClassifier() 311 # RFC_Model = RFC.fit(X, y) 312 # 313 # features_rfc = train_data[WOE_feat].columns 314 # RF_feat_importance = {features_rfc[i]: RFC_Model.feature_importances_[i] for i in range(len(features_rfc))} 315 # RF_feat_importance_sorted = sorted(RF_feat_importance.items(), key=lambda x: x[1], reverse=False) 316 # RF_feat_values = [i[1] for i in RF_feat_importance_sorted] 317 # RF_feat_names = [i[0] for i in RF_feat_importance_sorted] 318 # 319 # print("經過隨機森林法單變量挑選后,有 {} 個變量".format(len(RF_feat_names))) 320 # print("這些變量是:", RF_feat_names) 321 # 322 # fig_2, ax = plt.subplots(1, 1) 323 # plt.barh(range(len(RF_feat_values)), RF_feat_values) 324 # plt.title('隨機森林特征重要性排行') 325 # plt.yticks(range(len(RF_feat_names)), RF_feat_names) 326 # plt.show() 327 328 # 4.3 兩兩變量之間的線性相關性檢測 329 # 計算相關系數矩陣,並畫出熱力圖進行數據可視化 330 train_data_WOE = train_data[feat_select_1] 331 corr = train_data_WOE.corr() 332 # print(corr) 333 334 fig, ax = plt.subplots(figsize=(16, 16)) 335 sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), 336 square=True, ax=ax) 337 plt.savefig(folder_data + "相關性熱力圖.png", dpi=1000, bbox_inches='tight') # 解決圖片不清晰,不完整的問題 338 # plt.show() 339 340 # 1,將候選變量按照IV進行降序排列 341 # 2,計算第i和第i+1的變量的線性相關系數 342 # 3,對於系數超過閾值的兩個變量,剔除IV較低的一個 343 deleted_index = [] 344 cnt_vars = len(high_IV_dict_sorted) 345 for i in range(cnt_vars): 346 if i in deleted_index: 347 continue 348 x1 = high_IV_dict_sorted[i][0] + "_WOE" 349 for j in range(cnt_vars): 350 if i == j or j in deleted_index: 351 continue 352 y1 = high_IV_dict_sorted[j][0] + "_WOE" 353 roh = corr.loc[x1, y1] 354 if abs(roh) > 0.7: 355 x1_IV = high_IV_dict_sorted[i][1] 356 y1_IV = high_IV_dict_sorted[j][1] 357 if x1_IV > y1_IV: 358 deleted_index.append(j) 359 else: 360 deleted_index.append(i) 361 362 feat_select_2 = [high_IV_dict_sorted[i][0] + "_WOE" for i in range(cnt_vars) if i not in deleted_index] 363 364 print("經過兩兩共線性挑選后,有 {} 個變量".format(len(feat_select_2))) 365 print("這些變量是:", feat_select_2) 366 367 # 4.4 多重共線性分析 368 x = np.matrix(train_data[feat_select_2]) 369 VIF_list = [variance_inflation_factor(x, i) for i in range(x.shape[1])] 370 max_VIF = max(VIF_list) 371 print(max_VIF) 372 # 最大的VIF是 2.862878891263925,因此這一步認為沒有多重共線性 373 feat_select_3 = feat_select_2 374 375 print("經過多重共線性檢查后,有 {} 個變量".format(len(feat_select_3))) 376 print("這些變量是:", feat_select_3) 377 378 # ---------------------------------------------------------------------------------------------------------------------- 379 # 5. 邏輯回歸模型 380 # ---------------------------------------------------------------------------------------------------------------------- 381 x = train_data[feat_select_3].copy() 382 y = train_data['y_label'] 383 x['intercept'] = 1.0 384 385 LR = sm.Logit(y, x).fit() 386 summary = LR.summary() 387 print(summary) 388 p_values = LR.pvalues 389 p_values_dict = p_values.to_dict() 390 print(p_values_dict) 391 392 # 有些變量對因變量的影響不顯著 (p_value > 0.5 or 0.1 ......) ,去除他們 393 feat_select_4 = feat_select_3 394 large_p_values_dict = {k: v for k, v in p_values_dict.items() if v >= 0.1} 395 large_p_values_dict = sorted(large_p_values_dict.items(), key=lambda d: d[1], reverse=True) 396 while len(large_p_values_dict) > 0 and len(feat_select_4) > 0: 397 varMaxP = large_p_values_dict[0][0] 398 if varMaxP == 'intercept': 399 break 400 feat_select_4.remove(varMaxP) 401 y = train_data['y_label'] 402 x = train_data[feat_select_4] 403 x['intercept'] = [1] * x.shape[0] 404 405 LR = sm.Logit(y, x).fit() 406 p_values_dict = LR.pvalues 407 p_values_dict = p_values_dict.to_dict() 408 large_p_values_dict = {k: v for k, v in p_values_dict.items() if v >= 0.1} 409 large_p_values_dict = sorted(large_p_values_dict.items(), key=lambda d: d[1], reverse=True) 410 411 print("經過 p 值檢驗后,有 {} 個變量".format(len(feat_select_4))) 412 print("這些變量是:", feat_select_4) 413 414 # ---------------------------------------------------------------------------------------------------------------------- 415 # 6.對測試值進行預測,並且計算 KS 值 和 AUC 值 416 # ---------------------------------------------------------------------------------------------------------------------- 417 summary = LR.summary() 418 print(summary) 419 420 x_test = test_data[feat_select_4] 421 x_test['intercept'] = [1] * x_test.shape[0] 422 test_data['predict'] = LR.predict(x_test) 423 424 print(test_data[['predict', 'y_label']]) 425 KS, AUC = cf.CalcKsAuc(test_data, 'predict', 'y_label', draw=1) 426 print('normalLR: KS {}, AUC {}'.format(KS, AUC)) 427 428 end_time = time.process_time() 429 print("程序運行了 %s 秒" % (end_time - start_time))
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 from sklearn.metrics import roc_auc_score 5 6 plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標簽 7 plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號,#有中文出現的情況,需要u'內容' 8 9 folder_data = r'E:/3.Python_Project/LENGING_CLUB/data/' 10 11 # 分箱壞賬率計算 12 def BinBadRate(df, col, target, grantRateIndicator=0): 13 ''' 14 :param df: 需要計算好壞比率的數據集 15 :param col: 需要計算好壞比率的特征 16 :param target: 好壞標簽 17 :param grantRateIndicator: 1返回總體的壞樣本率,0不返回 18 :return: 每箱壞樣本率字典,每箱壞樣本df,以及總體的壞樣本率(當grantRateIndicator==1時) 19 ''' 20 data_df = df[[col, target]].copy() 21 total = data_df.shape[0] 22 group_df = data_df.groupby([col]).aggregate({target: ['sum', 'count']}).reset_index() 23 group_df.columns = [col, 'bad_num', 'all_num'] 24 group_df['total_num'] = total 25 group_df['bad_rate'] = group_df['bad_num'] / group_df['all_num'] 26 bad_rate_dict = dict(zip(group_df[col], group_df['bad_rate'])) # 每箱壞樣本率字典 27 overall_bad_rate = group_df['bad_num'].sum() / total # 總體壞樣本率 28 if grantRateIndicator == 0: 29 return (bad_rate_dict, group_df) 30 return (bad_rate_dict, group_df, overall_bad_rate) 31 32 33 # 合並全為0或全為1的箱 34 def MergeBin(df, col, target, direction='bad'): 35 ''' 36 :param df: 包含檢驗0%或者100%壞樣本率 37 :param col: 分箱后的變量或者類別型變量。檢驗其中是否有一組或者多組沒有壞樣本或者沒有好樣本。如果是,則需要進行合並 38 :param target: 目標變量,0、1表示好、壞 39 :return: 返回每個變量不同值的分箱方案,是個字典 40 ''' 41 regroup = BinBadRate(df, col, target)[1] 42 if direction == 'bad': 43 # 如果是合並0壞樣本率的組,則跟最小的非0壞樣本率的組進行合並 44 regroup = regroup.sort_values(by='bad_rate') 45 else: 46 # 如果是合並0好樣本樣本率的組,則跟最小的非0好樣本率的組進行合並 47 regroup = regroup.sort_values(by='bad_rate', ascending=False) 48 49 regroup.index = range(regroup.shape[0]) 50 col_regroup = [[i] for i in regroup[col]] 51 del_index = [] 52 53 for i in range(regroup.shape[0] - 1): 54 col_regroup[i + 1] = col_regroup[i] + col_regroup[i + 1] 55 del_index.append(i) 56 if direction == 'bad': 57 if regroup['bad_rate'][i + 1] > 0: 58 break 59 else: 60 if regroup['bad_rate'][i + 1] < 1: 61 break 62 col_regroup2 = [col_regroup[i] for i in range(len(col_regroup)) if i not in del_index] 63 newGroup = {} 64 for i in range(len(col_regroup2)): 65 for g2 in col_regroup2[i]: 66 newGroup[g2] = 'Bin ' + str(i) 67 return newGroup 68 69 70 # badrate 編碼 71 def BadRateEncoding(df, col, target): 72 ''' 73 :param df: dataframe containing feature and target 74 :param col: the feature that needs to be encoded with bad rate, usually categorical type 75 :param target: good/bad indicator 76 :return: 按照違約率 map 各字段的不同值,各自段不同值的違約率字典 77 ''' 78 bad_rate_dict, group_df = BinBadRate(df, col, target, grantRateIndicator=0) 79 bad_rate_encoded_df = df[col].map(lambda x: bad_rate_dict[x]) 80 return {'encoded_df': bad_rate_encoded_df, 'bad_rate_dict': bad_rate_dict} 81 82 83 # 對數據使用等頻粗分數據,得到切分數據 84 def SplitData(df, col, num_of_split, special_attribute=[]): 85 ''' 86 :param df: 按照col排序后的數據集 87 :param col: 待分箱的變量 88 :param numOfSplit: 切分的組別數 89 :param special_attribute: 在切分數據集的時候,某些特殊值需要排除在外 90 :return: col中“切分點“的值列表 91 ''' 92 df2 = df.copy() 93 if special_attribute != []: 94 df2 = df.loc[~df[col].isin(special_attribute)] 95 N = df2.shape[0] 96 n = int(N / num_of_split) 97 split_point_index = [i * n for i in range(1, num_of_split)] 98 raw_values = sorted(list(df2[col])) 99 split_point = [raw_values[i] for i in split_point_index] 100 split_point = sorted(list(set(split_point))) 101 return split_point # col中“切分點“右邊第一個值 102 103 104 # 按切分點分箱 105 def AssignBin(x, split_points, special_attribute=[]): 106 ''' 107 :param x: 某個變量的某個取值 108 :param split_points: 上述變量的分箱結果,用切分點表示 109 :param special_attribute: 不參與分箱的特殊取值 110 :return: 分箱后的對應的第幾個箱,從0開始 111 for example, if split_points = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3 112 ''' 113 bin_num = len(split_points) + 1 114 if x in special_attribute: 115 i = special_attribute.index(x) + 1 116 return 'Bin_{}'.format(0 - i) 117 elif x <= split_points[0]: 118 return 'Bin_0' 119 elif x > split_points[-1]: 120 return 'Bin_{}'.format(bin_num - 1) 121 else: 122 for i in range(0, bin_num - 1): 123 if split_points[i] < x <= split_points[i + 1]: 124 return 'Bin_{}'.format(i + 1) 125 126 127 # 按切分點,將值進行映射,每箱里的值映射的是每箱切分點右側點的值 128 def AssignGroup(x, split_points): 129 ''' 130 :param x: 某個變量的某個取值 131 :param split_points: 切分點列表 132 :return: x在分箱結果下的映射 133 ''' 134 N = len(split_points) 135 if x <= min(split_points): 136 return min(split_points) 137 elif x > max(split_points): 138 return 10e10 139 else: 140 for i in range(N - 1): 141 if split_points[i] < x <= split_points[i + 1]: 142 return split_points[i + 1] 143 144 145 # 計算卡方值 146 def Chi2(df, all_col, bad_col): 147 ''' 148 :param df: 包含全部樣本總計與壞樣本總計的數據框 149 :param all_col: 全部樣本的個數 150 :param bad_col: 壞樣本的個數 151 :return: 卡方值 152 ''' 153 df2 = df.copy() 154 bad_rate = sum(df2[bad_col]) * 1.0 / sum(df2[all_col]) # 總體壞樣本率 155 # 當全部樣本只有好或者壞樣本時,卡方值為0 156 if bad_rate in [0, 1]: 157 return 0 158 df2['good'] = df2.apply(lambda x: x[all_col] - x[bad_col], axis=1) 159 good_rate = sum(df2['good']) * 1.0 / sum(df2[all_col]) # 總體好樣本率 160 # 期望壞(好)樣本個數=全部樣本個數*平均壞(好)樣本占比 161 162 df2['bad_expected'] = df[all_col].apply(lambda x: x * bad_rate) 163 df2['good_expected'] = df[all_col].apply(lambda x: x * good_rate) 164 bad_combined = zip(df2['bad_expected'], df2[bad_col]) 165 good_combined = zip(df2['good_expected'], df2['good']) 166 bad_chi = [(i[0] - i[1]) ** 2 / i[0] for i in bad_combined] 167 good_chi = [(i[0] - i[1]) ** 2 / i[0] for i in good_combined] 168 chi2 = sum(bad_chi) + sum(good_chi) 169 return chi2 170 171 172 # 卡方分箱 173 def ChiMerge(df, col, target, max_bin_num=5, special_attribute=[], min_bin_threshold=0): 174 ''' 175 :param df: 包含目標變量與分箱屬性的數據框 176 :param col: 需要分箱的屬性 177 :param target: 目標變量,取值0或1 178 :param max_bin_num: 最大分箱數。如果原始屬性的取值個數低於該參數,不執行這段函數 179 :param special_attribute: 不參與分箱的屬性取值 180 :param min_bin_threshold:分箱后,箱內記錄數最少的數量控制要求 181 :return: 返回分箱切分點列表 182 ''' 183 value_list = sorted(list(set(df[col]))) # 變量的值從小達到排列列表 184 N_distinct = len(value_list) 185 if N_distinct <= max_bin_num: # 如果原始屬性的取值個數低於max_interval,不執行這段函數 186 print("{} 變量值的種類個數小於最大分箱數".format(col)) 187 return value_list[:-1] 188 else: 189 if len(special_attribute) >= 1: 190 df1 = df.loc[df[col].isin(special_attribute)] 191 df2 = df.loc[~df[col].isin(special_attribute)] # 去掉special_attribute后的df 192 else: 193 df2 = df.copy() 194 N_distinct = len(list(set(df2[col]))) 195 196 # 步驟一: 粗分箱,通過col對數據集進行分組,求出每組的總樣本數與壞樣本數 197 if N_distinct > 100: 198 split_x = SplitData(df2, col, 100, special_attribute=special_attribute) 199 df2['temp'] = df2.loc[:, col].map(lambda x: AssignGroup(x, split_x)) 200 # Assgingroup函數:每一行的數值和切分點做對比,返回原值在切分后的映射, 201 # 經過map以后,生成該特征的值對象的“分箱”后的值 202 else: 203 df2['temp'] = df2[col] 204 # 總體bad rate將被用來計算expected bad count 205 (binBadRate, regroup, overallRate) = BinBadRate(df2, 'temp', target, grantRateIndicator=1) 206 207 # 首先,每個單獨的屬性值將被分為單獨的一組 208 # 對屬性值進行排序,然后兩兩組別進行合並 209 value_list = sorted(list(set(df2['temp']))) 210 current_bin = [[i] for i in value_list] # 把每個箱的值打包成[[],[]]的形式 211 212 # 步驟二:建立循環,不斷合並最優的相鄰兩個組別,直到: 213 # 1,最終分裂出來的分箱數<=預設的最大分箱數 214 # 2,每箱的占比不低於預設值(可選) 215 # 3,每箱同時包含好壞樣本 216 # 如果有特殊屬性,那么最終分裂出來的分箱數=預設的最大分箱數-特殊屬性的個數 217 218 plan_bin_num = max_bin_num - len(special_attribute) 219 while (len(current_bin) > plan_bin_num): # 終止條件: 當前分箱數=預設的分箱數 220 # 每次循環時, 計算合並相鄰組別后的卡方值。具有最小卡方值的合並方案,是最優方案 221 chi2_list = [] 222 for k in range(len(current_bin) - 1): 223 temp_group = current_bin[k] + current_bin[k + 1] 224 df2b = regroup.loc[regroup['temp'].isin(temp_group)] 225 chi2 = Chi2(df2b, 'all_num', 'bad_num') 226 chi2_list.append(chi2) 227 # 把 current_bin 的值改成類似的值改成類似從[[1][2],[3]]到[[1,2],[3]] 228 best_combined = chi2_list.index(min(chi2_list)) # 找到卡方值最小的進行合並 229 current_bin[best_combined] = current_bin[best_combined] + current_bin[best_combined + 1] 230 current_bin.remove(current_bin[best_combined + 1]) 231 232 current_bin = [sorted(i) for i in current_bin] 233 split_points = [max(i) for i in current_bin[:-1]] # 卡方分箱后的切分點 234 235 # 檢查是否有箱沒有好或者壞樣本。如果有,需要跟相鄰的箱進行合並,直到每箱同時包含好壞樣本 236 # print("程序檢查 split_points ",split_points) 237 temp_bins = df2['temp'].apply( 238 lambda x: AssignBin(x, split_points, special_attribute=special_attribute)) # 每個原始箱對應卡方分箱后的箱號 239 df2['temp_bin'] = temp_bins 240 (bin_bad_rate, regroup) = BinBadRate(df2, 'temp_bin', target) 241 [min_bad_rate, max_bad_rate] = [min(bin_bad_rate.values()), max(bin_bad_rate.values())] 242 243 while min_bad_rate == 0 or max_bad_rate == 1: 244 # 找出全部為好/壞樣本的箱 245 bin_0_1 = regroup.loc[regroup['bad_rate'].isin([0, 1]), 'temp_bin'].tolist() 246 bin = bin_0_1[0] 247 248 # 如果是最后一箱,則需要和上一個箱進行合並,也就意味着分裂點split_points中的最后一個需要移除 249 if bin == max(regroup.temp_bin): 250 split_points = split_points[:-1] 251 # 如果是第一箱,則需要和下一個箱進行合並,也就意味着分裂點split_points中的第一個需要移除 252 elif bin == min(regroup.temp_bin): 253 split_points = split_points[1:] 254 # 如果是中間的某一箱,則需要和前后中的一個箱進行合並,依據是較小的卡方值 255 else: 256 current_index = list(regroup.temp_bin).index(bin) 257 # 和前一箱進行合並,並且計算卡方值 258 prev_index = list(regroup.temp_bin)[current_index - 1] 259 df3 = df2.loc[df2['temp_bin'].isin([prev_index, bin])] 260 (bin_bad_rate, df2b) = BinBadRate(df3, 'temp_bin', target) 261 chi2_1 = Chi2(df2b, 'all_num', 'bad_num') 262 263 later_index = list(regroup.temp_bin)[current_index + 1] 264 df3b = df2.loc[df2['temp_bin'].isin([later_index, bin])] 265 (binBadRate, df2b) = BinBadRate(df3b, 'temp_bin', target) 266 chi2_2 = Chi2(df2b, 'all_num', 'bad_num') 267 # 和后一箱進行合並,並且計算卡方值 268 if chi2_1 < chi2_2: 269 split_points.remove(split_points[current_index - 1]) 270 else: 271 split_points.remove(split_points[current_index]) 272 273 # 完成合並之后,需要再次計算新的分箱准則下,每箱是否同時包含好壞樣本 274 temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute)) 275 df2['temp_bin'] = temp_bins 276 (bin_bad_rate, regroup) = BinBadRate(df2, 'temp_bin', target) 277 [min_bad_rate, max_bad_rate] = [min(bin_bad_rate.values()), max(bin_bad_rate.values())] 278 279 # 需要檢查分箱后的箱數含量最小占比 280 if min_bin_threshold > 0: 281 temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute)) 282 df2['temp_bin'] = temp_bins 283 bin_counts = temp_bins.bin_counts().to_frame() 284 bin_counts['pcnt'] = bin_counts['temp'].apply(lambda x: x * 1.0 / sum(bin_counts['temp'])) 285 bin_counts = bin_counts.sort_index() 286 min_pcnt = min(bin_counts['pcnt']) 287 while min_pcnt < min_bin_threshold and len(split_points) > 2: 288 # 找出占比最小的箱 289 index_of_min_pcnt = bin_counts[bin_counts['pcnt'] == min_pcnt].index.tolist()[0] 290 # 如果占比最小的箱是最后一箱,則需要和上一個箱進行合並,也就意味着分裂點split_points中的最后一個需要移除 291 if index_of_min_pcnt == max(bin_counts.index): 292 split_points = split_points[:-1] 293 # 如果占比最小的箱是第一箱,則需要和下一個箱進行合並,也就意味着分裂點split_points中的第一個需要移除 294 elif index_of_min_pcnt == min(bin_counts.index): 295 split_points = split_points[1:] 296 # 如果占比最小的箱是中間的某一箱,則需要和前后中的一個箱進行合並,依據是較小的卡方值 297 else: 298 # 和前一箱進行合並,並且計算卡方值 299 current_index = list(bin_counts.index).index(index_of_min_pcnt) 300 prev_index = list(bin_counts.index)[current_index - 1] 301 df3 = df2.loc[df2['temp_bin'].isin([prev_index, index_of_min_pcnt])] 302 (bin_bad_rate, df2b) = BinBadRate(df3, 'temp_bin', target) 303 chi2_1 = Chi2(df2b, 'all_num', 'bad_num') 304 305 # 和后一箱進行合並,並且計算卡方值 306 later_index = list(bin_counts.index)[current_index + 1] 307 df3b = df2.loc[df2['temp_bin'].isin([later_index, index_of_min_pcnt])] 308 (bin_bad_rate, df2b) = BinBadRate(df3b, 'temp_bin', target) 309 chi2_2 = Chi2(df2b, 'all_num', 'bad_num') 310 311 if chi2_1 < chi2_2: 312 split_points.remove(split_points[current_index - 1]) 313 else: 314 split_points.remove(split_points[current_index]) 315 316 temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute)) 317 df2['temp_bin'] = temp_bins 318 bin_counts = temp_bins.bin_counts().to_frame() 319 bin_counts['pcnt'] = bin_counts['temp'].apply(lambda x: x * 1.0 / sum(bin_counts['temp'])) 320 bin_counts = bin_counts.sort_index() 321 min_pcnt = min(bin_counts['pcnt']) 322 323 split_points = special_attribute + split_points 324 325 return split_points 326 327 328 # 判斷某變量的壞樣本率是否單調 329 def BadRateMonotone(df, col, target, special_attribute=[]): 330 ''' 331 :param df: 包含檢驗壞樣本率的變量,和目標變量 332 :param col: 需要檢驗壞樣本率的變量 333 :param target: 目標變量,0、1表示好、壞 334 :param special_attribute: 不參與檢驗的特殊值 335 :return: 壞樣本率單調與否 336 ''' 337 338 special_bin = ['Bin_{}'.format(i) for i in special_attribute] # 填充未知變量 Unknown 所在分箱 339 df2 = df.loc[~df[col].isin(special_bin)] 340 if len(set(df2[col])) <= 2: 341 return True 342 bin_bad_rate, regroup = BinBadRate(df2, col, target) 343 bad_rate = [bin_bad_rate[i] for i in sorted(bin_bad_rate)] 344 # 嚴格單調遞增 345 if all(bad_rate[i] < bad_rate[i + 1] for i in range(len(bad_rate) - 1)): 346 return True 347 # 嚴格單調遞減 348 elif all(bad_rate[i] > bad_rate[i + 1] for i in range(len(bad_rate) - 1)): 349 return True 350 else: 351 return False 352 353 354 # 計算變量的 WOE 值 和 IV 值 355 def CalcWOE(df, col, target): 356 ''' 357 :param df: 包含需要計算WOE的變量和目標變量 358 :param col: 需要計算WOE、IV的變量,必須是分箱后的變量,或者不需要分箱的類別型變量 359 :param target: 目標變量,0、1表示好、壞 360 :return: 返回WOE字典和IV值 361 ''' 362 total = df.groupby([col])[target].count() 363 total = pd.DataFrame({'total': total}) 364 bad = df.groupby([col])[target].sum() 365 bad = pd.DataFrame({'bad': bad}) 366 regroup = total.merge(bad, left_index=True, right_index=True, how='left') 367 regroup.reset_index(level=0, inplace=True) 368 N = sum(regroup['total']) 369 B = sum(regroup['bad']) 370 regroup['good'] = regroup['total'] - regroup['bad'] 371 G = N - B 372 regroup['bad_pcnt'] = regroup['bad'].map(lambda x: x * 1.0 / B) 373 regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G) 374 regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt * 1.0 / x.bad_pcnt), axis=1) 375 WOE_dict = regroup[[col, 'WOE']].set_index(col).to_dict(orient='index') 376 for k, v in WOE_dict.items(): 377 WOE_dict[k] = v['WOE'] 378 IV = regroup.apply(lambda x: (x.good_pcnt - x.bad_pcnt) * np.log(x.good_pcnt * 1.0 / x.bad_pcnt), axis=1) 379 IV = sum(IV) 380 return {"WOE": WOE_dict, 'IV': IV} 381 382 383 # 計算 KS 值 384 def CalcKsAuc(data, var_col, y_col, draw=True): 385 ''' 386 :param df: 包含目標變量與預測值的數據集 387 :param score: 得分或者概率 388 :param target: 目標變量 389 :return: KS值 和 AUC值 390 ''' 391 temp_df = pd.crosstab(data[var_col], data[y_col]) 392 KS_df = temp_df.cumsum(axis=0) / temp_df.sum() 393 KS_df['KS'] = abs(KS_df[0] - KS_df[1]) 394 KS_df.columns = ['TPR', 'FPR', 'KS'] 395 KS = KS_df['KS'].max() 396 KS_df = KS_df.reset_index() 397 AUC = roc_auc_score(data[y_col], data[var_col]) 398 399 if draw: 400 fig, ax = plt.subplots(1, 2,figsize=(20,10)) 401 KS_down = float(KS_df.loc[KS_df['KS'] == KS, 'FPR']) 402 KS_up = float(KS_df.loc[KS_df['KS'] == KS, 'TPR']) 403 KS_x = float(KS_df.loc[KS_df['KS'] == KS, 'predict']) 404 ax[0].plot(KS_df['predict'], KS_df['FPR'], 'royalblue') 405 ax[0].plot(KS_df['predict'], KS_df['TPR'], 'darkorange') 406 ax[0].plot(KS_df['predict'], KS_df['KS'], 'mediumseagreen') 407 ax[0].plot([KS_x, KS_x], [KS_down, KS_up], '--r') 408 ax[0].text(0.2, 0.1, 'max(KS) = {}'.format(KS)) 409 ax[0].set_title("KS 曲線", fontsize=14) 410 ax[0].set_xlabel("", fontsize=12) 411 ax[0].set_ylabel("", fontsize=12) 412 ax[0].legend() 413 414 ax[1].plot(KS_df['FPR'], KS_df['TPR'], 'royalblue') 415 ax[1].plot([0, 1], [0, 1], '--r') 416 ax[1].set_title("ROC 曲線", fontsize=14) 417 ax[1].fill_between(KS_df['FPR'], 0, KS_df['TPR'], facecolor='mediumseagreen', alpha=0.2) 418 ax[1].text(0.6, 0.1, 'AUC = {}'.format(AUC)) 419 ax[1].set_xlabel("False positive rate", fontsize=12) 420 ax[1].set_ylabel("True positive rate", fontsize=12) 421 plt.savefig(folder_data + "KS&AUC.png", dpi=1000, bbox_inches='tight') # 解決圖片不清晰,不完整的問題 422 # plt.show() 423 return KS, AUC
