Lending Club 貸款業務信用評分卡建模

本文轉載自查看原文 2020-05-25 17:56 622

目前，國內外對個人信用風險評估模型的研究方法，是通過用戶的歷史行為（如歷史數據的多維特征和貸款狀態是否違約）來訓練模型，通過這個模型對新增的貸款人“是否具有償還能力，是否具有償債意願”進行分析，預測貸款申請人是否會發生違約貸款。主要有兩種方法：以 Logistic 回歸模型為代表的傳統信用風險評估方法；以支持向量機、神經網絡、決策樹等機器學習理論為代表的新型信用風險評估模型。

本文『以Logistic 回歸模型建立信用評分卡模型』方法對 Lending Club 公司貸款業務進行信用卡評分建模，通過模型預測貸款人是否會違約，達到最小化風險的目的。其原理是將模型變量 WOE 編碼方式離散化之后運用 logistic 回歸模型進行的一種二分類變量的廣義線性模型。一般地，我們將模型目標標量 1 記為違約用戶，目標變量 0 記為正常用戶。

原始數據來源：https://www.kaggle.com/wendykan/lending-club-loan-data/kernels
原始時間跨度：2007-2015
原始數據維度：226萬 * 145
本項違約定義：違約16天及其以上（d_loan = [ "Late (16-30 days)" , "Late (31-120 days)"，"Charged Off" , "Default", "Does not meet the credit policy. Status:Charged Off"]）
模型時間窗口：由於數據量較大，時間跨度過長，故選擇2016、2017 兩年的數據進行后續建模（數據877986*145）。

對 Lending Club 公司業務前期分析：

Lending Club 公司2007-2018貸款業務初步分析：https://www.cnblogs.com/Ray-0808/p/12618551.html
Lending Club 公司2007-2018貸款業務好壞帳分析：https://www.cnblogs.com/Ray-0808/p/12668594.html

1. 數據清洗

1.1 刪除變量

刪去缺失率大於 25% 變量（44個變量）
刪去取值只有一個的變量,同一性很大的變量（17個變量）
刪去一些無用變量，例如一些貸后數據，如下圖

1.2 刪除記錄缺失

25% 特征信息的記錄（1條記錄）
剔除包含異常值的記錄

1.3 填充空值

缺失值標示為一類，將字符型缺失數據填充 'UnKnown'
眾數填充，將數值型缺失數據用眾數填充

清洗后數據量：數據877985*71

2. 數據分箱

2.1 數據分箱方法

數據分箱的方法可以分為：有監督分箱和無監督分箱。有監督分箱可以分為 Split 分箱和 Merge 分箱，無監督分箱為等頻分箱、等距分箱、聚類分箱等。
本項目使用卡方分箱法（ChiMerge），基本思想是如果兩個相鄰的區間具有類似的類分布，則這兩個區間合並；否則，它們應保持分開。
操作過程：

輸入：分箱的最大區間數n
初始化
- 連續值按升序排列，離散值先轉化為壞客戶的比率，然后再按升序排列；
- 為了減少計算量，對於狀態數大於某一閾值 (建議為100) 的變量，利用等頻分箱進行粗分箱。
- 若有缺失值，則缺失值單獨作為一個分箱。
合並區間
- 計算每一對相鄰區間的卡方值；
- 將卡方值最小的一對區間合並
- 重復以上兩個步驟，直到分箱數量不大於n
分箱后處理
- 對於壞客戶比例為 0 或 1 的分箱進行合並 (一個分箱內不能全為好客戶或者全為壞客戶)。
- 對於分箱后某一箱樣本占比超過 95% 的箱子進行刪除。
輸出：分箱后的數據和分箱區間。

2.2 本項目數據划分

由於使用的卡方分箱法，分箱過程中會使用到實際違約變量（"y_label"）的數據，為防止數據泄露，需要在對數據進行分箱前划分數據集。這里由於數據傾斜（違約樣本占比遠少於正常樣本），所以使用分層抽樣的方法，避免數據分布不均。

1 # 分層抽樣, 避免不均
2 from sklearn.model_selection import StratifiedShuffleSplit
3 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
4 for train_index, test_index in split.split(data, data.loc[:, 'y_label']):
5     train_set = data.iloc[train_index, :]
6     test_set = data.iloc[test_index, :]

View Code

2.3 本項目分箱流程

類別型變量
1. 當取值較少時：
  1. 如果每種類別同時包含好壞樣本，無需分箱
  2. 如果有類別只包含好壞樣本的一種，需要合並
2. 當取值較多時，先用bad rate編碼（即將每個類別變量的不同值用其違約率代替），再用連續型分箱的方式進行分箱
連續型變量
1. 檢查是否有特殊值，需要單獨分箱
2. 檢查原始分箱個數，如果過大，需要粗分箱（比如100箱）
3. 使用卡方分箱法
4. 檢查所分箱是否有箱是相同值，有則合並
5. 檢查分箱后是否符合單調性，不單調則合並

本部代碼較多，集中放置於附錄代碼中。

3. WOE 編碼

在分箱完畢后，需要對每個變量的各箱進行編碼，才能送入邏輯回歸模型中進行使用。
WOE的全稱是“Weight of Evidence”,計算每個變量的每箱 WOE 值，然后使用對應 WOE 值對該變量進行編碼。其不同於ONE-HOT編碼后會創建許多新變量，造成矩陣稀疏、維度災難等問題
WOE 編碼相當於把分箱后的特征從非線性可分映射到近似線性可分的空間內：

可提升模型的預測效果
將自變量規范到同一尺度上
WOE能反映自變量取值的貢獻情況
有利於對變量的每個分箱進行評分
轉化為連續變量之后，便於分析變量與變量之間的相關性
與獨熱向量編碼相比，可以保證變量的完整性，同時避免稀疏矩陣和維度災難

如 ”int_rate“ 變量：

4. 變量篩選

4.1 單變量篩選

基於 IV 值的變量篩選信息價值，類似於信息增益、基尼系數
基於 stepwise 的變量篩選，一步一步進行變量篩選
基於特征重要度的變量篩選：RF, GBDT…
基於 LASSO 正則化的變量篩選

本文對比使用了兩種，最終選擇第一種方法。

 1 # 4.1 單變量選擇  -->  按 IV 值挑選變量，IV > 0.02
 2 high_IV = {k: v for k, v in IV_dict.items() if v >= 0.02}
 3 high_IV_dict_sorted = sorted(high_IV.items(), key=lambda x: x[1], reverse=False)
 4 high_IV_values = [i[1] for i in high_IV_dict_sorted]
 5 high_IV_names = [i[0] for i in high_IV_dict_sorted]
 6 feat_select_1 = []
 7 for i in high_IV_names:
 8     feat_select_1.append(i + '_WOE')
 9 print("經過 IV>0.02 挑選后，有 {} 個變量".format(len(feat_select_1)))
10 print("這些變量是：", feat_select_1)
11 fig_1, ax = plt.subplots(1, 1)
12 plt.barh(range(len(high_IV_values)), high_IV_values)
13 plt.title('feature IV')
14 plt.yticks(range(len(high_IV_names)), high_IV_names)
15 plt.savefig(folder_data + "feature_IV.png", dpi=1000, bbox_inches='tight')  # 解決圖片不清晰，不完整的問題
16 # plt.show()

查看基於IV值挑選變量的代碼

 1 # 4.2 單變量選擇  -->  隨機森林法選擇變量
 2 X = train_data[WOE_feat]
 3 X = np.matrix(X)
 4 y = train_data['y_label']
 5 y = np.array(y)
 6 RFC = RandomForestClassifier()
 7 RFC_Model = RFC.fit(X, y)
 8 features_rfc = train_data[WOE_feat].columns
 9 RF_feat_importance = {features_rfc[i]: RFC_Model.feature_importances_[i] for i in range(len(features_rfc))}
10 RF_feat_importance_sorted = sorted(RF_feat_importance.items(), key=lambda x: x[1], reverse=False)
11 RF_feat_values = [i[1] for i in RF_feat_importance_sorted]
12 RF_feat_names = [i[0] for i in RF_feat_importance_sorted]
13 print("經過隨機森林法單變量挑選后，有 {} 個變量".format(len(RF_feat_names)))
14 print("這些變量是：", RF_feat_names)
15 fig_2, ax = plt.subplots(1, 1)
16 plt.barh(range(len(RF_feat_values)), RF_feat_values)
17 plt.title('隨機森林特征重要性排行')
18 plt.yticks(range(len(RF_feat_names)), RF_feat_names)
19 plt.show()

查看基於隨機森林法挑選變量的代碼

4.2 變量相關性

雙變量相關性線性相關性：皮爾遜相關系數
多變量相關性多重共線性：某一個變量可以由其他變量線性表示，使用VIF（方差膨脹因子）檢驗

下圖變量線性相關性熱力圖：

5. 建模評價

5.1 建模

可以使用 sklearn 或者 statsmodels 等等庫調用邏輯回歸模型。

5.2 評價

		預測
		1	0
實際	1	TP（true positivie）	FN（false negative）
實際	0	FP（false positive）	TN（true negative）

真正率（TPR）：實際上為正例被預測為正例的比率，TPR = TP / (TP + FN)
假正率（FPR）：實際上為反例被預測為正例的比率，FPR = FP / (FP + TN)
ROC 曲線：橫坐標由假正率，縱坐標由真正率構成的曲線
AUC 值：ROC 曲線到坐標軸之間的面積，指隨機給定一個正樣本和一個負樣本，分類器輸出該正樣本為正的那個概率值比分類器輸出該負樣本為正的那個概率值要大的可能性
KS 值：KS = max(TPR-FPR)，大於0.2 則說明模型區分性較好

6. 參考文獻

邏輯回歸算法：https://www.cnblogs.com/biyeymyhjob/archive/2012/07/18/2595410.html
特征重要度 WoE、IV、BadRate：https://www.cnblogs.com/Allen-rg/p/11507970.html
隨機森林： https://www.cnblogs.com/maybe2030/p/4585705.html#top
L1、L2：https://blog.csdn.net/jinping_shi/article/details/52433975
玩轉邏輯回歸之金融評分卡模型：https://zhuanlan.zhihu.com/p/36539125
利用LendingClub數據建模：https://zhuanlan.zhihu.com/p/21550547

附錄：

  1 # coding:utf-8
  2 
  3 import pandas as pd
  4 import numpy as np
  5 import time
  6 import matplotlib.pyplot as plt
  7 import seaborn as sns
  8 from statsmodels.stats.outliers_influence import variance_inflation_factor
  9 import pickle
 10 from sklearn.model_selection import train_test_split
 11 import statsmodels.api as sm
 12 from sklearn.ensemble import RandomForestClassifier
 13 import CustomFunction as cf
 14 from sklearn.model_selection import StratifiedShuffleSplit
 15 
 16 # 顯示設置
 17 plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標簽
 18 plt.rcParams['axes.unicode_minus'] = False  # 用來正常顯示負號,#有中文出現的情況，需要u'內容'
 19 pd.set_option('display.max_columns', 1000)
 20 pd.set_option('display.width', 1000)
 21 pd.set_option('display.max_colwidth', 1000)
 22 pd.set_option('display.max_row', 1000)
 23 
 24 # 程序開始時間計算
 25 start_time = time.process_time()
 26 
 27 # 存放數據文件路徑
 28 folder_data = r'E:/3.Python_Project/LENGING_CLUB/data/'
 29 
 30 # ----------------------------------------------------------------------------------------------------------------------
 31 # 0. 讀取數據 ， 轉換數據格式
 32 # ----------------------------------------------------------------------------------------------------------------------
 33 # loan = pd.read_csv('loan.csv', low_memory=False)
 34 # df = loan.copy()
 35 #
 36 # # 將 ‘issue_d’ str形式的時間日期轉換
 37 # dt_series = pd.to_datetime(df['issue_d'])
 38 # df['year'] = dt_series.dt.year
 39 # df['month'] = dt_series.dt.month
 40 #
 41 # # 取 2016-2017 年度的數據進行后續計算
 42 # print(df['year'].unique())
 43 # df = df[df['year'].isin([2016, 2017])]
 44 # print(df['year'].unique())
 45 #
 46 # # 根據貸款狀態 ‘loan_status’ 定義違約標簽
 47 # d_loan = ["Late (16-30 days)", "Late (31-120 days)", "Charged Off", "Default",
 48 #           "Does not meet the credit policy. Status:Charged Off"]
 49 # df['loan_condition'] = np.nan
 50 # lst = [df]
 51 #
 52 #
 53 # def loan_condition(status):
 54 #     if status in d_loan:
 55 #         return 'd Loan'
 56 #     else:
 57 #         return 'c Loan'
 58 #
 59 #
 60 # df['loan_condition'] = df['loan_status'].apply(loan_condition)
 61 # df['y_label'] = np.nan
 62 # for col in lst:
 63 #     col.loc[df['loan_condition'] == 'c Loan', 'y_label'] = 0  # 正常貸款
 64 #     col.loc[df['loan_condition'] == 'd Loan', 'y_label'] = 1  # 違約貸款
 65 # df['y_label'] = df['y_label'].astype(int)
 66 # df.to_pickle(folder_data + "loan_1617_preprocessing.pkl")
 67 
 68 # ----------------------------------------------------------------------------------------------------------------------
 69 # 1. 數據清洗
 70 # ----------------------------------------------------------------------------------------------------------------------
 71 # df = pd.read_pickle(folder_data+"loan_1617_preprocessing.pkl")
 72 #
 73 # # 1.1 刪除變量
 74 # df_na = df.isna().sum()/df.shape[0]     # 缺失率
 75 # # 刪去缺失率大於 25% 變量
 76 # del_var = []
 77 # var_big_na = list(df_na[df_na>0.25].index ) # 缺失值超過 25% 的變量
 78 # del_var.append(var_big_na)
 79 # df.drop(var_big_na,axis=1,inplace=True)
 80 # df_na.drop(var_big_na,axis=0,inplace=True)
 81 # # 刪去特征只有一個的變量,同一性很大的變量
 82 # var_same=[]
 83 # for i in df.columns:
 84 #     mode_value = df[i].mode()[0]
 85 #     mode_rate = len(df[i][df[i]==mode_value]) / df.shape[0]
 86 #     if mode_rate > 0.9:
 87 #         var_same.append(i)
 88 #         df.drop(i,axis=1,inplace=True)
 89 # del_var.append(var_same)
 90 # # 刪去自選的一些無用特特征
 91 # var_no_use = ['sub_grade', 'emp_title', 'addr_state', 'zip_code', 'loan_condition', 'loan_status',
 92 #               'last_pymnt_d', 'last_credit_pull_d', 'last_pymnt_amnt', 'earliest_cr_line',
 93 #               'initial_list_status', 'title', 'issue_d', 'mths_since_recent_inq']
 94 # del_var.append(var_no_use)
 95 # df.drop(var_no_use,axis=1,inplace=True)
 96 #
 97 # # 1.2 刪除記錄
 98 # df_na_row = df.isna().sum(axis=1)/df.shape[1]
 99 # del_row = list(df_na_row[df_na_row>0.25].index ) # 信息缺失超過 25% 的記錄
100 # df.drop(del_row,axis=0,inplace=True)
101 #
102 # # 1.3 缺失值填充
103 # # 查看變量類型是否正確,並填充缺失值
104 # object_columns = df.select_dtypes(include=["object"]).columns
105 # numeric_columns = df.select_dtypes(exclude=["object"]).columns
106 # # print(df[object_columns].head())
107 # # print(df[numeric_columns].head())
108 #
109 # # 將字符型缺失數據填充 'UnKnown'
110 # df[object_columns]=df[object_columns].fillna('UnKnown')
111 #
112 # # 將數值型缺失數據用眾數填充
113 # for i in numeric_columns:
114 #     mode_value = df[i].mode()[0]
115 #     df[i] = df[i].fillna(mode_value)
116 #
117 # print("數據清洗后有{}個變量".format(df.shape[1]-3))
118 # print("數據清洗后有{}條記錄".format(df.shape[0]))
119 # df.to_pickle(folder_data+"loan_1617_clean.pkl")
120 
121 # ----------------------------------------------------------------------------------------------------------------------
122 # 2. 數據分箱(ChiMerge)
123 # ----------------------------------------------------------------------------------------------------------------------
124 # 分箱之后：
125 # （1）不超過5箱
126 # （2）Bad Rate單調
127 # （3）每箱同時包含好壞樣本
128 # （4）特殊值如'UnKnown'單獨成一箱
129 # ----------------------------------------------------------------------------------------------------------------------
130 loan_clean = pd.read_pickle(folder_data + "loan_1617_clean.pkl")
131 # print(loan_clean.head())
132 # print(loan_clean.isnull().sum())
133 loan_clean = loan_clean[loan_clean['dti'] != -1]
134 data = loan_clean.loc[loan_clean['term'] == ' 36 months'].copy()
135 data.drop(['year', 'month', 'term'], axis=1, inplace=True)
136 
137 num_features = list(data.select_dtypes(exclude=["object"]).columns)  # 數值變量
138 cat_features = list(data.select_dtypes(include=["object"]).columns)  # 類型變量
139 # print(num_features)
140 # print(cat_features)
141 #
142 # 划分數據集
143 # train_data, test_data = train_test_split(data, test_size=0.4)
144 # 分層抽樣, 避免不均
145 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
146 for train_index, test_index in split.split(data, data.loc[:, 'y_label']):
147     train_set = data.iloc[train_index, :]
148     test_set = data.iloc[test_index, :]
149 
150 # 有幾個字段的值比較少,剔除掉
151 train_set = train_set[~(train_set['home_ownership'] == 'NONE')]
152 train_set = train_set[~(train_set['purpose'].isin(['wedding', 'educational']))]
153 train_data = train_set.copy()
154 test_set = test_set[~(test_set['home_ownership'].isin(['NONE']))]
155 test_set = test_set[~(test_set['purpose'].isin(['wedding', 'educational']))]
156 test_data = test_set.copy()
157 
158 # 2.1 處理類型變量
159 cat_more_5_feat = []  # 類型變量中超過 5 種取值的變量
160 cat_less_5_feat = []  # 類型變量中超過 5 種取值的變量
161 # a) 檢查類型變量中哪些變量取值超過5
162 for var in cat_features:
163     value_counts = data[var].nunique()
164     if value_counts > 5:
165         cat_more_5_feat.append(var)
166     else:
167         cat_less_5_feat.append(var)
168 
169 # b) 對於類型變量值分類小於5的變量，其每種類別應該包含好壞兩種樣本，如果只有其中一種樣本則需要合並
170 merge_bin_dict = {}  # 存放需要合並的變量，以及合並方法
171 for var in cat_less_5_feat:
172     bin_bad_rate = cf.BinBadRate(train_data, var, 'y_label')[0]
173     if min(bin_bad_rate.values()) == 0:
174         print('{} 類型變量需要合並箱'.format(var))
175         combine_bin = cf.MergeBin(train_data, var, 'y_label')
176         merge_bin_dict[var] = combine_bin
177         pass
178     if max(bin_bad_rate.values()) == 1:
179         print('{} 類型變量需要合並箱'.format(var))
180         combine_bin = cf.MergeBin(train_data, var, 'y_label', direction='good')
181         merge_bin_dict[var] = combine_bin
182 #
183 # print(merge_bin_dict)
184 #
185 # # c） 對於類型變量值分類大於5的變量，使用 badrate 編碼，放入連續型變量
186 # for var in cat_more_5_feat:
187 #     if 'UnKnown' in list(data[var].unique()):
188 #         encoding_df = train_data.loc[train_data[var] != 'UnKnown'].copy()
189 #     else:
190 #         encoding_df = train_data.copy()
191 #     bin_bad_rate = cf.BinBadRate(encoding_df, var, 'y_label')[0]
192 #     if 'UnKnown' in list(data[var].unique()):
193 #         bin_bad_rate['UnKnown'] = -1
194 #     new_var = var + '_BR_encoding'
195 #     num_features.append(new_var)
196 #     train_data[new_var] = train_data[var].map(lambda x: bin_bad_rate[x])
197 #     test_data[new_var] = test_data[var].map(lambda x: bin_bad_rate[x])
198 #
199 # # 2.2 對連續變量進行卡方分箱（chimerge）
200 # bin_feat = []
201 # for var in num_features:
202 #     print("請稍等，正在處理 {} 字段".format(var))
203 #     if -1 not in train_data[var].unique():
204 #         max_bin_num = 5  # 設置最大分箱數
205 #         split_points = cf.ChiMerge(train_data, var, 'y_label', max_bin_num=max_bin_num)
206 #         train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points))
207 #         monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label')
208 #         while not monotone:
209 #             max_bin_num -= 1
210 #             split_points = cf.ChiMerge(train_data, var, 'y_label', max_bin_num=max_bin_num)
211 #             train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points))
212 #             monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label')
213 #             # if max_bin_num == 2:
214 #             #     # 當分箱數為2時，必然單調,退出
215 #             #     break
216 #         new_var = var + '_bin'
217 #         bin_feat.append(new_var)
218 #         test_data[var + '_bin'] = test_data[var].map(lambda x: cf.AssignBin(x, split_points))
219 #     else:
220 #         max_bin_num = 5  # 設置最大分箱數
221 #         split_points = cf.ChiMerge(train_data, var, 'y_label', special_attribute=[-1], max_bin_num=max_bin_num)
222 #         train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points, special_attribute=[-1]))
223 #         monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label', special_attribute=[-1])
224 #         while not monotone:
225 #             max_bin_num -= 1
226 #             split_points = cf.ChiMerge(train_data, var, 'y_label', special_attribute=[-1], max_bin_num=max_bin_num)
227 #             train_data[var + '_bin'] = train_data[var].map(lambda x: cf.AssignBin(x, split_points,
228 #                                                                                   special_attribute=[-1]))
229 #             monotone = cf.BadRateMonotone(train_data, var + '_bin', 'y_label', special_attribute=[-1])
230 #         new_var = var + '_bin'
231 #         bin_feat.append(new_var)
232 #         test_data[var + '_bin'] = test_data[var].map(lambda x: cf.AssignBin(x, split_points, special_attribute=[-1]))
233 #
234 # train_data.to_pickle(folder_data + 'train_data_1617.pkl')
235 # test_data.to_pickle(folder_data + 'test_data_1617.pkl')
236 #
237 # bin_feat.sort()
238 # print(bin_feat)
239 # print(train_data.head())
240 
241 # ----------------------------------------------------------------------------------------------------------------------
242 # 3. WOE 編碼，計算 IV
243 # ----------------------------------------------------------------------------------------------------------------------
244 train_data = pd.read_pickle(folder_data + 'train_data_1617.pkl')
245 test_data = pd.read_pickle(folder_data + 'test_data_1617.pkl')
246 
247 # print(train_data.head())
248 # print(test_data.head())
249 
250 num_features.remove('y_label')
251 for var in cat_more_5_feat:
252     new_var = var + '_BR_encoding'
253     num_features.append(new_var)
254 bin_feat = []  # bin_feat 分箱后變量
255 for var in num_features:
256     var_bin = var + '_bin'
257     bin_feat.append(var_bin)
258 bin_feat = bin_feat + cat_less_5_feat  # 少於5個不同值的類型變量在本例中不用合並箱，直接使用原分箱即可
259 bin_feat.sort()
260 
261 print("經過分箱后，有 {} 個變量".format(len(bin_feat)))
262 print("這些變量是：", bin_feat)
263 
264 # 計算每個變量分箱后的 WOE 和 IV 值
265 WOE_dict = {}
266 IV_dict = {}
267 for var in bin_feat:
268     WOE_IV = cf.CalcWOE(train_data, var, 'y_label')
269     WOE_dict[var] = WOE_IV['WOE']
270     IV_dict[var] = WOE_IV['IV']
271 
272 # WOE 編碼
273 WOE_feat = []
274 for var in IV_dict.keys():
275     var_WOE = var + '_WOE'
276     train_data[var_WOE] = train_data[var].map(WOE_dict[var])
277     test_data[var_WOE] = test_data[var].map(WOE_dict[var])
278     WOE_feat.append(var_WOE)
279 
280 # print(train_data[WOE_feat].head())
281 
282 # ----------------------------------------------------------------------------------------------------------------------
283 # 4. 變量選擇
284 # ----------------------------------------------------------------------------------------------------------------------
285 # 4.1 單變量選擇  -->  按 IV 值挑選變量，IV > 0.02
286 high_IV = {k: v for k, v in IV_dict.items() if v >= 0.02}
287 high_IV_dict_sorted = sorted(high_IV.items(), key=lambda x: x[1], reverse=False)
288 high_IV_values = [i[1] for i in high_IV_dict_sorted]
289 high_IV_names = [i[0] for i in high_IV_dict_sorted]
290 
291 feat_select_1 = []
292 for i in high_IV_names:
293     feat_select_1.append(i + '_WOE')
294 print("經過 IV>0.02 挑選后，有 {} 個變量".format(len(feat_select_1)))
295 print("這些變量是：", feat_select_1)
296 
297 fig_1, ax = plt.subplots(1, 1, figsize=(10, 10))
298 plt.barh(range(len(high_IV_values[0:19])), high_IV_values[0:19], facecolor='royalblue')
299 plt.title('feature IV')
300 plt.yticks(range(len(high_IV_names[0:19])), high_IV_names[0:19])
301 plt.savefig(folder_data + "feature_IV.png", dpi=1000, bbox_inches='tight')  # 解決圖片不清晰，不完整的問題
302 # plt.show()
303 
304 # 4.2 單變量選擇  -->  隨機森林法選擇變量
305 # X = train_data[WOE_feat]
306 # X = np.matrix(X)
307 # y = train_data['y_label']
308 # y = np.array(y)
309 #
310 # RFC = RandomForestClassifier()
311 # RFC_Model = RFC.fit(X, y)
312 #
313 # features_rfc = train_data[WOE_feat].columns
314 # RF_feat_importance = {features_rfc[i]: RFC_Model.feature_importances_[i] for i in range(len(features_rfc))}
315 # RF_feat_importance_sorted = sorted(RF_feat_importance.items(), key=lambda x: x[1], reverse=False)
316 # RF_feat_values = [i[1] for i in RF_feat_importance_sorted]
317 # RF_feat_names = [i[0] for i in RF_feat_importance_sorted]
318 #
319 # print("經過隨機森林法單變量挑選后，有 {} 個變量".format(len(RF_feat_names)))
320 # print("這些變量是：", RF_feat_names)
321 #
322 # fig_2, ax = plt.subplots(1, 1)
323 # plt.barh(range(len(RF_feat_values)), RF_feat_values)
324 # plt.title('隨機森林特征重要性排行')
325 # plt.yticks(range(len(RF_feat_names)), RF_feat_names)
326 # plt.show()
327 
328 # 4.3 兩兩變量之間的線性相關性檢測
329 # 計算相關系數矩陣，並畫出熱力圖進行數據可視化
330 train_data_WOE = train_data[feat_select_1]
331 corr = train_data_WOE.corr()
332 # print(corr)
333 
334 fig, ax = plt.subplots(figsize=(16, 16))
335 sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
336             square=True, ax=ax)
337 plt.savefig(folder_data + "相關性熱力圖.png", dpi=1000, bbox_inches='tight')  # 解決圖片不清晰，不完整的問題
338 # plt.show()
339 
340 # 1，將候選變量按照IV進行降序排列
341 # 2，計算第i和第i+1的變量的線性相關系數
342 # 3，對於系數超過閾值的兩個變量，剔除IV較低的一個
343 deleted_index = []
344 cnt_vars = len(high_IV_dict_sorted)
345 for i in range(cnt_vars):
346     if i in deleted_index:
347         continue
348     x1 = high_IV_dict_sorted[i][0] + "_WOE"
349     for j in range(cnt_vars):
350         if i == j or j in deleted_index:
351             continue
352         y1 = high_IV_dict_sorted[j][0] + "_WOE"
353         roh = corr.loc[x1, y1]
354         if abs(roh) > 0.7:
355             x1_IV = high_IV_dict_sorted[i][1]
356             y1_IV = high_IV_dict_sorted[j][1]
357             if x1_IV > y1_IV:
358                 deleted_index.append(j)
359             else:
360                 deleted_index.append(i)
361 
362 feat_select_2 = [high_IV_dict_sorted[i][0] + "_WOE" for i in range(cnt_vars) if i not in deleted_index]
363 
364 print("經過兩兩共線性挑選后，有 {} 個變量".format(len(feat_select_2)))
365 print("這些變量是：", feat_select_2)
366 
367 # 4.4 多重共線性分析
368 x = np.matrix(train_data[feat_select_2])
369 VIF_list = [variance_inflation_factor(x, i) for i in range(x.shape[1])]
370 max_VIF = max(VIF_list)
371 print(max_VIF)
372 # 最大的VIF是 2.862878891263925，因此這一步認為沒有多重共線性
373 feat_select_3 = feat_select_2
374 
375 print("經過多重共線性檢查后，有 {} 個變量".format(len(feat_select_3)))
376 print("這些變量是：", feat_select_3)
377 
378 # ----------------------------------------------------------------------------------------------------------------------
379 # 5. 邏輯回歸模型
380 # ----------------------------------------------------------------------------------------------------------------------
381 x = train_data[feat_select_3].copy()
382 y = train_data['y_label']
383 x['intercept'] = 1.0
384 
385 LR = sm.Logit(y, x).fit()
386 summary = LR.summary()
387 print(summary)
388 p_values = LR.pvalues
389 p_values_dict = p_values.to_dict()
390 print(p_values_dict)
391 
392 # 有些變量對因變量的影響不顯著 （p_value > 0.5 or 0.1 ......） ,去除他們
393 feat_select_4 = feat_select_3
394 large_p_values_dict = {k: v for k, v in p_values_dict.items() if v >= 0.1}
395 large_p_values_dict = sorted(large_p_values_dict.items(), key=lambda d: d[1], reverse=True)
396 while len(large_p_values_dict) > 0 and len(feat_select_4) > 0:
397     varMaxP = large_p_values_dict[0][0]
398     if varMaxP == 'intercept':
399         break
400     feat_select_4.remove(varMaxP)
401     y = train_data['y_label']
402     x = train_data[feat_select_4]
403     x['intercept'] = [1] * x.shape[0]
404 
405     LR = sm.Logit(y, x).fit()
406     p_values_dict = LR.pvalues
407     p_values_dict = p_values_dict.to_dict()
408     large_p_values_dict = {k: v for k, v in p_values_dict.items() if v >= 0.1}
409     large_p_values_dict = sorted(large_p_values_dict.items(), key=lambda d: d[1], reverse=True)
410 
411 print("經過 p 值檢驗后，有 {} 個變量".format(len(feat_select_4)))
412 print("這些變量是：", feat_select_4)
413 
414 # ----------------------------------------------------------------------------------------------------------------------
415 # 6.對測試值進行預測，並且計算 KS 值 和 AUC 值
416 # ----------------------------------------------------------------------------------------------------------------------
417 summary = LR.summary()
418 print(summary)
419 
420 x_test = test_data[feat_select_4]
421 x_test['intercept'] = [1] * x_test.shape[0]
422 test_data['predict'] = LR.predict(x_test)
423 
424 print(test_data[['predict', 'y_label']])
425 KS, AUC = cf.CalcKsAuc(test_data, 'predict', 'y_label', draw=1)
426 print('normalLR: KS {}, AUC {}'.format(KS, AUC))
427 
428 end_time = time.process_time()
429 print("程序運行了 %s 秒" % (end_time - start_time))

查看主程序

  1 import pandas as pd
  2 import numpy as np
  3 import matplotlib.pyplot as plt
  4 from sklearn.metrics import roc_auc_score
  5 
  6 plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標簽
  7 plt.rcParams['axes.unicode_minus'] = False  # 用來正常顯示負號,#有中文出現的情況，需要u'內容'
  8 
  9 folder_data = r'E:/3.Python_Project/LENGING_CLUB/data/'
 10 
 11 # 分箱壞賬率計算
 12 def BinBadRate(df, col, target, grantRateIndicator=0):
 13     '''
 14     :param df: 需要計算好壞比率的數據集
 15     :param col: 需要計算好壞比率的特征
 16     :param target: 好壞標簽
 17     :param grantRateIndicator: 1返回總體的壞樣本率，0不返回
 18     :return: 每箱壞樣本率字典，每箱壞樣本df，以及總體的壞樣本率（當grantRateIndicator＝＝1時）
 19     '''
 20     data_df = df[[col, target]].copy()
 21     total = data_df.shape[0]
 22     group_df = data_df.groupby([col]).aggregate({target: ['sum', 'count']}).reset_index()
 23     group_df.columns = [col, 'bad_num', 'all_num']
 24     group_df['total_num'] = total
 25     group_df['bad_rate'] = group_df['bad_num'] / group_df['all_num']
 26     bad_rate_dict = dict(zip(group_df[col], group_df['bad_rate']))  # 每箱壞樣本率字典
 27     overall_bad_rate = group_df['bad_num'].sum() / total  # 總體壞樣本率
 28     if grantRateIndicator == 0:
 29         return (bad_rate_dict, group_df)
 30     return (bad_rate_dict, group_df, overall_bad_rate)
 31 
 32 
 33 # 合並全為0或全為1的箱
 34 def MergeBin(df, col, target, direction='bad'):
 35     '''
 36      :param df: 包含檢驗0％或者100%壞樣本率
 37      :param col: 分箱后的變量或者類別型變量。檢驗其中是否有一組或者多組沒有壞樣本或者沒有好樣本。如果是，則需要進行合並
 38      :param target: 目標變量，0、1表示好、壞
 39      :return: 返回每個變量不同值的分箱方案，是個字典
 40      '''
 41     regroup = BinBadRate(df, col, target)[1]
 42     if direction == 'bad':
 43         # 如果是合並0壞樣本率的組，則跟最小的非0壞樣本率的組進行合並
 44         regroup = regroup.sort_values(by='bad_rate')
 45     else:
 46         # 如果是合並0好樣本樣本率的組，則跟最小的非0好樣本率的組進行合並
 47         regroup = regroup.sort_values(by='bad_rate', ascending=False)
 48 
 49     regroup.index = range(regroup.shape[0])
 50     col_regroup = [[i] for i in regroup[col]]
 51     del_index = []
 52 
 53     for i in range(regroup.shape[0] - 1):
 54         col_regroup[i + 1] = col_regroup[i] + col_regroup[i + 1]
 55         del_index.append(i)
 56         if direction == 'bad':
 57             if regroup['bad_rate'][i + 1] > 0:
 58                 break
 59         else:
 60             if regroup['bad_rate'][i + 1] < 1:
 61                 break
 62     col_regroup2 = [col_regroup[i] for i in range(len(col_regroup)) if i not in del_index]
 63     newGroup = {}
 64     for i in range(len(col_regroup2)):
 65         for g2 in col_regroup2[i]:
 66             newGroup[g2] = 'Bin ' + str(i)
 67     return newGroup
 68 
 69 
 70 # badrate 編碼
 71 def BadRateEncoding(df, col, target):
 72     '''
 73     :param df: dataframe containing feature and target
 74     :param col: the feature that needs to be encoded with bad rate, usually categorical type
 75     :param target: good/bad indicator
 76     :return: 按照違約率 map 各字段的不同值，各自段不同值的違約率字典
 77     '''
 78     bad_rate_dict, group_df = BinBadRate(df, col, target, grantRateIndicator=0)
 79     bad_rate_encoded_df = df[col].map(lambda x: bad_rate_dict[x])
 80     return {'encoded_df': bad_rate_encoded_df, 'bad_rate_dict': bad_rate_dict}
 81 
 82 
 83 # 對數據使用等頻粗分數據，得到切分數據
 84 def SplitData(df, col, num_of_split, special_attribute=[]):
 85     '''
 86     :param df: 按照col排序后的數據集
 87     :param col: 待分箱的變量
 88     :param numOfSplit: 切分的組別數
 89     :param special_attribute: 在切分數據集的時候，某些特殊值需要排除在外
 90     :return: col中“切分點“的值列表
 91     '''
 92     df2 = df.copy()
 93     if special_attribute != []:
 94         df2 = df.loc[~df[col].isin(special_attribute)]
 95     N = df2.shape[0]
 96     n = int(N / num_of_split)
 97     split_point_index = [i * n for i in range(1, num_of_split)]
 98     raw_values = sorted(list(df2[col]))
 99     split_point = [raw_values[i] for i in split_point_index]
100     split_point = sorted(list(set(split_point)))
101     return split_point  # col中“切分點“右邊第一個值
102 
103 
104 # 按切分點分箱
105 def AssignBin(x, split_points, special_attribute=[]):
106     '''
107     :param x: 某個變量的某個取值
108     :param split_points: 上述變量的分箱結果，用切分點表示
109     :param special_attribute:  不參與分箱的特殊取值
110     :return: 分箱后的對應的第幾個箱，從0開始
111     for example, if split_points = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3
112     '''
113     bin_num = len(split_points) + 1
114     if x in special_attribute:
115         i = special_attribute.index(x) + 1
116         return 'Bin_{}'.format(0 - i)
117     elif x <= split_points[0]:
118         return 'Bin_0'
119     elif x > split_points[-1]:
120         return 'Bin_{}'.format(bin_num - 1)
121     else:
122         for i in range(0, bin_num - 1):
123             if split_points[i] < x <= split_points[i + 1]:
124                 return 'Bin_{}'.format(i + 1)
125 
126 
127 # 按切分點，將值進行映射，每箱里的值映射的是每箱切分點右側點的值
128 def AssignGroup(x, split_points):
129     '''
130     :param x: 某個變量的某個取值
131     :param split_points: 切分點列表
132     :return: x在分箱結果下的映射
133     '''
134     N = len(split_points)
135     if x <= min(split_points):
136         return min(split_points)
137     elif x > max(split_points):
138         return 10e10
139     else:
140         for i in range(N - 1):
141             if split_points[i] < x <= split_points[i + 1]:
142                 return split_points[i + 1]
143 
144 
145 # 計算卡方值
146 def Chi2(df, all_col, bad_col):
147     '''
148     :param df: 包含全部樣本總計與壞樣本總計的數據框
149     :param all_col: 全部樣本的個數
150     :param bad_col: 壞樣本的個數
151     :return: 卡方值
152     '''
153     df2 = df.copy()
154     bad_rate = sum(df2[bad_col]) * 1.0 / sum(df2[all_col])  # 總體壞樣本率
155     # 當全部樣本只有好或者壞樣本時，卡方值為0
156     if bad_rate in [0, 1]:
157         return 0
158     df2['good'] = df2.apply(lambda x: x[all_col] - x[bad_col], axis=1)
159     good_rate = sum(df2['good']) * 1.0 / sum(df2[all_col])  # 總體好樣本率
160     # 期望壞（好）樣本個數＝全部樣本個數*平均壞（好）樣本占比
161 
162     df2['bad_expected'] = df[all_col].apply(lambda x: x * bad_rate)
163     df2['good_expected'] = df[all_col].apply(lambda x: x * good_rate)
164     bad_combined = zip(df2['bad_expected'], df2[bad_col])
165     good_combined = zip(df2['good_expected'], df2['good'])
166     bad_chi = [(i[0] - i[1]) ** 2 / i[0] for i in bad_combined]
167     good_chi = [(i[0] - i[1]) ** 2 / i[0] for i in good_combined]
168     chi2 = sum(bad_chi) + sum(good_chi)
169     return chi2
170 
171 
172 # 卡方分箱
173 def ChiMerge(df, col, target, max_bin_num=5, special_attribute=[], min_bin_threshold=0):
174     '''
175     :param df: 包含目標變量與分箱屬性的數據框
176     :param col: 需要分箱的屬性
177     :param target: 目標變量，取值0或1
178     :param max_bin_num: 最大分箱數。如果原始屬性的取值個數低於該參數，不執行這段函數
179     :param special_attribute: 不參與分箱的屬性取值
180     :param min_bin_threshold：分箱后，箱內記錄數最少的數量控制要求
181     :return: 返回分箱切分點列表
182     '''
183     value_list = sorted(list(set(df[col])))  # 變量的值從小達到排列列表
184     N_distinct = len(value_list)
185     if N_distinct <= max_bin_num:  # 如果原始屬性的取值個數低於max_interval，不執行這段函數
186         print("{} 變量值的種類個數小於最大分箱數".format(col))
187         return value_list[:-1]
188     else:
189         if len(special_attribute) >= 1:
190             df1 = df.loc[df[col].isin(special_attribute)]
191             df2 = df.loc[~df[col].isin(special_attribute)]  # 去掉special_attribute后的df
192         else:
193             df2 = df.copy()
194         N_distinct = len(list(set(df2[col])))
195 
196         # 步驟一: 粗分箱，通過col對數據集進行分組，求出每組的總樣本數與壞樣本數
197         if N_distinct > 100:
198             split_x = SplitData(df2, col, 100, special_attribute=special_attribute)
199             df2['temp'] = df2.loc[:, col].map(lambda x: AssignGroup(x, split_x))
200             # Assgingroup函數：每一行的數值和切分點做對比，返回原值在切分后的映射，
201             # 經過map以后，生成該特征的值對象的“分箱”后的值
202         else:
203             df2['temp'] = df2[col]
204         # 總體bad rate將被用來計算expected bad count
205         (binBadRate, regroup, overallRate) = BinBadRate(df2, 'temp', target, grantRateIndicator=1)
206 
207         # 首先，每個單獨的屬性值將被分為單獨的一組
208         # 對屬性值進行排序，然后兩兩組別進行合並
209         value_list = sorted(list(set(df2['temp'])))
210         current_bin = [[i] for i in value_list]  # 把每個箱的值打包成[[],[]]的形式
211 
212         # 步驟二：建立循環，不斷合並最優的相鄰兩個組別，直到：
213         # 1，最終分裂出來的分箱數<＝預設的最大分箱數
214         # 2，每箱的占比不低於預設值（可選）
215         # 3，每箱同時包含好壞樣本
216         # 如果有特殊屬性，那么最終分裂出來的分箱數＝預設的最大分箱數－特殊屬性的個數
217 
218         plan_bin_num = max_bin_num - len(special_attribute)
219         while (len(current_bin) > plan_bin_num):  # 終止條件: 當前分箱數＝預設的分箱數
220             # 每次循環時, 計算合並相鄰組別后的卡方值。具有最小卡方值的合並方案，是最優方案
221             chi2_list = []
222             for k in range(len(current_bin) - 1):
223                 temp_group = current_bin[k] + current_bin[k + 1]
224                 df2b = regroup.loc[regroup['temp'].isin(temp_group)]
225                 chi2 = Chi2(df2b, 'all_num', 'bad_num')
226                 chi2_list.append(chi2)
227             # 把 current_bin 的值改成類似的值改成類似從[[1][2],[3]]到[[1,2],[3]]
228             best_combined = chi2_list.index(min(chi2_list))  # 找到卡方值最小的進行合並
229             current_bin[best_combined] = current_bin[best_combined] + current_bin[best_combined + 1]
230             current_bin.remove(current_bin[best_combined + 1])
231 
232         current_bin = [sorted(i) for i in current_bin]
233         split_points = [max(i) for i in current_bin[:-1]]  # 卡方分箱后的切分點
234 
235         # 檢查是否有箱沒有好或者壞樣本。如果有，需要跟相鄰的箱進行合並，直到每箱同時包含好壞樣本
236         # print("程序檢查 split_points ",split_points)
237         temp_bins = df2['temp'].apply(
238             lambda x: AssignBin(x, split_points, special_attribute=special_attribute))  # 每個原始箱對應卡方分箱后的箱號
239         df2['temp_bin'] = temp_bins
240         (bin_bad_rate, regroup) = BinBadRate(df2, 'temp_bin', target)
241         [min_bad_rate, max_bad_rate] = [min(bin_bad_rate.values()), max(bin_bad_rate.values())]
242 
243         while min_bad_rate == 0 or max_bad_rate == 1:
244             # 找出全部為好／壞樣本的箱
245             bin_0_1 = regroup.loc[regroup['bad_rate'].isin([0, 1]), 'temp_bin'].tolist()
246             bin = bin_0_1[0]
247 
248             # 如果是最后一箱，則需要和上一個箱進行合並，也就意味着分裂點split_points中的最后一個需要移除
249             if bin == max(regroup.temp_bin):
250                 split_points = split_points[:-1]
251             # 如果是第一箱，則需要和下一個箱進行合並，也就意味着分裂點split_points中的第一個需要移除
252             elif bin == min(regroup.temp_bin):
253                 split_points = split_points[1:]
254             # 如果是中間的某一箱，則需要和前后中的一個箱進行合並，依據是較小的卡方值
255             else:
256                 current_index = list(regroup.temp_bin).index(bin)
257                 # 和前一箱進行合並，並且計算卡方值
258                 prev_index = list(regroup.temp_bin)[current_index - 1]
259                 df3 = df2.loc[df2['temp_bin'].isin([prev_index, bin])]
260                 (bin_bad_rate, df2b) = BinBadRate(df3, 'temp_bin', target)
261                 chi2_1 = Chi2(df2b, 'all_num', 'bad_num')
262 
263                 later_index = list(regroup.temp_bin)[current_index + 1]
264                 df3b = df2.loc[df2['temp_bin'].isin([later_index, bin])]
265                 (binBadRate, df2b) = BinBadRate(df3b, 'temp_bin', target)
266                 chi2_2 = Chi2(df2b, 'all_num', 'bad_num')
267                 # 和后一箱進行合並，並且計算卡方值
268                 if chi2_1 < chi2_2:
269                     split_points.remove(split_points[current_index - 1])
270                 else:
271                     split_points.remove(split_points[current_index])
272 
273             # 完成合並之后，需要再次計算新的分箱准則下，每箱是否同時包含好壞樣本
274             temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute))
275             df2['temp_bin'] = temp_bins
276             (bin_bad_rate, regroup) = BinBadRate(df2, 'temp_bin', target)
277             [min_bad_rate, max_bad_rate] = [min(bin_bad_rate.values()), max(bin_bad_rate.values())]
278 
279         # 需要檢查分箱后的箱數含量最小占比
280         if min_bin_threshold > 0:
281             temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute))
282             df2['temp_bin'] = temp_bins
283             bin_counts = temp_bins.bin_counts().to_frame()
284             bin_counts['pcnt'] = bin_counts['temp'].apply(lambda x: x * 1.0 / sum(bin_counts['temp']))
285             bin_counts = bin_counts.sort_index()
286             min_pcnt = min(bin_counts['pcnt'])
287             while min_pcnt < min_bin_threshold and len(split_points) > 2:
288                 # 找出占比最小的箱
289                 index_of_min_pcnt = bin_counts[bin_counts['pcnt'] == min_pcnt].index.tolist()[0]
290                 # 如果占比最小的箱是最后一箱，則需要和上一個箱進行合並，也就意味着分裂點split_points中的最后一個需要移除
291                 if index_of_min_pcnt == max(bin_counts.index):
292                     split_points = split_points[:-1]
293                 # 如果占比最小的箱是第一箱，則需要和下一個箱進行合並，也就意味着分裂點split_points中的第一個需要移除
294                 elif index_of_min_pcnt == min(bin_counts.index):
295                     split_points = split_points[1:]
296                 # 如果占比最小的箱是中間的某一箱，則需要和前后中的一個箱進行合並，依據是較小的卡方值
297                 else:
298                     # 和前一箱進行合並，並且計算卡方值
299                     current_index = list(bin_counts.index).index(index_of_min_pcnt)
300                     prev_index = list(bin_counts.index)[current_index - 1]
301                     df3 = df2.loc[df2['temp_bin'].isin([prev_index, index_of_min_pcnt])]
302                     (bin_bad_rate, df2b) = BinBadRate(df3, 'temp_bin', target)
303                     chi2_1 = Chi2(df2b, 'all_num', 'bad_num')
304 
305                     # 和后一箱進行合並，並且計算卡方值
306                     later_index = list(bin_counts.index)[current_index + 1]
307                     df3b = df2.loc[df2['temp_bin'].isin([later_index, index_of_min_pcnt])]
308                     (bin_bad_rate, df2b) = BinBadRate(df3b, 'temp_bin', target)
309                     chi2_2 = Chi2(df2b, 'all_num', 'bad_num')
310 
311                     if chi2_1 < chi2_2:
312                         split_points.remove(split_points[current_index - 1])
313                     else:
314                         split_points.remove(split_points[current_index])
315 
316                 temp_bins = df2['temp'].apply(lambda x: AssignBin(x, split_points, special_attribute=special_attribute))
317                 df2['temp_bin'] = temp_bins
318                 bin_counts = temp_bins.bin_counts().to_frame()
319                 bin_counts['pcnt'] = bin_counts['temp'].apply(lambda x: x * 1.0 / sum(bin_counts['temp']))
320                 bin_counts = bin_counts.sort_index()
321                 min_pcnt = min(bin_counts['pcnt'])
322 
323         split_points = special_attribute + split_points
324 
325         return split_points
326 
327 
328 # 判斷某變量的壞樣本率是否單調
329 def BadRateMonotone(df, col, target, special_attribute=[]):
330     '''
331     :param df: 包含檢驗壞樣本率的變量，和目標變量
332     :param col: 需要檢驗壞樣本率的變量
333     :param target: 目標變量，0、1表示好、壞
334     :param special_attribute: 不參與檢驗的特殊值
335     :return: 壞樣本率單調與否
336     '''
337 
338     special_bin = ['Bin_{}'.format(i) for i in special_attribute]  # 填充未知變量 Unknown 所在分箱
339     df2 = df.loc[~df[col].isin(special_bin)]
340     if len(set(df2[col])) <= 2:
341         return True
342     bin_bad_rate, regroup = BinBadRate(df2, col, target)
343     bad_rate = [bin_bad_rate[i] for i in sorted(bin_bad_rate)]
344     # 嚴格單調遞增
345     if all(bad_rate[i] < bad_rate[i + 1] for i in range(len(bad_rate) - 1)):
346         return True
347     # 嚴格單調遞減
348     elif all(bad_rate[i] > bad_rate[i + 1] for i in range(len(bad_rate) - 1)):
349         return True
350     else:
351         return False
352 
353 
354 # 計算變量的 WOE 值 和 IV 值
355 def CalcWOE(df, col, target):
356     '''
357     :param df: 包含需要計算WOE的變量和目標變量
358     :param col: 需要計算WOE、IV的變量，必須是分箱后的變量，或者不需要分箱的類別型變量
359     :param target: 目標變量，0、1表示好、壞
360     :return: 返回WOE字典和IV值
361     '''
362     total = df.groupby([col])[target].count()
363     total = pd.DataFrame({'total': total})
364     bad = df.groupby([col])[target].sum()
365     bad = pd.DataFrame({'bad': bad})
366     regroup = total.merge(bad, left_index=True, right_index=True, how='left')
367     regroup.reset_index(level=0, inplace=True)
368     N = sum(regroup['total'])
369     B = sum(regroup['bad'])
370     regroup['good'] = regroup['total'] - regroup['bad']
371     G = N - B
372     regroup['bad_pcnt'] = regroup['bad'].map(lambda x: x * 1.0 / B)
373     regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G)
374     regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt * 1.0 / x.bad_pcnt), axis=1)
375     WOE_dict = regroup[[col, 'WOE']].set_index(col).to_dict(orient='index')
376     for k, v in WOE_dict.items():
377         WOE_dict[k] = v['WOE']
378     IV = regroup.apply(lambda x: (x.good_pcnt - x.bad_pcnt) * np.log(x.good_pcnt * 1.0 / x.bad_pcnt), axis=1)
379     IV = sum(IV)
380     return {"WOE": WOE_dict, 'IV': IV}
381 
382 
383 # 計算 KS 值
384 def CalcKsAuc(data, var_col, y_col, draw=True):
385     '''
386     :param df: 包含目標變量與預測值的數據集
387     :param score: 得分或者概率
388     :param target: 目標變量
389     :return: KS值 和 AUC值
390     '''
391     temp_df = pd.crosstab(data[var_col], data[y_col])
392     KS_df = temp_df.cumsum(axis=0) / temp_df.sum()
393     KS_df['KS'] = abs(KS_df[0] - KS_df[1])
394     KS_df.columns = ['TPR', 'FPR', 'KS']
395     KS = KS_df['KS'].max()
396     KS_df = KS_df.reset_index()
397     AUC = roc_auc_score(data[y_col], data[var_col])
398 
399     if draw:
400         fig, ax = plt.subplots(1, 2,figsize=(20,10))
401         KS_down = float(KS_df.loc[KS_df['KS'] == KS, 'FPR'])
402         KS_up = float(KS_df.loc[KS_df['KS'] == KS, 'TPR'])
403         KS_x = float(KS_df.loc[KS_df['KS'] == KS, 'predict'])
404         ax[0].plot(KS_df['predict'], KS_df['FPR'], 'royalblue')
405         ax[0].plot(KS_df['predict'], KS_df['TPR'], 'darkorange')
406         ax[0].plot(KS_df['predict'], KS_df['KS'], 'mediumseagreen')
407         ax[0].plot([KS_x, KS_x], [KS_down, KS_up], '--r')
408         ax[0].text(0.2, 0.1, 'max(KS) = {}'.format(KS))
409         ax[0].set_title("KS 曲線", fontsize=14)
410         ax[0].set_xlabel("", fontsize=12)
411         ax[0].set_ylabel("", fontsize=12)
412         ax[0].legend()
413 
414         ax[1].plot(KS_df['FPR'], KS_df['TPR'], 'royalblue')
415         ax[1].plot([0, 1], [0, 1], '--r')
416         ax[1].set_title("ROC 曲線", fontsize=14)
417         ax[1].fill_between(KS_df['FPR'], 0, KS_df['TPR'], facecolor='mediumseagreen', alpha=0.2)
418         ax[1].text(0.6, 0.1, 'AUC = {}'.format(AUC))
419         ax[1].set_xlabel("False positive rate", fontsize=12)
420         ax[1].set_ylabel("True positive rate", fontsize=12)
421         plt.savefig(folder_data + "KS&AUC.png", dpi=1000, bbox_inches='tight')  # 解決圖片不清晰，不完整的問題
422         # plt.show()
423     return KS, AUC

查看輔助函數

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lending Club 數據做數據分析&評分卡 Lending Club—構建貸款違約預測模型 Lending Club貸款數據分析（python代碼）數據化風控之信用評分卡的建模與開發信用卡評分信用評分卡模型信用評分卡（一）信用卡評分模型（五）項目總結-信用評分卡基於卡方分箱的評分卡建模