toad標准化評分庫

本文轉載自查看原文 2020-11-23 19:54 1319 金融指標

toad是由厚本金融風控團隊內部孵化，后開源並堅持維護的標准化評分卡庫。其功能全面、性能穩健、運行速度快、問題反饋后維護迅速、深受同行喜愛。如果有些小伙伴沒有一些標准化的信用評分開發工具或者企業級的定制化腳本，toad應該會極大的節省大家的時間

github主頁：https://github.com/amphibian-dev/toad

文檔：https://toad.readthedocs.io/

演示：https://toad.readthedocs.io/en/latest/tutorial.html

whl下載地址：https://pypi.org/simple/toad/

文章轉自：https://zhuanlan.zhihu.com/p/90354450

一、加載所需模塊

import pandas as pd
from sklearn.metrics import roc_auc_score,roc_curve,auc  #常見的分類評分標准
from sklearn.model_selection import train_test_split #切分數據
from sklearn.linear_model import LogisticRegression #邏輯回歸
from sklearn.model_selection import GridSearchCV as gscv #網格搜索
from sklearn.neighbors import KNeighborsClassifier #k近鄰
import numpy as np
import glob
import math
import xgboost as xgb 
import toad

pc.obj_info(toad)

ObjInfo object of :
    模塊：['c_utils', 'detector', 'metrics', 'scorecard', 'selection', 'stats', 'transform', 'utils', 'version']

    類/對象：['ScoreCard']

    函數/方法：['F1', 'KS', 'KS_bucket', 'VIF', 'WOE', 'detect', 'entropy', 'gini', 'quality', 'select']

    屬性：['ChiMerge', 'DTMerge', 'IV', 'KMeansMerge', 'QuantileMerge', 'StepMerge', 'VERSION', 'entropy_cond', 'gini_cond', 'merge']

二、加載數據

#加載數據path = "D:/風控模型/data/"

data_all = pd.read_csv(path+"data.txt",engine='python',index_col=False)
data_all_woe = pd.read_csv(path+"ccard_all_woe.txt",engine='python',index_col=False) 

#指定不參與訓練列名
ex_lis = ['uid','obs_mth','ovd_dt','samp_type','weight',
          'af30_status','submit_time','bad_ind']

#參與訓練列名
ft_lis = list(data_all.columns)
for i in ex_lis:    
    ft_lis.remove(i)

三、划分訓練集和測試集

#訓練集與跨時間驗證集合
dev = data_all[(data_all['samp_type'] == 'dev') | 
        (data_all['samp_type'] == 'val') | 
        (data_all['samp_type'] == 'off1') ]
off = data_all[data_all['samp_type'] == 'off2']

四、EDA

探索性數據分析 同時處理數值型和字符型

a = toad.detector.detect(data_all)
a.head(8)

pc.obj_info(toad.detector)

ObjInfo object of :
    模塊：['pd']

    函數/方法：['countBlank', 'detect', 'getDescribe', 'getTopValues', 'isNumeric']

五、特征刷選

empty：缺失率上限
iv：信息量
corr：相關系數大於閾值，則刪除IV小的特征
return_drop：返回刪除特征
exclude：不參與篩選的變量名

pc.obj_info(toad.selection)

ObjInfo object of :
    模塊：['np', 'pd', 'stats']

    類/對象：['StatsModel']

    函數/方法：['AIC', 'AUC', 'BIC', 'KS', 'MSE', 'VIF', 'drop_corr', 'drop_empty', 'drop_iv', 'drop_var', 'drop_vif', 'select', 'split_target', 'stepwise', 'to_ndarray', 'unpack_tuple']

    屬性：['INTERCEPT_COLS', 'IV']

dev_slct1, drop_lst= toad.selection.select(dev,dev['bad_ind'], empty = 0.7, 
                                           iv = 0.02, corr = 0.7, return_drop=True, exclude=ex_lis)
print("keep:",dev_slct1.shape[1],
      "drop empty:",len(drop_lst['empty']),
      "drop iv:",len(drop_lst['iv']),
      "drop corr:",len(drop_lst['corr']))

keep: 584

drop empty: 637

drop iv: 1961

drop corr: 2043

dev_slct2, drop_lst= toad.selection.select(dev_slct1,dev_slct1['bad_ind'], empty = 0.6, 
                                           iv = 0.02, corr = 0.7, return_drop=True, exclude=ex_lis)
print("keep:",dev_slct2.shape[1],
      "drop empty:",len(drop_lst['empty']),
      "drop iv:",len(drop_lst['iv']),
      "drop corr:",len(drop_lst['corr']))

keep: 560

drop empty: 24

drop iv: 0

drop corr: 0

help(toad.selection.select)

Help on function select in module toad.selection:

select(frame, target='target', empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None)
    select features by rate of empty, iv and correlation
    
    Args:
        frame (DataFrame)
        target (str): target's name in dataframe
        empty (number): drop the features which empty num is greater than threshold. if threshold is float, it will be use as percentage
        iv (float): drop the features whose IV is less than threshold
        corr (float): drop features that has the smallest IV in each groups which correlation is greater than threshold
        return_drop (bool): if need to return features' name who has been dropped
        exclude (array-like): list of feature name that will not be dropped
    
    Returns:
        DataFrame: selected dataframe
        dict: list of dropped feature names in each step

六、分箱

先找到分箱的閾值
分箱閾值的方法（method）包括：'chi','dt','quantile','step','kmeans'
然后利用分箱閾值進行粗分箱。

pc.obj_info(toad.transform)

ObjInfo object of :
    模塊：['copy', 'math', 'np', 'pd']

    類/對象：['BinsMixin', 'Combiner', 'GBDTTransformer', 'GradientBoostingClassifier', 'OneHotEncoder', 'RulesMixin', 'Transformer', 'TransformerMixin', 'WOETransformer', 'frame_exclude', 'select_dtypes']

    函數/方法：['WOE', 'bin_by_splits', 'np_count', 'probability', 'split_target', 'to_ndarray', 'wraps']

    屬性：['merge']

#得到切分節點
combiner = toad.transform.Combiner()
combiner.fit(dev_slct2,dev_slct2['bad_ind'],method='chi',min_samples = 0.05,
             exclude=ex_lis)
#導出箱的節點
bins = combiner.export()

#根據節點實施分箱
dev_slct3 = combiner.transform(dev_slct2)
off3 = combiner.transform(off[dev_slct2.columns])

#分箱后通過畫圖觀察,x分箱后某個字段
from toad.plot import  bin_plot,badrate_plot
bin_plot(dev_slct3,x='p_ovpromise_6mth',target='bad_ind')
bin_plot(off3,x='p_ovpromise_6mth',target='bad_ind')

pc.obj_info(toad.plot)

ObjInfo object of :
    模塊：['np', 'pd']

    函數/方法：['AUC', 'add_annotate', 'add_text', 'badrate_plot', 'bin_plot', 'corr_plot', 'generate_str', 'proportion_plot', 'reset_ylim', 'roc_plot', 'unpack_tuple']

    屬性：['HEATMAP_CMAP', 'IV', 'MAX_STYLE', 'tadpole']

后2箱不單調

#查看單箱節點  [0.0, 24.0, 60.0, 100.0]
bins['p_ovpromise_6mth']

合並最后兩箱

adj_bin = {'p_ovpromise_6mth':  [0.0, 24.0, 60.0]}
combiner.set_rules(adj_bin)
dev_slct3 = combiner.transform(dev_slct2)
off3 = combiner.transform(off[dev_slct2.columns])

bin_plot(dev_slct3,x='p_ovpromise_6mth',target='bad_ind')
bin_plot(off3,x='p_ovpromise_6mth',target='bad_ind')

對比不同數據集上特征的badrate圖是否有交叉

data = pd.concat([dev_slct3,off3],join='inner') 
badrate_plot(data, x='samp_type', target='bad_ind', by='p_ovpromise_6mth')

沒有交叉，因此該特征的分組不需要再進行合並。篇幅有限，不對所有特征的精細化調整做展示。接下來進行WOE映射

#
t=toad.transform.WOETransformer()
dev_slct2_woe = t.fit_transform(dev_slct3,dev_slct3['bad_ind'], exclude=ex_lis)
off_woe = t.transform(off3[dev_slct3.columns])

data = pd.concat([dev_slct2_woe,off_woe])

通過穩定性篩選特征。計算訓練集與跨時間驗證集的PSI。刪除PSI大於0.05的特征

#(41199, 476)
psi_df = toad.metrics.PSI(dev_slct2_woe, off_woe).sort_values(0)
psi_df = psi_df.reset_index()
psi_df = psi_df.rename(columns = {'index' : 'feature',0:'psi'})

psi005 = list(psi_df[psi_df.psi<0.05].feature)
for i in ex_lis:
    if i in psi005:
        pass
    else:
       psi005.append(i) 
data = data[psi005]  
dev_woe_psi = dev_slct2_woe[psi005]
off_woe_psi = off_woe[psi005]
print(data.shape)

查看這個對象

pc.obj_info(toad.metrics)

ObjInfo object of :
    模塊：['np', 'pd']

    類/對象：['Combiner']

    函數/方法：['AIC', 'AUC', 'BIC', 'F1', 'KS', 'KS_bucket', 'KS_by_col', 'MSE', 'PSI', 'SSE', 'bin_by_splits', 'f1_score', 'feature_splits', 'iter_df', 'ks_2samp', 'matrix', 'roc_auc_score', 'roc_curve', 'unpack_tuple']

    屬性：['merge']

由於分箱后變量之間的共線性會變強，通過相關性再次篩選特征

#
dev_woe_psi2, drop_lst= toad.selection.select(dev_woe_psi,dev_woe_psi['bad_ind'], empty = 0.6, 
                                           iv = 0.02, corr = 0.5, return_drop=True, exclude=ex_lis)
print("keep:",dev_woe_psi2.shape[1],
      "drop empty:",len(drop_lst['empty']),
      "drop iv:",len(drop_lst['iv']),
      "drop corr:",len(drop_lst['corr']))

keep: 85

drop empty: 0

drop iv: 56

drop corr: 335

接下來通過逐步回歸進行最終的特征篩選。檢驗方法（criterion）：

'aic'
'bic'

七、檢驗模型（estimator）

'ols': LinearRegression,
'lr': LogisticRegression,
'lasso': Lasso,
'ridge': Ridge,

#(41199, 33)
dev_woe_psi_stp = toad.selection.stepwise(dev_woe_psi2,
                                          dev_woe_psi2['bad_ind'],
                                          exclude = ex_lis,
                                          direction = 'both', 
                                          criterion = 'aic',
                                          estimator = 'ols',
                                          intercept = False)

off_woe_psi_stp = off_woe_psi[dev_woe_psi_stp.columns]

data = pd.concat([dev_woe_psi_stp,off_woe_psi_stp])
data.shape

接下來定義雙向邏輯回歸和檢驗模型XGBoost

#定義邏輯回歸
def lr_model(x,y,offx,offy,C):
    model = LogisticRegression(C=C,class_weight='balanced')    
    model.fit(x,y)
    
    y_pred = model.predict_proba(x)[:,1]
    fpr_dev,tpr_dev,_ = roc_curve(y,y_pred)
    train_ks = abs(fpr_dev - tpr_dev).max()
    print('train_ks : ',train_ks)
    
    y_pred = model.predict_proba(offx)[:,1]
    fpr_off,tpr_off,_ = roc_curve(offy,y_pred)
    off_ks = abs(fpr_off - tpr_off).max()
    print('off_ks : ',off_ks)
    
    
    from matplotlib import pyplot as plt
    plt.plot(fpr_dev,tpr_dev,label = 'train')
    plt.plot(fpr_off,tpr_off,label = 'off')
    plt.plot([0,1],[0,1],'k--')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()

#定義xgboost輔助判斷盤牙鞥特征交叉是否有必要 
def xgb_model(x,y,offx,offy):
    model = xgb.XGBClassifier(learning_rate=0.05,
                              n_estimators=400,
                              max_depth=3,
                              class_weight='balanced',
                              min_child_weight=1,
                              subsample=1,
                              objective="binary:logistic",
                              nthread=-1,
                              scale_pos_weight=1,
                              random_state=1,
                              n_jobs=-1,
                              reg_lambda=300)
    model.fit(x,y)
    
    print('>>>>>>>>>')
    y_pred = model.predict_proba(x)[:,1]
    fpr_dev,tpr_dev,_ = roc_curve(y,y_pred)
    train_ks = abs(fpr_dev - tpr_dev).max()
    print('train_ks : ',train_ks)
    
    y_pred = model.predict_proba(offx)[:,1]
    fpr_off,tpr_off,_ = roc_curve(offy,y_pred)
    off_ks = abs(fpr_off - tpr_off).max()
    print('off_ks : ',off_ks)
    
    
    from matplotlib import pyplot as plt
    plt.plot(fpr_dev,tpr_dev,label = 'train')
    plt.plot(fpr_off,tpr_off,label = 'off')
    plt.plot([0,1],[0,1],'k--')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()
#模型訓練
def c_train(data,dep='bg_result_compensate',exclude=None):
    from sklearn.preprocessing import StandardScaler
    std_scaler = StandardScaler()
    #變量名
    lis = list(data.columns)
    
    for i in exclude:
        lis.remove(i)

    data[lis] = std_scaler.fit_transform(data[lis])

    devv = data[(data['samp_type']=='dev') | (data['samp_type']=='val')]
    offf = data[(data['samp_type']=='off1') | (data['samp_type']=='off2') ]
    
    x,y = devv[lis],devv[dep]
    offx,offy = offf[lis],offf[dep]

    #邏輯回歸正向
    lr_model(x,y,offx,offy,0.1)

    #邏輯回歸反向
    lr_model(offx,offy,x,y,0.1)
    
    #XGBoost正向
    xgb_model(x,y,offx,offy)

    #XGBoost反向
    xgb_model(offx,offy,x,y)

在特征精細化分箱后，xgboost模型的KS明顯高於LR，則特征交叉是有必要的。需要返回特征工程過程進行特征交叉衍生。兩模型KS接近代表特征交叉對模型沒有明顯提升。反向模型KS代表模型最高可能達到的結果。如果反向訓練集效果較差，說明跨時間驗證集本身分布較為特殊，應當重新划分數據。

#
c_train(data,dep='bad_ind',exclude=ex_lis)

八、評分卡模型訓練

#模型訓練   
dep = 'bad_ind'
lis = list(data.columns)
for i in ex_lis:
    lis.remove(i)
    
devv = data[(data['samp_type']=='dev') | (data['samp_type']=='val')]
offf = data[(data['samp_type']=='off1') | (data['samp_type']=='off2') ]

x,y = devv[lis],devv[dep]
offx,offy = offf[lis],offf[dep]  
   
lr = LogisticRegression()
lr.fit(x,y)

分別計算：F1分數 KS值 AUC值

from toad.metrics import KS, F1, AUC

prob_dev = lr.predict_proba(x)[:,1]

print('訓練集')
print('F1:', F1(prob_dev,y))
print('KS:', KS(prob_dev,y))
print('AUC:', AUC(prob_dev,y))

prob_off = lr.predict_proba(offx)[:,1]

print('跨時間')
print('F1:', F1(prob_off,offy))
print('KS:', KS(prob_off,offy))
print('AUC:', AUC(prob_off,offy))

訓練集

F1: 0.30815569972196477

KS: 0.2819389063516508

AUC: 0.6908879633467695

跨時間

F1: 0.2848354792560801

KS: 0.23181102640650808

AUC: 0.6522823050763138

計算模型PSI和變量PSI，兩個角度衡量穩定性

print('模型PSI:',toad.metrics.PSI(prob_dev,prob_off))
print('特征PSI:','\n',toad.metrics.PSI(x,offx).sort_values(0))

模型PSI: 0.022260098554531284

特征PSI:

生產模型KS報告

off_bucket = toad.metrics.KS_bucket(prob_off,offy,bucket=10,method='quantile')
off_bucket

生產評分卡。支持傳入所有的模型參數，以及Fico分數校准的基礎分與pdo（point of double odds），我一直管pdo叫步長...orz

pc.obj_info(toad.scorecard)

ObjInfo object of :
    模塊：['np', 'pd', 're']

    類/對象：['BaseEstimator', 'BinsMixin', 'Combiner', 'LogisticRegression', 'RulesMixin', 'ScoreCard', 'WOETransformer']

    函數/方法：['bin_by_splits', 'read_json', 'save_json', 'to_ndarray']

    屬性：['FACTOR_EMPTY', 'FACTOR_UNKNOWN', 'NUMBER_EMPTY', 'NUMBER_INF']

from toad.scorecard import ScoreCard
card = ScoreCard(combiner = combiner, transer  = t,class_weight = 'balanced',C=0.1,base_score = 600,base_odds = 35 ,pdo = 60,rate = 2)
card.fit(x,y)
final_card = card.export(to_frame = True)
final_card.head(8)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Toad建立標准化評分卡模型數據標准化三、標准化數據標准化法數據標准化向量標准化什么是歸一化和標准化歸一化與標准化標准化和歸一化 python數據標准化