正文

在我們開發完信用分模型后，經常需要計算如下的一些指標：
    ●      區分度的指標：    ○      AUC    ○      KS    ○      GINI
    ●      穩定性的指標：    ○      PSI
    ●      分數分布：    ○      總人數比例    ○      壞用戶比例

一、生成樣本

注意數據是構造的，而非真實的數據

import numpy as np
import pandas as pd

n_sample = 1000

#構造虛擬的數據，主要字段有4個
df_score = pd.DataFrame({
    'user_id': [u for u in range(n_sample)],
    'label':np.random.randint(2, size=n_sample),
    'score': 900*np.random.random(size=n_sample),
    'term': 20201+np.random.randint(5, size=n_sample)
})

統計下分term的總人數，壞人數和壞人比例：

#根據期限去計算好壞用戶占比
df_score.groupby('term').agg(total=('label', 'count'), 
                             bad=('label', 'sum'), 
                             bad_rate=('label', 'mean'))

所以我們平時需要注意一下groupby之后的agg的用法

二、計算AUC、KS、GINI

這里對原有sklearn的auc計算做了一點修改，如果AUC<0.5的話會返回1-AUC, 這樣能忽略區分度的方向性。

#KS,GINI,AUC

from sklearn.metrics import roc_auc_score, roc_curve

#auc
def get_auc(ytrue, yprob):
    auc = roc_auc_score(ytrue, yprob)
    if auc < 0.5:
        auc = 1 - auc
    return auc

#ks
def get_ks(ytrue, yprob):
    fpr, tpr, thr = roc_curve(ytrue, yprob)
    ks = max(abs(tpr - fpr))
    return ks
#gini=2 * auc - 1  (既然acu在80%左右，那么這個應該是在69%左右)
def get_gini(ytrue, yprob):
    auc = get_auc(ytrue, yprob)
    gini = 2 * auc - 1
    return gini

#根據期限去計算KS,GINI,AUC，score可以當做是預測值，label就是真實值，這樣可以直接使用sklearn去計算
df_metrics = pd.DataFrame({
    'auc': df_score.groupby('term').apply(lambda x: get_auc(x['label'], x['score'])),
    'ks': df_score.groupby('term').apply(lambda x: get_ks(x['label'], x['score'])),
    'gini': df_score.groupby('term').apply(lambda x: get_gini(x['label'], x['score']))
})

最后得到一個包含這些指標的df

這里需要注意一下groupby.apply的用法

三、PSI模型穩定性

這里先分成2步：

簡單對隨機生成的信用分按固定分數區間分段；
按照分段計算PSI:使用pivot_table把數據按照term進行排列計算每個term上的人數比例。

#PSI，也就是穩定性，可以認定為訓練集和測試集的分布差異不大

df_score['score_bin'] = pd.cut(df_score['score'], [0, 500, 700, 800, 900])

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

透視表之后的結果如下：

div里面的df是上面透視表最后一行，也就是說所有的數據對應的列分別除以最后一行對應的數據，最終結果如下：

根據人數比例計算PSI再放回表格內

eps = np.finfo(np.float32).eps
lst_psi = list()
for idx in range(1, len(df_ratio.columns)-1):
    last, cur = df_ratio.iloc[0, -1: idx-1]+eps, df_ratio.iloc[0, -1: idx]+eps
    psi = sum((cur-last) * np.log(cur / last))
    lst_psi.append(psi)
df_ratio.append(pd.Series([np.nan]+lst_psi+[np.nan], 
                          index=df_ratio.columns, 
                          name='psi'))

我們可以看出這個數據是這樣計算出來的：

sum((cur-last) * np.log(cur / last)),其中cur是基准

四、分數分布

統計總人數分布和壞用戶比例的分布，其實在上面計算PSI的時候已經計算出人數分布，就是上面的df_ratio：

所以，這里照葫蘆畫瓢把壞用戶抽取出來再重復一遍，就可以把壞用戶比例計算出來。

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

df_bad = pd.pivot_table(df_score[df_score['label']==1], 
                        values='user_id', 
                        index='score_bin', 
                        columns=['term'], 
                        aggfunc="count", 
                        margins=True)
df_bad_rate = df_bad/df_total

可以使用seaborn的stacked line和stacked bar來做出總用戶的分布和壞用戶的比列分布。

#做圖

import seaborn as sns
import matplotlib.pyplot as plt

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_ratio.drop('All').T.plot(kind='bar', stacked=True, colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_bad_rate.drop('All').T.plot(kind='line', colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

附上代碼

import numpy as np
import pandas as pd

n_sample = 1000

#構造虛擬的數據，主要字段有4個
df_score = pd.DataFrame({
    'user_id': [u for u in range(n_sample)],
    'label':np.random.randint(2, size=n_sample),
    'score': 900*np.random.random(size=n_sample),
    'term': 20201+np.random.randint(5, size=n_sample)
})


#根據期限去計算好壞用戶占比
df_score.groupby('term').agg(total=('label', 'count'), 
                             bad=('label', 'sum'), 
                             bad_rate=('label', 'mean'))

#KS,GINI,AUC

from sklearn.metrics import roc_auc_score, roc_curve

#auc
def get_auc(ytrue, yprob):
    auc = roc_auc_score(ytrue, yprob)
    if auc < 0.5:
        auc = 1 - auc
    return auc

#ks
def get_ks(ytrue, yprob):
    fpr, tpr, thr = roc_curve(ytrue, yprob)
    ks = max(abs(tpr - fpr))
    return ks
#gini=2 * auc - 1  (既然acu在80%左右，那么這個應該是在69%左右)
def get_gini(ytrue, yprob):
    auc = get_auc(ytrue, yprob)
    gini = 2 * auc - 1
    return gini

#根據期限去計算KS,GINI,AUC，score可以當做是預測值，label就是真實值，這樣可以直接使用sklearn去計算
df_metrics = pd.DataFrame({
    'auc': df_score.groupby('term').apply(lambda x: get_auc(x['label'], x['score'])),
    'ks': df_score.groupby('term').apply(lambda x: get_ks(x['label'], x['score'])),
    'gini': df_score.groupby('term').apply(lambda x: get_gini(x['label'], x['score']))
})


#PSI，也就是穩定性，可以認定為訓練集和測試集的分布差異不大

df_score['score_bin'] = pd.cut(df_score['score'], [0, 500, 700, 800, 900])

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

eps = np.finfo(np.float32).eps  #除法，處理分母為零的情況
lst_psi = list()
for idx in range(1, len(df_ratio.columns)-1):  #第一行不需要計算，因為需要以第一行為基准
    last, cur = df_ratio.iloc[0, -1: idx-1]+eps, df_ratio.iloc[0, -1: idx]+eps  #
    psi = sum((cur-last) * np.log(cur / last))
    lst_psi.append(psi)
df_ratio.append(pd.Series([np.nan]+lst_psi+[np.nan], 
                          index=df_ratio.columns, 
                          name='psi'))


#總人數比例和壞客戶比例

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

df_bad = pd.pivot_table(df_score[df_score['label']==1], 
                        values='user_id', 
                        index='score_bin', 
                        columns=['term'], 
                        aggfunc="count", 
                        margins=True)
df_bad_rate = df_bad/df_total

#做圖

import seaborn as sns

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_ratio.drop('All').T.plot(kind='bar', stacked=True, colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_bad_rate.drop('All').T.plot(kind='line', colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 金融風控_RFM模型風控模型---貸后催收模型智能風控平台核心之風控決策引擎（一）智能風控平台核心之風控決策引擎（三）風控模型-美國FICO標准大數據風控模型《風控策略筆記》之評分模型 [筆記] 智能風控01：信用管理基礎概念指標+信貸風控架構目標檢測模型的評估指標mAP詳解(附代碼）【轉】風控中的特征評價指標（三）——KS值

風控模型6大核心指標（附代碼）

目錄

正文

一、生成樣本

二、計算AUC、KS、GINI

三、PSI模型穩定性

四、分數分布

免責聲明！