信用卡評分模型（五）

本文轉載自查看原文 2021-11-29 17:21 203 kaggle

最近在探索xgboost 調參事情，現在存在着幾點問題：

1.調參方式，網上有多種調參方式，但是基本都是一個一個參數去調，貪心算法，只能滿足局部最優，但是我們的參數都是相互影響的，局部最優，組合起來並非是最優的。

2.我基本都是確定幾個參數的固定形式，比如說樹的深度=3，最小葉節點=樣本*5%，scale_pos_weight看好壞占比等等，最后使用train去得到最佳樹的數量，但是這種也耗費時間，且不是自動化，還容易過擬合

3.模型的最優，首先模型的評判最優是模型的訓練集，測試集，oot的KS的差距不能大於4%或者5%，且oot的KS不能與最優（比如說，我們嘗試過多種建模方式，發現使用這些數據，oot的最優效果在0.35左右）差異過大。在這些條件下，去調參，去實現自動化

還是使用give-me-some-credit的數據：https://www.kaggle.com/brycecf/give-me-some-credit-dataset?select=cs-test.csv

除了調補缺失值-9999，基本不對數據進行處理

#%%導入模塊
import pandas as pd 
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rc("font",family="SimHei",size="12")  #解決中文無法顯示的問題

#%%導入數據
train=pd.read_csv('cs-training.csv')

train.shape  #(150000, 12)
train.pop('Unnamed: 0')
train.columns

train = train.fillna(-9999)

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.tree import export_graphviz
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

#oot 
from sklearn.model_selection import train_test_split
train_x, oot_x, train_y, oot_y = train_test_split(train.drop(columns='SeriousDlqin2yrs'), \
                                                    train.SeriousDlqin2yrs, test_size=0.3, random_state=2021,stratify=train.SeriousDlqin2yrs)


#訓練集，測試集
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(train_x, \
                                                    train_y, test_size=0.3, random_state=2021,stratify=train_y)

import xgboost as xgb
from xgboost import plot_importance
d_train = xgb.DMatrix(train_x,train_y,feature_names=train_x.columns)
d_valid = xgb.DMatrix(test_x,test_y,feature_names=train_x.columns)
watchlist = [(d_train,'train'),(d_valid,'valid')]
#參數設置（未調箱前的參數）
params={
        'eta':0.2,                        #特征權重，取值范圍0~1，通常最后設置eta為0.01~0.2
        'max_depth':3,                    #樹的深度，通常取值3-10，過大容易過擬合，過小欠擬合  230
        'min_child_weight':230,             #最小樣本的權重，調大參數可以繁殖過擬合
        'gamma':0.4,                      #控制是否后剪枝，越大越保守，一般0.1、 0.2的樣子
        'subsample':0.8,                  #隨機取樣比例
        'colsample_bytree':0.8 ,          #默認為1，取值0~1，對特征隨機采集比例
        'lambda':0.8,
        'alpha':0.6,
        'n_estimators':500,
        'booster':'gbtree',               #迭代樹
        'objective':'binary:logistic',    #邏輯回歸，輸出為概率
        'nthread':6,                      #設置最大的進程量，若不設置則會使用全部資源
        'scale_pos_weight':10,             #默認為0,1可以處理類別不平衡

        'seed':1234,                      #隨機樹種子
        'silent':1,                       #0表示輸出結果
        'eval_metric':'auc'               #評分指標
}
bst = xgb.train(params, d_train,1000,watchlist,early_stopping_rounds=100, verbose_eval=5)   #最大迭代次數1000次

tree_nums = bst.best_ntree_limit
print('最優模型樹的數量：%s,最優迭代次數：%s,auc: %s' %(bst.best_ntree_limit,bst.best_iteration,bst.best_score))

bst = xgb.train(params, d_train,tree_nums,watchlist,early_stopping_rounds=1000, verbose_eval=10) #最優模型迭代次數去訓練

預測

train_p =  bst.predict(xgb.DMatrix(train_x))
test_p =  bst.predict(xgb.DMatrix(test_x))
oot_p = bst.predict(xgb.DMatrix(oot_x))

畫出ks 圖形

from sklearn.metrics import roc_curve,auc
def plot_roc(p1, p,string):
    '''
    目標：計算出分類模型的ks值
    變量：
    self:模型fit(x,y)，如（self=tree.fit(x,y))
    data:一般是訓練集（不包括label）或者是測試集（也是不包括label）
    y:label的column_name 
    返回：訓練集（或者測試集）的auc的圖片

    '''      

 
    fpr, tpr, p_threshold = roc_curve(p1, p,
                                              drop_intermediate=False,
                                              pos_label=1)
    df = pd.DataFrame({'fpr': fpr, 'tpr': tpr, 'p': p_threshold})
    df.loc[0, 'p'] = max(p)

    ks = (df['tpr'] - df['fpr']).max()
    roc_auc = auc(fpr, tpr)

    fig = plt.figure(figsize=(2.8, 2.8), dpi=140)
    ax = fig.add_subplot(111)

    ax.plot(fpr, tpr, color='darkorange', lw=2,
            label='ROC curve\nAUC = %0.4f\nK-S = %0.4f' % (roc_auc, ks)
            )
    ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(string)
    ax.legend(loc="lower right")
    plt.close()
    return fig

訓練集

plot_roc(train_y, train_p,'訓練集ROC Curve')  #訓練集

測試集

plot_roc(test_y, test_p,'測試集ROC Curve')

oot

plot_roc(oot_y, oot_p,'驗證集ROC Curve')

像這種可以部署了，ks相差2%左右

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 信用卡評分模型（二）python 信用卡評分模型（R語言）信用卡評分信用評分卡模型基於客戶數據的銀行信用卡風險控制模型研究-金融風控模型標准評分卡基於Python的信用評分卡模型分析（一）信用評分卡模型的理論准備基於Python的信用評分卡模型分析（二）信用評分卡8_授信模型信用評分卡（一）