Lending Club—構建貸款違約預測模型

本文轉載自查看原文 2019-04-03 11:26 2058 LendingClub/ 模型/ 違約/ 風控/ python信用評分卡建模（附代碼）

python風控建模實戰lendingClub(博主錄制，catboost，lightgbm建模，2K超清分辨率)

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

https://blog.csdn.net/arsenal0435/article/details/80446829（原文鏈接）

1.本項目需解決的問題
本項目通過利用P2P平台Lending Club的貸款數據，進行機器學習，構建貸款違約預測模型，對新增貸款申請人進行預測是否會違約，從而決定是否放款。

2.建模思路
以下為本次項目的工作流程。

3.場景解析
貸款申請人向Lending Club平台申請貸款時，Lending Club平台通過線上或線下讓客戶填寫貸款申請表，收集客戶的基本信息，這里包括申請人的年齡、性別、婚姻狀況、學歷、貸款金額、申請人財產情況等信息，通常來說還會借助第三方平台如征信機構或FICO等機構的信息。通過這些信息屬性來做線性回歸，生成預測模型，Lending Club平台可以通過預測判斷貸款申請是否會違約，從而決定是否向申請人發放貸款。

1）首先，我們的場景是通過用戶的歷史行為（如歷史數據的多維特征和貸款狀態是否違約）來訓練模型，通過這個模型對新增的貸款人“是否具有償還能力，是否具有償債意願”進行分析，預測貸款申請人是否會發生違約貸款。這是一個監督學習的場景，因為已知了特征以及貸款狀態是否違約（目標列），我們判定貸款申請人是否違約是一個二元分類問題，可以通過一個分類算法來處理，這里選用邏輯斯蒂回歸（Logistic Regression）。

2）觀察數據集發現部分數據是半結構化數據，需要進行特征抽象。

現對該業務場景進行總結如下：

根據歷史記錄數據學習並對貸款是否違約進行預測，監督學習場景，選擇邏輯斯蒂回歸（Logistic Regression）算法。
數據為半結構化數據，需要進行特征抽象。

4.數據預處理（Pre-Processing Data）
本次項目數據集來源於Lending Club Statistics，具體為2018年第一季Lending Club平台發生借貸的業務數據。
數據預覽

查看每列屬性缺失值的比例

check_null = data.isnull().sum().sort_values(ascending=False)/float(len(data))
print(check_null[check_null > 0.2]) # 查看缺失比例大於20%的屬性。

從上面信息可以發現，本次數據集缺失值較多的屬性對我們模型預測意義不大，例如id和member_id以及url等。因此，我們直接刪除這些沒有意義且缺失值較多的屬性。此外，如果缺失值對屬性來說是有意義的，還得細分缺失值對應的屬性是數值型變量或是分類類型變量。

thresh_count = len(data)*0.4 # 設定閥值
data = data.dropna(thresh=thresh_count, axis=1) #若某一列數據缺失的數量超過閥值就會被刪除
再將處理后的數據轉化為csv

data.to_csv('loans_2018q1_ml.csv', index = False)
loans = pd.read_csv('loans_2018q1_ml.csv')
loans.dtypes.value_counts() # 分類統計數據類型

loans.shape
(107866, 103)

同值化處理
如果一個變量大部分的觀測都是相同的特征，那么這個特征或者輸入變量就是無法用來區分目標時間。

loans = loans.loc[:,loans.apply(pd.Series.nunique) != 1]
loans.shape
(107866, 96)

缺失值處理——分類變量
objectColumns = loans.select_dtypes(include=["object"]).columns
loans[objectColumns].isnull().sum().sort_values(ascending=False)

loans[objectColumns]

loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float')
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float')
objectColumns = loans.select_dtypes(include=["object"]).columns
我們可以調用missingno庫來快速評估數據缺失的情況。

msno.matrix(loans[objectColumns]) # 缺失值可視化

從圖中可以直觀看出變量“last_pymnt_d”、“emp_title”、“emp_length”缺失值較多。

這里我們先用‘unknown’來填充。

objectColumns = loans.select_dtypes(include=["object"]).columns
loans[objectColumns] = loans[objectColumns].fillna("Unknown")
缺失值處理——數值變量
numColumns = loans.select_dtypes(include=[np.number]).columns

pd.set_option('display.max_columns', len(numColumns))
loans[numColumns].tail()

loans.drop([107864, 107865], inplace =True)
這里使用可sklearn的Preprocessing模塊，參數strategy選用most_frequent，采用眾數插補的方法填充缺失值。
imr = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) # axis=0 針對列來處理
imr = imr.fit(loans[numColumns])
loans[numColumns] = imr.transform(loans[numColumns])
這樣缺失值就已經處理完。

數據過濾
print(objectColumns)

將以上重復或對構建預測模型沒有意義的屬性進行刪除。

drop_list = ['sub_grade', 'emp_title', 'issue_d', 'title', 'zip_code', 'addr_state', 'earliest_cr_line',
'initial_list_status', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d', 'disbursement_method']

loans.drop(drop_list, axis=1, inplace=True)
loans.select_dtypes(include = ['object']).shape
(107866, 8)

5.特征工程（Feature Engineering）
特征衍生
Lending Club平台中，"installment"代表貸款每月分期的金額，我們將'annual_inc'除以12個月獲得貸款申請人的月收入金額，然后再把"installment"（月負債）與（'annual_inc'/12）（月收入）相除生成新的特征'installment_feat'，新特征'installment_feat'代表客戶每月還款支出占月收入的比，'installment_feat'的值越大，意味着貸款人的償債壓力越大，違約的可能性越大。
loans['installment_feat'] = loans['installment'] / ((loans['annual_inc']+1) / 12)
特征抽象（Feature Abstraction）
def coding(col, codeDict):

colCoded = pd.Series(col, copy=True)
for key, value in codeDict.items():
colCoded.replace(key, value, inplace=True)

return colCoded

#把貸款狀態LoanStatus編碼為違約=1, 正常=0:

loans["loan_status"] = coding(loans["loan_status"], {'Current':0,'Issued':0,'Fully Paid':0,'In Grace Period':1,'Late (31-120 days)':1,'Late (16-30 days)':1,'Charged Off':1})

print( '\nAfter Coding:')

pd.value_counts(loans["loan_status"])

貸款狀態可視化

loans.select_dtypes(include=["object"]).head()

首先，我們對變量“emp_length”、"grade"進行特征抽象化。

# 有序特征的映射
mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"Unknown": 0
},
"grade":{
"A": 1,
"B": 2,
"C": 3,
"D": 4,
"E": 5,
"F": 6,
"G": 7
}
}

loans = loans.replace(mapping_dict)
loans[['emp_length','grade']].head()

再對剩余特征進行One-hot編碼。

n_columns = ["home_ownership", "verification_status", "application_type","purpose", "term"]
dummy_df = pd.get_dummies(loans[n_columns]) # 用get_dummies進行one hot編碼
loans = pd.concat([loans, dummy_df], axis=1) #當axis = 1的時候，concat就是行對齊，然后將不同列名稱的兩張表合並
再清除掉原來的屬性。

loans = loans.drop(n_columns, axis=1)
loans.info()

這樣，就已經將所有類型為object的變量作了轉化。

col = loans.select_dtypes(include=['int64','float64']).columns
col = col.drop('loan_status') #剔除目標變量

loans_ml_df = loans # 復制數據至變量loans_ml_df
特征縮放（Feature Scaling）
    我們采用的是標准化的方法，調用scikit-learn模塊preprocessing的子模塊StandardScaler。
sc =StandardScaler() # 初始化縮放器
loans_ml_df[col] =sc.fit_transform(loans_ml_df[col]) #對數據進行標准化
特征選擇（Feature Selecting）
    目的：首先，優先選擇與目標相關性較高的特征；其次，去除不相關特征可以降低學習的難度。
#構建X特征變量和Y目標變量
x_feature = list(loans_ml_df.columns)
x_feature.remove('loan_status')
x_val = loans_ml_df[x_feature]
y_val = loans_ml_df['loan_status']
len(x_feature) # 查看初始特征集合的數量
103
    首先，選出與目標變量相關性較高的特征。這里采用的是Wrapper方法，通過暴力的遞歸特征消除 (Recursive Feature Elimination)方法篩選30個與目標變量相關性最強的特征，逐步剔除特征從而達到首次降維，自變量從103個降到30個。
# 建立邏輯回歸分類器
model = LogisticRegression()
# 建立遞歸特征消除篩選器
rfe = RFE(model, 30) #通過遞歸選擇特征，選擇30個特征
rfe = rfe.fit(x_val, y_val)
# 打印篩選結果
print(rfe.n_features_)
print(rfe.estimator_ )
print(rfe.support_)
print(rfe.ranking_) #ranking 為 1代表被選中，其他則未被代表未被選中

col_filter = x_val.columns[rfe.support_] #通過布爾值篩選首次降維后的變量
col_filter

Filter

在第一次降維的基礎上，通過皮爾森相關性圖譜找出冗余特征並將其剔除；同時，可以通過相關性圖譜進一步引導我們選擇特征的方向。

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(loans_ml_df[col_filter].corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

drop_col = ['funded_amnt', 'funded_amnt_inv', 'out_prncp', 'out_prncp_inv', 'total_pymnt_inv', 'total_rec_prncp',
'num_actv_rev_tl', 'num_rev_tl_bal_gt_0', 'home_ownership_RENT', 'application_type_Joint App',
'term_ 60 months', 'purpose_debt_consolidation', 'verification_status_Source Verified', 'home_ownership_OWN',
'verification_status_Verified',]
col_new = col_filter.drop(drop_col) #剔除冗余特征

len(col_new) # 特征子集包含的變量從30個降維至15個。
15

Embedded
下面需要對特征的權重有一個正確的評判和排序，可以通過特征重要性排序來挖掘哪些變量是比較重要的，降低學習難度，最終達到優化模型計算的目的。這里，我們采用的是隨機森林算法判定特征的重要性，工程實現方式采用scikit-learn的featureimportances 的方法。
names = loans_ml_df[col_new].columns
clf=RandomForestClassifier(n_estimators=10,random_state=123) #構建分類隨機森林分類器
clf.fit(x_val[col_new], y_val) #對自變量和因變量進行擬合
for feature in zip(names, clf.feature_importances_):
print(feature)

plt.style.use('ggplot')

## feature importances 可視化##
importances = clf.feature_importances_
feat_names = names
indices = np.argsort(importances)[::-1]
fig = plt.figure(figsize=(20,6))
plt.title("Feature importances by RandomTreeClassifier")
plt.bar(range(len(indices)), importances[indices], color='lightblue', align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()

# 下圖是根據特征在特征子集中的相對重要性繪制的排序圖，這些特征經過特征縮放后，其特征重要性的和為1.0。
# 由下圖我們可以得出的結論：基於決策樹的計算，特征子集上最具判別效果的特征是“total_pymnt”。

6.模型訓練
處理樣本不均衡
前面已提到，目標變量“loans_status”正常和違約兩種類別的數量差別較大，會對模型學習造成困擾。我們采用過采樣的方法來處理樣本不均衡問題，具體操作使用的是SMOTE（Synthetic Minority Oversampling Technique），SMOET的基本原理是：采樣最鄰近算法，計算出每個少數類樣本的K個近鄰，從K個近鄰中隨機挑選N個樣本進行隨機線性插值，構造新的少數樣本，同時將新樣本與原數據合成，產生新的訓練集。

# 構建自變量和因變量
X = loans_ml_df[col_new]
y = loans_ml_df["loan_status"]

n_sample = y.shape[0]
n_pos_sample = y[y == 0].shape[0]
n_neg_sample = y[y == 1].shape[0]
print('樣本個數：{}; 正樣本占{:.2%}; 負樣本占{:.2%}'.format(n_sample,
n_pos_sample / n_sample,
n_neg_sample / n_sample))
print('特征維數：', X.shape[1])

# 處理不平衡數據
sm = SMOTE(random_state=42) # 處理過采樣的方法
X, y = sm.fit_sample(X, y)
print('通過SMOTE方法平衡正負樣本后')
n_sample = y.shape[0]
n_pos_sample = y[y == 0].shape[0]
n_neg_sample = y[y == 1].shape[0]
print('樣本個數：{}; 正樣本占{:.2%}; 負樣本占{:.2%}'.format(n_sample,
n_pos_sample / n_sample,
n_neg_sample / n_sample))

構建分類器訓練
本次項目我們采用交叉驗證法划分數據集，將數據划分為3部分：訓練集（training set）、驗證集（validation set）和測試集（test set）。讓模型在訓練集進行學習，在驗證集上進行參數調優，最后使用測試集數據評估模型的性能。

模型調優我們采用網格搜索調優參數（grid search），通過構建參數候選集合，然后網格搜索會窮舉各種參數組合，根據設定評定的評分機制找到最好的那一組設置。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) # random_state = 0 每次切分的數據都一樣
# 構建參數組合
param_grid = {'C': [0.01,0.1, 1, 10, 100, 1000,],
'penalty': [ 'l1', 'l2']}
# C：Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=10) # 確定模型LogisticRegression，和參數組合param_grid ，cv指定10折
grid_search.fit(X_train, y_train) # 使用訓練集學習算法

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))

print("Best estimator:\n{}".format(grid_search.best_estimator_)) # grid_search.best_estimator_ 返回模型以及他的所有參數（包含最優參數）

現在使用經過訓練和調優后的模型在測試集上測試。

y_pred = grid_search.predict(X_test)
print("Test set accuracy score: {:.5f}".format(accuracy_score(y_test, y_pred,)))
Test set accuracy score: 0.66064

print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_pred)
print("Area under the ROC curve : %f" % roc_auc)
Area under the ROC curve : 0.660654

總結
最后結果不太理想，實際工作中還要做特征分箱處理，計算IV值和WOE編碼也是需要的。模型評估方面也有不足，這為以后的工作提供了些經驗。

python金融風控評分卡模型和數據分析微專業課（博主親自錄制視頻）：http://dwz.date/b9vv

掃描和關注博主二維碼，學習免費python視頻教學資

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 阿里雲的金融風控-貸款違約預測_模型融合 Lending Club貸款數據分析（python代碼） Lending Club 貸款業務信用評分卡建模金融風控之貸款違約預測筆記薪資預測模型【轉】基於R語言構建的電影評分預測模型灰色理論預測模型阿里雲的金融風控-貸款違約預測_特征工程組合預測模型（轉）預測方法——灰色預測模型