本系列是針對於DataWhale學習小組的筆記,從一個對統計學和機器學習理論基礎薄弱的初學者角度出發,在小組學習資料和其他網絡資源的基礎上,對知識進行總結和整理,今后有了新的理解可能還會不斷完善。由於水平實在有限,不免產生謬誤,歡迎讀者多多批評指正。如需要轉載請與博主聯系,謝謝
比賽核心思路
賽題內容理解
-
比賽目標及數據
本場比賽的目標屬於典型的分類問題,即依靠某信貸平台提供的超過120萬條貸款記錄,來構建預測未知新用戶是否違約的模型。提供的訓練集train.csv中包括各類與用戶貸款信用相關的47列變量信息,其中15列為匿名變量,其中部分數據可能出錯或空缺,因此需要進行分析和預處理,除此之外還有一列isDefault數據代表該用戶是否違約(數值為1則違約,0則未違約),總共80萬條數據;提供的測試集testA.csv用於選手進行線下驗證,除沒有標簽isDefault外其他特征信息與訓練集類似,總共20萬條。47條樣本特征字段及其含義如下圖所示:
訓練集特征的數據格式及最終預測結果的提交數據格式如下圖所示(注意提交的數據為待預測用戶id及對應的0-1之間的違約概率):
-
金融風控預測常用指標
- ROC(Receiver Operating Characteristic)即“受試者工作特征”曲線常被用來評價分類模型的效果。在之前的新聞文本分類賽中我們曾經介紹了分類模型評估的幾個基本指標,即樣例可根據其真實類別與學習器所預測類別的組合划分為真正例(true positive,TP)——將正例正確預測為正例、假正例(false positive,FP)——將負例錯誤的預測為正例、真反例(true negative,TN)——將負例正確的預測為負例、假反例(false negative,FN)——將正例錯誤預測為負例四種情形。而在這里我們又需要引入假正例率(FPR)和真正例率(TPR)的概念,其計算公式如下:
ROC曲線所在的二維空間將FPR定義為X軸,TPR定義為Y軸(兩個軸的坐標值都從0-1變化)。作圖時我們首先需要設定從高到低的一系列判定閾值用於區分模型輸出樣例的類別(由於我們模型會針對每條測試樣本輸出0-1之間的一個概率值,因此具體標簽值是0還是1需要設定一個閾值來判別,模型輸出概率高於此閾值時認為此條樣例被歸為類別1,否則被歸為類別0),每取一個閾值時都會獲得一個不同的樣本輸出類別分布,因此可以計算出一組FPR、TPR的值。將不同閾值下計算出的FPR及TPR作為橫縱坐標在圖中畫出對應的點,然后用線連接起來即獲得了我們需要的ROC曲線。
# ROC曲線示例
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import numpy as np
y_pred = np.random.rand(10)
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 1])
FPR,TPR,thresholds=roc_curve(y_true, y_pred, pos_label=1)
plt.title('ROC')
plt.plot(FPR, TPR,'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('TPR')
plt.xlabel('FPR')
plt.show()
- 線下面積AUC(Area Under Curve)被定義為在ROC圖中由ROC曲線及其右側和下方的坐標軸圍成的圖形面積。一般ROC曲線位於對角線y=x的上方,因此取值范圍在0.5~1之間(ROC橫縱坐標構成的方形總面積為1)。當AUC值越接近於1,則認為模型預測准確性越高,反之則越低。如下圖所示:
- KS(Kolmogorov-Smirnov)曲線同樣利用了TPR和FPR指標,但將二者均當作縱軸,橫軸為0~1之間等間隔划分的n個閾值,每個閾值取值下可計算出一組TPR和FPR值,因此可以分別繪制出TPR和FPR隨判定閾值變化的兩條曲線。通常TPR曲線在上,FPR曲線在下,二者之間差值的最大值即為KS值。一般來說KS>20%即說明模型效果良好,KS>75%時區分效果過好可能存在問題。
# KS值計算示例
from sklearn.metrics import roc_curve
import numpy as np
y_pred = np.random.rand(10)
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 1])
FPR,TPR,thresholds=roc_curve(y_true, y_pred)
KS=abs(FPR-TPR).max()
print('KS值=',KS)
'''
>> KS值= 0.5833333333333334
'''
數據分析及預處理
- 數據讀取和查看
# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas.util.testing as tm
warnings.filterwarnings('ignore')
data_train = pd.read_csv('D://人工智能資料//數據比賽//貸款風險預測//train.csv') # 讀取訓練集
# pd.read_csv中加入nrows=n參數可設置讀取文件的前n行(對於大文件);chunksize=m參數可以將數據讀取為可迭代的形式,其中每一組有m個點存為一個DataFrame
data_test_a = pd.read_csv('D://人工智能資料//數據比賽//貸款風險預測//testA.csv') # 讀取測試集
print(data_test_a.shape)
'''
(200000, 48)
'''
print(data_train.shape)
'''
(800000, 47)
'''
print(data_train.columns)
'''
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
'''
print(data_train.info()) # 獲得數據類型
'''
# 以下只展示一部分,實際上可以獲得完整信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
...
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
None
'''
print(data_train.describe()) # 獲得一些基本統計量,直接用data_train.describe()即可得到可移動查看的表格
'''
id loanAmnt term interestRate \
count 800000.000000 800000.000000 800000.000000 800000.000000
mean 399999.500000 14416.818875 3.482745 13.238391
std 230940.252015 8716.086178 0.855832 4.765757
min 0.000000 500.000000 3.000000 5.310000
25% 199999.750000 8000.000000 3.000000 9.750000
50% 399999.500000 12000.000000 3.000000 12.740000
75% 599999.250000 20000.000000 3.000000 15.990000
max 799999.000000 40000.000000 5.000000 30.990000
...
n11 n12 n13 n14
count 730248.000000 759730.000000 759730.000000 759730.000000
mean 0.000815 0.003384 0.089366 2.178606
std 0.030075 0.062041 0.509069 1.844377
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 1.000000
50% 0.000000 0.000000 0.000000 2.000000
75% 0.000000 0.000000 0.000000 3.000000
max 4.000000 4.000000 39.000000 30.000000
'''
print(data_train.head(3).append(data_train.tail(3))) # 查看原始數據的前三行和最后三行
'''
id loanAmnt term interestRate installment grade subGrade \
0 0 35000.0 5 19.52 917.97 E E2
1 1 18000.0 5 18.49 461.90 D D2
2 2 12000.0 5 16.99 298.17 D D3
799997 799997 6000.0 3 13.33 203.12 C C3
799998 799998 19200.0 3 6.92 592.14 A A4
799999 799999 9000.0 3 11.06 294.91 B B3
employmentTitle employmentLength homeOwnership ... n5 n6 \
0 320.0 2 years 2 ... 9.0 8.0
1 219843.0 5 years 0 ... NaN NaN
2 31698.0 8 years 0 ... 0.0 21.0
799997 2582.0 10+ years 1 ... 4.0 26.0
799998 151.0 10+ years 0 ... 10.0 6.0
799999 13.0 5 years 0 ... 3.0 4.0
n7 n8 n9 n10 n11 n12 n13 n14
0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 NaN NaN NaN 13.0 NaN NaN NaN NaN
2 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
799997 4.0 10.0 4.0 5.0 0.0 0.0 1.0 4.0
799998 12.0 22.0 8.0 16.0 0.0 0.0 0.0 5.0
799999 4.0 8.0 3.0 7.0 0.0 0.0 0.0 2.0
[6 rows x 47 columns]
'''
- 數據分析
print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.') # df.isnull().any()會判斷哪些列包含缺失值,該列存在缺失值則返回True,反之False;加上.sum()則可求得列數
'''
There are 22 columns in train dataset with missing values.
'''
have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict() # 這一部分用於判斷缺失率高於某個值的特征列
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
if value > 0.05: # 記錄缺失率高於5%的列及其缺失值比率
fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar() # 繪制缺失率分布圖
'''
{'employmentLength': 0.05849875, 'n0': 0.0503375, 'n1': 0.0503375, 'n2': 0.0503375, 'n2.1': 0.0503375, 'n5': 0.0503375, 'n6': 0.0503375, 'n7': 0.0503375, 'n8': 0.05033875, 'n9': 0.0503375, 'n11': 0.08719, 'n12': 0.0503375, 'n13': 0.0503375, 'n14': 0.0503375}
'''
one_value_fea = [col for col in data_train.columns if data_train[col].nunique() == 1] # 尋找訓練集和測試集中是否有特征列僅出現了一種取值
one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() == 1]
print(one_value_fea)
print(one_value_fea_test)
'''
['policyCode']
['policyCode']
'''
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns) # 篩選數值型特征
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns))) # 篩選類別型特征
def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
print(numerical_serial_fea) # 查看哪些是連續型變量
'''
['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']
'''
print(numerical_noserial_fea) # 查看哪些是離散型變量
'''
['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']
'''
print(data_train['homeOwnership'].value_counts()) # 查看某個離散型變量取值的分布情況
'''
0 395732
1 317660
2 86309
3 185
5 81
4 33
Name: homeOwnership, dtype: int64
'''
f = pd.melt(data_train, value_vars=numerical_serial_fea) # 篩選連續型數值變量
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False) # 繪制分布直方圖(會繪制所有連續變量,下面只展示兩個),注意它的輸入只能是df或Series
g = g.map(sns.distplot, "value")
# 也可以單獨繪制某個特征的取值分布圖,看變量(或取log后的變量)是否符合正態分布,
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)
plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)
print(category_fea) # 觀察哪些變量是類別型變量
'''
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
'''
print(data_train['employmentLength'].value_counts()) # 顯示其中一個類型變量的取值分布
'''
10+ years 262753
2 years 72358
< 1 year 64237
3 years 64152
1 year 52489
5 years 50102
4 years 47985
6 years 37254
8 years 36192
7 years 35407
9 years 30272
Name: employmentLength, dtype: int64
'''
plt.figure(figsize=(8, 8))
sns.barplot(data_train["employmentLength"].value_counts(dropna=False)[:20],
data_train["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()
train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0] # 分別取出標簽'isDefault'等於0和1的樣本集
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax1, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax2, title='Count of employmentLength non-fraud') # 統計並繪制兩種標簽取值下該類別特征的取值(類別)分布
plt.show()
import pandas_profiling # 生成數據報告
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")
- 特征預處理
- 缺失值處理:首先要考慮對數據集中缺失的特征進行處理,使其能夠以正確的方式輸入到之后的模型中。
# data_train = data_train.fillna(0) # 用指定值替換缺失值
# data_train = data_train.fillna(axis=0,method='ffill') # axis=0按行方向進行填充,=1沿列方向進行填充,'ffill'/’bfill'用該方向的前一個/后一個值填充,可用limit=n限制連續缺失時最多填充n個值
# data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].median()) # 按照中位數填充數值型特征
data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].mean()) # 按照平均數填充數值型特征
data_test_a[category_fea] = data_test_a[category_fea].fillna(data_train[category_fea].mode()) # 按照眾數填充類別型特征
data_train.isnull().sum() # 查看此時缺失值的統計情況(這里不再展示結果)
- 時間格式處理:將以字符串存儲的、不規范的時間數據轉化為時間戳存儲的、規范化的時間格式
for data in [data_train, data_test_a]:
data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d') # 后一個參數代表需要按什么格式從前一個輸入的序列中進行匹配,下一行同理
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
#構造時間特征
data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
print(type(data['issueDate'][0]))
'''
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 可以看到該列數據的類型由字符串變為了pd的時間戳
'''
- 將類別特征轉化為數值特征:如'employmentLength'一列以'1 year'、'2 years'等作為特征取值,難以直接輸入模型中,需要轉化為數值標簽處理
def employmentLength_to_int(s): # 自定義函數用於提取年份信息並轉化為相應的數值
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0])
for data in [data_train, data_test_a]:
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True) # '10+ years'這一類取值需轉化為'10 years'格式后才方便統一處理,下一行同理
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int) # .apply將每個取值應用到函數中,並以返回值替代原值
data['employmentLength'].value_counts(dropna=False).sort_index()
'''
0.0 15989
1.0 13182
2.0 18207
3.0 16011
4.0 11833
5.0 12543
6.0 9328
7.0 8823
8.0 8976
9.0 7594
10.0 65772
NaN 11742
Name: employmentLength, dtype: int64
'''
# 同理對'earliesCreditLine'一列進行處理
data_train['earliesCreditLine'].sample(5)
'''
519915 Sep-2002
564368 Dec-1996
768209 May-2004
453092 Nov-1995
763866 Sep-2000
Name: earliesCreditLine, dtype: object
'''
for data in [data_train, data_test_a]:
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))
# 對'grade'這種取值有優先順序的特征可以使用連續的數字進行映射
for data in [data_train, data_test_a]:
data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
# 類型數在2之上,又不是高維稀疏的,且純分類特征,可以用one hot進行編碼
for data in [data_train, data_test_a]:
data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
- 異常值處理:處理時要盡可能辨別該異常值到底是偶然誤差還是代表了某種罕見的現象,前者可以直接刪除該樣本,后者則要納入模型的考慮之中。
# 常用方法:均方差判別異常值,然后分析異常值與樣本標簽之間的關聯
def find_outliers_by_3segama(data,fea): # 3σ原則:正態分布中大約 68% 的數據值會在均值的一個標准差范圍內,大約 95% 會在兩個標准差范圍內,大約 99.7% 會在三個標准差范圍內
data_std = np.std(data[fea])
data_mean = np.mean(data[fea])
outliers_cut_off = data_std * 3
lower_rule = data_mean - outliers_cut_off
upper_rule = data_mean + outliers_cut_off
data[fea+'_outliers'] = data[fea].apply(lambda x:str('異常值') if x > upper_rule or x < lower_rule else '正常值')
return data
data_train = data_train.copy()
for fea in numerical_fea:
data_train = find_outliers_by_3segama(data_train,fea)
print(data_train[fea+'_outliers'].value_counts())
print(data_train.groupby(fea+'_outliers')['isDefault'].sum())
print('*'*10)
# 結果不在這里展示,但是分析可得異常值在兩個標簽上的分布與整體的分布無明顯差異,因此可認為這些異常值屬於偶然誤差,可以刪除
#刪除異常值
for fea in numerical_fea:
data_train = data_train[data_train[fea+'_outliers']=='正常值']
data_train = data_train.reset_index(drop=True)
- 數據分桶:分桶操作實際上是對特征取值的模糊化和離散化,可以降低變量的復雜性,減少噪音的影響,也提升模型的穩定性。分桶既可以對連續變量使用,也可以對較多取值的離散變量使用。分箱的關鍵在於找到合適的切分點,盡量保證每個箱內都有一定的數據量(不要出現空桶),也不要讓某個桶內樣本的標簽相同。這里有幾種最簡單的辦法:
# 方法一:固定間隔分桶
# 通過除法映射到間隔均勻的分箱中,每個分箱的取值范圍都是loanAmnt/1000
data['loanAmnt_bin1'] = np.floor_divide(data['loanAmnt'], 1000)
# 方法二:指數間隔分桶
## 通過對數函數映射到指數寬度分箱
data['loanAmnt_bin2'] = np.floor(np.log10(data['loanAmnt']))
# 方法三:分位數分桶
data['loanAmnt_bin3'] = pd.qcut(data['loanAmnt'], 10, labels=False)
模型建立與調參
- 數據准備
在完成上面部分所述的數據分析和預處理過程后,我們可以着手准備訓練和測試模型所用的數據。可以看一下baseline中的代碼(數據導入和特征預處理部分都省略):
features = [f for f in data.columns if f not in ['id','issueDate','isDefault']] # 除這三列之外均作為模型特征記錄下來
# baseline中為了處理特征方便已將兩個數據集合並data = pd.concat([train, testA], axis=0, ignore_index=True),在這里要重新分開
train = data[data.isDefault.notnull()].reset_index(drop=True)
test = data[data.isDefault.isnull()].reset_index(drop=True)
x_train = train[features] # 選取訓練集中做為模型特征的列組成訓練輸入
x_test = test[features] # 選取測試集中做為模型特征的列組成測試輸入
y_train = train['isDefault'] # 挑出訓練集樣本的標簽列
- 模型構建
這里說的模型構建實際上僅是定義一個函數來封裝機器學習模型的操作流程和參數,因為我們只是調用現成模塊,因此不用自己寫算法的實現細節。baseline中選用了三種常見的決策樹集成類的模型。
import lightgbm as lgb
import xgboost as xgb
# from catboost import CatBoostRegressor # catboost模塊在我自己的電腦上安裝總是出問題,因此無法使用
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': 2020,
'nthread': 28,
'n_jobs':24,
'silent': True,
'verbose': -1,
}
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
# print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix = clf.DMatrix(trn_x , label=trn_y)
valid_matrix = clf.DMatrix(val_x , label=val_y)
test_matrix = clf.DMatrix(test_x)
params = {'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.04,
'tree_method': 'exact',
'seed': 2020,
'nthread': 36,
"silent": True,
}
watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
'''
if clf_name == "cat":
params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
model = clf(iterations=20000, **params)
model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
cat_features=[], use_best_model=True, verbose=500)
val_pred = model.predict(val_x)
test_pred = model.predict(test_x)
train[valid_index] = val_pred
test = test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
print(cv_scores)
'''
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
- 模型訓練和預測
# 這三個函數實際上僅相當於將模型名稱加入到函數名中
def lgb_model(x_train, y_train, x_test):
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
return xgb_train, xgb_test
'''
def cat_model(x_train, y_train, x_test):
cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
return cat_train, cat_test
'''
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test) # lgb模型訓練(下面僅展示訓練過程中的部分輸出)
'''
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.742781 valid_1's auc: 0.730238
[400] training's auc: 0.755391 valid_1's auc: 0.731421
[600] training's auc: 0.766076 valid_1's auc: 0.731637
[800] training's auc: 0.776276 valid_1's auc: 0.731616
[1000] training's auc: 0.785706 valid_1's auc: 0.731626
Early stopping, best iteration is:
[850] training's auc: 0.778637 valid_1's auc: 0.731771
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.743829 valid_1's auc: 0.726629
[400] training's auc: 0.756563 valid_1's auc: 0.728084
[600] training's auc: 0.767445 valid_1's auc: 0.728527
[800] training's auc: 0.777538 valid_1's auc: 0.728466
Early stopping, best iteration is:
[652] training's auc: 0.770397 valid_1's auc: 0.728656
************************************ 3 ************************************
...
'''
xgb_train, xgb_test = xgb_model(x_train, y_train, x_test) # xgb模型訓練(輸出省略)
rh_test = lgb_test*0.5 + xgb_test*0.5 # 將模型預測值進行融合可能有助於提升預測效果
testA['isDefault'] = rh_test
testA[['id','isDefault']].to_csv('test_sub.csv', index=False) # 生成結果文件
模型融合
模型融合一般是比賽后期上分的關鍵技術之一,其基本思想就是通過對多個單模型的預測結果進行匯總融合以提升整體預測效果。
常用的模型融合方法可以分為以下幾類:
- 平均法
即簡單的將不同模型的輸出結果(前提是格式一致)進行相加或加權相加后取平均。這是最簡單的辦法,有時也會取得不錯的效果,類似上一部分內容中所展示的那樣:
rh_test = lgb_test*0.3 + xgb_test*0.7
- 投票法
對於分類問題,可以利用幾個不同模型進行預測,然后利用預測的結果對每條測試樣本的標簽類別進行投票,以此決定測試集的預測結果。
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2, subsample=0.7,objective='binary:logistic')
vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)]) # 簡單投票
# vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)], voting='soft', weights=[2, 1, 1]) # 加權投票
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))
- stacking
stacking的核心思想就是建立兩層預測模型來獲得更精准的預測效果。假設第一層由五個不同類的模型組成,則如下圖所示,利用類似五折驗證的方式分別獲得每個模型對訓練集的預測結果,將這些結果融合后作為第二層模型的訓練數據(注意這里利用預測的結果作為一列新的特征,因此第二層模型確實存在過擬合的風險),同樣測試集也通過第一次模型預測結果的融合獲得一列新的特征,然后利用第二層模型進行預測。
import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
# 以python自帶的鳶尾花數據集為例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf_cv_mean.append(scores.mean())
clf_cv_std.append(scores.std())
clf.fit(X, y)
- blending
blending的思路與stacking類似,最關鍵的區別在於blending對於每個模型分別留出訓練集的一部分作為預測對象,然后將預測的結果按順序排列作為第二層訓練集的新特征,而不是分別進行n者驗證后融合獲得新特征列。
# 以python自帶的鳶尾花數據集為例
data_0 = iris.data
data = data_0[:100,:]
target_0 = iris.target
target = target_0[:100]
#模型融合中基學習器
clfs = [LogisticRegression(),
RandomForestClassifier(),
ExtraTreesClassifier(),
GradientBoostingClassifier()]
#切分一部分數據作為測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=914)
#切分訓練數據集為d1,d2兩部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=914)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
for j, clf in enumerate(clfs):
#依次訓練各個單模型
clf.fit(X_d1, y_d1)
y_submission = clf.predict_proba(X_d2)[:, 1]
dataset_d1[:, j] = y_submission
#對於測試集,直接用這k個模型的預測值作為新的特征。
dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))
#融合使用的模型
clf = GradientBoostingClassifier()
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))
參考資料:
- https://github.com/datawhalechina/team-learning-data-mining/tree/master/FinancialRiskControl Datawhale小組學習資料
- https://tianchi.aliyun.com/competition/entrance/531830/introduction 阿里雲-天池零基礎入門金融風控-貸款違約預測
- https://blog.csdn.net/qq_30992103/article/details/99730059 ROC曲線學習總結
- https://blog.csdn.net/anshuai_aw1/article/details/82498557 模型融合之stacking&blending原理及代碼
- https://blog.csdn.net/wuzhongqiang/article/details/105012739 零基礎數據挖掘入門系列(六) - 模型的融合技術大總結與結果部署