寫在前面:本文為本人所做數據分析關於信用評分卡的習作,使用的是一個多年前kaggle的一個數據集,所以已經有人做過相關的分析。正在學習增強中,水平有限,文中不當之處望各位多多指點。
一、 數據介紹
| SeriousDlqin2yrs |
是否有超過90天的逾期 |
| RevolvingUtilizationOfUnsecuredLines |
循環貸款無抵押額度 |
| Age |
借款人年齡 |
| NumberOfTime30-59DaysPastDueNotWorse |
有逾期30-59天,但在過去2年沒有更糟的情況出現的次數 |
| DebtRatio |
每月債務支付,贍養費,生活費用除以月總收入 |
| MonthlyIncome |
月收入 |
| NumberOfOpenCreditLinesAndLoans |
開放式貸款的數量(如汽車貸款或抵押貸款等分期付款)和信用額度(如信用卡) |
| NumberOfTimes90DaysLate |
借款人逾期90天或以上的次數 |
| NumberRealEstateLoansOrLines |
按揭及房地產貸款數目,包括房屋凈值信貸額度。 |
| NumberOfTime60-89DaysPastDueNotWorse |
借款人逾期60-89天的次數,但過去兩年更糟的情況出現 |
| NumberOfDependents |
不包括自己在內的家屬(配偶,子女等)數量。 |
二、 數據獲取和整合
1. 數據獲取
adress1<-"E:\\project\\pfk\\cs-training.csv"
train1<-read.table(adress1,header=TRUE,sep=",",row.names="id")
2. 變量名整理
由於變量名都太長了,進行重新命名
names(train1)[1:11]<-c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9","x10")
3. 缺失值分析及處理
> md.pattern(train1)
y x1 x2 x3 x4 x6 x7 x8 x9 x10 x5
120269 1 1 1 1 1 1 1 1 1 1 1 0
25807 1 1 1 1 1 1 1 1 1 1 0 1
3924 1 1 1 1 1 1 1 1 1 0 0 2
0 0 0 0 0 0 0 0 0 3924 29731 33655
> matrixplot(train1)

存在缺失x5(月收入)的情況,x5(月收入)與x10(家庭成員情況)同時缺失的情況。
由於缺失值數量較多,不宜采用直接刪除的方法,從缺失的位置可以看出,x10的缺失只出現在x5也缺失的情況下,為非隨機缺失,也不宜直接刪除。這里選擇對缺失值進行填補,使用k臨近方法進行插值。此處缺失值x10(家庭成員數量)為整數,將k設置為奇數,采用中位數進行插值。
> train2<-knnImputation(train1[,2:11],k=11, scale =T,meth = "median", distData =NULL)
缺失值在數據中顯示為NA,但是在使用knnImputation函數的時候將distData設置為NA反而會出現Not sufficient complete cases for computing neighbors.錯誤,不進行指定反而可以正常運行。迷
4. 異常值的分析處理
(1) 單變量異常檢測
> summary(train2)
x1 x2 x3 x4 x5
Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.0 Min. : 0
1st Qu.: 0.03 1st Qu.: 41.0 1st Qu.: 0.000 1st Qu.: 0.2 1st Qu.: 2500
Median : 0.15 Median : 52.0 Median : 0.000 Median : 0.4 Median : 4500
Mean : 6.05 Mean : 52.3 Mean : 0.421 Mean : 353.0 Mean : 5658
3rd Qu.: 0.56 3rd Qu.: 63.0 3rd Qu.: 0.000 3rd Qu.: 0.9 3rd Qu.: 7416
Max. :50708.00 Max. :109.0 Max. :98.000 Max. :329664.0 Max. :3008750
x6 x7 x8 x9 x10
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
1st Qu.: 5.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 8.000 Median : 0.000 Median : 1.000 Median : 0.0000 Median : 0.0000
Mean : 8.453 Mean : 0.266 Mean : 1.018 Mean : 0.2404 Mean : 0.7451
3rd Qu.:11.000 3rd Qu.: 0.000 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
Max. :58.000 Max. :98.000 Max. :54.000 Max. :98.0000 Max. :20.0000
根據變量的定義
X1是已經使用的信用額度與總的信用額度之比,正常情況下在(0,1)之間,但是數據出現了50708這樣的異常值,綜合考慮現實中可能會出現超額的情況和銀行對貸款應有的風險管理,定義正常的數據范圍為(0,1.5),超過1.5的為異常數據,予以刪除。
X2(年齡)有0的情況 ,0為異常值。
par(mfrow=c(1,3))
boxplot(train1$x3,xlab="x3")
boxplot(train1$x7,xlab="x7")
boxplot(train1$x9,xlab="x9")

X3,x7,x9在大於80位置,均有異常值
其他變量沒有異常點或者無法通過經驗對數據范圍進行限定,暫不做處理。
處理方法:
> train3<-train21[which(train21$x2!=0&train21$x3<=80&train21$x7<80&train21$x9<80&train21$x1<=1.5),]
(2) 多變量異常值檢測
多變量異常值檢測,使用LOF檢測異常值。在此處可采用多個異常值檢驗模型進行檢驗,后根據多個模型的結論投票最終決定一個值是否為異常值,由於水平有限暫時做不了。
a <-names(train3)%in%c("y")
train31<-train3[!a]
train31<-scale(train31)
yc <- lofactor(train31, k=5)
train32<-cbind(train3,yc)
plot(density(train32[which(train32$yc>0&train32$yc<3),]$yc))

定義大於1.5的部分為異常值。
#對異常值進行篩選
train33<-na.omit(train32[which(train32$yc>0&train32$yc<1.5),])
train33<-train33[,1:11]

三、 數據探索(EDA)與變量描述性統計
> stat.desc(train33,basic=TRUE,desc=TRUE,norm=FALSE,p=0.95)

1. 單變量分析
X1:
p<-ggplot(train33,aes(x=x1))#
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x1))#
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:循環貸款無抵押額度在人數分布上集中於0與1兩個點附近,比率越大,人數分布越少;違約概率與循環貸款無抵押額度在(0,1)區間內呈現比較標准的線性正相關,而大於1后呈現出非常高的違約率。
X2:
p<-ggplot(train33,aes(x=x2))#
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x2))#
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:年紀的人數分布大致呈現正態分布,年紀98歲以前呈現出隨着年紀的增加,違約率下降,而到了98歲之后呈現異常的違約率增加的現象。
X3:
p<-ggplot(train33,aes(x=x3))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x3))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:35-59天違約的次數集中在0次,有過本類違約的人是較少的,隨着違約次數的增加,違約率先增加,然后在6-7次出出現違約率減少,而在10次以后出現了違約率猛然上升的現象。
X4:
未發現規律性
X5:
p<-ggplot(train33[which(train33$x5<100000&train33$x1>0),],aes(x=x5))
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33[which(train33$x5<100000&train33$x1>0),],aes(x=x5))
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:人群主要分布在10000月收入以下,在10000月收入以下,違約率呈現收入越高違約率越低的趨勢,但是在更高月收入的數據當真呈現混亂的現象,初步判斷是少部分顧客將月收入錯填為年收入而產生的混亂。
X6:
p<-ggplot(train33,aes(x=x6))
p+geom_histogram(binwidth =4,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x6))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:人群在不同的信貸數量下呈現正態的分布,違約率呈現出持有較少的信貸數量和較多信貸數量的客戶有較高的違約概率,而持有較多數量信貸的客戶比持有較少的信貸數量的客戶的違約率更高。
X7:
p<-ggplot(train33,aes(x=x7))
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x7))
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:x7與x3類似
X8:
p<-ggplot(train33,aes(x=x8))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x8))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:隨着貸款數量的增加,人數越來越少,大多數的人的貸款數量在4個以下,隨着貸款數量的增加,違約率也呈現着增加的趨勢。
X9:
與x3,x7類似
X10:
p<-ggplot(train33,aes(x=x10))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x10))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))


評析:家庭成員越多,客戶越少,單身的客戶占了大半;隨着家庭成員的增多,違約率變化不大,當家庭成員在8個以上時,違約率低。
2. 變量間的相關性分析
library(corrplot)
cor1<-cor(train33)
corrplot(cor1)
geom_tile(aes(fill=value))
corrplot(corr=cor1, method = "number",addCoef.col = "grey")

評析:各變量間相關性小,初步判斷符合logistic回歸的要求。
四、 模型開發
1. 數據切分
library(caret)
set.seed(1111)
inTrain <- createDataPartition(y=train33$y,p=0.5,list=FALSE)
training <- train33[inTrain, ]
testing <- train33[-inTrain, ]
summary(testing)
2. 建立模型
glm1<-glm(y~.,family = binomial(link = "logit"),data = training)
summary(glm1)
Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data = training)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2404 -0.3241 -0.2230 -0.1725 3.2137
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.310338803 0.080252123 -41.249 < 0.0000000000000002 ***
x1 1.958402994 0.048926358 40.028 < 0.0000000000000002 ***
x2 -0.019645433 0.001371451 -14.325 < 0.0000000000000002 ***
x3 0.431973025 0.016448532 26.262 < 0.0000000000000002 ***
x4 -0.000062736 0.000017252 -3.636 0.000276 ***
x5 -0.000028903 0.000004622 -6.253 0.000000000402647570 ***
x6 0.030681183 0.003769002 8.140 0.000000000000000394 ***
x7 0.696080668 0.024405712 28.521 < 0.0000000000000002 ***
x8 0.128121160 0.015487068 8.273 < 0.0000000000000002 ***
x9 0.591574598 0.032721639 18.079 < 0.0000000000000002 ***
x10 0.049479163 0.014527829 3.406 0.000660 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35353 on 73821 degrees of freedom
Residual deviance: 27127 on 73811 degrees of freedom
AIC: 27149
Number of Fisher Scoring iterations: 6
模型變量都通過檢驗。
3. 模型評估
使用AUC值對模型進行評估。
library(pROC)
library(e1071)
pre <- predict(glm1,testing)
modelroc <- roc(testing$y,pre)
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE)

最優點FPR=1-TNR=0.771,TPR=0.773,AUC值為0.851,說明模型預測效果不錯。
五、 評分卡實施
1. 數據分箱
數據分箱方法分為手動(等距分箱和等寬分箱)與自動分箱(基於決策樹算法的最優分箱)。本例采用最優分箱與手動分箱相結合。
library(smbinning)
#該列數據小數位數較長,全部納入計算會出現內存不足。
xx1<-round(train33$x1,4)
a <-names(train3)%in%c("x1")
train341<-train33[!a]
train341<-cbind(train341,xx1)
smbin1<-smbinning(df=train341, y="y", x="xx1", p = 0.05)
smbin1$bands
smbinning.plot(smbin1,option="WoE",sub="xx1")
> smbin1$bands
[1] 0.0000 0.1318 0.2177 0.3008 0.3852 0.4949 0.6980 0.9032 1.4995

smbin2<-smbinning(df=train33, y="y", x="x2", p = 0.05)
smbin2$bands
smbinning.plot(smbin2,option="WoE",sub="x2")
> smbin2$bands
[1] 21 29 36 42 47 52 55 59 62 67 105

smbin3<-smbinning(df=train33, y="y", x="x3", p = 0.05)
smbin3$bands
smbinning.plot(smbin3,option="WoE",sub="x3")
> smbin3$bands
[1] 0 0 1 13

xx4<-round(train33$x4,4)
a <-names(train3)%in%c("x4")
train344<-train33[!a]
train344<-cbind(train344,xx4)
smbin4<-smbinning(df=train344, y="y", x="xx4", p = 0.05)
smbin4$bands
smbinning.plot(smbin4,option="WoE",sub="xx4")
> smbin4$bands
[1] 0.0000 0.0192 0.1371 0.4232 0.5041 0.6536 3.9723
[8] 329664.0000

smbin5<-smbinning(df=train33, y="y", x="x5", p = 0.05)
smbin5$bands
smbinning.plot(smbin5,option="WoE",sub="x5")
> smbin5$bands
[1] 0 391 3331 4833 6643 835040

smbin6<-smbinning(df=train33, y="y", x="x6", p = 0.05)
smbin6$bands
smbinning.plot(smbin6,option="WoE",sub="x6")
> smbin6$bands
[1] 0 2 3 13 57

smbin7<-smbinning(df=train33, y="y", x="x7", p = 0.01)
smbin7$bands
smbinning.plot(smbin7,option="WoE",sub="x7")
> smbin7$bands
[1] 0 0 1 17

smbin10<-smbinning(df=train33, y="y", x="x10", p = 0.05)
smbin10$bands
smbinning.plot(smbin10,option="WoE",sub="x10")
> smbin10$bands
[1] 0 0 1 2 10

在最優分箱的過程中,x8,x9數據不滿足smbinning函數的輸入數據要求,需要進行手東的分箱。
getwoe<-function(xx,a,b)
{
good<-nrow(train33[which(train33$y==0&train33[,xx]>=a&train33[,xx]<b),])
bad<-nrow(train33[which(train33$y==1&train33[,xx]>=a&train33[,xx]<b),])
Tgood<-nrow(train33[which(train33$y==0),])
Tbad<-nrow(train33[which(train33$y==1),])
woe<-log((bad*Tgood)/(good*Tbad))
return(woe)
}
table(train33$x8[duplicated(train33$x8)])
barplot(c(getwoe(9,0,1),getwoe(9,1,2),getwoe(9,2,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70)))
c(getwoe(9,0,1),getwoe(9,1,2),getwoe(9,2,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70))

第二項與第三項水平相當,可以合並。
barplot(c(getwoe(9,0,1),getwoe(9,1,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70)))

分區為:[0,2,3,4,5,6]
[1] 0.2162265 -0.2093599 0.0216662 0.3377388 0.6708343 0.9103778 1.2689269
table(train33$x9[duplicated(train33$x9)])
barplot(c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,3),getwoe(10,3,4),getwoe(10,4,5),getwoe(10,5,6),getwoe(10,6,7),getwoe(10,7,70)))
c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,3),getwoe(10,3,4),getwoe(10,4,5),getwoe(10,5,6),getwoe(10,6,7),getwoe(10,7,70))

第三到第六相當合並
barplot(c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,6),getwoe(10,6,7),getwoe(10,7,70)))
c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,6),getwoe(10,6,7),getwoe(10,7,70))

分區為:[0,1,5,6]
[1] -0.272719 1.848474 2.766116 3.764645 2.666032
2. 評分卡刻度
主要工作為:
(1) 指定某個特定比率的預期分值;
(2)指定翻倍比率的分值

設定1:20比率時分數為600分,違約率翻倍時減少的分數為20分。
B=20/ln(2)
B=28.8539
A=600+28.8539*ln(0.05)
A=513.5614
評分卡刻度:

3. 生成評分卡
| base |
|
609 |
| RevolvingUtilizationOfUnsecuredLines |
<= 0.13 |
38 |
|
|
<= 0.22 |
22 |
|
|
<= 0.30 |
16 |
|
|
<= 0.39 |
7 |
|
|
<= 0.50 |
0 |
|
|
<= 0.70 |
-12 |
|
|
<= 0.90 |
-26 |
|
|
> 0.90 |
-41 |
| Age |
<= 29 |
-17 |
|
|
<= 36 |
-14 |
|
|
<= 42 |
-10 |
|
|
<= 47 |
-7 |
|
|
<= 52 |
-4 |
|
|
<= 55 |
-1 |
|
|
<= 59 |
7 |
|
|
<= 62 |
10 |
|
|
<= 67 |
21 |
|
|
> 67 |
32 |
| NumberOfTime30-59DaysPastDueNotWorse |
<= 0 |
15 |
|
|
<= 1 |
-26 |
|
|
> 1 |
-54 |
| DebtRatio |
<= 0.02 |
14 |
|
|
<= 0.14 |
-1 |
|
|
<= 0.42 |
4 |
|
|
<= 0.50 |
-3 |
|
|
<= 0.65 |
-10 |
|
|
<= 3.97 |
-18 |
|
|
> 3.97 |
6 |
| MonthlyIncome |
<= 391 |
14 |
|
|
<= 3331 |
-10 |
|
|
<= 4833 |
-6 |
|
|
<= 6643 |
-1 |
|
|
> 6643 |
9 |
| NumberOfOpenCreditLinesAndLoans |
<= 2 |
-19 |
|
|
<= 3 |
-4 |
|
|
<= 13 |
4 |
|
|
> 13 |
-2 |
| NumberOfTimes90DaysLate |
<= 0 |
11 |
|
|
<= 1 |
-57 |
|
|
> 1 |
-83 |
| NumberRealEstateLoansOrLines |
<= 0 |
-6 |
|
|
<= 2 |
6 |
|
|
<= 3 |
-1 |
|
|
<= 4 |
-10 |
|
|
<= 5 |
-19 |
|
|
<= 6 |
-26 |
|
|
> 6 |
-37 |
| NumberOfTime60-89DaysPastDueNotWorse |
<= 0 |
-8 |
|
|
<= 1 |
-53 |
|
|
<= 5 |
-80 |
|
|
<= 6 |
-109 |
|
|
> 6 |
-77 |
| NumberOfDependents |
<= 0 |
5 |
|
|
<= 1 |
-3 |
|
|
<= 2 |
-6 |
|
|
> 2 |
-11 |
4. 分值分配
建立一個用戶數據自動進行打分的模塊:
owoe<-function(x,n)#除x8,x9的woe
{
woe1<-((get(paste("smbin",n,sep="")))$ivtable)[table(c((get(paste("smbin",n,sep=""))$cuts<x),"TRUE"))[["TRUE"]],13]
return(woe1)
}
owoe1<-function(x,n)#x8,x9的woe
{
woe2<-(get(paste("c",n,sep="")))[table(c((get(paste("smbin",n,sep=""))<x),"TRUE"))[["TRUE"]]]
return(woe2)
}
twoe=c()
for (i in 1:150000)
{ twoe1<-(513.5614-28.8539*(-3.310338803))-28.8539*(owoe(train2[i,1],1)+owoe(train2[i,2],2)+owoe(train2[i,3],3)+owoe(train2[i,4],4)+owoe(train2[i,5],5)+owoe(train2[1,6],6)+owoe(train2[i,7],7)+owoe1(train2[i,7],8)+owoe1(train2[i,7],9)+owoe(train2[i,10],10))
twoe<-c(twoe,twoe1)
}
JG<-cbind(train2,twoe)

最后對評分結果與違約率關系進行可視化展現
p<-ggplot(JG,aes(x=twoe))
p+geom_histogram(binwidth =20,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(JG,aes(x=twoe))
p+geom_histogram(binwidth =20,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))

由於低分段人數較少,導致實際結果比率不穩定,但是總體上對不同違約比率的人群做了較好的划分。
六、 番外
在完成評分卡之余,把數據來源的kaggle競賽結果提交后,有84.5%的預測正確,還不錯。
#導入數據
adress1<-"E:\\project\\pfk\\cs-test.csv"
text1<-read.table(adress1,header=TRUE,sep=",",row.names="id")
#重命名列
names(text1)[1:11]<-c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9","x10")
#缺失值信息
md.pattern(text1)
#缺失值圖
matrixplot(text1)
#填補缺失值
text2<-knnImputation(text1[,2:11],k=11, scale =T,meth = "median", distData =NULL)
#建立模型
glm2<-glm(y~.,family = binomial(link = "logit"),data = train33)
summary(glm2)
#輸出預測
library(pROC)
library(e1071)
pre1 <- predict(glm2,text2)
pre1 <- exp(pre1)/(1+exp(pre1))
text4<-cbind(text2,pre1)
adress2<-"E:\\project\\pfk\\jg.csv"
write.csv(pre1,file=adress2,quote=T,row.name=T)

