Linear Model Selection and Regularization
此博文是 An Introduction to Statistical Learning with Applications in R 的系列讀書筆記,作為本人的一份學習總結,也希望和朋友們進行交流學習。
該書是The Elements of Statistical Learning 的R語言簡明版,包含了對算法的簡明介紹以及其R實現,最讓我感興趣的是算法的R語言實現。
【轉載時請注明來源】:http://www.cnblogs.com/runner-ljt/
Ljt 勿忘初心 無畏未來
作為一個初學者,水平有限,歡迎交流指正。
這一節的內容主要是線性回歸的進一步延伸,在一般的最小二乘線性回歸中經常遇到的問題是自變量數目較多,高維度的自變量會導致樣本數目
相對偏少和自變量間的共線性現象,最終使得回歸模型過擬合,預測精度下降以及模型的解釋能力的削弱。選擇合理的自變量不僅能降低模型的復雜
度,而且能有效防止過擬合問題.
本節內容主要介紹在變量選擇上的三種常見方法:
(1)子集選擇(subset selection):以模型優劣的判別准則來選擇合適的自變量子集,常見的方法有最優子集法和逐步回歸法;
(2)收縮法/正則化(shrinkage/regularization):在對所有變量的傳統最小二乘回歸中加入約束,使得某些回歸系數收縮為0,從而達到選擇合適
自變量的目的,主要包括嶺回歸(ridge regression)及lasso回歸;
(3)維數縮減(dimension reduction):利用投影得到不相關的自變量組合,主要方法為主成分回歸(PCR)和偏最小二乘回歸(PLS).
前兩種方法在本質上是一樣的,都是通過選擇全變量的一個合適的子集來達到降維的目的,只不過(1)是將選擇和判別分開,每一個子集便是一個
選擇,判別是通過相應的判別准則來決定哪一個子集選擇是最佳的,而(2)是將選擇和判別融為一個過程,在全變量的最小二乘中加入罰約束,在最小
二乘的過程中要考慮到變量的選擇問題,最終達到最小二乘和變量選擇的一個權衡;(1)和(2)都是在原始變量基礎上選擇一個合理的子集,而(3)則換
了一種思路,不直接使用原始變量,而是對原始變量進行投影轉換為新的變量,既使得變量的維度減少又使新變量間相互獨立。
subset selection
最優子集法是將全自變量的所有非空子集都作為一個選擇,然后通過判別准則來選擇最優的子集,當有p個自變量時便會有2^p 個回歸方程需要進
行判別,當p稍微大一些時復雜度就會極高。
逐步回歸法是對自變量進行逐個的分析,按照優進差退是原則進行選擇,主要的方法包括前進法、后退法及混合法;前進法和后退法主要的區別是
原始模型是空模型還是全模型,由此產生了不同的適用性:當自變量維度較高(自變量個數大於樣本量,p>n)時,前進法仍然適用,而由於后退法第
一步就要建立全模型(只有當樣本量大於自變量個數時才能建立合理的全模型),所以后退法不再適用;前進法是只進不出,后退法是只出不進,而合
法則是有進也有出,因為新進的變量不能保證已被選擇的變量仍然顯著,所以混合法最受推崇。
模型選擇的判別准則
模型擬合優劣的一般判別准則有殘差平方和SSE及復相關系數R等,但是當自變量子集擴大時,殘差平方和會減少,復決定系數會增大,這便會導
致自變量選擇的越多擬合效果就會越好的假象,所以殘差平方和、復相關系數或樣本決定系數都不能作為選擇自變量的准則。
以下本文從不同角度給出自模型選擇的幾個常用判別准則:
(1)擬合角度:調整的復決定系數 Ra=1-(n-1)(1-R2)/(n-p-1)越大,對應的回歸方程越好;
(2)極大似然估計角度:AIC准則,AIC值越小,對應的回歸方程越優;
(3)預測角度:Cp統計量越小,對應的回歸模型越優;
(4)試驗誤差:對樣本進行訓練樣本和試驗樣本分組,在試驗樣本上的驗證集法或交叉驗證的試驗誤差最小的回歸方程最優.
基於 Ra2、AIC、Cp判別准則的模型選擇
regsubsets(formula,data,nvmax=8,method=c('exhaustive','backward','frowand','seqrep'))
formula:因變量Y及自變量X的線性回歸模型
nvmax:自變量集中含有自變量的最大個數
na.omit(data) 去掉含有NA的樣本
plot(x,scale=c('r2','adjr2','Cp','bic')) -----plot.regsubsets
x:regsubsets類
scale:由於繪圖排序的統計量(判別准則)
> > library(ISLR) > library(leaps) > names(Hitters) [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks" [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI" [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors" [19] "Salary" "NewLeague" > dim(Hitters) [1] 322 20 > #去掉含有NA的樣本 > data<-na.omit(Hitters) > dim(data) [1] 263 20 > #最優子集法 > regfit.full<-regsubsets(Salary~.,data,nvmax=15) > summary(regfit.full) Subset selection object Call: regsubsets.formula(Salary ~ ., data, nvmax = 15) 19 Variables (and intercept) Forced in Forced out AtBat FALSE FALSE Hits FALSE FALSE HmRun FALSE FALSE Runs FALSE FALSE RBI FALSE FALSE Walks FALSE FALSE Years FALSE FALSE CAtBat FALSE FALSE CHits FALSE FALSE CHmRun FALSE FALSE CRuns FALSE FALSE CRBI FALSE FALSE CWalks FALSE FALSE LeagueN FALSE FALSE DivisionW FALSE FALSE PutOuts FALSE FALSE Assists FALSE FALSE Errors FALSE FALSE NewLeagueN FALSE FALSE 1 subsets of each size up to 15 Selection Algorithm: exhaustive AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " " " "*" " " " " " " 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " "*" "*" " " " " " " 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" " " " " "*" "*" " " " " " " 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" " " " " "*" "*" " " " " " " 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " " " " " "*" "*" " " " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " "*" " " "*" "*" " " " " " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " " "*" "*" " " " " " " 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " " "*" "*" "*" " " " " 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" "*" "*" "*" "*" " " " " 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*" "*" "*" "*" " " " " 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*" "*" "*" "*" "*" " " 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" "*" "*" "*" "*" "*" "*" " " 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" " " > > > #選取最優自變量集合 > regsumm<-summary(regfit.full) > names(regsumm) [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" > regsumm$rsq [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 [14] 0.5452164 0.5454692 > regsumm$adjr2 [1] 0.3188503 0.4208024 0.4450753 0.4672734 0.4808971 0.4972001 0.5007849 0.5137083 0.5180572 0.5222606 0.5225706 0.5217245 0.5206736 [14] 0.5195431 0.5178661 > regsumm$cp [1] 104.281319 50.723090 38.693127 27.856220 21.613011 14.023870 [7] 13.128474 7.400719 6.158685 5.009317 5.874113 7.330766 [13] 8.888112 10.481576 12.346193 > which.min(regsumm$cp) [1] 10 > regsumm$bic [1] -90.84637 -128.92622 -135.62693 -141.80892 -144.07143 -147.91690 -145.25594 -147.61525 -145.44316 -143.21651 -138.86077 [12] -133.87283 -128.77759 -123.64420 -118.21832 > which.max(regsumm$rsq) [1] 15 > which.max(regsumm$adjr2) [1] 11 > which.min(regsumm$bic) [1] 6 > #圖形展示 > plot(regfit.full,scale='r2') > plot(regfit.full,scale='adjr2') > plot(regfit.full,scale='Cp') > plot(regfit.full,scale='bic') > > #最優自變量的回歸系數 > coef(regfit.full,6) (Intercept) AtBat Hits Walks CRBI DivisionW PutOuts 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338 0.2643076 > > #前進法 > regfit.fwd<-regsubsets(Salary~.,data,nvmax=15,method='forward') > #后退法 > regfit.fwd<-regsubsets(Salary~.,data,nvmax=15,method='backward') >
基於驗證集法或交叉驗證判別准則的模型選擇
Validation Set Approach
首先在訓練樣本上按照變量選擇的方法,得到不同自變量個數所對應的自變量集合,即得到不同的模型;然后在試驗樣本上計算每個模型對應
的MES,得到MES最小的模型,即得到最優的自變量集合;最后在全體樣本上重復變量選擇方法,得到對應最優模型的回歸系數(對於最優模型的
回歸系數必須在全體樣本上計算得到)。
> library(ISLR) > library(leaps) > Hitters<-na.omit(Hitters) > set.seed(1) > train<-sample(c(T,F),nrow(Hitters),rep=T) > test<-!train > #在訓練樣本上進行自變量子集選擇 > regfit.best<-regsubsets(Salary~.,data=Hitters[train,],nvmax=19) > #產生試驗樣本的X矩陣 > test.mat<-model.matrix(Salary~.,data=Hitters[test,]) > #對於產生的不同模型,計算在試驗樣本上的MSE > val.errors<-rep(NA,19) > > for(i in 1:19){ + coefi<-coef(regfit.best,i) + pred<-test.mat[,names(coefi)]%*%coefi + val.errors[i]<-mean((Hitters$Salary[test]-pred)^2) + } > > which.min(val.errors) [1] 10 > > #得到最優模型后,在全體樣本上計算模型回歸系數 > regfit.best<-regsubsets(Salary~.,data=Hitters,nvmax=19) > coef(regfit.best,10) (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680 >
K-Cross Validation
> #10折交叉驗證法 > k<-10 > set.seed(2) > #將樣本分為10組 > folds<-sample(1:k,nrow(Hitters),replace=T) > #產生存儲10折交叉驗證的MSE > cv.errors<-matrix(NA,k,19,dimnames=list(NULL,paste(1:19))) > > for(i in 1:k){ + best.fit<-regsubsets(Salary~.,data=Hitters[folds!=i,],nvmax=19) + test.mat<-model.matrix(Salary~.,data=Hitters[folds==i,]) + for(j in 1:19){ + coefi<-coef(regfit.best,j) + pred<-test.mat[,names(coefi)]%*%coefi + cv.errors[i,j]<-mean((Hitters$Salary[folds==i]-pred)^2) + } + } > > #計算含有j個自變量的模型的交叉驗證的評價MSE > mean.cv.errors<-apply(cv.errors,2,mean) > which.min(mean.cv.errors) 14 14 > #在全體樣布上計算最優模型的回歸系數 > reg.best<-regsubsets(Salary~.,data=Hitters,nvmax=19) > coef(reg.best,14) (Intercept) AtBat Hits HmRun Runs Walks CAtBat CRuns CRBI CWalks LeagueN 144.6793182 -2.0883279 7.6436454 2.3406524 -2.3580478 6.1794713 -0.1488074 1.5931621 0.7170767 -0.8565844 44.2352269 DivisionW PutOuts Assists Errors -112.8079905 0.2876182 0.3677311 -3.1271251 >