R語言實戰讀書筆記(十三)廣義線性模型

本文轉載自查看原文 2015-12-04 16:44 5345 讀書筆記/ R語言/ R語言實戰/ 廣義線性模型

# 婚外情數據集
data(Affairs, package = "AER")
summary(Affairs)
table(Affairs$affairs)

# 用二值變量，是或否
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0
Affairs$ynaffair <- factor(Affairs$ynaffair, levels = c(0, 1), labels = c("No", "Yes"))
table(Affairs$ynaffair)

#用glm
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, data = Affairs, family = binomial())
summary(fit.full)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5713 -0.7499 -0.5690 -0.2539 2.5191

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.37726 0.88776 1.551 0.120807
gendermale 0.28029 0.23909 1.172 0.241083
age -0.04426 0.01825 -2.425 0.015301 *
yearsmarried 0.09477 0.03221 2.942 0.003262 **
childrenyes 0.39767 0.29151 1.364 0.172508
religiousness -0.32472 0.08975 -3.618 0.000297 ***
education 0.02105 0.05051 0.417 0.676851
occupation 0.03092 0.07178 0.431 0.666630
rating -0.46845 0.09091 -5.153 2.56e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 609.51 on 592 degrees of freedom
AIC: 627.51

Number of Fisher Scoring iterations: 4

#結果是有些變量不顯著把它們都去掉

fit.reduced <- glm(ynaffair ~ age + yearsmarried + 
religiousness + rating, data = Affairs, family = binomial())
summary(fit.reduced)

Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness +
rating, family = binomial(), data = Affairs)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6278 -0.7550 -0.5701 -0.2624 2.3998

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93083 0.61032 3.164 0.001558 **
age -0.03527 0.01736 -2.032 0.042127 *
yearsmarried 0.10062 0.02921 3.445 0.000571 ***
religiousness -0.32902 0.08945 -3.678 0.000235 ***
rating -0.46136 0.08884 -5.193 2.06e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 615.36 on 596 degrees of freedom
AIC: 625.36

Number of Fisher Scoring iterations: 4

#如上，發現每個系數都非常顯著。這是兩模型嵌套，可以使用anova進行比較，對於廣義線性回歸，可用卡方檢驗。

anova(fit.reduced, fit.full, test = "Chisq")

Analysis of Deviance Table

Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 596 615.36
2 592 609.51 4 5.8474 0.2108

#卡方值不顯著(0.21)，表明新模型與包含了全部變量的模型擬合程度一樣好。這可以確認添4個被去掉的變量不會顯著提高方程的預測精度，可以依據更簡單的模型進行解釋。

13.2.1 解釋模型參數

看回歸系數：當其他預測變量不變時，一單位預測變量的變化可引起的響應變量對數優勢比的變化。

coef(fit.reduced)

(Intercept) 　　　　age 　　　　　　yearsmarried 　　　religiousness 　　rating
1.93083017 　　 -0.03527112 　　 0.10062274 　　 -0.32902386 　　 -0.46136144

由於對數優勢比解釋性差，可對結果進行指數化

exp(coef(fit.reduced))

(Intercept) age 　　　　yearsmarried 　　religiousness 　　rating
6.8952321 0.9653437 1.1058594 　　 0.7196258 　　　 0.6304248

可以看到婚齡增加一年，婚外情的優勢比將乘以1.106（其他不變），年齡增加一歲，婚外情優勢比乘以0.965。

因此，隨着婚齡增加和年齡、宗教信仰與婚姻評分的降低，婚外情優勢比將上升。

因為預測變量不能等於0，截距項在此處沒有特定含義。

還可以使用confint函數獲取置信區間。

exp(confint(fit.reduced))

Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 　　 2.1255764 　　23.3506030
age 　　　　　 0.9323342 　　0.9981470
yearsmarried 　 1.0448584 　　1.1718250
religiousness 　 0.6026782 　　0.8562807
rating 　　　　 0.5286586 　　0.7493370

13.2.2 評價預測變量對結果概率的影響

testdata <- data.frame(rating = c(1, 2, 3, 4, 5), 
    age = mean(Affairs$age), yearsmarried = mean(Affairs$yearsmarried), 
    religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, 
    type = "response")
testdata

rating 　　age 　　　　yearsmarried 　　　religiousness 　　　　prob
1 1 　　32.48752 　　 8.177696 　　　　 3.116473 　　　　0.5302296
2 2 　　32.48752 　　 8.177696 　　　　 3.116473 　　　　0.4157377
3 3 　　32.48752 　　 8.177696 　　　　 3.116473 　　　　0.3096712
4 4 　　32.48752 　　 8.177696 　　　　 3.116473 　　　　0.2204547
5 5 　　32.48752 　　 8.177696 　　　　 3.116473 　　　　0.1513079

從結果看出，當婚姻評分從1（很不幸福）變成5（非常幸福），婚外情概率從0.53降低到0.15（假定年齡等因素不變），看看年齡影響

testdata <- data.frame(rating = mean(Affairs$rating), 
    age = seq(17, 57, 10), yearsmarried = mean(Affairs$yearsmarried), 
    religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, 
    type = "response")
testdata

rating 　　age 　　　　yearsmarried 　　religiousness 　　prob
1 　　　　3.93178 17 8.177696 　　　 3.116473 　　　　0.3350834
2 　　　　3.93178 27 8.177696 　　　　3.116473 　　　　0.2615373
3 　　　　3.93178 37 8.177696 　　　　3.116473 　　　　0.1992953
4 　　　　3.93178 47 8.177696 　　　　3.116473 　　　　0.1488796
5 　　　　3.93178 57 8.177696 　　　　3.116473 　　　　0.1094738

從結果看出，當其他變量不變，年齡從17增加到57時，婚外情概率將從0.34降低到0.11，利用該方法，可探究每一個預測變量對結果概率的影響。

13.2.3 過度離勢

過度離勢：觀測到的響應變量方差大於期望的二項分布的方式。這會導致奇異的標准誤檢驗和不精確的顯著性檢驗。

若出現過度離勢，還可以用glm擬合Logistic回歸，但此時需要將二項分布改為類二項分布。

檢查方法：比較二項分布模型和殘差偏差與殘差自由度，如果比值比1大很多，便可認為存在過度離勢。

本例中，比值=615.36/596=1.02，沒有過度離勢。

還可以對過度離勢進行檢驗。為此，需要擬合模型兩次，第一次使用family=binomial，第二次使用family=quasibinomial，假設第一次glm返回對象記為fit，第二次返回對象記為fit.od，用pchisq，提供的p值可對零假設（比值為1）與備擇假設（比值不等於1）進行檢驗。

fit <- glm(ynaffair ~ age + yearsmarried + religiousness + 
    rating, family = binomial(), data = Affairs)
fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + 
    rating, family = quasibinomial(), data = Affairs)
pchisq(summary(fit.od)$dispersion * fit$df.residual, 
    fit$df.residual, lower = F)

結果值為0.34不顯著，因此不存在過度離勢。

13.3 泊松回歸

data(breslow.dat, package = "robust")
names(breslow.dat)
summary(breslow.dat[c(6, 7, 8, 10)])

opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
attach(breslow.dat)
hist(sumY, breaks = 20, xlab = "Seizure Count", main = "Distribution of Seizures")
boxplot(sumY ~ Trt, xlab = "Treatment", main = "Group Comparisons")
par(opar)

從圖上可以看到因變量的偏倚特性及可能的離群點。葯物治療下發病數變小，方差也變小。與標准最小二乘回歸不同，泊松回歸並不關注方差異質量性。

fit <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, 
    family = poisson())
summary(fit)

結果非常顯著

13.3.1 解釋模型參數

coef(fit)
exp(coef(fit))

(Intercept) 　　Base 　　　　　　Age 　　　　　　Trtprogabide
1.94882593 　　 0.02265174 　　 0.02274013 　　 -0.15270095

年齡回歸參數為0.0227，表明保持其他預測變量不變，年齡增加一歲，癲閑發病數的對數均值將增加0.03。

指數化系數中，年齡增加一歲，發病數乘以1.023。Trt變化，發病數*0.86。也就是說，保持發病數和年齡不變，服葯組相對於安慰劑組發病數降低了20%。

13.3.2 過度離勢

比值=559.44/55=10.17，大於1，說明存在過度離勢。

也可以用以下方法檢驗過度離勢。

library(qcc)
qcc.overdispersion.test(breslow.dat$sumY, type = "poisson")

P值為0，存在過度離勢。

# fit model with quasipoisson
fit.od <- glm(sumY ~ Base + Age + Trt, data = breslow.dat,
family = quasipoisson())
summary(fit.od)

過度離勢的原因：

1.遺漏了某個重要的預測變量

2.可能因為事件相關。泊松分布中，每個事件都被認為是獨立發生的。就是對隨便一位病人，每一次發病概率都認為是相互獨立，但這並不成立。一個病人發生了40次病，第一次概率與第40次概率肯定不相同。

3.在縱向數據分析中，重復測量的數據由於內存群聚特性可導致過度離勢。這里不討論。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [讀書筆記] R語言實戰（十三）廣義線性模型 R語言實戰（八）廣義線性模型 R語言-廣義線性模型 R語言實戰讀書筆記(八)回歸廣義線性模型廣義線性模型廣義線性模型《Go語言實戰》讀書筆記廣義線性模型 R--glm函數廣義線性模型（GLM）