摘要: 僅用於記錄R語言學習過程:
內容提要:
描述性統計;t檢驗;數據轉換;方差分析;卡方檢驗;回歸分析與模型診斷;生存分析;COX回歸
寫在正文前的話,關於基礎知識,此篇為終結篇,筆記來自醫學方的課程,僅用於學習R的過程。
正文:
描述性統計
n 如何去生成table1 用table()函數,快速匯總頻數
u 生成四格表:table(行名,列名)
> table(tips$sex,tips$smoker)
No Yes
Female 54 33
Male 97 60
u addmargins()函數:對生成的table表格進行計算
> table(esoph$agegp,esoph$ncases)
0 1 2 3 4 5 6 8 9 17
25-34 14 1 0 0 0 0 0 0 0 0
35-44 10 2 2 1 0 0 0 0 0 0
45-54 3 2 2 2 3 2 2 0 0 0
55-64 0 0 2 4 3 2 2 1 2 0
65-74 1 4 2 2 2 2 1 0 0 1
75+ 1 7 3 0 0 0 0 0 0 0
> tt <- table(esoph$agegp,esoph$ncases)
> addmargins(tt,margin = c(1,2)) # margin 1表示行,2表示列
0 1 2 3 4 5 6 8 9 17 Sum
25-34 14 1 0 0 0 0 0 0 0 0 15
35-44 10 2 2 1 0 0 0 0 0 0 15
45-54 3 2 2 2 3 2 2 0 0 0 16
55-64 0 0 2 4 3 2 2 1 2 0 16
65-74 1 4 2 2 2 2 1 0 0 1 15
75+ 1 7 3 0 0 0 0 0 0 0 11
Sum 29 16 11 9 8 6 5 1 2 1 88
n xtabs()函數:在數據集中取子集,生成表格
> hightip <- tips[,'tip'] > mean(tips[,'tip'] )
> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker=='No'))
#as.data.frame轉換成數據框;~后面的公式類似table括號中的內容,為分類變量;~左邊需添加的是連續型變量;有一個子集subset可進行提取
tips.sex hightip Freq
1 Female FALSE 31
2 Male FALSE 49
3 Female TRUE 23
4 Male TRUE 48
> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c('Sun','Sat')))
tips.sex hightip Freq
1 Female FALSE 21
2 Male FALSE 53
3 Female TRUE 25
4 Male TRUE 64
示例2:
> xtabs(ncontrols~agegp + alcgp,data = esoph)
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 61 45 5 5
35-44 89 80 20 10
45-54 78 81 39 15
55-64 89 84 43 26
65-74 71 53 29 8
75+ 27 12 2 3
> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))
alcgp
agegp 0-39g/day 40-79 80-119 120+ Sum
25-34 61 45 5 5 116
35-44 89 80 20 10 199
45-54 78 81 39 15 213
55-64 89 84 43 26 242
65-74 71 53 29 8 161
75+ 27 12 2 3 44
Sum 415 355 138 67 975
#cbind(數值型,數值型)
> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)
, , = ncases
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 0 0 0 1
35-44 1 4 0 4
45-54 1 20 12 13
55-64 12 22 24 18
65-74 11 25 13 6
75+ 4 4 2 3
, , = ncontrols
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 61 45 5 5
35-44 89 80 20 10
45-54 78 81 39 15
55-64 89 84 43 26
65-74 71 53 29 8
75+ 27 12 2 3
n ftable()函數:扁平表格,接受xtabs對象,進行調整
> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))
ncases ncontrols
agegp alcgp
25-34 0-39g/day 0 61
40-79 0 45
80-119 0 5
120+ 1 5
35-44 0-39g/day 1 89
40-79 4 80
80-119 0 20
120+ 4 10
45-54 0-39g/day 1 78
40-79 20 81
80-119 12 39
120+ 13 15
55-64 0-39g/day 12 89
40-79 22 84
80-119 24 43
120+ 18 26
65-74 0-39g/day 11 71
40-79 25 53
80-119 13 29
120+ 6 8
75+ 0-39g/day 4 27
40-79 4 12
80-119 2 2
120+ 3 3
n 數據匯總可結合前面學習的函數,如:
u summary(數據集)函數
u describe(數據集)函數(psych包里的)
u describe.data.frame(數據集)函數 (Hmisc包里的)
u apply()家族等
t檢驗
n 假設檢驗:服從正態分布,方差齊性(如果不齊,可以用t’檢驗),
n 示例:
u 生成隨機數據,並檢驗是否符合正態分布:(shapiro.test()函數)
> data1 <- sample(1:100,50)
>
> #檢驗正態性 shapiro.test()函數
> shapiro.test(data1)
Shapiro-Wilk normality test
data: data1
W = 0.96424, p-value = 0.1338 #不能拒絕原假設,服從正態分布
除此之外,還有以下5個函數可用於檢驗是否符合正態分布:
library(nortest)
lillie.test(data1)
ad.test(data1)
cvm.test(data1)
pearson.test(data1)
sf.test(data1)
如:
> lillie.test(data1)
Lilliefors (Kolmogorov-Smirnov) normality test
data: data1
D = 0.064492, p-value = 0.8693
> ad.test(data1)
Anderson-Darling normality test
data: data1
A = 0.40918, p-value = 0.333
> cvm.test(data1)
Cramer-von Mises normality test
data: data1
W = 0.054212, p-value = 0.4437
> pearson.test(data1)
Pearson chi-square normality test
data: data1
P = 4, p-value = 0.7798
> sf.test(data1)
Shapiro-Francia normality test
data: data1
W = 0.97496, p-value = 0.3075
u 生成服從正態分布的數據:rnorm()函數
> #生成服從正態分布的數據
> data3 <- rnorm(100,3,5)
> data4 <- rnorm(200,3.4,8)
>
> shapiro.test(data3)
Shapiro-Wilk normality test
data: data3
W = 0.99369, p-value = 0.9258
>
> shapiro.test(data4)
Shapiro-Wilk normality test
data: data4
W = 0.98946, p-value = 0.149
u 檢驗方差齊性:var.test()函數:僅適用於兩組
> var.test(data3,data4)
F test to compare two variances
data: data3 and data4
F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2893355 0.5739700
sample estimates:
ratio of variances
0.4034794 #p值小於0.05,說明方差不齊性
u 進行t檢驗函數:t.test()函數
u 兩組均數t檢驗
> t.test(data3,data4,var.equal = F) #方差不齊則var.equal設置為F,默認也為FALSE,若其設置為TRUE,如為FALSE,則會使用t’檢驗
Welch Two Sample t-test
data: data3 and data4
t = -1.3681, df = 281.41, p-value = 0.1724 #說明兩組均數無顯著統計學差異
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.587004 0.465494
sample estimates:
mean of x mean of y
2.468549 3.529304
u 樣本均數與總體均數t檢驗
> #樣本均數與總體均數的比較
> t.test(data3,mu =3.2)
One Sample t-test
data: data3
t = -1.4117, df = 99, p-value = 0.1612 #樣本與總體無統計學差異
alternative hypothesis: true mean is not equal to 3.2
95 percent confidence interval:
1.440424 3.496675
sample estimates:
mean of x
2.468549
u 配對數據的t檢驗
data3 <- rnorm(200,3,5)
> data4 <- rnorm(200,3.4,5)
>
> #配對數據
> #進行配對t檢驗
> t.test(data3,data4,paired = TRUE)
Paired t-test
data: data3 and data4
t = 0.59079, df = 199, p-value = 0.5553 #p大於0.05,說明無顯著差異
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6946177 1.2888581
sample estimates:
mean of the differences
0.2971202
數據變換
n 數據不滿足正態分布時該如何處理:有時可采用取log()、sqrt()、1/x等等方式,
n runif()平均分布函數
> mydata <- runif(100,min = 1,max = 2)
> shapiro.test(mydata)
Shapiro-Wilk normality test
data: mydata
W = 0.93149, p-value = 6.05e-05
> shapiro.test(log(mydata))
Shapiro-Wilk normality test
data: log(mydata)
W = 0.93082, p-value = 5.54e-05
> shapiro.test(sqrt(mydata))
Shapiro-Wilk normality test
data: sqrt(mydata)
W = 0.93236, p-value = 6.794e-05
> shapiro.test(1/mydata)
Shapiro-Wilk normality test
data: 1/mydata
W = 0.9195, p-value = 1.325e-05
#幾種方式均不可行。
n boxcox轉換---對公式的,加載MASS包,運用里面的boxcox()函數,#boxcox()轉換,核心為找到lambda的值進行相應的轉換
bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,
lambda = seq(-0.25,0.25,length =10))
#Volume為擬操作的變量,Height和Girth為用此兩個數據進行估計,但要取log,trees為數據集,lambda后面的公式為取最佳值
u 用公式:(x^lambda-1)/lambda 進行數據轉換(lambda 不等於1),新的Volume_bc就服從正態分布了。
> Volume_bc <- (trees$Volume^lambda-1)/lambda
> shapiro.test(Volume_bc)
Shapiro-Wilk normality test
data: Volume_bc
W = 0.96431, p-value = 0.3776
u 可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看圖,是否符合正態分布
n boxcox轉換---對數值的,加載forecast包,利用包中的BoxCox.lambda()函數
> BoxCox.lambda(trees$Volume)
[1] -0.4954451 #lambda的值 ,采用1/sqrt(x)
分以下幾種情況:
ü lambda接近-1時 1/x
ü -0.5 1/sqrt(x)
ü 0 ln(x) 或log(x)
ü 0.5 sqrt()
ü 1 不用變換了
完整示例:
> BoxCox.lambda(trees$Volume)
[1] -0.4954451
> new_volum <- 1/sqrt(trees$Volume)
> shapiro.test(new_volum)
Shapiro-Wilk normality test
data: new_volum
W = 0.94523, p-value = 0.1152
n 利用car包中powerTransform得到lambda值,再進行正態分布分析
> library(car)
> powerTransform(trees$Volume)
Estimated transformation parameter
trees$Volume
-0.07476608
> new_volum <- trees$Volume^-0.07476608
> shapiro.test(new_volum)
Shapiro-Wilk normality test
data: new_volum
W = 0.96428, p-value = 0.3768
方差分析
n 用於多組均值的比較。兩組均值的t檢驗與方差分析等價,但是對於>=3組的均數比較,t檢驗不適用,需用方差分析。
n 假設檢驗,需滿足的條件:
u 正態性
u 方差齊性
u 獨立性
n 方差齊性的檢驗
u 安裝multcomp包,加載cholesterol數據集(里面包含response組和trt組)
u 方差齊性的檢驗:
l R語言的內置包:bartlett.test()
> bartlett.test(response~trt,data = cholesterol)
Bartlett test of homogeneity of variances
data: response by trt
Bartlett's K-squared = 0.57975, df = 4, p-value =
0.9653
> #p = 0.9653 方差是齊性的
>
> #正態性檢驗
> shapiro.test(cholesterol$response)
Shapiro-Wilk normality test
data: cholesterol$response
W = 0.97722, p-value = 0.4417
l R語言的內置包:fligner.test() 與bartlett.test()檢驗原理不同,但公式一樣
> #方差齊性檢驗
> fligner.test(response~trt,data = cholesterol)
Fligner-Killeen test of homogeneity of variances
data: response by trt
Fligner-Killeen:med chi-squared = 0.74277, df = 4,
p-value = 0.946
l car包中的ncvTest()函數:
> ncvTest(lm(response~trt,data = cholesterol))
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.1338923 Df = 1 p = 0.71443
l car包中的leveneTest()函數:
> leveneTest(response~trt,data = cholesterol)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 0.0755 0.9893
45
n 方差分析
u 單因素方差分析:aov()函數
> aov(response~trt,data = cholesterol)
Call:
aov(formula = response ~ trt, data = cholesterol)
Terms:
trt Residuals
Sum of Squares 1351.3690 468.7504
Deg. of Freedom 4 45
Residual standard error: 3.227488
Estimated effects may be unbalanced
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
trt 4 1351.4 337.8 32.43 9.82e-13 ***
Residuals 45 468.8 10.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#用gplots包中的plotmeans可視化
plotmeans(response~trt,data = cholesterol)
u 單因素方差分析:oneway.test()函數
> oneway.test(response~trt,data = cholesterol)
One-way analysis of means (not assuming equal
variances)
data: response and trt
F = 30.709, num df = 4.00, denom df = 22.46, p-value =
8.227e-09
> oneway.test(response~trt,data = cholesterol,var.equal = T)
One-way analysis of means
data: response and trt
F = 32.433, num df = 4, denom df = 45, p-value =
9.819e-13
u 兩兩比較:TukeyHSD()函數
> TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = response ~ trt, data = cholesterol)
$trt
diff lwr upr p adj
2times-1time 3.44300 -0.6582817 7.544282 0.1380949
4times-1time 6.59281 2.4915283 10.694092 0.0003542
drugD-1time 9.57920 5.4779183 13.680482 0.0000003
drugE-1time 15.16555 11.0642683 19.266832 0.0000000
4times-2times 3.14981 -0.9514717 7.251092 0.2050382
drugD-2times 6.13620 2.0349183 10.237482 0.0009611
drugE-2times 11.72255 7.6212683 15.823832 0.0000000
drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
drugE-4times 8.57274 4.4714583 12.674022 0.0000037
drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
#可視化
plot(TukeyHSD(fit))
u 多因素方差分析:aov()函數
> data('ToothGrowth')
> head('ToothGrowth')
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
> levels(ToothGrowth$supp)
[1] "OJ" "VC"
> levels(ToothGrowth$dose)
NULL
> levels(as.factor(ToothGrowth$dose))
[1] "0.5" "1" "2"
> table(ToothGrowth$supp,ToothGrowth$dose)
0.5 1 2
OJ 10 10 10
VC 10 10 10
#公式的寫法 關注交互作用
> fangcha <- aov (len~supp+dose,data = ToothGrowth)
> summary(fangcha)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 11.45 0.0013 **
dose 1 2224.3 2224.3 123.99 6.31e-16 ***
Residuals 57 1022.6 17.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#交互作用
> fangcha <- aov (len~supp*dose,data = ToothGrowth)
> summary(fangcha)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 12.317 0.000894 ***
dose 1 2224.3 2224.3 133.415 < 2e-16 ***
supp:dose 1 88.9 88.9 5.333 0.024631 * #交互作用不能忽視
Residuals 56 933.6 16.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#可視化 交互作用圖
>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = 'b',
+ pch = c(1,10),col = c('red','green'))
u 單因素協方差分析:aov()函數
> data('litter')
> head(litter)
dose weight gesttime number
1 0 28.05 22.5 15
2 0 33.33 22.5 14
3 0 36.37 22.0 14
4 0 35.52 22.0 13
5 0 36.77 21.5 15
6 0 29.60 23.0 5
> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)
> summary(facn)
Df Sum Sq Mean Sq F value Pr(>F)
gesttime 1 134.3 134.30 8.289 0.00537 ** #協變量確實有影響,該如何去除?
dose 3 137.1 45.71 2.821 0.04556 *
gesttime:dose 3 81.9 27.29 1.684 0.17889
Residuals 66 1069.4 16.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
u 去除協變量的影響:加載effects包,用其中的effect()函數
> effect('dose',facn)
NOTE: dose is not a high-order term in the model
dose effect
dose
0 5 50 500
32.30722 28.50625 30.61078 29.29005
#常見研究設計的表達式
卡方檢驗
n 對分類變量的檢驗方法
n 分類:
u 擬合優度檢驗:用chisq.test()函數
(針對樣本數據的分布與已知總體分布是否一致)
> men <- c(11,120,60,45)
> women <- c(20,102,39,30)
> df <- as.data.frame(rbind(men,women))
> colnames(df) <- c('AB','O','A','B')
> chisq.test(men)
Chi-squared test for given probabilities
data: men
X-squared = 105.46, df = 3, p-value < 2.2e-16
> chisq.test(men,p = c(0.1,0.5,0.2,0.2)) #p 中為總體人群中各血型的比例
Chi-squared test for given probabilities
data: men
X-squared = 10.335, df = 3, p-value = 0.01592
u 卡方齊性檢驗:用於比較不同分類水平下,各個類型的比例是否一致。
> chisq.test(df) #行變量與列變量的相關性檢驗
Pearson's Chi-squared test
data: df
X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致
u 卡方獨立性檢驗:
> chisq.test(df)
Pearson's Chi-squared test
data: df
X-squared = 6.8607, df = 3, p-value = 0.07647 # 行變量和列變量無關聯,血型分布和性別無關
u CMH檢驗:分層檢驗 三維,行變量為無序,列變量為有序數據;判斷是否有混雜因素
#采用mantelhaen.test()函數
> Rabbits <-
+ array(c(0,0,6,5,
+ 3,0,3,6,
+ 6,2,0,4,
+ 5,6,1,0,
+ 2,5,0,0),
+ dim = c(2,2,5),
+ dimnames = list(
+ Delay = c('None','1.5h'),
+ Response = c('Cured','Died'),
+ Penicillin.Level = c('1/8','1/4','1/2','1','4')))
> Rabbits
, , Penicillin.Level = 1/8
Response
Delay Cured Died
None 0 6
1.5h 0 5
, , Penicillin.Level = 1/4
Response
Delay Cured Died
None 3 3
1.5h 0 6
, , Penicillin.Level = 1/2
Response
Delay Cured Died
None 6 0
1.5h 2 4
, , Penicillin.Level = 1
Response
Delay Cured Died
None 5 1
1.5h 6 0
, , Penicillin.Level = 4
Response
Delay Cured Died
None 2 0
1.5h 5 0
> mantelhaen.test(Rabbits)
Mantel-Haenszel chi-squared test with continuity
correction
data: Rabbits
Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =
0.04747
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
1.026713 47.725133
sample estimates:
common odds ratio
7 #可以判定盤尼西林是否為混雜因素
u CMH檢驗:有序分類的卡方檢驗:連續型變量
#采用mantelhaen.test()函數
> Satisfaction <-
+ as.table(array(c(1,2,0,0,3,3,1,2,
+ 11,17,8,4,2,3,5,2,
+ 1,0,0,0,1,3,0,1,
+ 2,5,7,9,1,1,3,6),
+ dim = c(4,4,2),
+ dimnames =
+ list(Income =
+ c('<5000','5000-15000',
+ '15000-25000','>25000'),
+ 'Job Satisfaction' =
+ c('V_D','L_S','M_S','V_S'),
+ Gender = c('Female','Male'))))
> Satisfaction
, , Gender = Female
Job Satisfaction
Income V_D L_S M_S V_S
<5000 1 3 11 2
5000-15000 2 3 17 3
15000-25000 0 1 8 5
>25000 0 2 4 2
, , Gender = Male
Job Satisfaction
Income V_D L_S M_S V_S
<5000 1 1 2 1
5000-15000 0 3 5 1
15000-25000 0 0 7 3
>25000 0 1 9 6
> mantelhaen.test(Satisfaction)
Cochran-Mantel-Haenszel test
data: Satisfaction
Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =
0.3345 #工資水平與滿意度無關
u 配對四格表的卡方檢驗:采用mcnemar.test()函數
> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c('A','B'),control = c('A','B'))))
> paired
control
case A B
A 157 69
B 24 18
> mcnemar.test(paired)
McNemar's Chi-squared test with continuity correction
data: paired
McNemar's chi-squared = 20.817, df = 1, p-value =
5.053e-06
> chisq.test(paired)
Pearson's Chi-squared test with Yates' continuity
correction
data: paired
X-squared = 1.9244, df = 1, p-value = 0.1654
u 的
回歸分析與模型診斷
前提:是否適合做線性回歸,是否滿足正態分布
只要因變量是連續型變量即可做線性回歸,因變量是分類變量需要做Logistic回歸;自變量是連續型或者分類變量無影響
n 一元回歸分析:lm(因變量~自變量)
u x為連續型變量
> x<- seq(1,5,len =100)
> noise <- rnorm(n=100,mean = 0,sd = 1)
> beta0 = 1
> beta1 <-2
> y <- beta0 + beta1*x + noise
> plot(y~x) #看一下是否適合做線性回歸
> model <- lm(y~x)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.49732 -0.64794 0.00215 0.75355 3.06579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.77938 0.28733 2.712 0.00789 **
x 2.05561 0.08927 23.027 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.041 on 98 degrees of freedom
Multiple R-squared: 0.844, Adjusted R-squared: 0.8424 #模型可解釋度
F-statistic: 530.3 on 1 and 98 DF, p-value: < 2.2e-16 #模型可靠性
u x為分類變量
> x <- factor(rep(c(0,1,2),each = 20))
> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))
> model <- lm(y~x)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.50566 -0.82826 0.01137 0.87966 2.41338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05338 0.24331 0.219 0.827
x1 0.81708 0.34409 2.375 0.021 *
x2 1.99903 0.34409 5.810 2.94e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.088 on 57 degrees of freedom
Multiple R-squared: 0.3745, Adjusted R-squared: 0.3525
F-statistic: 17.06 on 2 and 57 DF, p-value: 1.558e-06
> exp( 1.99903) #轉化成OR值
[1] 7.381892
n 模型診斷
u 非正態性:shapiro.test()判斷
> #檢驗殘差是否服從正態分布
> data(LMdata,package = 'rinds')
> model <- lm(y~x,data = LMdata$NonL)
> #找殘差
>
> res1 <- residuals(model)
> shapiro.test(res1) #殘差不符合正態分布,需要重新做
Shapiro-Wilk normality test
data: res1
W = 0.93524, p-value = 1e-04 #殘差不符合正態分布,需要重新做
u 非線性
> #2.非線性
> model2 <- lm(y~x,data = LMdata$NonL)
> plot(model2)
#y 與x2的關系是否成線性
model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)
summary(model2)
#踢掉x
> model3 <- update(model2,y~.-x)
> summary(model3)
Call:
lm(formula = y ~ 1, data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-19.456 -12.907 -2.464 10.725 28.697
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.829 1.429 15.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.29 on 99 degrees of freedom
> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)
> summary(model2)
Call:
lm(formula = y ~ x + I(x^2), data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-2.32348 -0.60702 0.00701 0.56018 2.27346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.98702 0.62216 1.586 0.116
x 0.11085 0.45405 0.244 0.808
I(x^2) 1.97966 0.07456 26.552 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.907 on 97 degrees of freedom
Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
F-statistic: 1.224e+04 on 2 and 97 DF, p-value: < 2.2e-16
> model3 <- update(model2,y~.-x)
> summary(model3)
Call:
lm(formula = y ~ I(x^2), data = LMdata$NonL)
Residuals:
Min 1Q Median 3Q Max
-2.34289 -0.60692 0.01722 0.58186 2.29601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.13378 0.15963 7.103 1.97e-10 ***
I(x^2) 1.99760 0.01271 157.195 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9026 on 98 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.996
F-statistic: 2.471e+04 on 1 and 98 DF, p-value: < 2.2e-16
> AIC(model,model2,model3)
df AIC
model 3 478.4558
model2 4 269.2121
model3 3 267.2736 #擬合效果越來越好
plot(model3$residuals~LMdata$NonL) #在0左右分布
u 異方差:考慮的是噪聲對模型的影響
model4 <- lm(y~x,data = LMdata$Hetero)
plot(model4$residuals~LMdata$Hetero$x)
#處理方法:使用加權最小二乘法:對於不同的樣本點給予不同的權重
> summary(model5)
Call:
lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.25132 -0.68409 -0.03924 0.63997 2.61098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2611 0.5139 2.454 0.0159 *
x 1.8973 0.2317 8.190 9.98e-13 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025 on 98 degrees of freedom
Multiple R-squared: 0.4063, Adjusted R-squared: 0.4003
F-statistic: 67.07 on 1 and 98 DF, p-value: 9.977e-13
> summary(model4)
Call:
lm(formula = y ~ x, data = LMdata$Hetero)
Residuals:
Min 1Q Median 3Q Max
-8.3371 -1.6383 -0.1533 1.4075 12.3423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6175 1.0019 1.615 0.11
x 1.7671 0.3113 5.677 1.4e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.63 on 98 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2398
F-statistic: 32.23 on 1 and 98 DF, p-value: 1.396e-07
#處理方法:使用迭代重復加權最小二乘法 采用nlme包中的gls()函數,找到最合適的weights公式,得到最小AIC
> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)
> summary(glsmodel)
Generalized least squares fit by REML
Model: y ~ x
Data: LMdata$Hetero
AIC BIC logLik
517.5286 525.2835 -255.7643
Variance function:
Structure: fixed weights
Formula: ~x
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.421324 0.7091780 2.004184 0.0478
x 1.832543 0.2603653 7.038351 0.0000
Correlation:
(Intr)
x -0.908
Standardized residuals:
Min Q1 Med Q3
-2.17451000 -0.55322766 -0.03454023 0.51060224
Max
2.93969900
Residual standard error: 1.890127
Degrees of freedom: 100 total; 98 residual
u 自相關:利用lmtest包中的DW檢驗函數:dwtest()函數
> model6 <- lm(y~x,data = LMdata$AC)
> dwtest(model6)
Durbin-Watson test
data: model6
DW = 0.65556, p-value = 2.683e-12 #存在自相關,不能使用最小二乘法,可使用gls()函數,實現廣義最小二乘法,注意gls中的correlation 參數,設置為corAR1()
alternative hypothesis: true autocorrelation is greater than 0
> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)
> summary(glsmodel2)
Generalized least squares fit by REML
Model: y ~ x
Data: LMdata$AC
AIC BIC logLik
268.7617 279.1016 -130.3809
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.7041665
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.335059 0.7792019 1.713367 0.0898
x 2.072936 0.2405513 8.617438 0.0000
Correlation:
(Intr)
x -0.926
Standardized residuals:
Min Q1 Med Q3
-1.667086282 -0.571601900 0.001724114 0.568360354
Max
2.306177988
Residual standard error: 1.25329
Degrees of freedom: 100 total; 98 residual
> summary(model6)
Call:
lm(formula = y ~ x, data = LMdata$AC)
Residuals:
Min 1Q Median 3Q Max
-2.10131 -0.72894 -0.03329 0.66138 2.77461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2626 0.3303 3.822 0.000232 ***
x 2.1118 0.1026 20.578 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.197 on 98 degrees of freedom
Multiple R-squared: 0.8121, Adjusted R-squared: 0.8101
F-statistic: 423.4 on 1 and 98 DF, p-value: < 2.2e-16
u 異常值:離群點,杠桿點、高影響點
model7 <- lm(y~x,data = LMdata$Outlier)
plot(y~x,data = LMdata$Outlier)
abline(model7) #畫出回歸直線
#利用car包中的infuencePlot()函數進行判斷,圓圈越大,對模型影響越大,做線性回歸模型時需踢掉該點,用update函數
model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)
plot(y~x,data = LMdata$Outlier) #比較去除前后的差異
abline(model7,col = 'red') #比較去除前后的差異
abline(model8,col = 'green') #比較去除前后的差異
u 多重共線性的判斷:自變量之間存在線性關系
#x1,x2,x3與y之間p值無顯著差異,但是總體上的p值和R值非常顯著,說明x1,x2,x3之間可能是存在相關性的,但與y沒有相關性,不能進行回歸分析,需要用函數進行檢驗確認:vif()函數計算方差膨脹因子
> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)
> summary(model9)
Call:
lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)
Residuals:
Min 1Q Median 3Q Max
-1.29100 -0.26369 0.00141 0.32529 0.91182
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3848 0.5813 0.662 0.5095
x1 7.2022 4.8552 1.483 0.1412
x2 -14.0916 12.1385 -1.161 0.2486
x3 8.2312 4.8559 1.695 0.0933 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.499 on 96 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 2.057e+04 on 3 and 96 DF, p-value: < 2.2e-16
#vif計算方差膨脹因子,一般而言,大於10則認為存在共線性
> vif(model9)
x1 x2 x3
7560.819 214990.752 222630.742
#運用自動回歸判斷是x中的哪些變量存在共線性:step()函數
> model10 <- step(model9)
Start: AIC=-135.11
y ~ x1 + x2 + x3
Df Sum of Sq RSS AIC
- x2 1 0.33560 24.241 -135.71
<none> 23.905 -135.11
- x1 1 0.54795 24.453 -134.84
- x3 1 0.71550 24.621 -134.16
Step: AIC=-135.71
y ~ x1 + x3
Df Sum of Sq RSS AIC
<none> 24.2 -135.71
- x1 1 189.2 213.4 79.81
- x3 1 15276.4 15300.7 507.05
> summary(model10)
Call:
lm(formula = y ~ x1 + x3, data = LMdata$Mult)
Residuals:
Min 1Q Median 3Q Max
-1.24046 -0.27519 -0.02751 0.32824 0.91344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.38158 0.58230 0.655 0.514
x1 1.56614 0.05692 27.514 <2e-16 ***
x3 2.59393 0.01049 247.241 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4999 on 97 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.074e+04 on 2 and 97 DF, p-value: < 2.2e-16
n Logistic回歸
與線性回歸的區別:y為分類變量。
u 使用glm()函數:下載HSAUR2包的數據集plasma
> library(HSAUR2)
> data('plasma')
> head(plasma)
fibrinogen globulin ESR
1 2.52 38 ESR < 20
2 2.56 31 ESR < 20
3 2.19 33 ESR < 20
4 2.18 31 ESR < 20
5 3.41 37 ESR < 20
6 2.46 36 ESR < 20
> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,
+ family = binomial()) #family選擇二分類
> summary(fit1)
Call:
glm(formula = ESR ~ fibrinogen + globulin, family = binomial(),
data = plasma)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9683 -0.6122 -0.3458 -0.2116 2.2636
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.7921 5.7963 -2.207 0.0273 *
fibrinogen 1.9104 0.9710 1.967 0.0491 *
globulin 0.1558 0.1195 1.303 0.1925
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 30.885 on 31 degrees of freedom
Residual deviance: 22.971 on 29 degrees of freedom
AIC: 28.971
Number of Fisher Scoring iterations: 5
> exp(coef(fit1)['fibrinogen'])
fibrinogen
6.755579
> exp(1.9104)
[1] 6.755791 #ORR值
> exp(confint(fit1,parm = 'fibrinogen'))
Waiting for profiling to be done...
2.5 % 97.5 % #95%的可信區間
1.404131 73.000836
n 生存分析與COX回歸
生存分析
u 用coin包中的glima數據集演示,用survival包進行生存分析,利用其中的survfit()函數做出生存曲線圖
> g3 <- subset(glioma,histology =='Grade3')
> fit <- survfit(Surv(time,event)~group,data = g3) #注意公式的寫法
> plot(fit,lty = c(2,3),col = c(2,1))
> legend('bottomright',legend = c('Control','Treatment'),lty = c(2,1),col = c(2,1))
u 用survdiff()函數計算兩組間生存差異
> survdiff(Surv(time,event)~group,data = g3) #接受傳入的形式
Call:
survdiff(formula = Surv(time, event) ~ group, data = g3)
N Observed Expected (O-E)^2/E (O-E)^2/V
group=Control 6 4 1.49 4.23 6.06
group=RIT 11 2 4.51 1.40 6.06
Chisq= 6.1 on 1 degrees of freedom, p= 0.0138
u 用survdiff()函數計算兩組間生存差異
> logrank_test(Surv(time,event)~group,data = g3,distribution='exact')
# 注意公式的寫法,選擇精確分布
Exact Two-Sample Logrank Test
data: Surv(time, event) by group (Control, RIT)
Z = -2.1711, p-value = 0.02877
alternative hypothesis: true theta is not equal to 1
#討論不同histology的對照組與治療組是否有差異
logrank_test(Surv(time,event)~group|histology,data = glioma,distribution = approximate(B =1000))
Approximative Two-Sample Logrank Test
data: Surv(time, event) by
group (Control, RIT)
stratified by histology
Z = -3.6704, p-value < 2.2e-16
alternative hypothesis: true theta is not equal to 1
COX回歸
u 用GBSG2數據集
> install.packages('TH.data')
Error in install.packages : Updating loaded packages
> data('GBSG2',package = 'TH.data')
> head(GBSG2)
horTh age menostat tsize tgrade pnodes progrec estrec time cens
1 no 70 Post 21 II 3 48 66 1814 1
2 yes 56 Post 12 II 7 61 77 2018 1
3 yes 58 Post 35 II 9 52 271 712 1
4 yes 59 Post 17 II 4 60 29 1807 1
5 no 73 Post 35 II 1 26 65 772 1
6 no 32 Pre 57 III 24 0 13 448 1
> plot(survfit(Surv(time,cens)~horTh,data = GBSG2),lty = c(2,1),col = c(2,1), mark.Time = T) #如果需要標記生存時間,在后面加上mark.Time = T
> legend('bottomright',legend = c('yes','no'),lty = c(2,1),col = c(2,1))
#畫出了生存曲線圖
u 生存分析:使用survival包中的coxph()函數
> coxreg <- coxph(Surv(time,cens)~.,data = GBSG2)
> summary(coxreg)
Call:
coxph(formula = Surv(time, cens) ~ ., data = GBSG2)
n= 686, number of events= 299
coef exp(coef) se(coef) z Pr(>|z|)
horThyes -0.3462784 0.7073155 0.1290747 -2.683 0.007301 **
age -0.0094592 0.9905854 0.0093006 -1.017 0.309126
menostatPost 0.2584448 1.2949147 0.1834765 1.409 0.158954
tsize 0.0077961 1.0078266 0.0039390 1.979 0.047794 *
tgrade.L 0.5512988 1.7355056 0.1898441 2.904 0.003685 **
tgrade.Q -0.2010905 0.8178384 0.1219654 -1.649 0.099199 .
pnodes 0.0487886 1.0499984 0.0074471 6.551 5.7e-11 ***
progrec -0.0022172 0.9977852 0.0005735 -3.866 0.000111 ***
estrec 0.0001973 1.0001973 0.0004504 0.438 0.661307
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
horThyes 0.7073 1.4138 0.5492 0.9109
age 0.9906 1.0095 0.9727 1.0088
menostatPost 1.2949 0.7723 0.9038 1.8553
tsize 1.0078 0.9922 1.0001 1.0156
tgrade.L 1.7355 0.5762 1.1963 2.5178
tgrade.Q 0.8178 1.2227 0.6439 1.0387
pnodes 1.0500 0.9524 1.0348 1.0654
progrec 0.9978 1.0022 0.9967 0.9989
estrec 1.0002 0.9998 0.9993 1.0011
Concordance= 0.692 (se = 0.018 )
Rsquare= 0.142 (max possible= 0.995 )
Likelihood ratio test= 104.8 on 9 df, p=0
Wald test = 114.8 on 9 df, p=0
Score (logrank) test = 120.7 on 9 df, p=0
#畫樹
install.packages('party')
tree <- ctree(Surv(time,cens)~.,data = GBSG2)
plot(tree)