R語言學習筆記之十

本文轉載自查看原文 2018-08-09 15:03 3115 R語言相關

摘要: 僅用於記錄R語言學習過程：

內容提要：

描述性統計；t檢驗；數據轉換；方差分析；卡方檢驗；回歸分析與模型診斷；生存分析；COX回歸

寫在正文前的話，關於基礎知識，此篇為終結篇，筆記來自醫學方的課程，僅用於學習R的過程。

正文：

描述性統計

n 如何去生成table1 用table()函數，快速匯總頻數

u 生成四格表：table(行名，列名)

> table(tips$sex,tips$smoker)

No Yes

Female 54 33

Male 97 60

u addmargins()函數：對生成的table表格進行計算

> table(esoph$agegp,esoph$ncases)

0 1 2 3 4 5 6 8 9 17

25-34 14 1 0 0 0 0 0 0 0 0

35-44 10 2 2 1 0 0 0 0 0 0

45-54 3 2 2 2 3 2 2 0 0 0

55-64 0 0 2 4 3 2 2 1 2 0

65-74 1 4 2 2 2 2 1 0 0 1

75+ 1 7 3 0 0 0 0 0 0 0

> tt <- table(esoph$agegp,esoph$ncases)

> addmargins(tt,margin = c(1,2)) # margin 1表示行，2表示列

0 1 2 3 4 5 6 8 9 17 Sum

25-34 14 1 0 0 0 0 0 0 0 0 15

35-44 10 2 2 1 0 0 0 0 0 0 15

45-54 3 2 2 2 3 2 2 0 0 0 16

55-64 0 0 2 4 3 2 2 1 2 0 16

65-74 1 4 2 2 2 2 1 0 0 1 15

75+ 1 7 3 0 0 0 0 0 0 0 11

Sum 29 16 11 9 8 6 5 1 2 1 88

n xtabs()函數：在數據集中取子集，生成表格

> hightip <- tips[,'tip'] > mean(tips[,'tip'] )

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker=='No'))

#as.data.frame轉換成數據框；~后面的公式類似table括號中的內容，為分類變量；~左邊需添加的是連續型變量；有一個子集subset可進行提取

tips.sex hightip Freq

1 Female FALSE 31

2 Male FALSE 49

3 Female TRUE 23

4 Male TRUE 48

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c('Sun','Sat')))

tips.sex hightip Freq

1 Female FALSE 21

2 Male FALSE 53

3 Female TRUE 25

4 Male TRUE 64

示例2：

> xtabs(ncontrols~agegp + alcgp,data = esoph)

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 61 45 5 5

35-44 89 80 20 10

45-54 78 81 39 15

55-64 89 84 43 26

65-74 71 53 29 8

75+ 27 12 2 3

> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))

alcgp

agegp 0-39g/day 40-79 80-119 120+ Sum

25-34 61 45 5 5 116

35-44 89 80 20 10 199

45-54 78 81 39 15 213

55-64 89 84 43 26 242

65-74 71 53 29 8 161

75+ 27 12 2 3 44

Sum 415 355 138 67 975

#cbind(數值型，數值型)

> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)

, , = ncases

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 0 0 0 1

35-44 1 4 0 4

45-54 1 20 12 13

55-64 12 22 24 18

65-74 11 25 13 6

75+ 4 4 2 3

, , = ncontrols

alcgp

agegp 0-39g/day 40-79 80-119 120+

25-34 61 45 5 5

35-44 89 80 20 10

45-54 78 81 39 15

55-64 89 84 43 26

65-74 71 53 29 8

75+ 27 12 2 3

n ftable()函數：扁平表格，接受xtabs對象，進行調整

> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))

ncases ncontrols

agegp alcgp

25-34 0-39g/day 0 61

40-79 0 45

80-119 0 5

120+ 1 5

35-44 0-39g/day 1 89

40-79 4 80

80-119 0 20

120+ 4 10

45-54 0-39g/day 1 78

40-79 20 81

80-119 12 39

120+ 13 15

55-64 0-39g/day 12 89

40-79 22 84

80-119 24 43

120+ 18 26

65-74 0-39g/day 11 71

40-79 25 53

80-119 13 29

120+ 6 8

75+ 0-39g/day 4 27

40-79 4 12

80-119 2 2

120+ 3 3

n 數據匯總可結合前面學習的函數，如：

u summary(數據集)函數

u describe（數據集）函數（psych包里的）

u describe.data.frame(數據集)函數（Hmisc包里的）

u apply()家族等

t檢驗

n 假設檢驗:服從正態分布，方差齊性（如果不齊，可以用t’檢驗），

n 示例：

u 生成隨機數據，並檢驗是否符合正態分布：（shapiro.test()函數）

> data1 <- sample(1:100,50)

> #檢驗正態性 shapiro.test()函數

> shapiro.test(data1)

Shapiro-Wilk normality test

data: data1

W = 0.96424, p-value = 0.1338 #不能拒絕原假設，服從正態分布

除此之外，還有以下5個函數可用於檢驗是否符合正態分布：

library(nortest)

lillie.test(data1)

ad.test(data1)

cvm.test(data1)

pearson.test(data1)

sf.test(data1)

如：

> lillie.test(data1)

Lilliefors (Kolmogorov-Smirnov) normality test

data: data1

D = 0.064492, p-value = 0.8693

> ad.test(data1)

Anderson-Darling normality test

data: data1

A = 0.40918, p-value = 0.333

> cvm.test(data1)

Cramer-von Mises normality test

data: data1

W = 0.054212, p-value = 0.4437

> pearson.test(data1)

Pearson chi-square normality test

data: data1

P = 4, p-value = 0.7798

> sf.test(data1)

Shapiro-Francia normality test

data: data1

W = 0.97496, p-value = 0.3075

u 生成服從正態分布的數據：rnorm()函數

> #生成服從正態分布的數據

> data3 <- rnorm(100,3,5)

> data4 <- rnorm(200,3.4,8)

> shapiro.test(data3)

Shapiro-Wilk normality test

data: data3

W = 0.99369, p-value = 0.9258

> shapiro.test(data4)

Shapiro-Wilk normality test

data: data4

W = 0.98946, p-value = 0.149

u 檢驗方差齊性：var.test()函數：僅適用於兩組

> var.test(data3,data4)

F test to compare two variances

data: data3 and data4

F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.2893355 0.5739700

sample estimates:

ratio of variances

0.4034794 #p值小於0.05，說明方差不齊性

u 進行t檢驗函數：t.test()函數

u 兩組均數t檢驗

> t.test(data3,data4,var.equal = F) #方差不齊則var.equal設置為F，默認也為FALSE，若其設置為TRUE，如為FALSE,則會使用t’檢驗

Welch Two Sample t-test

data: data3 and data4

t = -1.3681, df = 281.41, p-value = 0.1724 #說明兩組均數無顯著統計學差異

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-2.587004 0.465494

sample estimates:

mean of x mean of y

2.468549 3.529304

u 樣本均數與總體均數t檢驗

> #樣本均數與總體均數的比較

> t.test(data3,mu =3.2)

One Sample t-test

data: data3

t = -1.4117, df = 99, p-value = 0.1612 #樣本與總體無統計學差異

alternative hypothesis: true mean is not equal to 3.2

95 percent confidence interval:

1.440424 3.496675

sample estimates:

mean of x

2.468549

u 配對數據的t檢驗

data3 <- rnorm(200,3,5)

> data4 <- rnorm(200,3.4,5)

> #配對數據

> #進行配對t檢驗

> t.test(data3,data4,paired = TRUE)

Paired t-test

data: data3 and data4

t = 0.59079, df = 199, p-value = 0.5553 #p大於0.05，說明無顯著差異

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.6946177 1.2888581

sample estimates:

mean of the differences

0.2971202

數據變換

n 數據不滿足正態分布時該如何處理：有時可采用取log()、sqrt()、1/x等等方式，

n runif()平均分布函數

> mydata <- runif(100,min = 1,max = 2)

> shapiro.test(mydata)

Shapiro-Wilk normality test

data: mydata

W = 0.93149, p-value = 6.05e-05

> shapiro.test(log(mydata))

Shapiro-Wilk normality test

data: log(mydata)

W = 0.93082, p-value = 5.54e-05

> shapiro.test(sqrt(mydata))

Shapiro-Wilk normality test

data: sqrt(mydata)

W = 0.93236, p-value = 6.794e-05

> shapiro.test(1/mydata)

Shapiro-Wilk normality test

data: 1/mydata

W = 0.9195, p-value = 1.325e-05

#幾種方式均不可行。

n boxcox轉換---對公式的，加載MASS包，運用里面的boxcox()函數，#boxcox()轉換，核心為找到lambda的值進行相應的轉換

bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,

lambda = seq(-0.25,0.25,length =10))

#Volume為擬操作的變量，Height和Girth為用此兩個數據進行估計，但要取log，trees為數據集，lambda后面的公式為取最佳值

u 用公式：(x^lambda-1)/lambda 進行數據轉換（lambda 不等於1），新的Volume_bc就服從正態分布了。

> Volume_bc <- (trees$Volume^lambda-1)/lambda

> shapiro.test(Volume_bc)

Shapiro-Wilk normality test

data: Volume_bc

W = 0.96431, p-value = 0.3776

u 可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看圖，是否符合正態分布

n boxcox轉換---對數值的，加載forecast包，利用包中的BoxCox.lambda()函數

> BoxCox.lambda(trees$Volume)

[1] -0.4954451 #lambda的值，采用1/sqrt(x)

分以下幾種情況：

ü lambda接近-1時 1/x

ü -0.5 1/sqrt(x)

ü 0 ln(x) 或log(x)

ü 0.5 sqrt()

ü 1 不用變換了

完整示例：

> BoxCox.lambda(trees$Volume)

[1] -0.4954451

> new_volum <- 1/sqrt(trees$Volume)

> shapiro.test(new_volum)

Shapiro-Wilk normality test

data: new_volum

W = 0.94523, p-value = 0.1152

n 利用car包中powerTransform得到lambda值，再進行正態分布分析

> library(car)

> powerTransform(trees$Volume)

Estimated transformation parameter

trees$Volume

-0.07476608

> new_volum <- trees$Volume^-0.07476608

> shapiro.test(new_volum)

Shapiro-Wilk normality test

data: new_volum

W = 0.96428, p-value = 0.3768

方差分析

n 用於多組均值的比較。兩組均值的t檢驗與方差分析等價，但是對於>=3組的均數比較，t檢驗不適用，需用方差分析。

n 假設檢驗，需滿足的條件：

u 正態性

u 方差齊性

u 獨立性

n 方差齊性的檢驗

u 安裝multcomp包，加載cholesterol數據集(里面包含response組和trt組)

u 方差齊性的檢驗：

l R語言的內置包：bartlett.test()

> bartlett.test(response~trt,data = cholesterol)

Bartlett test of homogeneity of variances

data: response by trt

Bartlett's K-squared = 0.57975, df = 4, p-value =

0.9653

> #p = 0.9653 方差是齊性的

> #正態性檢驗

> shapiro.test(cholesterol$response)

Shapiro-Wilk normality test

data: cholesterol$response

W = 0.97722, p-value = 0.4417

l R語言的內置包：fligner.test() 與bartlett.test()檢驗原理不同，但公式一樣

> #方差齊性檢驗

> fligner.test(response~trt,data = cholesterol)

Fligner-Killeen test of homogeneity of variances

data: response by trt

Fligner-Killeen:med chi-squared = 0.74277, df = 4,

p-value = 0.946

l car包中的ncvTest()函數：

> ncvTest(lm(response~trt,data = cholesterol))

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.1338923 Df = 1 p = 0.71443

l car包中的leveneTest()函數：

> leveneTest(response~trt,data = cholesterol)

Levene's Test for Homogeneity of Variance (center = median)

Df F value Pr(>F)

group 4 0.0755 0.9893

n 方差分析

u 單因素方差分析：aov()函數

> aov(response~trt,data = cholesterol)

Call:

aov(formula = response ~ trt, data = cholesterol)

Terms:

trt Residuals

Sum of Squares 1351.3690 468.7504

Deg. of Freedom 4 45

Residual standard error: 3.227488

Estimated effects may be unbalanced

> summary(fit)

Df Sum Sq Mean Sq F value Pr(>F)

trt 4 1351.4 337.8 32.43 9.82e-13 ***

Residuals 45 468.8 10.4

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#用gplots包中的plotmeans可視化

plotmeans(response~trt,data = cholesterol)

u 單因素方差分析：oneway.test()函數

> oneway.test(response~trt,data = cholesterol)

One-way analysis of means (not assuming equal

variances)

data: response and trt

F = 30.709, num df = 4.00, denom df = 22.46, p-value =

8.227e-09

> oneway.test(response~trt,data = cholesterol,var.equal = T)

One-way analysis of means

data: response and trt

F = 32.433, num df = 4, denom df = 45, p-value =

9.819e-13

u 兩兩比較：TukeyHSD()函數

> TukeyHSD(fit)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = response ~ trt, data = cholesterol)

$trt

diff lwr upr p adj

2times-1time 3.44300 -0.6582817 7.544282 0.1380949

4times-1time 6.59281 2.4915283 10.694092 0.0003542

drugD-1time 9.57920 5.4779183 13.680482 0.0000003

drugE-1time 15.16555 11.0642683 19.266832 0.0000000

4times-2times 3.14981 -0.9514717 7.251092 0.2050382

drugD-2times 6.13620 2.0349183 10.237482 0.0009611

drugE-2times 11.72255 7.6212683 15.823832 0.0000000

drugD-4times 2.98639 -1.1148917 7.087672 0.2512446

drugE-4times 8.57274 4.4714583 12.674022 0.0000037

drugE-drugD 5.58635 1.4850683 9.687632 0.0030633

#可視化

plot(TukeyHSD(fit))

u 多因素方差分析：aov()函數

> data('ToothGrowth')

> head('ToothGrowth')

len supp dose

1 4.2 VC 0.5

2 11.5 VC 0.5

3 7.3 VC 0.5

4 5.8 VC 0.5

5 6.4 VC 0.5

6 10.0 VC 0.5

> levels(ToothGrowth$supp)

[1] "OJ" "VC"

> levels(ToothGrowth$dose)

NULL

> levels(as.factor(ToothGrowth$dose))

[1] "0.5" "1" "2"

> table(ToothGrowth$supp,ToothGrowth$dose)

0.5 1 2

OJ 10 10 10

VC 10 10 10

#公式的寫法關注交互作用

> fangcha <- aov (len~supp+dose,data = ToothGrowth)

> summary(fangcha)

Df Sum Sq Mean Sq F value Pr(>F)

supp 1 205.4 205.4 11.45 0.0013 **

dose 1 2224.3 2224.3 123.99 6.31e-16 ***

Residuals 57 1022.6 17.9

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#交互作用

> fangcha <- aov (len~supp*dose,data = ToothGrowth)

> summary(fangcha)

Df Sum Sq Mean Sq F value Pr(>F)

supp 1 205.4 205.4 12.317 0.000894 ***

dose 1 2224.3 2224.3 133.415 < 2e-16 ***

supp:dose 1 88.9 88.9 5.333 0.024631 * #交互作用不能忽視

Residuals 56 933.6 16.7

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#可視化交互作用圖

>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = 'b',

+ pch = c(1,10),col = c('red','green'))

u 單因素協方差分析：aov()函數

> data('litter')

> head(litter)

dose weight gesttime number

1 0 28.05 22.5 15

2 0 33.33 22.5 14

3 0 36.37 22.0 14

4 0 35.52 22.0 13

5 0 36.77 21.5 15

6 0 29.60 23.0 5

> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)

> summary(facn)

Df Sum Sq Mean Sq F value Pr(>F)

gesttime 1 134.3 134.30 8.289 0.00537 ** #協變量確實有影響，該如何去除？

dose 3 137.1 45.71 2.821 0.04556 *

gesttime:dose 3 81.9 27.29 1.684 0.17889

Residuals 66 1069.4 16.20

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

u 去除協變量的影響：加載effects包，用其中的effect()函數

> effect('dose',facn)

NOTE: dose is not a high-order term in the model

dose effect

dose

0 5 50 500

32.30722 28.50625 30.61078 29.29005

#常見研究設計的表達式

卡方檢驗

n 對分類變量的檢驗方法

n 分類：

u 擬合優度檢驗：用chisq.test()函數

（針對樣本數據的分布與已知總體分布是否一致）

> men <- c(11,120,60,45)

> women <- c(20,102,39,30)

> df <- as.data.frame(rbind(men,women))

> colnames(df) <- c('AB','O','A','B')

> chisq.test(men)

Chi-squared test for given probabilities

data: men

X-squared = 105.46, df = 3, p-value < 2.2e-16

> chisq.test(men,p = c(0.1,0.5,0.2,0.2)) #p 中為總體人群中各血型的比例

Chi-squared test for given probabilities

data: men

X-squared = 10.335, df = 3, p-value = 0.01592

u 卡方齊性檢驗：用於比較不同分類水平下，各個類型的比例是否一致。

> chisq.test(df) #行變量與列變量的相關性檢驗

Pearson's Chi-squared test

data: df

X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致

u 卡方獨立性檢驗：

> chisq.test(df)

Pearson's Chi-squared test

data: df

X-squared = 6.8607, df = 3, p-value = 0.07647 # 行變量和列變量無關聯，血型分布和性別無關

u CMH檢驗：分層檢驗三維，行變量為無序，列變量為有序數據；判斷是否有混雜因素

#采用mantelhaen.test()函數

> Rabbits <-

+ array(c(0,0,6,5,

+ 3,0,3,6,

+ 6,2,0,4,

+ 5,6,1,0,

+ 2,5,0,0),

+ dim = c(2,2,5),

+ dimnames = list(

+ Delay = c('None','1.5h'),

+ Response = c('Cured','Died'),

+ Penicillin.Level = c('1/8','1/4','1/2','1','4')))

> Rabbits

, , Penicillin.Level = 1/8

Response

Delay Cured Died

None 0 6

1.5h 0 5

, , Penicillin.Level = 1/4

Response

Delay Cured Died

None 3 3

1.5h 0 6

, , Penicillin.Level = 1/2

Response

Delay Cured Died

None 6 0

1.5h 2 4

, , Penicillin.Level = 1

Response

Delay Cured Died

None 5 1

1.5h 6 0

, , Penicillin.Level = 4

Response

Delay Cured Died

None 2 0

1.5h 5 0

> mantelhaen.test(Rabbits)

Mantel-Haenszel chi-squared test with continuity

correction

data: Rabbits

Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =

0.04747

alternative hypothesis: true common odds ratio is not equal to 1

95 percent confidence interval:

1.026713 47.725133

sample estimates:

common odds ratio

7 #可以判定盤尼西林是否為混雜因素

u CMH檢驗：有序分類的卡方檢驗：連續型變量

#采用mantelhaen.test()函數

> Satisfaction <-

+ as.table(array(c(1,2,0,0,3,3,1,2,

+ 11,17,8,4,2,3,5,2,

+ 1,0,0,0,1,3,0,1,

+ 2,5,7,9,1,1,3,6),

+ dim = c(4,4,2),

+ dimnames =

+ list(Income =

+ c('<5000','5000-15000',

+ '15000-25000','>25000'),

+ 'Job Satisfaction' =

+ c('V_D','L_S','M_S','V_S'),

+ Gender = c('Female','Male'))))

> Satisfaction

, , Gender = Female

Job Satisfaction

Income V_D L_S M_S V_S

<5000 1 3 11 2

5000-15000 2 3 17 3

15000-25000 0 1 8 5

>25000 0 2 4 2

, , Gender = Male

Job Satisfaction

Income V_D L_S M_S V_S

<5000 1 1 2 1

5000-15000 0 3 5 1

15000-25000 0 0 7 3

>25000 0 1 9 6

> mantelhaen.test(Satisfaction)

Cochran-Mantel-Haenszel test

data: Satisfaction

Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =

0.3345 #工資水平與滿意度無關

u 配對四格表的卡方檢驗：采用mcnemar.test()函數

> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c('A','B'),control = c('A','B'))))

> paired

control

case A B

A 157 69

B 24 18

> mcnemar.test(paired)

McNemar's Chi-squared test with continuity correction

data: paired

McNemar's chi-squared = 20.817, df = 1, p-value =

5.053e-06

> chisq.test(paired)

Pearson's Chi-squared test with Yates' continuity

correction

data: paired

X-squared = 1.9244, df = 1, p-value = 0.1654

u 的

回歸分析與模型診斷

前提：是否適合做線性回歸，是否滿足正態分布

只要因變量是連續型變量即可做線性回歸，因變量是分類變量需要做Logistic回歸；自變量是連續型或者分類變量無影響

n 一元回歸分析：lm(因變量~自變量)

u x為連續型變量

> x<- seq(1,5,len =100)

> noise <- rnorm(n=100,mean = 0,sd = 1)

> beta0 = 1

> beta1 <-2

> y <- beta0 + beta1*x + noise

> plot(y~x) #看一下是否適合做線性回歸

> model <- lm(y~x)

> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-2.49732 -0.64794 0.00215 0.75355 3.06579

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.77938 0.28733 2.712 0.00789 **

x 2.05561 0.08927 23.027 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.041 on 98 degrees of freedom

Multiple R-squared: 0.844, Adjusted R-squared: 0.8424 #模型可解釋度

F-statistic: 530.3 on 1 and 98 DF, p-value: < 2.2e-16 #模型可靠性

u x為分類變量

> x <- factor(rep(c(0,1,2),each = 20))

> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))

> model <- lm(y~x)

> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-2.50566 -0.82826 0.01137 0.87966 2.41338

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.05338 0.24331 0.219 0.827

x1 0.81708 0.34409 2.375 0.021 *

x2 1.99903 0.34409 5.810 2.94e-07 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.088 on 57 degrees of freedom

Multiple R-squared: 0.3745, Adjusted R-squared: 0.3525

F-statistic: 17.06 on 2 and 57 DF, p-value: 1.558e-06

> exp( 1.99903) #轉化成OR值

[1] 7.381892

n 模型診斷

u 非正態性：shapiro.test()判斷

> #檢驗殘差是否服從正態分布

> data(LMdata,package = 'rinds')

> model <- lm(y~x,data = LMdata$NonL)

> #找殘差

> res1 <- residuals(model)

> shapiro.test(res1) #殘差不符合正態分布，需要重新做

Shapiro-Wilk normality test

data: res1

W = 0.93524, p-value = 1e-04 #殘差不符合正態分布，需要重新做

u 非線性

> #2.非線性

> model2 <- lm(y~x,data = LMdata$NonL)

> plot(model2)

#y 與x2的關系是否成線性

model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

summary(model2)

#踢掉x

> model3 <- update(model2,y~.-x)

> summary(model3)

Call:

lm(formula = y ~ 1, data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-19.456 -12.907 -2.464 10.725 28.697

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.829 1.429 15.28 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.29 on 99 degrees of freedom

> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

> summary(model2)

Call:

lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-2.32348 -0.60702 0.00701 0.56018 2.27346

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.98702 0.62216 1.586 0.116

x 0.11085 0.45405 0.244 0.808

I(x^2) 1.97966 0.07456 26.552 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.907 on 97 degrees of freedom

Multiple R-squared: 0.9961, Adjusted R-squared: 0.996

F-statistic: 1.224e+04 on 2 and 97 DF, p-value: < 2.2e-16

> model3 <- update(model2,y~.-x)

> summary(model3)

Call:

lm(formula = y ~ I(x^2), data = LMdata$NonL)

Residuals:

Min 1Q Median 3Q Max

-2.34289 -0.60692 0.01722 0.58186 2.29601

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.13378 0.15963 7.103 1.97e-10 ***

I(x^2) 1.99760 0.01271 157.195 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9026 on 98 degrees of freedom

Multiple R-squared: 0.996, Adjusted R-squared: 0.996

F-statistic: 2.471e+04 on 1 and 98 DF, p-value: < 2.2e-16

> AIC(model,model2,model3)

df AIC

model 3 478.4558

model2 4 269.2121

model3 3 267.2736 #擬合效果越來越好

plot(model3$residuals~LMdata$NonL) #在0左右分布

u 異方差：考慮的是噪聲對模型的影響

model4 <- lm(y~x,data = LMdata$Hetero)

plot(model4$residuals~LMdata$Hetero$x)

#處理方法：使用加權最小二乘法：對於不同的樣本點給予不同的權重

> summary(model5)

Call:

lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

Weighted Residuals:

Min 1Q Median 3Q Max

-2.25132 -0.68409 -0.03924 0.63997 2.61098

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.2611 0.5139 2.454 0.0159 *

x 1.8973 0.2317 8.190 9.98e-13 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 98 degrees of freedom

Multiple R-squared: 0.4063, Adjusted R-squared: 0.4003

F-statistic: 67.07 on 1 and 98 DF, p-value: 9.977e-13

> summary(model4)

Call:

lm(formula = y ~ x, data = LMdata$Hetero)

Residuals:

Min 1Q Median 3Q Max

-8.3371 -1.6383 -0.1533 1.4075 12.3423

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6175 1.0019 1.615 0.11

x 1.7671 0.3113 5.677 1.4e-07 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.63 on 98 degrees of freedom

Multiple R-squared: 0.2475, Adjusted R-squared: 0.2398

F-statistic: 32.23 on 1 and 98 DF, p-value: 1.396e-07

#處理方法：使用迭代重復加權最小二乘法采用nlme包中的gls()函數，找到最合適的weights公式，得到最小AIC

> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)

> summary(glsmodel)

Generalized least squares fit by REML

Model: y ~ x

Data: LMdata$Hetero

AIC BIC logLik

517.5286 525.2835 -255.7643

Variance function:

Structure: fixed weights

Formula: ~x

Coefficients:

Value Std.Error t-value p-value

(Intercept) 1.421324 0.7091780 2.004184 0.0478

x 1.832543 0.2603653 7.038351 0.0000

Correlation:

(Intr)

x -0.908

Standardized residuals:

Min Q1 Med Q3

-2.17451000 -0.55322766 -0.03454023 0.51060224

Max

2.93969900

Residual standard error: 1.890127

Degrees of freedom: 100 total; 98 residual

u 自相關：利用lmtest包中的DW檢驗函數：dwtest()函數

> model6 <- lm(y~x,data = LMdata$AC)

> dwtest(model6)

Durbin-Watson test

data: model6

DW = 0.65556, p-value = 2.683e-12 #存在自相關，不能使用最小二乘法，可使用gls()函數，實現廣義最小二乘法，注意gls中的correlation 參數，設置為corAR1()

alternative hypothesis: true autocorrelation is greater than 0

> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)

> summary(glsmodel2)

Generalized least squares fit by REML

Model: y ~ x

Data: LMdata$AC

AIC BIC logLik

268.7617 279.1016 -130.3809

Correlation Structure: AR(1)

Formula: ~1

Parameter estimate(s):

Phi

0.7041665

Coefficients:

Value Std.Error t-value p-value

(Intercept) 1.335059 0.7792019 1.713367 0.0898

x 2.072936 0.2405513 8.617438 0.0000

Correlation:

(Intr)

x -0.926

Standardized residuals:

Min Q1 Med Q3

-1.667086282 -0.571601900 0.001724114 0.568360354

Max

2.306177988

Residual standard error: 1.25329

Degrees of freedom: 100 total; 98 residual

> summary(model6)

Call:

lm(formula = y ~ x, data = LMdata$AC)

Residuals:

Min 1Q Median 3Q Max

-2.10131 -0.72894 -0.03329 0.66138 2.77461

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.2626 0.3303 3.822 0.000232 ***

x 2.1118 0.1026 20.578 < 2e-16 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.197 on 98 degrees of freedom

Multiple R-squared: 0.8121, Adjusted R-squared: 0.8101

F-statistic: 423.4 on 1 and 98 DF, p-value: < 2.2e-16

u 異常值：離群點，杠桿點、高影響點

model7 <- lm(y~x,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)

abline(model7) #畫出回歸直線

#利用car包中的infuencePlot()函數進行判斷，圓圈越大，對模型影響越大，做線性回歸模型時需踢掉該點，用update函數

model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier) #比較去除前后的差異

abline(model7,col = 'red') #比較去除前后的差異

abline(model8,col = 'green') #比較去除前后的差異

u 多重共線性的判斷：自變量之間存在線性關系

#x1,x2,x3與y之間p值無顯著差異，但是總體上的p值和R值非常顯著，說明x1,x2,x3之間可能是存在相關性的，但與y沒有相關性，不能進行回歸分析，需要用函數進行檢驗確認：vif()函數計算方差膨脹因子

> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)

> summary(model9)

Call:

lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

Residuals:

Min 1Q Median 3Q Max

-1.29100 -0.26369 0.00141 0.32529 0.91182

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.3848 0.5813 0.662 0.5095

x1 7.2022 4.8552 1.483 0.1412

x2 -14.0916 12.1385 -1.161 0.2486

x3 8.2312 4.8559 1.695 0.0933 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.499 on 96 degrees of freedom

Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984

F-statistic: 2.057e+04 on 3 and 96 DF, p-value: < 2.2e-16

#vif計算方差膨脹因子，一般而言，大於10則認為存在共線性

> vif(model9)

x1 x2 x3

7560.819 214990.752 222630.742

#運用自動回歸判斷是x中的哪些變量存在共線性：step()函數

> model10 <- step(model9)

Start: AIC=-135.11

y ~ x1 + x2 + x3

Df Sum of Sq RSS AIC

- x2 1 0.33560 24.241 -135.71

<none> 23.905 -135.11

- x1 1 0.54795 24.453 -134.84

- x3 1 0.71550 24.621 -134.16

Step: AIC=-135.71

y ~ x1 + x3

Df Sum of Sq RSS AIC

<none> 24.2 -135.71

- x1 1 189.2 213.4 79.81

- x3 1 15276.4 15300.7 507.05

> summary(model10)

Call:

lm(formula = y ~ x1 + x3, data = LMdata$Mult)

Residuals:

Min 1Q Median 3Q Max

-1.24046 -0.27519 -0.02751 0.32824 0.91344

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.38158 0.58230 0.655 0.514

x1 1.56614 0.05692 27.514 <2e-16 ***

x3 2.59393 0.01049 247.241 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4999 on 97 degrees of freedom

Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984

F-statistic: 3.074e+04 on 2 and 97 DF, p-value: < 2.2e-16

n Logistic回歸

與線性回歸的區別：y為分類變量。

u 使用glm()函數：下載HSAUR2包的數據集plasma

> library(HSAUR2)

> data('plasma')

> head(plasma)

fibrinogen globulin ESR

1 2.52 38 ESR < 20

2 2.56 31 ESR < 20

3 2.19 33 ESR < 20

4 2.18 31 ESR < 20

5 3.41 37 ESR < 20

6 2.46 36 ESR < 20

> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,

+ family = binomial()) #family選擇二分類

> summary(fit1)

Call:

glm(formula = ESR ~ fibrinogen + globulin, family = binomial(),

data = plasma)

Deviance Residuals:

Min 1Q Median 3Q Max

-0.9683 -0.6122 -0.3458 -0.2116 2.2636

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -12.7921 5.7963 -2.207 0.0273 *

fibrinogen 1.9104 0.9710 1.967 0.0491 *

globulin 0.1558 0.1195 1.303 0.1925

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 30.885 on 31 degrees of freedom

Residual deviance: 22.971 on 29 degrees of freedom

AIC: 28.971

Number of Fisher Scoring iterations: 5

> exp(coef(fit1)['fibrinogen'])

fibrinogen

6.755579

> exp(1.9104)

[1] 6.755791 #ORR值

> exp(confint(fit1,parm = 'fibrinogen'))

Waiting for profiling to be done...

2.5 % 97.5 % #95%的可信區間

1.404131 73.000836

n 生存分析與COX回歸

生存分析

u 用coin包中的glima數據集演示，用survival包進行生存分析，利用其中的survfit()函數做出生存曲線圖

> g3 <- subset(glioma,histology =='Grade3')

> fit <- survfit(Surv(time,event)~group,data = g3) #注意公式的寫法

> plot(fit,lty = c(2,3),col = c(2,1))

> legend('bottomright',legend = c('Control','Treatment'),lty = c(2,1),col = c(2,1))

u 用survdiff()函數計算兩組間生存差異

> survdiff(Surv(time,event)~group,data = g3) #接受傳入的形式

Call:

survdiff(formula = Surv(time, event) ~ group, data = g3)

N Observed Expected (O-E)^2/E (O-E)^2/V

group=Control 6 4 1.49 4.23 6.06

group=RIT 11 2 4.51 1.40 6.06

Chisq= 6.1 on 1 degrees of freedom, p= 0.0138

u 用survdiff()函數計算兩組間生存差異

> logrank_test(Surv(time,event)~group,data = g3,distribution='exact')

# 注意公式的寫法，選擇精確分布

Exact Two-Sample Logrank Test

data: Surv(time, event) by group (Control, RIT)

Z = -2.1711, p-value = 0.02877

alternative hypothesis: true theta is not equal to 1

#討論不同histology的對照組與治療組是否有差異

logrank_test(Surv(time,event)~group|histology,data = glioma,distribution = approximate(B =1000))

Approximative Two-Sample Logrank Test

data: Surv(time, event) by

group (Control, RIT)

stratified by histology

Z = -3.6704, p-value < 2.2e-16

alternative hypothesis: true theta is not equal to 1

COX回歸

u 用GBSG2數據集

> install.packages('TH.data')

Error in install.packages : Updating loaded packages

> data('GBSG2',package = 'TH.data')

> head(GBSG2)

horTh age menostat tsize tgrade pnodes progrec estrec time cens

1 no 70 Post 21 II 3 48 66 1814 1

2 yes 56 Post 12 II 7 61 77 2018 1

3 yes 58 Post 35 II 9 52 271 712 1

4 yes 59 Post 17 II 4 60 29 1807 1

5 no 73 Post 35 II 1 26 65 772 1

6 no 32 Pre 57 III 24 0 13 448 1

> plot(survfit(Surv(time,cens)~horTh,data = GBSG2),lty = c(2,1),col = c(2,1), mark.Time = T) #如果需要標記生存時間，在后面加上mark.Time = T

> legend('bottomright',legend = c('yes','no'),lty = c(2,1),col = c(2,1))

#畫出了生存曲線圖

u 生存分析：使用survival包中的coxph()函數

> coxreg <- coxph(Surv(time,cens)~.,data = GBSG2)

> summary(coxreg)

Call:

coxph(formula = Surv(time, cens) ~ ., data = GBSG2)

n= 686, number of events= 299

coef exp(coef) se(coef) z Pr(>|z|)

horThyes -0.3462784 0.7073155 0.1290747 -2.683 0.007301 **

age -0.0094592 0.9905854 0.0093006 -1.017 0.309126

menostatPost 0.2584448 1.2949147 0.1834765 1.409 0.158954

tsize 0.0077961 1.0078266 0.0039390 1.979 0.047794 *

tgrade.L 0.5512988 1.7355056 0.1898441 2.904 0.003685 **

tgrade.Q -0.2010905 0.8178384 0.1219654 -1.649 0.099199 .

pnodes 0.0487886 1.0499984 0.0074471 6.551 5.7e-11 ***

progrec -0.0022172 0.9977852 0.0005735 -3.866 0.000111 ***

estrec 0.0001973 1.0001973 0.0004504 0.438 0.661307

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95

horThyes 0.7073 1.4138 0.5492 0.9109

age 0.9906 1.0095 0.9727 1.0088

menostatPost 1.2949 0.7723 0.9038 1.8553

tsize 1.0078 0.9922 1.0001 1.0156

tgrade.L 1.7355 0.5762 1.1963 2.5178

tgrade.Q 0.8178 1.2227 0.6439 1.0387

pnodes 1.0500 0.9524 1.0348 1.0654

progrec 0.9978 1.0022 0.9967 0.9989

estrec 1.0002 0.9998 0.9993 1.0011

Concordance= 0.692 (se = 0.018 )

Rsquare= 0.142 (max possible= 0.995 )

Likelihood ratio test= 104.8 on 9 df, p=0

Wald test = 114.8 on 9 df, p=0

Score (logrank) test = 120.7 on 9 df, p=0

#畫樹

install.packages('party')

tree <- ctree(Surv(time,cens)~.,data = GBSG2)

plot(tree)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R語言學習筆記（二） R語言學習筆記：向量 R語言學習筆記（六）：OLS回歸 R語言學習筆記之二 R語言學習筆記（九）：fivenum（）與quantile（） R語言學習筆記（九）：中級繪圖 R語言學習筆記1——R語言中的基本對象 R語言學習筆記：小試R環境 R語言學習筆記-Corrplot相關性分析 R語言學習筆記（十五）：獲取文件和目錄信息