R語言學習筆記之十


摘要: 僅用於記錄R語言學習過程:

內容提要:

描述性統計;t檢驗;數據轉換;方差分析;卡方檢驗;回歸分析與模型診斷;生存分析;COX回歸

寫在正文前的話,關於基礎知識,此篇為終結篇,筆記來自醫學方的課程,僅用於學習R的過程。

正文:

  描述性統計

n  如何去生成table1 用table()函數,快速匯總頻數

u  生成四格表:table(行名,列名)

> table(tips$sex,tips$smoker)

      No Yes

Female 54  33

      Male   97  60

 

addmargins()函數:對生成的table表格進行計算

> table(esoph$agegp,esoph$ncases)

      

         0  1  2  3  4  5  6  8  9 17

  25-34 14  1  0  0  0  0  0  0  0  0

  35-44 10  2  2  1  0  0  0  0  0  0

  45-54  3  2  2  2  3  2  2  0  0  0

  55-64  0  0  2  4  3  2  2  1  2  0

  65-74  1  4  2  2  2  2  1  0  0  1

  75+    1  7  3  0  0  0  0  0  0  0

> tt <- table(esoph$agegp,esoph$ncases)

> addmargins(tt,margin = c(1,2))  # margin 1表示行,2表示列

      

         0  1  2  3  4  5  6  8  9 17 Sum

  25-34 14  1  0  0  0  0  0  0  0  0  15

  35-44 10  2  2  1  0  0  0  0  0  0  15

  45-54  3  2  2  2  3  2  2  0  0  0  16

  55-64  0  0  2  4  3  2  2  1  2  0  16

  65-74  1  4  2  2  2  2  1  0  0  1  15

  75+    1  7  3  0  0  0  0  0  0  0  11

      Sum   29 16 11  9  8  6  5  1  2  1  88

n  xtabs()函數:在數據集中取子集,生成表格

> hightip <- tips[,'tip'] > mean(tips[,'tip'] )

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker=='No'))

#as.data.frame轉換成數據框;~后面的公式類似table括號中的內容,為分類變量;~左邊需添加的是連續型變量;有一個子集subset可進行提取

  tips.sex hightip Freq

1   Female   FALSE   31

2     Male   FALSE   49

3   Female    TRUE   23

4     Male    TRUE   48

 

> as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c('Sun','Sat')))

  tips.sex hightip Freq

1   Female   FALSE   21

2     Male   FALSE   53

3   Female    TRUE   25

4     Male    TRUE   64

 

示例2:

> xtabs(ncontrols~agegp + alcgp,data = esoph)

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34        61    45      5    5

  35-44        89    80     20   10

  45-54        78    81     39   15

  55-64        89    84     43   26

  65-74        71    53     29    8

  75+          27    12      2    3

> addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))

       alcgp

agegp   0-39g/day 40-79 80-119 120+ Sum

  25-34        61    45      5    5 116

  35-44        89    80     20   10 199

  45-54        78    81     39   15 213

  55-64        89    84     43   26 242

  65-74        71    53     29    8 161

  75+          27    12      2    3  44

  Sum         415   355    138   67 975

 

#cbind(數值型,數值型)

> xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)

, ,  = ncases

 

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34         0     0      0    1

  35-44         1     4      0    4

  45-54         1    20     12   13

  55-64        12    22     24   18

  65-74        11    25     13    6

  75+           4     4      2    3

 

, ,  = ncontrols

 

       alcgp

agegp   0-39g/day 40-79 80-119 120+

  25-34        61    45      5    5

  35-44        89    80     20   10

  45-54        78    81     39   15

  55-64        89    84     43   26

  65-74        71    53     29    8

  75+          27    12      2    3

n  ftable()函數:扁平表格,接受xtabs對象,進行調整

> ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))

                 ncases ncontrols

agegp alcgp                     

25-34 0-39g/day       0        61

      40-79           0        45

      80-119          0         5

      120+            1         5

35-44 0-39g/day       1        89

      40-79           4        80

      80-119          0        20

      120+            4        10

45-54 0-39g/day       1        78

      40-79          20        81

      80-119         12        39

      120+           13        15

55-64 0-39g/day      12        89

      40-79          22        84

      80-119         24        43

      120+           18        26

65-74 0-39g/day      11        71

      40-79          25        53

      80-119         13        29

      120+            6         8

75+   0-39g/day       4        27

      40-79           4        12

      80-119          2         2

      120+            3         3

n  數據匯總可結合前面學習的函數,如:

u  summary(數據集)函數

u  describe(數據集)函數(psych包里的)

u  describe.data.frame(數據集)函數 (Hmisc包里的)

u  apply()家族等

  t檢驗

n  假設檢驗:服從正態分布方差齊性(如果不齊,可以用t’檢驗),

n  示例:

u  生成隨機數據,並檢驗是否符合正態分布:(shapiro.test()函數)

> data1 <- sample(1:100,50)

>

> #檢驗正態性  shapiro.test()函數

> shapiro.test(data1)

 

       Shapiro-Wilk normality test

 

data:  data1

W = 0.96424, p-value = 0.1338  #不能拒絕原假設,服從正態分布

除此之外,還有以下5個函數可用於檢驗是否符合正態分布:

library(nortest)

lillie.test(data1)

ad.test(data1)

cvm.test(data1)

pearson.test(data1)

sf.test(data1)

如:

> lillie.test(data1)

 

       Lilliefors (Kolmogorov-Smirnov) normality test

 

data:  data1

D = 0.064492, p-value = 0.8693

 

> ad.test(data1)

 

       Anderson-Darling normality test

 

data:  data1

A = 0.40918, p-value = 0.333

 

> cvm.test(data1)

 

       Cramer-von Mises normality test

 

data:  data1

W = 0.054212, p-value = 0.4437

 

> pearson.test(data1)

 

       Pearson chi-square normality test

 

data:  data1

P = 4, p-value = 0.7798

 

> sf.test(data1)

 

       Shapiro-Francia normality test

 

data:  data1

W = 0.97496, p-value = 0.3075

生成服從正態分布的數據:rnorm()函數

> #生成服從正態分布的數據

> data3 <- rnorm(100,3,5)

> data4 <- rnorm(200,3.4,8)

>

> shapiro.test(data3)

 

       Shapiro-Wilk normality test

 

data:  data3

W = 0.99369, p-value = 0.9258

 

>

> shapiro.test(data4)

 

       Shapiro-Wilk normality test

 

data:  data4

W = 0.98946, p-value = 0.149

檢驗方差齊性:var.test()函數:僅適用於兩組

> var.test(data3,data4)

 

       F test to compare two variances

 

data:  data3 and data4

F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

 0.2893355 0.5739700

sample estimates:

ratio of variances

         0.4034794   #p值小於0.05,說明方差不齊性

u  進行t檢驗函數:t.test()函數

兩組均數t檢驗

> t.test(data3,data4,var.equal = F)   #方差不齊則var.equal設置為F,默認也為FALSE,若其設置為TRUE,如為FALSE,則會使用t’檢驗

 

       Welch Two Sample t-test

 

data:  data3 and data4

t = -1.3681, df = 281.41, p-value = 0.1724   #說明兩組均數無顯著統計學差異

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -2.587004  0.465494

sample estimates:

mean of x mean of y

      2.468549  3.529304

樣本均數與總體均數t檢驗

> #樣本均數與總體均數的比較

> t.test(data3,mu =3.2)

 

       One Sample t-test

 

data:  data3

t = -1.4117, df = 99, p-value = 0.1612  #樣本與總體無統計學差異

alternative hypothesis: true mean is not equal to 3.2

95 percent confidence interval:

 1.440424 3.496675

sample estimates:

mean of x

 2.468549

配對數據的t檢驗

data3 <- rnorm(200,3,5)

> data4 <- rnorm(200,3.4,5)

>

> #配對數據

> #進行配對t檢驗

> t.test(data3,data4,paired = TRUE)

 

       Paired t-test

 

data:  data3 and data4

t = 0.59079, df = 199, p-value = 0.5553  #p大於0.05,說明無顯著差異

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -0.6946177  1.2888581

sample estimates:

mean of the differences

              0.2971202

 

  數據變換

n  數據不滿足正態分布時該如何處理:有時可采用取log()、sqrt()、1/x等等方式,

n  runif()平均分布函數

> mydata <- runif(100,min = 1,max = 2)

> shapiro.test(mydata)

 

       Shapiro-Wilk normality test

 

data:  mydata

W = 0.93149, p-value = 6.05e-05

 

> shapiro.test(log(mydata))

 

       Shapiro-Wilk normality test

 

data:  log(mydata)

W = 0.93082, p-value = 5.54e-05

 

> shapiro.test(sqrt(mydata))

 

       Shapiro-Wilk normality test

 

data:  sqrt(mydata)

W = 0.93236, p-value = 6.794e-05

 

> shapiro.test(1/mydata)

 

       Shapiro-Wilk normality test

 

data:  1/mydata

W = 0.9195, p-value = 1.325e-05

#幾種方式均不可行。

boxcox轉換---對公式的,加載MASS包,運用里面的boxcox()函數,#boxcox()轉換,核心為找到lambda的值進行相應的轉換

bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,

             lambda = seq(-0.25,0.25,length =10))

#Volume為擬操作的變量,Height和Girth為用此兩個數據進行估計,但要取log,trees為數據集,lambda后面的公式為取最佳值

u  用公式:(x^lambda-1)/lambda  進行數據轉換(lambda 不等於1),新的Volume_bc就服從正態分布了。

> Volume_bc <- (trees$Volume^lambda-1)/lambda

> shapiro.test(Volume_bc)

 

       Shapiro-Wilk normality test

 

data:  Volume_bc

W = 0.96431, p-value = 0.3776

u  可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看圖,是否符合正態分布

boxcox轉換---對數值的,加載forecast包,利用包中的BoxCox.lambda()函數

> BoxCox.lambda(trees$Volume)

[1] -0.4954451   #lambda的值  ,采用1/sqrt(x)

分以下幾種情況:

ü  lambda接近-1時    1/x

ü             -0.5     1/sqrt(x)

ü              0       ln(x) 或log(x)

ü              0.5      sqrt()

ü               1       不用變換了

完整示例:

> BoxCox.lambda(trees$Volume)

[1] -0.4954451

> new_volum <- 1/sqrt(trees$Volume)

> shapiro.test(new_volum)

 

       Shapiro-Wilk normality test

 

data:  new_volum

W = 0.94523, p-value = 0.1152

利用car包中powerTransform得到lambda值,再進行正態分布分析

> library(car)

> powerTransform(trees$Volume)

Estimated transformation parameter

trees$Volume

 -0.07476608

> new_volum <- trees$Volume^-0.07476608

> shapiro.test(new_volum)

 

       Shapiro-Wilk normality test

 

data:  new_volum

W = 0.96428, p-value = 0.3768

  方差分析

n  用於多組均值的比較。兩組均值的t檢驗與方差分析等價,但是對於>=3組的均數比較,t檢驗不適用,需用方差分析。

n  假設檢驗,需滿足的條件:

u  正態性

u  方差齊性

u  獨立性

n  方差齊性的檢驗

u  安裝multcomp包,加載cholesterol數據集(里面包含response組和trt組)

u  方差齊性的檢驗:

l  R語言的內置包:bartlett.test()

> bartlett.test(response~trt,data = cholesterol)

 

       Bartlett test of homogeneity of variances

 

data:  response by trt

Bartlett's K-squared = 0.57975, df = 4, p-value =

0.9653

 

> #p = 0.9653   方差是齊性的

>

> #正態性檢驗

> shapiro.test(cholesterol$response)

 

       Shapiro-Wilk normality test

 

data:  cholesterol$response

W = 0.97722, p-value = 0.4417

l  R語言的內置包:fligner.test() 與bartlett.test()檢驗原理不同,但公式一樣

> #方差齊性檢驗

> fligner.test(response~trt,data = cholesterol)

 

       Fligner-Killeen test of homogeneity of variances

 

data:  response by trt

Fligner-Killeen:med chi-squared = 0.74277, df = 4,

p-value = 0.946

car包中的ncvTest()函數:

> ncvTest(lm(response~trt,data = cholesterol))

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.1338923    Df = 1     p = 0.71443

car包中的leveneTest()函數:

> leveneTest(response~trt,data = cholesterol)

Levene's Test for Homogeneity of Variance (center = median)

      Df F value Pr(>F)

group  4  0.0755 0.9893

      45

 

n  方差分析

單因素方差分析aov()函數

> aov(response~trt,data = cholesterol)

Call:

   aov(formula = response ~ trt, data = cholesterol)

 

Terms:

                      trt Residuals

Sum of Squares  1351.3690  468.7504

Deg. of Freedom         4        45

 

Residual standard error: 3.227488

Estimated effects may be unbalanced

> summary(fit)

            Df Sum Sq Mean Sq F value   Pr(>F)   

trt          4 1351.4   337.8   32.43 9.82e-13 ***

Residuals   45  468.8    10.4                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#用gplots包中的plotmeans可視化

plotmeans(response~trt,data = cholesterol)

單因素方差分析:oneway.test()函數

> oneway.test(response~trt,data = cholesterol)

 

       One-way analysis of means (not assuming equal

       variances)

 

data:  response and trt

F = 30.709, num df = 4.00, denom df = 22.46, p-value =

8.227e-09

 

> oneway.test(response~trt,data = cholesterol,var.equal = T)

 

       One-way analysis of means

 

data:  response and trt

F = 32.433, num df = 4, denom df = 45, p-value =

9.819e-13

兩兩比較:TukeyHSD()函數

> TukeyHSD(fit)

  Tukey multiple comparisons of means

    95% family-wise confidence level

 

Fit: aov(formula = response ~ trt, data = cholesterol)

 

$trt

                  diff        lwr       upr     p adj

2times-1time   3.44300 -0.6582817  7.544282 0.1380949

4times-1time   6.59281  2.4915283 10.694092 0.0003542

drugD-1time    9.57920  5.4779183 13.680482 0.0000003

drugE-1time   15.16555 11.0642683 19.266832 0.0000000

4times-2times  3.14981 -0.9514717  7.251092 0.2050382

drugD-2times   6.13620  2.0349183 10.237482 0.0009611

drugE-2times  11.72255  7.6212683 15.823832 0.0000000

drugD-4times   2.98639 -1.1148917  7.087672 0.2512446

drugE-4times   8.57274  4.4714583 12.674022 0.0000037

drugE-drugD    5.58635  1.4850683  9.687632 0.0030633

#可視化

plot(TukeyHSD(fit))

多因素方差分析:aov()函數

> data('ToothGrowth')

> head('ToothGrowth')

len supp dose

1  4.2   VC  0.5

2 11.5   VC  0.5

3  7.3   VC  0.5

4  5.8   VC  0.5

5  6.4   VC  0.5

6 10.0   VC  0.5

> levels(ToothGrowth$supp)

[1] "OJ" "VC"

> levels(ToothGrowth$dose)

NULL

> levels(as.factor(ToothGrowth$dose))

[1] "0.5" "1"   "2" 

> table(ToothGrowth$supp,ToothGrowth$dose)

   

     0.5  1  2

  OJ  10 10 10

  VC  10 10 10

 

#公式的寫法   關注交互作用

 

> fangcha <- aov (len~supp+dose,data = ToothGrowth)

> summary(fangcha)

            Df Sum Sq Mean Sq F value   Pr(>F)   

supp         1  205.4   205.4   11.45   0.0013 **

dose         1 2224.3  2224.3  123.99 6.31e-16 ***

Residuals   57 1022.6    17.9                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#交互作用

> fangcha <- aov (len~supp*dose,data = ToothGrowth)

> summary(fangcha)

            Df Sum Sq Mean Sq F value   Pr(>F)   

supp         1  205.4   205.4  12.317 0.000894 ***

dose         1 2224.3  2224.3 133.415  < 2e-16 ***

supp:dose    1   88.9    88.9   5.333 0.024631 *    #交互作用不能忽視

Residuals   56  933.6    16.7                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#可視化  交互作用圖

>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = 'b',

+                  pch = c(1,10),col = c('red','green'))

單因素協方差分析:aov()函數

> data('litter')

> head(litter)

  dose weight gesttime number

1    0  28.05     22.5     15

2    0  33.33     22.5     14

3    0  36.37     22.0     14

4    0  35.52     22.0     13

5    0  36.77     21.5     15

6    0  29.60     23.0      5

 

> facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)

> summary(facn)

              Df Sum Sq Mean Sq F value  Pr(>F)  

gesttime       1  134.3  134.30   8.289 0.00537 **  #協變量確實有影響,該如何去除?

dose           3  137.1   45.71   2.821 0.04556 *

gesttime:dose  3   81.9   27.29   1.684 0.17889  

Residuals     66 1069.4   16.20                  

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

去除協變量的影響:加載effects包,用其中的effect()函數

> effect('dose',facn)

NOTE: dose is not a high-order term in the model

 

 dose effect

dose

       0        5       50      500

32.30722 28.50625 30.61078 29.29005

 

#常見研究設計的表達式

 

 

  卡方檢驗

n  對分類變量的檢驗方法

n  分類:

u  擬合優度檢驗:用chisq.test()函數

(針對樣本數據的分布與已知總體分布是否一致)

> men <- c(11,120,60,45)

> women <- c(20,102,39,30)

> df <- as.data.frame(rbind(men,women))

> colnames(df) <- c('AB','O','A','B')

> chisq.test(men)

 

       Chi-squared test for given probabilities

 

data:  men

X-squared = 105.46, df = 3, p-value < 2.2e-16

 

> chisq.test(men,p = c(0.1,0.5,0.2,0.2))  #p 中為總體人群中各血型的比例

 

       Chi-squared test for given probabilities

 

data:  men

X-squared = 10.335, df = 3, p-value = 0.01592

u  卡方齊性檢驗:用於比較不同分類水平下,各個類型的比例是否一致。

> chisq.test(df)  #行變量與列變量的相關性檢驗

 

       Pearson's Chi-squared test

 

data:  df

X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致

u  卡方獨立性檢驗:

> chisq.test(df)

 

       Pearson's Chi-squared test

 

data:  df

X-squared = 6.8607, df = 3, p-value = 0.07647 # 行變量和列變量無關聯,血型分布和性別無關

u  CMH檢驗:分層檢驗 三維,行變量為無序,列變量為有序數據;判斷是否有混雜因素

#采用mantelhaen.test()函數

> Rabbits <-

+   array(c(0,0,6,5,

+           3,0,3,6,

+           6,2,0,4,

+           5,6,1,0,

+           2,5,0,0),

+         dim = c(2,2,5),

+         dimnames = list(

+           Delay = c('None','1.5h'),

+           Response = c('Cured','Died'),

+           Penicillin.Level = c('1/8','1/4','1/2','1','4')))

> Rabbits

, , Penicillin.Level = 1/8

 

      Response

Delay  Cured Died

  None     0    6

  1.5h     0    5

 

, , Penicillin.Level = 1/4

 

      Response

Delay  Cured Died

  None     3    3

  1.5h     0    6

 

, , Penicillin.Level = 1/2

 

      Response

Delay  Cured Died

  None     6    0

  1.5h     2    4

 

, , Penicillin.Level = 1

 

      Response

Delay  Cured Died

  None     5    1

  1.5h     6    0

 

, , Penicillin.Level = 4

 

      Response

Delay  Cured Died

  None     2    0

  1.5h     5    0

 

> mantelhaen.test(Rabbits)

 

       Mantel-Haenszel chi-squared test with continuity

       correction

 

data:  Rabbits

Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =

0.04747

alternative hypothesis: true common odds ratio is not equal to 1

95 percent confidence interval:

  1.026713 47.725133

sample estimates:

common odds ratio

                7    #可以判定盤尼西林是否為混雜因素

u  CMH檢驗:有序分類的卡方檢驗:連續型變量

#采用mantelhaen.test()函數

> Satisfaction <-

+   as.table(array(c(1,2,0,0,3,3,1,2,

+                    11,17,8,4,2,3,5,2,

+                    1,0,0,0,1,3,0,1,

+                    2,5,7,9,1,1,3,6),

+                  dim = c(4,4,2),

+                  dimnames =

+                    list(Income =

+                           c('<5000','5000-15000',

+                             '15000-25000','>25000'),

+                         'Job Satisfaction' =

+                           c('V_D','L_S','M_S','V_S'),

+                         Gender = c('Female','Male'))))

> Satisfaction

, , Gender = Female

 

             Job Satisfaction

Income        V_D L_S M_S V_S

  <5000         1   3  11   2

  5000-15000    2   3  17   3

  15000-25000   0   1   8   5

  >25000        0   2   4   2

 

, , Gender = Male

 

             Job Satisfaction

Income        V_D L_S M_S V_S

  <5000         1   1   2   1

  5000-15000    0   3   5   1

  15000-25000   0   0   7   3

  >25000        0   1   9   6

 

> mantelhaen.test(Satisfaction)

 

       Cochran-Mantel-Haenszel test

 

data:  Satisfaction

Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =

0.3345   #工資水平與滿意度無關

u  配對四格表的卡方檢驗:采用mcnemar.test()函數

> paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c('A','B'),control = c('A','B'))))

> paired

    control

case   A   B

   A 157  69

   B  24  18

> mcnemar.test(paired)

 

       McNemar's Chi-squared test with continuity correction

 

data:  paired

McNemar's chi-squared = 20.817, df = 1, p-value =

5.053e-06   

> chisq.test(paired)

 

       Pearson's Chi-squared test with Yates' continuity

       correction

 

data:  paired

X-squared = 1.9244, df = 1, p-value = 0.1654

u  的

  回歸分析與模型診斷

前提:是否適合做線性回歸,是否滿足正態分布

只要因變量是連續型變量即可做線性回歸,因變量是分類變量需要做Logistic回歸;自變量是連續型或者分類變量無影響

n  一元回歸分析:lm(因變量~自變量)

x為連續型變量

> x<- seq(1,5,len =100)

> noise <- rnorm(n=100,mean = 0,sd = 1)

> beta0 = 1

> beta1 <-2

> y <- beta0 + beta1*x + noise

> plot(y~x)   #看一下是否適合做線性回歸

> model <- lm(y~x)

> summary(model)

 

Call:

lm(formula = y ~ x)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.49732 -0.64794  0.00215  0.75355  3.06579

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.77938    0.28733   2.712  0.00789 **

x            2.05561    0.08927  23.027  < 2e-16 *** 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.041 on 98 degrees of freedom

Multiple R-squared:  0.844,    Adjusted R-squared:  0.8424 #模型可解釋度

F-statistic: 530.3 on 1 and 98 DF,  p-value: < 2.2e-16  #模型可靠性

x為分類變量

> x <- factor(rep(c(0,1,2),each = 20))

> y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))

> model <- lm(y~x)

> summary(model)

 

Call:

lm(formula = y ~ x)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.50566 -0.82826  0.01137  0.87966  2.41338

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept)  0.05338    0.24331   0.219    0.827   

x1           0.81708    0.34409   2.375    0.021 * 

x2           1.99903    0.34409   5.810 2.94e-07 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.088 on 57 degrees of freedom

Multiple R-squared:  0.3745,      Adjusted R-squared:  0.3525

F-statistic: 17.06 on 2 and 57 DF,  p-value: 1.558e-06

 

> exp( 1.99903)  #轉化成OR值

[1] 7.381892

n  模型診斷

非正態性:shapiro.test()判斷

> #檢驗殘差是否服從正態分布

> data(LMdata,package = 'rinds')

> model <- lm(y~x,data = LMdata$NonL)

> #找殘差

>

> res1 <- residuals(model)

> shapiro.test(res1)   #殘差不符合正態分布,需要重新做

 

       Shapiro-Wilk normality test

 

data:  res1

W = 0.93524, p-value = 1e-04   #殘差不符合正態分布,需要重新做

u  非線性

> #2.非線性

> model2 <- lm(y~x,data = LMdata$NonL)

> plot(model2)

#y 與x2的關系是否成線性

model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

summary(model2)

#踢掉x

> model3 <- update(model2,y~.-x)

> summary(model3)

 

Call:

lm(formula = y ~ 1, data = LMdata$NonL)

 

Residuals:

    Min      1Q  Median      3Q     Max

-19.456 -12.907  -2.464  10.725  28.697

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   21.829      1.429   15.28   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 14.29 on 99 degrees of freedom

> model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

> summary(model2)

Call:

lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.32348 -0.60702  0.00701  0.56018  2.27346

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.98702    0.62216   1.586    0.116   

x            0.11085    0.45405   0.244    0.808   

I(x^2)       1.97966    0.07456  26.552   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.907 on 97 degrees of freedom

Multiple R-squared:  0.9961,      Adjusted R-squared:  0.996

F-statistic: 1.224e+04 on 2 and 97 DF,  p-value: < 2.2e-16

 

> model3 <- update(model2,y~.-x)

> summary(model3)

 

Call:

lm(formula = y ~ I(x^2), data = LMdata$NonL)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.34289 -0.60692  0.01722  0.58186  2.29601

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  1.13378    0.15963   7.103 1.97e-10 ***

I(x^2)       1.99760    0.01271 157.195  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.9026 on 98 degrees of freedom

Multiple R-squared:  0.996,       Adjusted R-squared:  0.996

F-statistic: 2.471e+04 on 1 and 98 DF,  p-value: < 2.2e-16

 

> AIC(model,model2,model3)

       df      AIC

model   3 478.4558

model2  4 269.2121

model3  3 267.2736  #擬合效果越來越好

 

plot(model3$residuals~LMdata$NonL)  #在0左右分布

u  異方差:考慮的是噪聲對模型的影響

model4 <- lm(y~x,data = LMdata$Hetero)

plot(model4$residuals~LMdata$Hetero$x)

#處理方法:使用加權最小二乘法:對於不同的樣本點給予不同的權重

> summary(model5)

 

Call:

lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

 

Weighted Residuals:

     Min       1Q   Median       3Q      Max

-2.25132 -0.68409 -0.03924  0.63997  2.61098

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.2611     0.5139   2.454   0.0159 * 

x             1.8973     0.2317   8.190 9.98e-13 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.025 on 98 degrees of freedom

Multiple R-squared:  0.4063,      Adjusted R-squared:  0.4003

F-statistic: 67.07 on 1 and 98 DF,  p-value: 9.977e-13

 

> summary(model4)

 

Call:

lm(formula = y ~ x, data = LMdata$Hetero)

 

Residuals:

    Min      1Q  Median      3Q     Max

-8.3371 -1.6383 -0.1533  1.4075 12.3423

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.6175     1.0019   1.615     0.11   

x             1.7671     0.3113   5.677  1.4e-07 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 3.63 on 98 degrees of freedom

Multiple R-squared:  0.2475,      Adjusted R-squared:  0.2398

F-statistic: 32.23 on 1 and 98 DF,  p-value: 1.396e-07

 

#處理方法:使用迭代重復加權最小二乘法  采用nlme包中的gls()函數,找到最合適的weights公式,得到最小AIC

> glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)

> summary(glsmodel)

Generalized least squares fit by REML

  Model: y ~ x

  Data: LMdata$Hetero

       AIC      BIC    logLik

  517.5286 525.2835 -255.7643

 

Variance function:

 Structure: fixed weights

 Formula: ~x

 

Coefficients:

               Value Std.Error  t-value p-value

(Intercept) 1.421324 0.7091780 2.004184  0.0478

x           1.832543 0.2603653 7.038351  0.0000

 

 Correlation:

  (Intr)

x -0.908

 

Standardized residuals:

        Min          Q1         Med          Q3

-2.17451000 -0.55322766 -0.03454023  0.51060224

        Max

 2.93969900

 

Residual standard error: 1.890127

Degrees of freedom: 100 total; 98 residual

u  自相關:利用lmtest包中的DW檢驗函數:dwtest()函數

> model6 <- lm(y~x,data = LMdata$AC)

> dwtest(model6)

 

       Durbin-Watson test

 

data:  model6

DW = 0.65556, p-value = 2.683e-12  #存在自相關,不能使用最小二乘法,可使用gls()函數,實現廣義最小二乘法,注意gls中的correlation 參數,設置為corAR1()

alternative hypothesis: true autocorrelation is greater than 0

 

> glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)

> summary(glsmodel2)

Generalized least squares fit by REML

  Model: y ~ x

  Data: LMdata$AC

       AIC      BIC    logLik

  268.7617 279.1016 -130.3809

 

Correlation Structure: AR(1)

 Formula: ~1

 Parameter estimate(s):

      Phi

0.7041665

 

Coefficients:

               Value Std.Error  t-value p-value

(Intercept) 1.335059 0.7792019 1.713367  0.0898

x           2.072936 0.2405513 8.617438  0.0000

 

 Correlation:

  (Intr)

x -0.926

 

Standardized residuals:

         Min           Q1          Med           Q3

-1.667086282 -0.571601900  0.001724114  0.568360354

         Max

 2.306177988

 

Residual standard error: 1.25329

Degrees of freedom: 100 total; 98 residual

> summary(model6)

 

Call:

lm(formula = y ~ x, data = LMdata$AC)

 

Residuals:

     Min       1Q   Median       3Q      Max

-2.10131 -0.72894 -0.03329  0.66138  2.77461

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   1.2626     0.3303   3.822 0.000232 ***

x             2.1118     0.1026  20.578  < 2e-16 ***

---

Signif. codes: 

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.197 on 98 degrees of freedom

Multiple R-squared:  0.8121,      Adjusted R-squared:  0.8101

F-statistic: 423.4 on 1 and 98 DF,  p-value: < 2.2e-16

u  異常值:離群點,杠桿點、高影響點

model7 <- lm(y~x,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)

abline(model7)  #畫出回歸直線

#利用car包中的infuencePlot()函數進行判斷,圓圈越大,對模型影響越大,做線性回歸模型時需踢掉該點,用update函數

model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)

plot(y~x,data = LMdata$Outlier)   #比較去除前后的差異

abline(model7,col = 'red')   #比較去除前后的差異

abline(model8,col = 'green')  #比較去除前后的差異

u  多重共線性的判斷:自變量之間存在線性關系

#x1,x2,x3與y之間p值無顯著差異,但是總體上的p值和R值非常顯著,說明x1,x2,x3之間可能是存在相關性的,但與y沒有相關性,不能進行回歸分析,需要用函數進行檢驗確認:vif()函數計算方差膨脹因子

> model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)

> summary(model9)

 

Call:

lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

 

Residuals:

     Min       1Q   Median       3Q      Max

-1.29100 -0.26369  0.00141  0.32529  0.91182

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|) 

(Intercept)   0.3848     0.5813   0.662   0.5095 

x1            7.2022     4.8552   1.483   0.1412  

x2          -14.0916    12.1385  -1.161   0.2486 

x3            8.2312     4.8559   1.695   0.0933 .

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.499 on 96 degrees of freedom

Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

F-statistic: 2.057e+04 on 3 and 96 DF,  p-value: < 2.2e-16

#vif計算方差膨脹因子,一般而言,大於10則認為存在共線性

> vif(model9)

        x1         x2         x3

  7560.819 214990.752 222630.742

#運用自動回歸判斷是x中的哪些變量存在共線性:step()函數

> model10 <- step(model9)

Start:  AIC=-135.11

y ~ x1 + x2 + x3

 

       Df Sum of Sq    RSS     AIC

- x2    1   0.33560 24.241 -135.71

<none>              23.905 -135.11

- x1    1   0.54795 24.453 -134.84

- x3    1   0.71550 24.621 -134.16

 

Step:  AIC=-135.71

y ~ x1 + x3

 

       Df Sum of Sq     RSS     AIC

<none>                 24.2 -135.71

- x1    1     189.2   213.4   79.81

- x3    1   15276.4 15300.7  507.05

> summary(model10)

 

Call:

lm(formula = y ~ x1 + x3, data = LMdata$Mult)

 

Residuals:

     Min       1Q   Median       3Q      Max

-1.24046 -0.27519 -0.02751  0.32824  0.91344

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  0.38158    0.58230   0.655    0.514   

x1           1.56614    0.05692  27.514   <2e-16 ***

x3           2.59393    0.01049 247.241   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 0.4999 on 97 degrees of freedom

Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

F-statistic: 3.074e+04 on 2 and 97 DF,  p-value: < 2.2e-16

 

n  Logistic回歸

與線性回歸的區別:y為分類變量。

u  使用glm()函數:下載HSAUR2包的數據集plasma

> library(HSAUR2)

> data('plasma')

> head(plasma)

  fibrinogen globulin      ESR

1       2.52       38 ESR < 20

2       2.56       31 ESR < 20

3       2.19       33 ESR < 20

4       2.18       31 ESR < 20

5       3.41       37 ESR < 20

6       2.46       36 ESR < 20

> fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,

+             family = binomial())  #family選擇二分類

> summary(fit1)

 

Call:

glm(formula = ESR ~ fibrinogen + globulin, family = binomial(),

    data = plasma)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-0.9683  -0.6122  -0.3458  -0.2116   2.2636 

 

Coefficients:

            Estimate Std. Error z value Pr(>|z|) 

(Intercept) -12.7921     5.7963  -2.207   0.0273 *

fibrinogen    1.9104     0.9710   1.967   0.0491 *

globulin      0.1558     0.1195   1.303   0.1925 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 30.885  on 31  degrees of freedom

Residual deviance: 22.971  on 29  degrees of freedom

AIC: 28.971

 

Number of Fisher Scoring iterations: 5

 

> exp(coef(fit1)['fibrinogen'])

fibrinogen

  6.755579

> exp(1.9104)

[1] 6.755791   #ORR值

> exp(confint(fit1,parm = 'fibrinogen'))

Waiting for profiling to be done...

    2.5 %    97.5 %   #95%的可信區間

    1.404131 73.000836

n  生存分析與COX回歸

生存分析

u  用coin包中的glima數據集演示,用survival包進行生存分析,利用其中的survfit()函數做出生存曲線圖

> g3 <- subset(glioma,histology =='Grade3')

> fit <- survfit(Surv(time,event)~group,data = g3)  #注意公式的寫法

> plot(fit,lty = c(2,3),col = c(2,1))

> legend('bottomright',legend = c('Control','Treatment'),lty = c(2,1),col = c(2,1))

用survdiff()函數計算兩組間生存差異

> survdiff(Surv(time,event)~group,data = g3)  #接受傳入的形式

Call:

survdiff(formula = Surv(time, event) ~ group, data = g3)

 

               N Observed Expected (O-E)^2/E (O-E)^2/V

group=Control  6        4     1.49      4.23      6.06

group=RIT     11        2     4.51      1.40      6.06

 

      Chisq= 6.1  on 1 degrees of freedom, p= 0.0138

u  用survdiff()函數計算兩組間生存差異

> logrank_test(Surv(time,event)~group,data = g3,distribution='exact')

# 注意公式的寫法,選擇精確分布

       Exact Two-Sample Logrank Test

 

data:  Surv(time, event) by group (Control, RIT)

Z = -2.1711, p-value = 0.02877

alternative hypothesis: true theta is not equal to 1

#討論不同histology的對照組與治療組是否有差異

logrank_test(Surv(time,event)~group|histology,data = glioma,distribution = approximate(B =1000))

 

       Approximative Two-Sample Logrank Test

 

data:  Surv(time, event) by

        group (Control, RIT)

        stratified by histology

Z = -3.6704, p-value < 2.2e-16

alternative hypothesis: true theta is not equal to 1

              COX回歸

u  用GBSG2數據集

> install.packages('TH.data')

Error in install.packages : Updating loaded packages

> data('GBSG2',package = 'TH.data')

> head(GBSG2)

  horTh age menostat tsize tgrade pnodes progrec estrec time cens

1    no  70     Post    21     II      3      48     66 1814    1

2   yes  56     Post    12     II      7      61     77 2018    1

3   yes  58     Post    35     II      9      52    271  712    1

4   yes  59     Post    17     II      4      60     29 1807    1

5    no  73     Post    35     II      1      26     65  772    1

6    no  32      Pre    57    III     24       0     13  448    1

> plot(survfit(Surv(time,cens)~horTh,data = GBSG2),lty = c(2,1),col = c(2,1), mark.Time = T)  #如果需要標記生存時間,在后面加上mark.Time = T

> legend('bottomright',legend = c('yes','no'),lty = c(2,1),col = c(2,1))

#畫出了生存曲線圖

u  生存分析:使用survival包中的coxph()函數

> coxreg <- coxph(Surv(time,cens)~.,data = GBSG2)

> summary(coxreg)

Call:

coxph(formula = Surv(time, cens) ~ ., data = GBSG2)

 

  n= 686, number of events= 299

 

                   coef  exp(coef)   se(coef)      z Pr(>|z|)   

horThyes     -0.3462784  0.7073155  0.1290747 -2.683 0.007301 **

age          -0.0094592  0.9905854  0.0093006 -1.017 0.309126   

menostatPost  0.2584448  1.2949147  0.1834765  1.409 0.158954   

tsize         0.0077961  1.0078266  0.0039390  1.979 0.047794 * 

tgrade.L      0.5512988  1.7355056  0.1898441  2.904 0.003685 **

tgrade.Q     -0.2010905  0.8178384  0.1219654 -1.649 0.099199 . 

pnodes        0.0487886  1.0499984  0.0074471  6.551  5.7e-11 ***

progrec      -0.0022172  0.9977852  0.0005735 -3.866 0.000111 ***

estrec        0.0001973  1.0001973  0.0004504  0.438 0.661307   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

             exp(coef) exp(-coef) lower .95 upper .95

horThyes        0.7073     1.4138    0.5492    0.9109

age             0.9906     1.0095    0.9727    1.0088

menostatPost    1.2949     0.7723    0.9038    1.8553

tsize           1.0078     0.9922    1.0001    1.0156

tgrade.L        1.7355     0.5762    1.1963    2.5178

tgrade.Q        0.8178     1.2227    0.6439    1.0387

pnodes          1.0500     0.9524    1.0348    1.0654

progrec         0.9978     1.0022    0.9967    0.9989

estrec          1.0002     0.9998    0.9993    1.0011

 

Concordance= 0.692  (se = 0.018 )

Rsquare= 0.142   (max possible= 0.995 )

Likelihood ratio test= 104.8  on 9 df,   p=0

Wald test            = 114.8  on 9 df,   p=0

Score (logrank) test = 120.7  on 9 df,   p=0

       #畫樹

install.packages('party')

tree <- ctree(Surv(time,cens)~.,data = GBSG2)

plot(tree)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM