1.兩獨樣本參數的非參數檢驗
1.1.Welcoxon秩和檢驗
先將兩樣本看成是單一樣本(混合樣本)然后由小到大排列觀察值統一編秩。如果原假設兩個獨立樣本來自相同的總體為真,那么秩將大約均勻分布在兩個樣本中,即小的、中等的、大的秩值應該大約被均勻分在兩個樣本中。如果備選假設兩個獨立樣本來自不相同的總體為真,那么其中一個樣本將會有更多的小秩值,這樣就會得到一個較小的秩和;另一個樣本將會有更多的大秩值,因此就會得到一個較大的秩和。
R:wilcox.test
##################獨立樣本的曼-惠特尼U檢驗 Forest<-read.table(file="ForestData.txt",header=TRUE,sep=" ") Forest$month<-factor(Forest$month,levels=c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")) Tmp<-subset(Forest,Forest$month=="jan" | Forest$month=="aug") wilcox.test(temp~month,data=Tmp)
Wilcoxon rank sum test with continuity correction
data: temp by month
W = 2, p-value = 0.01653
alternative hypothesis: true location shift is not equal to 0
1.2.K-S檢驗
##################獨立樣本的K-S檢驗 x1<-subset(Forest,Forest$month=="jan") x2<-subset(Forest,Forest$month=="aug") ks.test(x1$temp,x2$temp)
Two-sample Kolmogorov-Smirnov test
data: x1$temp and x2$temp
D = 0.99457, p-value = 0.03992
alternative hypothesis: two-sided
1.3.兩配對樣本分布
###############配對樣本的Wilcoxon符號秩檢驗 ReportCard<-read.table(file="ReportCard.txt",header=TRUE,sep=" ") ReportCard<-na.omit(ReportCard) wilcox.test(ReportCard$chi,ReportCard$math,paired=TRUE) sum(outer(ReportCard$chi,ReportCard$math,"-")<0) sum(outer(ReportCard$math,ReportCard$chi,"-")<0)
Wilcoxon signed rank test with continuity correction
data: ReportCard$chi and ReportCard$math
V = 1695.5, p-value = 8.021e-11
alternative hypothesis: true location shift is not equal to 0
>
> sum(outer(ReportCard$chi,ReportCard$math,"-")<0)
[1] 332
> sum(outer(ReportCard$math,ReportCard$chi,"-")<0)
[1] 3026
2.兩樣本均值置換檢驗
我們在實驗中經常會因為各種問題(時間、經費、人力、物力)得到一些小樣本結果,如果我們想知道這些小樣本結果的總體是什么樣子的,就需要用到置換檢驗。
Permutation test 置換檢驗是Fisher於20世紀30年代提出的一種基於大量計算(computationally intensive),利用樣本數據的全(或隨機)排列,進行統計推斷的方法,因其對總體分布自由,應用較為廣泛,特別適用於總體分布未知的小樣本資料,以及某些難以用常規方法分析資料的假設檢驗問題。在具體使用上它和Bootstrap Methods類似,通過對樣本進行順序上的置換,重新計算統計檢驗量,構造經驗分布,然后在此基礎上求出P-value進行推斷。
2.1.概述
參數也可以是中位數等
2.2R程序
oneway_test()
Forest<-read.table(file="ForestData.txt",header=TRUE,sep=" ") Forest$month<-factor(Forest$month,levels=c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")) Tmp<-subset(Forest,Forest$month=="jan" | Forest$month=="aug") t.test(temp~month,data=Tmp,paired=FALSE,var.equal=TRUE) Tmp$month<-as.vector(Tmp$month) Tmp$month<-as.factor(Tmp$month) oneway_test(temp~month,data=Tmp,distribution="exact") oneway_test(temp~month,data=Tmp,distribution="asymptotic") oneway_test(temp~month,data=Tmp,distribution=approximate(B=1000))
Two Sample t-test
data: temp by month
t = -4.8063, df = 184, p-value = 3.184e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-23.106033 -9.657011
sample estimates:
mean in group jan mean in group aug
5.25000 21.63152
Exact Two-Sample Fisher-Pitman Permutation Test
data: temp by month (aug, jan)
Z = 4.5426, p-value = 0.0001744
alternative hypothesis: true mu is not equal to 0
Asymptotic Two-Sample Fisher-Pitman Permutation Test
data: temp by month (aug, jan)
Z = 4.5426, p-value = 5.557e-06
alternative hypothesis: true mu is not equal to 0
Approximative Two-Sample Fisher-Pitman Permutation Test
data: temp by month (aug, jan)
Z = 4.5426, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0
2.3相關系數置換檢驗
spearsman_test
對學生成績,基於數學和物理成績的spearsman相關系數進行置換檢驗
ReportCard<-read.table(file="ReportCard.txt",header=TRUE,sep=" ") Tmp<-ReportCard[complete.cases(ReportCard),] cor.test(Tmp[,5],Tmp[,7],alternative="two.side",method="spearman") #是讓你的模擬能夠可重復出現,因為很多時候我們需要取隨機數,但這段代碼再跑一次的時候,結果就不一樣 #了,如果需要重復出現的模擬結果的話,就可以用set.seed()。在調試程序或者做展示的時候,結果的可重#復性是很重要的. 12345是種子數 set.seed(12345) spearman_test(math~phy,data=Tmp,distribution=approximate(B=1000))
sample estimates:
rho
0.7651233
Approximative Spearman Correlation Test
data: math by phy
Z = 5.7766, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
2.4卡方分布置換檢驗
對於學生的成績,在性別和平均分等級列聯表上,采用置換檢驗,看性別與平均分兩個變量是否是獨立的
Tmp<-ReportCard[complete.cases(ReportCard),] CrossTable<-table(Tmp[,c(2,12)]) #編制性別和平均分等級的列聯表 chisq.test(CrossTable,correct=FALSE) chisq_test(sex~avScore,data=Tmp,distribution="asymptotic") set.seed(12345) chisq_test(sex~avScore,data=Tmp,distribution=approximate(B=1000))
> CrossTable
avScore
sex B C D E
F 2 13 10 3
M 2 11 12 5
Pearson's Chi-squared test
data: CrossTable
X-squared = 0.78045, df = 3, p-value = 0.8541
Asymptotic Pearson Chi-Squared Test
data: sex by avScore (B, C, D, E)
chi-squared = 0.78045, df = 3, p-value = 0.8541
Approximative Pearson Chi-Squared Test
data: sex by avScore (B, C, D, E)
chi-squared = 0.78045, p-value = 0.922
原假設:有關,不應拒絕原假設。
2.5兩配對樣本置換檢驗
wilcoxsign_test
ReportCard<-read.table(file="ReportCard.txt",header=TRUE,sep=" ") ReportCard<-na.omit(ReportCard) wilcox.test(ReportCard$chi,ReportCard$math,paired=TRUE) wilcoxsign_test(chi~math,data=ReportCard,distribution="asymptotic")
Wilcoxon signed rank test with continuity correction
data: ReportCard$chi and ReportCard$math
V = 1695.5, p-value = 8.021e-11
alternative hypothesis: true location shift is not equal to 0
Asymptotic Wilcoxon-Pratt Signed-Rank Test
data: y by x (pos, neg)
stratified by block
Z = 6.5041, p-value = 7.817e-11
alternative hypothesis: true mu is not equal to 0
量結論一致
3.兩樣本均值差的自舉檢驗
3.1概述
兩樣本均值的置換檢驗可以檢驗出兩個總體的均值是否存在顯著差異,但對總體均值差的置信區間估計比較困難。置信區間的估計,是以樣本均值差的抽樣分布已知且對稱為前提的,若無法保證這個前提,則可采用自舉發進行檢驗。
3.2.R實現
1.編寫用戶自定義函數
例如,對兩樣本均值的自舉法檢驗:分別計算兩個樣本的均值並返回
DiffMean<-function(DataSet,indices){ ReSample<-DataSet[indices,]#從Dataset中抽取indices決定的觀測形成自舉樣本 diff<-tapply(ReSample[,1],INDEX=as.factor(ReSample[,2]),FUN=mean) #表示以自舉樣本第2列分組標識,分別計算自舉樣本第1列的均值。 return(diff[1]-diff[2]) } #第一列是待檢驗變量,第二列為觀測來自總體的標識。indices包括了n個元素的隨機位置向量,它是從DataSet #中抽取觀測以形成自舉樣本的依據。
2.調用boot函數實現自舉法檢驗
library("boot") Forest<-read.table(file="ForestData.txt",header=TRUE,sep=" ") Forest$month<-factor(Forest$month,levels=c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")) Tmp<-subset(Forest,Forest$month=="jan" | Forest$month=="aug") Tmp<-cbind(Tmp$temp,Tmp$month) set.seed(12345) BootObject<-boot(data=Tmp,statistic=DiffMean,R=20) #調用自定義函數,自舉重復次數20。
Call:
boot(data = Tmp, statistic = DiffMean, R = 20)
Bootstrap Statistics :
original bias std. error
t1* -16.38152 -0.07459533 0.2012279
BootObject:t是從自舉樣本中得到的M個統計量。
3.獲得計算結果
BootObject$t0 mean(BootObject$t,na.rm=TRUE) print(BootObject) plot(BootObject) boot.ci(BootObject,conf=0.95,type=c("norm","perc"))
CALL :
boot.ci(boot.out = BootObject, conf = 0.95, type = c("norm",
"perc"))
Intervals :
Level Normal Percentile
95% (-16.70, -15.91 ) (-16.85, -16.06 )
Calculations and Intervals on Original Scale
基於自舉樣本的樣本均值差不服從正態分布,因此不適合采納根據正態分布確定的置信區間。