柯爾莫可洛夫-斯米洛夫檢驗(Kolmogorov–Smirnov test,K-S test)


K-S檢驗方法能夠利用樣本數據推斷樣本來自的總體是否服從某一理論分布,是一種擬合優度的檢驗方法,適用於探索連續型隨機變量的分布。

Kolmogorov–Smirnov test

Kolmogorov–Smirnov statistic

累計分布函數:

定義n個獨立同分布(i.i.d.)有序觀測樣本Xi的經驗分布函數Fn為:

其中 I[inf,x] 為indicator function(指示函數), 

 


樣本集Xi的累計分布函數Fn(x)和一個假設的理論分布F(x),Kolmogorov–Smirnov統計量定義為: 


  supx是距離的上確界(supremum), 基於Glivenko–Cantelli theorem(Glivenko–Cantelli theorem),若Xi服從理論分布F(x),則當n趨於無窮時Dn幾乎肯定(almost surely)收斂於0。Kolmogorov通過有效地提供其收斂速度加強了這一結果。Donsker定理(Donsker's theorem )提供了一個更強的結果。
在實踐中,統計量需要相對大量的數據點(與 Anderson–Darling test statistic等其他擬合優度標准相比)才能恰當地拒絕零假設。

Kolmogorov distribution

預備知識:

 (1) 獨立增量過程 

顧名思義,就是指其增量是相互獨立的。嚴格定義如下: 

 (2) 維納過程(Wiener process) 

大概可以理解為一種數學化的布朗運動,嚴格定義如下: 

 (3)布朗橋(Brownian bridge) 

一種特殊的維納過程,嚴格定義如下: 

  一個在[0,T]區間上,且WT=0的維納過程。 

 紅色和綠色的都是“布朗橋”。

Kolmogorov distribution

 柯爾莫戈羅夫分布是隨機變量K的分布:

 即是通過求布朗運動上確界得到的隨機變量的分布。其中B(t)為布朗橋。

它的累積分布函數可以寫為: 

 which can also be expressed by the Jacobi theta function {\displaystyle \vartheta _{01}(z=0;\tau =2ix^{2}/\pi )}. Both the form of the Kolmogorov–Smirnov test statistic and its asymptotic distribution under the null hypothesis were published by Andrey Kolmogorov,[3] while a table of the distribution was published by Nikolai Smirnov.[4] Recurrence relations for the distribution of the test statistic in finite samples are available.[3]

單樣本Kolmogorov Goodness-of-Fit Test  

單樣本K-S檢驗即是檢驗樣本數據點是否滿足某種理論分布。 

我們從零假設H0出發( 在樣本來自假設分布F(x)的零假設下) ,此時,若理論分布是一種連續分布(這里僅考慮連續分布的情況),則有: 

也就是說在樣本點趨於無限多時,將趨向於一個Kolmogorov distribution(依分布收斂),且與F的具體形式無關。這個結果也可以稱為柯爾莫戈羅夫定理。

當 n是有限的時,這個極限作為對 K的精確cdf的近似的准確性不是很好。(even when n=1000, the corresponding maximum error is about {\displaystyle 0.9\%}; this error increases to {\displaystyle 2.6\%} when {\displaystyle n=100} and to a totally unacceptable {\displaystyle 7\%} when n=10.)

 通過修正提高精度:

However, a very simple expedient of replacing x by

{\displaystyle x+{\frac {1}{6{\sqrt {n}}}}+{\frac {x-1}{4n}}}

in the argument of the Jacobi theta function reduces these errors to {\displaystyle 0.003\%}{\displaystyle 0.027\%}, and {\displaystyle 0.27\%} respectively; such accuracy would be usually considered more than adequate for all practical applications.[5]

 擬合優度檢驗(goodness-of-fit test)或柯爾莫戈羅夫-斯米爾諾夫檢驗(Kolmogorov–Smirnov test )可以用柯爾莫戈羅夫分布的臨界值來構造。

{\displaystyle n\to \infty },這個檢驗是漸近有效的。

 在水平 \alpha 下,若滿足則拒絕零假設。其中,Kα由以下方式給出: 

 該檢驗的漸進統計功效(statistical power)為1。

Test with estimated parameters

 如果從數據Xi中確定F(x)的形式或參數,則以這種方式確定的臨界值無效!(來自wiki百科)

在這種情況下,可能需要蒙特卡羅方法或其他方法,但已為某些情況編制了表格。

 

查閱資料[3]可以看到,Kolmogorov測試僅用於假設分布函數完全指定的情況,也即,假設分布函數中不含有需要從樣本中估出的參數。否則,該測試結果將變得保守。

 

Details for the required modifications to the test statistic and for the critical values for the normal distribution and the exponential distributionhave been published,[10] and later publications also include the Gumbel distribution.[11] The Lilliefors test represents a special case of this for the normal distribution. The logarithm transformation may help to overcome cases where the Kolmogorov test data does not seem to fit the assumption that it came from the normal distribution.

Using estimated parameters, the questions arises which estimation method should be used. Usually this would be the maximum likelihood method, but e.g. for the normal distribution MLE has a large bias error on sigma! Using a moment fit or KS minimization instead has a large impact on the critical values, and also some impact on test power. If we need to decide for Student-T data with df = 2 via KS test whether the data could be normal or not, then a ML estimate based on H0 (data is normal, so using the standard deviation for scale) would give much larger KS distance, than a fit with minimum KS. In this case we should reject H0, which is often the case with MLE, because the sample standard deviation might be very large for T-2 data, but with KS minimization we may get still a too low KS to reject H0. In the Student-T case, a modified KS test with KS estimate instead of MLE, makes the KS test indeed slightly worse. However, in other cases, such a modified KS test leads to slightly better test power.

Discrete and mixed null distribution

Two-sample Kolmogorov–Smirnov test(The Smirnov Test )

  • Two samples. Are they coming from the same population with a specific(underlying) distribution? or the two datasets differ significantly? 兩個樣本集是否來自同一分布,或二者是否存在顯著差異?

 Kolmogorov-Smirnov檢驗也可以用來檢驗兩個潛在的一維概率分布是否不同。

Smirnov統計量是:

 where F_{1,n} and {\displaystyle F_{2,m}} are the empirical distribution functions of the first and the second sample respectively, and \sup  is the supremum function.

對於大樣本, 零假設在水平\alpha  上被拒絕,如果:

 其中 n和m分別為第一和第二樣本集的大小。對於最常見的alpha級別,下表給出了 c(alpha)的值:

 一般可取:

 注意,雙樣本測試檢查兩個數據樣本是否來自相同的分布。這並沒有指定這個常見的分布是什么(例如,它是正態分布)。同樣,已經發布了臨界值表。

 Kolmogorov-Smirnov檢驗的一個缺點是它不是很強大,因為它被設計成對兩個分布函數之間所有可能的類型的差異都很敏感。[19][20]表明,Cucconi檢驗(最初提出用於同時比較位置和尺度),在比較兩個分布函數時,比Kolmogorov-Smirnov檢驗要強大得多。

 A shortcoming of the Kolmogorov–Smirnov test is that it is not very powerful because it is devised to be sensitive against all possible types of differences between two distribution functions. [19] and [20] showed evidence that the Cucconi test, originally proposed for simultaneously comparing location and scale, is much more powerful than the Kolmogorov–Smirnov test when comparing two distribution functions.

The Kolmogorov–Smirnov statistic in more than one dimension

 

 

 參考:

[1] https://blog.csdn.net/qq_41679006/article/details/80977113

[2] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

[3] Conover, W. J., & Conover, W. J. (1980). Practical nonparametric statistics.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM