Tests for normality正態分布檢驗(python代碼實現)


python金融風控評分卡模型和數據分析微專業課(博主親自錄制視頻):http://dwz.date/b9vv

模型和統計項目可聯系
 QQ:231469242
 
 
 目錄:
1.Shapiro-Wilk test
  樣本量小於50
2.normaltest
  樣本量小於50normaltest運用了D’Agostino–Pearson綜合測試法,每組樣本數大於20
3.Lilliefors-test
- for intermediate sample numbers, the Lilliefors-test is good since the original Kolmogorov-Smirnov-test is unreliable when mean and std of the distribution are not known.
4.Kolmogorov-Smirnov(Kolmogorov-Smirnov) test
- the Kolmogorov-Smirnov(Kolmogorov-Smirnov) test should only be used for large sample numbers (>300)
 
 
 

 最新版本代碼

 
# -*- coding: utf-8 -*-
'''
Author:Toby
QQ:231469242,all right reversed,no commercial use
微信公眾號:pythonEducation
'''

import scipy
from scipy.stats import f
import numpy as np 
import matplotlib.pyplot as plt
import scipy.stats as stats
# additional packages
from statsmodels.stats.diagnostic import lillifors

group1=[2,3,7,2,6]
group2=[10,8,7,5,10]
group3=[10,13,14,13,15]
list_groups=[group1,group2,group3]
list_total=group1+group2+group3


#正態分布測試
def check_normality(testData):
    #20<樣本數<50用normal test算法檢驗正態分布性
    if 20<len(testData) <50:
       p_value= stats.normaltest(testData)[1]
       if p_value<0.05:
           print"use normaltest"
           print "data are not normal distributed"
           return  False
       else:
           print"use normaltest"
           print "data are normal distributed"
           return True
    
    #樣本數小於50用Shapiro-Wilk算法檢驗正態分布性
    if len(testData) <50:
       p_value= stats.shapiro(testData)[1]
       if p_value<0.05:
           print "use shapiro:"
           print "data are not normal distributed"
           return  False
       else:
           print "use shapiro:"
           print "data are normal distributed"
           return True
      
    if 300>=len(testData) >=50:
       p_value= lillifors(testData)[1]
       if p_value<0.05:
           print "use lillifors:"
           print "data are not normal distributed"
           return  False
       else:
           print "use lillifors:"
           print "data are normal distributed"
           return True
    
    if len(testData) >300:  
       p_value= stats.kstest(testData,'norm')[1]
       if p_value<0.05:
           print "use kstest:"
           print "data are not normal distributed"
           return  False
       else:
           print "use kstest:"
           print "data are normal distributed"
           return True


#對所有樣本組進行正態性檢驗
def NormalTest(list_groups):
    for group in list_groups:
        #正態性檢驗
        status=check_normality(group1)
        if status==False :
            return False
            
    
#對所有樣本組進行正態性檢驗    
NormalTest(list_groups)

 

 
 
 
 
 

 

pp-plot和qq-plot結論都很類似。如果數據服從正太分布,生成的點會很好依附在y=x直線上

 In all three cases the results are similar: if the two distributions being compared
are similar, the points will approximately lie on the line y D x. If the distributions
are linearly related, the points will approximately lie on a line, but not necessarily
on the line y D x (Fig. 7.1).
In Python, a probability plot can be generated with the command
stats.probplot(data, plot=plt)

 

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.probplot.html

 
a) Probability-Plots
用於可視化評估分布,繪制分位點來比較概率分布
 
sample quantilies是你的樣本原始的數據
sample distribution
In statistics different tools are available for the visual assessments of distributions.
A number of graphical methods exist for comparing two probability distributions by plotting their quantiles, or closely related parameters, against each other:
 
 
# -*- coding: utf-8 -*-
import numpy as np 
import pylab 
import scipy.stats as stats

measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()

 

7.1 Probability-plot, to
check for normality of a
 由於隨機產生的100個正態分布點,測試其正太性。概率圖顯示100個點很好落在y=x直線附近,所以這些數據有很好正態性。

 
 
 
 
 
 
QQPlot(quantile quantile plot)
 

http://baike.baidu.com/link?url=o9Z7vr6VdvGAtTRO3RYxQbVu56U_XDaSdibPeVcidMJQ7B6LcAUBHcIro4tLf5BSI5Pu-59W4SPNZ-zRFJ8_FgL3dxJLaUdY0JiB2xUmqie

 

QQPlot圖是用於直觀驗證一組數據是否來自某個分布,或者驗證某兩組數據是否來自同一(族)分布。在教學和軟件中常用的是檢驗數據是否來自於正態分布

 


 
 
 
# -*- coding: utf-8 -*-
import numpy as np
import statsmodels.api as sm
import pylab

test = np.random.normal(0,1, 1000)

sm.qqplot(test, line='45')
pylab.show()

 QQ圖顯示1000個點很好落在y=x直線附近,所以這些數據有很好正態性。

 
 
 驗證右圖的生成的卡方數據是否服從正太分布,pp-plot圖中,很多點沒有很好落在y=x直線附近,所以正態性比較差,R**2只有0.796
 
 
 
 
 
 
 
 
 
# -*- coding: utf-8 -*-
#author:231469242@qq.com
#微信公眾號:pythonEducation from scipy import stats import matplotlib.pyplot as plt import numpy as np nsample = 100 np.random.seed(7654321) ax1 = plt.subplot(221) #A t distribution with small degrees of freedom: #生成自由度3,樣本量100的t分布數據,自由度太小,正態分布性較差 x = stats.t.rvs(3, size=nsample) res = stats.probplot(x, plot=plt) #A t distribution with larger degrees of freedom: #自由度大,數據接近正態分布 ax2 = plt.subplot(222) x2 = stats.t.rvs(25, size=nsample) res1 = stats.probplot(x2, plot=plt) #A mixture of two normal distributions with broadcasting: ax3 = plt.subplot(223) x3 = stats.norm.rvs(loc=[0,5], scale=[1,1.5], size=(nsample/2.,2)).ravel() res = stats.probplot(x3, plot=plt) #A standard normal distribution:標准正太分布,pp-plot正態性較好 ax4 = plt.subplot(224) x4 = stats.norm.rvs(loc=0, scale=1, size=nsample) res = stats.probplot(x4, plot=plt) #Produce a new figure with a loggamma distribution, using the dist and sparams keywords: fig = plt.figure() ax = fig.add_subplot(111) x = stats.loggamma.rvs(c=2.5, size=500) stats.probplot(x, dist=stats.loggamma, sparams=(2.5,), plot=ax) ax.set_title("Probplot for loggamma dist with shape parameter 2.5") plt.show()

 

 
 
 
 
 
 
 
 
 綜合測試法
 
代碼GitHub下載地址
https://github.com/thomas-haslwanter/statsintro_python/tree/master/ISP/Code_Quantlets/07_CheckNormality_CalcSamplesize/checkNormality
 
 
In tests for normality, different challenges can arise: sometimes only few samples
may be available, while other times one may have many data, but some extremely
outlying values. To cope with the different situations different tests for normality
have been developed. These tests to evaluate normality (or similarity to some
specific distribution) can be broadly divided into two categories:
1. Tests based on comparison (“best fit”) with a given distribution, often specified
in terms of its CDF. Examples are the Kolmogorov–Smirnov test, the Lilliefors
test, the Anderson–Darling test, the Cramer–von Mises criterion, as well as the
Shapiro–Wilk and Shapiro–Francia tests.
2. Tests based on descriptive statistics of the sample. Examples are the skewness
test, the kurtosis test, the D’Agostino–Pearson omnibus test, or the Jarque–Bera
test.
For example, the Lilliefors test, which is based on the Kolmogorov–Smirnov
test, quantifies a distance between the empirical distribution function of the sample
and the cumulative distribution function of the reference distribution (Fig. 7.3),
or between the empirical distribution functions of two samples. (The original
Kolmogorov–Smirnov test should not be used if the number of samples is ca. 300.)
The Shapiro–Wilk W test, which depends on the covariance matrix between the
order statistics of the observations, can also be used with 50 samples, and has been
recommended by Altman (1999) and by Ghasemi and Zahediasl (2012).
The Python command stats.normaltest(x) uses the D’Agostino–Pearson
omnibus test. This test  combines a skewness and kurtosis test to produce a single,
global “omnibus” statistic.
 
# -*- coding: utf-8 -*-
#bug report 231469242@qq.com
#微信公眾號:pythonEducation ''' Graphical and quantitative check, if a given distribution is normal. - For small sample-numbers (<50), you should use the Shapiro-Wilk test or the "normaltest" - for intermediate sample numbers, the Lilliefors-test is good since the original Kolmogorov-Smirnov-test is unreliable when mean and std of the distribution are not known. - the Kolmogorov-Smirnov(Kolmogorov-Smirnov) test should only be used for large sample numbers (>300) ''' # Copyright(c) 2015, Thomas Haslwanter. All rights reserved, under the CC BY-SA 4.0 International License # Import standard packages import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import pandas as pd # additional packages from statsmodels.stats.diagnostic import lillifors def check_normality(): '''Check if the distribution is normal.''' # Set the parameters numData = 1000 myMean = 0 mySD = 3 # To get reproducable values, I provide a seed value np.random.seed(1234) # Generate and show random data data = stats.norm.rvs(myMean, mySD, size=numData) fewData = data[:100] plt.hist(data) plt.show() # --- >>> START stats <<< --- # Graphical test: if the data lie on a line, they are pretty much # normally distributed _ = stats.probplot(data, plot=plt) plt.show() pVals = pd.Series() pFewVals = pd.Series() # The scipy normaltest is based on D-Agostino and Pearsons test that # combines skew and kurtosis to produce an omnibus test of normality. _, pVals['Omnibus'] = stats.normaltest(data) _, pFewVals['Omnibus'] = stats.normaltest(fewData) # Shapiro-Wilk test _, pVals['Shapiro-Wilk'] = stats.shapiro(data) _, pFewVals['Shapiro-Wilk'] = stats.shapiro(fewData) # Or you can check for normality with Lilliefors-test _, pVals['Lilliefors'] = lillifors(data) _, pFewVals['Lilliefors'] = lillifors(fewData) # Alternatively with original Kolmogorov-Smirnov test _, pVals['Kolmogorov-Smirnov'] = stats.kstest((data-np.mean(data))/np.std(data,ddof=1), 'norm') _, pFewVals['Kolmogorov-Smirnov'] = stats.kstest((fewData-np.mean(fewData))/np.std(fewData,ddof=1), 'norm') print('p-values for all {0} data points: ----------------'.format(len(data))) print(pVals) print('p-values for the first 100 data points: ----------------') print(pFewVals) if pVals['Omnibus'] > 0.05: print('Data are normally distributed') # --- >>> STOP stats <<< --- return pVals['Kolmogorov-Smirnov'] if __name__ == '__main__': p = check_normality() print(p)

 

 
 
 
 

normaltest

 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
 
http://stackoverflow.com/questions/42036907/scipy-stats-normaltest-to-test-the-normality-of-numpy-random-normal
 
 
scipy.stats.normaltest運用了D’Agostino–Pearson綜合測試法,返回(得分值,p值),得分值=偏態平方+峰態平方
 
樣本量必須大於等於20
UserWarning: kurtosistest only valid for n>=20
 

If my understanding is correct, it indicates how likely the input data is in normal distribution. I had expected that all the pvalues generated by the above code very close to 1.

Your understanding is incorrect, I'm afraid. The p-value is the probability to get a result that is at least as extreme as the observation under the null hypothesis (i.e. under the assumption that the data is actually normal distributed). It does not need to be close to 1. Usually, p-values greater than 0.05 are considered not significant, which means that normality has not been disproved by the test.

As pointed out by Victor Chubukov, you can get low p-values simply by chance, even if the data is really normally distributed.

Statistical hypothesis testing is rather complex and can appear somewhat counter intuitive. If you need to know more details, Cross Validated is the place to get more detailed answers.

 
 
# -*- coding: utf-8 -*-
'''
樣本量必須大於等於20
UserWarning: kurtosistest only valid for n>=20 
'''

import numpy
from numpy import random
from scipy import stats


d = numpy.random.normal(size=1000)
n = stats.normaltest(d)
print (n)

 

 
 
 
# -*- coding: utf-8 -*-
import numpy,scipy
from numpy import random
from scipy import stats

for i in range(0, 10):
  d = numpy.random.normal(size=50000)
  n = scipy.stats.normaltest(d)
  print (n)

 H0:樣本服從正太分布

 p值都大於0.05,H0成立
 
 
 
 
 
 
 Shapiro-Wilk
  https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html
 #樣本數小於50用Shapiro-Wilk算法檢驗正態分布性
- For small sample-numbers (<50), you should use the Shapiro-Wilk test or the "normaltest"




# -*- coding: utf-8 -*-
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
x = stats.norm.rvs(loc=5, scale=3, size=49)
stats.shapiro(x)
'''
p值大於0.05,H0成立,數據呈現正態分布
Out[9]: (0.9735164046287537, 0.3322194814682007)
'''
plt.hist(x)
           

 







Lilliefors-test

適用於中等樣本數據
- for intermediate sample numbers, the Lilliefors-test is good since the original Kolmogorov-Smirnov-test is unreliable when mean and std of the distribution are not known.

In statistics, the Lilliefors test, named after Hubert Lilliefors, professor of statistics at George Washington University, is a normality test based on the Kolmogorov–Smirnov test. It is used to test the null hypothesis that data come from a normally distributed population, when the null hypothesis does not specify which normal distribution; i.e., it does not specify the expected value and variance of the distribution.

statsmodels.stats.diagnostic.lilliefors

http://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.lilliefors.html
statsmodels.stats.diagnostic. lilliefors (x, pvalmethod='approx')

lilliefors test for normality,

Kolmogorov Smirnov test with estimated mean and variance

Parameters:

x : array_like, 1d

data series, sample

pvalmethod : ‘approx’, ‘table’

‘approx’ uses the approximation formula of Dalal and Wilkinson, valid for pvalues < 0.1. If the pvalue is larger than 0.1, then the result of table is returned ‘table’ uses the table from Dalal and Wilkinson, which is available for pvalues between 0.001 and 0.2, and the formula of Lilliefors for large n (n>900). Values in the table are linearly interpolated. Values outside the range will be returned as bounds, 0.2 for large and 0.001 for small pvalues.

Returns:

ksstat : float

Kolmogorov-Smirnov test statistic with estimated mean and variance.

pvalue : float

If the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution

Notes

Reported power to distinguish normal from some other distributions is lower than with the Anderson-Darling test.

could be vectorized

 
# -*- coding: utf-8 -*-
'''
樣本量必須大於等於20
UserWarning: kurtosistest only valid for n>=20 
'''

import numpy
from numpy import random
from statsmodels.stats.diagnostic import lillifors

d = numpy.random.normal(size=200)
n = lillifors(d)
print (n)
'''
(0.047470987201221337, 0.3052490552871156)
'''
 
         

 






Kolmogorov-Smirnov(Kolmogorov-Smirnov) test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

- the Kolmogorov-Smirnov(Kolmogorov-Smirnov) test should only be used for large sample numbers (>300)



# -*- coding: utf-8 -*-
'''
樣本量必須大於等於20
UserWarning: kurtosistest only valid for n>=20 
'''

import numpy
from numpy import random
from scipy import stats


d = numpy.random.normal(size=1000)
n = stats.kstest(d,'norm')
print (n)

'''
KstestResult(statistic=0.028620435047503723, pvalue=0.38131540630243177)
'''
 
         

 




http://jingyan.baidu.com/article/86112f135cf84c27379787cb.html

K-S檢驗是以兩位蘇聯數學家Kolmogorov和Smirnov的名字命名的,它是一個擬合優度檢驗。K-S檢驗通過對兩個分布之間的差異的分析,判斷樣本的觀察結果是否來自制定分布的總體。

  • 數據錄入

首先把要分析的數據導入到SPSS軟件中,如圖所示:

  • 步驟1

點擊“分析”,然后選擇“非參數檢驗(N)”,選擇“舊對話框”中的“1-樣本K-S(1)”,如圖所示。

  • 步驟2

這里,我們只對“身高”和“體重”進行檢驗,所以把兩變量導入到“檢驗變量列表(T)”,如圖所示。

  • 步驟3

然后點擊“選項”,選擇“描述性(D)”,點擊“繼續”。如圖所示。

  • 結果分析

點擊“確定”,即可得到以下結果。

由於身高和體重的雙側顯著性取值均小於0.10,故否定零假設,即認為初中生的身高和體重不服從正態分布。



http://www.cnblogs.com/sddai/p/5737408.html
 

柯爾莫哥洛夫-斯米爾諾夫檢驗Колмогоров-Смирнов檢驗)基於累計分布函數,用以檢驗兩個經驗分布是否不同或一個經驗分布與另一個理想分布是否不同。

在進行 cumulative probability統計(如下圖)的時候,你怎么知道組之間是否有顯著性差異?有人首先想到單因素方差分析或雙尾檢驗(2 tailed TEST)。其實這些是不准確的,最好采用Kolmogorov-Smirnov test(柯爾莫諾夫-斯米爾諾夫檢驗)來分析變量是否符合某種分布或比較兩組之間有無顯著性差異。


Kolmogorov-Smirnov test原理:尋找最大距離(Distance), 所以常簡稱為D法。 適用於大樣本。 KS test checks if two independent distributions are similar or different, by generating cumulative probability plots for two distributions and finding the distance along the y-axis for a given x values between the two curves. From all the distances calculated for each x value, the maximum distance is searched.

如何分析結果呢?This maximum distance or maximum difference is then plugged into KS probability function to calculate the probability value.  The lower the probability value is the less likely the two distributions are similar.  Conversely, the higher or more close to 1 the value is the more similar the two distributions are.極端情況:如果P值為1的話,說明兩給數據基本相同,如果P值無限接近0,說明兩組數據差異性極大。

 
有一個網站可以進行在線的統計,你只需要輸入數據就可以了。地址如下: http://www.physics.csbsju.edu/stats/KS-test.n.plot_form.html

當然還有更多的軟件支持這個統計,如SPSS,SAS,MiniAnalysis,Clampfit10

根據軟件統計出來后給出的結果決定有沒有顯著性差異,如果D  max值>D  0.05。則認為有顯著性差異。D  0.05的經驗算法:1.36/SQRT(N) 其中SQRT為平方要,N為樣本數。D  0.01經驗算法1.64/SQRT(N) 。當然最准確的辦法還是去查KS檢定表。不過大多數軟件如CLAMPFIT,MINIANALYSIS統計出來的結果都是直接有P值。根據這個值(alpha=0.05)就可以斷定有沒有差異了。

   

 在統計學中,柯爾莫可洛夫-斯米洛夫檢驗基於累計分布函數,用以檢驗兩個經驗分布是否不同或一個經驗分布與另一個理想分布是否不同。

     在進行累計概率(cumulative probability)統計的時候,你怎么知道組之間是否有顯著性差異?有人首先想到單因素方差分析或雙尾檢驗(2 tailedTEST)。其實這些是不准確的,最好采用Kolmogorov-Smirnov test(柯爾莫諾夫-斯米爾諾夫檢驗)來分析變量是否符合某種分布或比較兩組之間有無顯著性差異。

分類:

1、Single sample Kolmogorov-Smirnov goodness-of-fit hypothesis test.

      采用柯爾莫諾夫-斯米爾諾夫檢驗來分析變量是否符合某種分布,可以檢驗的分布有正態分布、均勻分布、Poission分布和指數分布。指令如下:

>> H = KSTEST(X,CDF,ALPHA,TAIL) % X為待檢測樣本,CDF可選:如果空缺,則默認為檢測標准正態分布;

如果填寫兩列的矩陣,第一列是x的可能的值,第二列是相應的假設累計概率分布函數的值G(x)。ALPHA是顯著性水平(默認0.05)。TAIL是表示檢驗的類型(默認unequal,不平衡)。還有larger,smaller可以選擇。

 如果,H=1 則否定無效假設; H=0,不否定無效假設(在alpha水平上)

例如,

x = -2:1:4
x =
  -2  -1   0   1   2   3   4

[h,p,k,c] = kstest(x,[],0.05,0)
h =
   0
p =
   0.13632
k =
   0.41277
c =
   0.48342

The test fails to reject the null hypothesis that the values come from a standard normal distribution.

 

2、Two-sample Kolmogorov-Smirnov test

     檢驗兩個數據向量之間的分布的。

>>[h,p,ks2stat] = kstest2(x1,x2,alpha,tail)

% x1,x2都為向量,ALPHA是顯著性水平(默認0.05)。TAIL是表示檢驗的類型(默認unequal,不平衡)。

例如,x = -1:1:5
y = randn(20,1);
[h,p,k] = kstest2(x,y)
h =
     0
p =
    0.0774
k =
    0.5214         

 

 KolmogorovSmirnov test (K–S test)

wiki翻譯起來太麻煩,還有可能曲解本意,最好看原版解釋。

       In statistics, the KolmogorovSmirnov test (K–S test) is a form of minimum distance estimation used as a nonparametric test of equality of one-dimensional probability distributions used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). The Kolmogorov–Smirnov statistic quantifies a distance between theempirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample case). In each case, the distributions considered under the null hypothesis are continuous distributions but are otherwise unrestricted.

The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.

The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using the sample to modify the null hypothesis reduces the power of a test. Correcting for this bias leads to theLilliefors test. However, even Lilliefors' modification is less powerful than the Shapiro–Wilk test or Anderson–Darling test for testing normality.[1]

 
KS比卡方檢驗更加簡單和方便,但ks用於評估正態分布時,樣本量需要大於300
 
 
 
 
 
 python機器學習生物信息學系列課(博主錄制): http://dwz.date/b9vw

 


 
 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM