SciPy - 正態性與 KS 檢驗與 boxcox 正態化

本文轉載自查看原文 2020-04-11 14:28 1101 Python 數據分析模塊

假設檢驗的基本思想

若對總體的某個假設是真實的，那么不利於或者不能支持這一假設的事件A在一次試驗中是幾乎不可能發生的；如果事件A真的發生了，則有理由懷疑這一假設的真實性，從而拒絕該假設；

假設檢驗實質上是對原假設是否正確進行檢驗，因此檢驗過程中要使原假設得到維護，使之不輕易被拒絕；否定原假設必須有充分的理由。同時，當原假設被接受時，也只能認為否定該假設的根據不充分，而不是認為它絕對正確

ks 檢驗

ks 檢驗分為單樣本和兩樣本檢驗；

單樣本檢驗用於檢驗一個數據的觀測分布是否符合某種理論分布；

兩樣本檢驗用於檢驗兩個樣本是否屬於同一分布，ks 檢驗是兩樣本檢驗最有用且最常用的非參數方法之一；

ks 檢驗不僅能檢驗正態分布，還能檢驗其他分布；

def kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx'):
    """
    Parameters
    ----------
    rvs : str, array_like, or callable
        If a string, it should be the name of a distribution in `scipy.stats`.
        If an array, it should be a 1-D array of observations of random
        variables.
        If a callable, it should be a function to generate random variables;
        it is required to have a keyword argument `size`.
    Returns
    -------
    statistic : float
        KS test statistic, either D, D+ or D-.
    pvalue :  float
        One-tailed or two-tailed p-value.
    """

rvs：待檢驗的數據,1-D 數組

cdf：待檢驗的分布，如 norm 正態檢驗

alternative：默認為雙尾檢驗，可以設置為‘less’或‘greater’作單尾檢驗

statistic：統計結果

pvalue：p 值越大，越支持原假設，一般會和指定顯著水平 5% 比較，大於 5%,支持原假設；【支持原假設無法否定原假設，不代表原假設絕對正確】

單樣本檢驗示例

######### 非正態 #########
x1 = np.linspace(-15, 15, 9)
print(kstest(x1, 'norm'))
# KstestResult(statistic=0.4443560271592436, pvalue=0.03885014270517116)  <0.05

######### 正態 #########
np.random.seed(1000)
x2 = np.random.randn(100)
print(kstest(x2, 'norm'))
# KstestResult(statistic=0.06029093862878099, pvalue=0.8604070909241421)   >0.05

#### 同一個數據(服從正態分布)，不同的參數有截然不同的檢測結果，說明 ks 檢測 正態性 比較麻煩
x3 = np.random.normal(100, 0.01, 1000)
print(kstest(x3, 'norm'))
# KstestResult(statistic=1.0, pvalue=0.0)           <0.05
print(kstest(x3, 'norm', alternative='greater'))
# KstestResult(statistic=0.0, pvalue=1.0)           >0.05

兩樣本檢驗示例

import numpy as np
from scipy.stats import ks_2samp, kstest

beta=np.random.beta(7, 5, 1000)
norm=np.random.normal(0, 1, 1000)
print(ks_2samp(beta, norm))
# Ks_2sampResult(statistic=0.578, pvalue=7.844864182954565e-155)

# p-value比指定的顯著水平（假設為5%）小，則我們完全可以拒絕假設：beta和norm不服從同一分布

shapiro 正態檢驗

專門做正態檢驗的模塊

shapiro 不適合做樣本數＞5000的正態性檢驗，檢驗結果的P值可能不准確

def shapiro(x):
    """
    Parameters
    ----------
    x : array_like
        Array of sample data.

    Returns
    -------
    W : float
        The test statistic.
    p-value : float
        The p-value for the hypothesis test.
    """

示例

######### 非正態 #########
x = np.random.rand(100)
print(stats.shapiro(x))
# (0.9387611746788025, 0.00016213695926126093)   <0.05

######### 正態 #########
x1 = stats.norm.rvs(loc=5, scale=3, size=100)
print(stats.shapiro(x1))
# (0.9818177223205566, 0.18372832238674164)     >0.05

np.random.seed(1000)
x2 = np.random.randn(100)
print(stats.shapiro(x2))
# (0.9930729269981384, 0.89242023229599)        >0.05

#### 同樣的數據，ks 檢測 非正態，shapiro 檢測 正態
x3 = np.random.normal(100, 0.01, 1000)
print(stats.shapiro(x3))
# (0.9976787567138672, 0.17260634899139404)     >0.05

normaltest

也是專門做正態檢測的模塊

def normaltest(a, axis=0, nan_policy='propagate'):
    """
    Parameters
    ----------
    a : array_like
        The array containing the sample to be tested.
    Returns
    -------
    statistic : float or array
    pvalue : float or array
       A 2-sided chi squared probability for the hypothesis test.
    """

示例

######### 非正態 #########
x = np.random.rand(100)
print(normaltest(x))
# NormaltestResult(statistic=17.409796250892015, pvalue=0.00016577184786199797) <0.05

######### 正態 #########
x1 = np.random.randn(10, 20)
print(normaltest(x1, axis=None))
#NormaltestResult(statistic=1.8245865103063612, pvalue=0.4016021909152733)  >0.05

np.random.seed(1000)
x2 = np.random.randn(100)
print(normaltest(x2))
# NormaltestResult(statistic=1.2417269613653144, pvalue=0.5374801334521462)       >0.05


#### 同樣的數據，ks 檢測 非正態，shapiro 檢測 正態, normaltest 檢測 正態
x3 = np.random.normal(100, 0.01, 1000)
print(normaltest(x3))
# NormaltestResult(statistic=3.3633106484154944, pvalue=0.18606572188584633)    >0.05

anderson

增強版的 ks 檢測

def anderson(x, dist='norm'):
    """
    Anderson-Darling test for data coming from a particular distribution.
    x : array_like
        Array of sample data.
    dist : {'norm','expon','logistic','gumbel','gumbel_l', gumbel_r',
        'extreme1'}, optional
        the type of distribution to test against.  The default is 'norm'
        and 'extreme1', 'gumbel_l' and 'gumbel' are synonyms.

    Returns
    -------
    statistic : float
        The Anderson-Darling test statistic.
    critical_values : list
        The critical values for this distribution.
    significance_level : list
        The significance levels for the corresponding critical values
        in percents.  The function returns critical values for a
        differing set of significance levels depending on the
        distribution that is being tested against.
    """

anderson 有三個輸出值，第一個為統計數，第二個為評判值，第三個為顯著性水平，

評判值與顯著性水平對應，

對於正態性檢驗，顯著性水平為：15%, 10%, 5%, 2.5%, 1%

# If the returned statistic is larger than these critical values then for the corresponding significance level,
# the null hypothesis that the data come from the chosen distribution can be rejected.

示例

######### 非正態 #########
x1 = np.random.rand(100)
print(anderson(x1))
# AndersonResult(statistic=2.297910919361925, critical_values=array([0.555, 0.632, 0.759, 0.885, 1.053]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))
### 統計數 大於 評判值，拒絕假設，非正態

######### 正態 #########
x2 = np.random.randn(100)
print(anderson(x2))
# AndersonResult(statistic=0.35148092757619054, critical_values=array([0.555, 0.632, 0.759, 0.885, 1.053]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))
### 統計數 小於 評判值，正態

#### 同樣的數據，ks 檢測 非正態，shapiro 檢測 正態
np.random.seed(1000)
x3 = np.random.normal(100, 0.01, 1000)
print(anderson(x3))
# AndersonResult(statistic=0.201867508676969, critical_values=array([0.574, 0.653, 0.784, 0.914, 1.088]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))
### 統計數 小於 評判值，正態

Q-Q 圖

QQ 圖功能很多，本文僅介紹如何使用它進行正態性檢測

def probplot(x, sparams=(), dist='norm', fit=True, plot=None, rvalue=False)

示例

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st

fig, axs = plt.subplots(3, 1)

######### 非正態 #########
x1 = np.random.randint(1, 100, 100)
st.probplot(x1, plot=axs[0])

######### 正態 #########
np.random.seed(1000)
x2 = np.random.randn(100)
st.probplot(x2, plot=axs[1])

#### ks 檢驗非正態，看看 qq 圖
x3 = np.random.normal(100, 0.01, 1000)
st.probplot(x3, plot=axs[2])

plt.show()

輸出：圖 1 為非正態，圖 2 3 為正態

紅色線條表示正態分布，藍色線條表示樣本數據，藍色越接近紅色參考線，說明越符合預期分布（這是是正態分布）

正態化處理

在做回歸分析時，經常需要把非正態數據轉換成正態數據，方法如下

在一些情況下（P值<0.003）上述方法很難實現正態化處理，此時可使用 BOX-COX 轉換，但是當P值>0.003時兩種方法均可，優先考慮普通的平方變換

boxcox 示例

from scipy.stats import kstest, shapiro
from scipy.stats import boxcox
import matplotlib.pyplot as plt
from scipy import stats

fig, axs = plt.subplots(2, 1)
np.random.seed(12345)

x = np.random.normal(100, 20, 200) + np.random.normal(40, 5, 200)
x = np.log2(x)

print(shapiro(x))
# (0.9857848882675171, 0.04184681549668312)     ### 非正態
stats.probplot(x, plot=axs[0])

x2, _ = boxcox(x)
print(shapiro(x2))
# (0.9948143362998962, 0.7225888967514038)      ### 正態
stats.probplot(x2, plot=axs[1])
plt.show()