方差分析（python代碼實現）

本文轉載自查看原文 2017-05-19 11:01 9552 python/ 統計statistics/ 方差分析/ 統計/ statistics

python金融風控評分卡模型和數據分析微專業課（博主親自錄制視頻）：http://dwz.date/b9vv

項目合作QQ:231469242

Tukey等多重檢驗容易報錯，數據結構不一致

TypeError: Cannot cast array data from dtype('S11') to dtype('float64') according to the rule 'safe'

來源

英國統計家Fisher為解釋實驗數據而首先引入

方差分析VS t檢驗

方差分析的H0：所有樣本的平均數相等，如果H1成了，我們只知道他們（樣本平均數）不一樣

The null hypothesis in a one-way ANOVA is that the means of all the samples are
the same. So if a one-way ANOVA yields a significant result, we only know that
they are not the same.

當因素較多，用t做兩兩比較時，第一型錯誤概率都是a，多次重復檢驗會使一型錯誤概略相應增加，檢驗完時，犯一型錯誤概率大於a。同時，隨着檢驗次數的增加，偶然因素導致差別的可能性也會增加。方差分析就很容易解決此問題，它同時考慮所有樣本數據，一次檢驗即可判斷多個總體的均值是否相同，這不僅排除了犯錯誤的累積概率，也提高了檢驗效率。

總體分析過程

術語

方差分析（analysis of variance）:縮寫為ANOVA，分析分類自變量對數值因變量影響的一種統計方法。

單因素方差分析（one-way analsis of variance）:研究一個分類自變量對數值因變量影響的方差分析。

雙因素方差分析（two-hway analysis of variance）:研究兩個自變量對數值因變量影響的方差分析。它分為只考慮主效應的雙因素方差分析和考慮交互效應的雙因素方差分析。

因素（factor）：檢驗的對象，既自變量

處理（treatment）：也稱水平，因素不同的取值

處理誤差（treatment error）：因素的不同處理造成觀測數據的誤差

總平方和（sum of squares for total）:SST，反應全部觀測數據誤差大小的平方和

處理平方和（treatment sum of squares）:SSA，反應處理誤差大小的平方和

誤差平方和（sum of squares of error）:SSE,反應隨機誤差大小的平方和，也稱組內平方和（within-group sum of squares）

均方（mean square）:也稱方差（variance），數據誤差大小的平方和除以相應的自由度的結果，記為MS

主效應（main effect）:因素對因變量的單獨影響。

交互效應（interaction）：一個因素和另一個因素聯合產生的對因變量的附加效應。

可重復雙因素分析（two-factor with replication）:考慮交互性

multiple test 多重比較：用於分析哪些處理之間有顯著差異，即具體水平的分析，就是找哪兩個均值不相等，多重比較方法多，Fisher提出最小差異顯著least significant difference,LSD，測試結果LSD不准確。如果數據來自同一樣本，用配對T實驗兩兩比較，如果數據來自不同樣本，用獨立T實驗兩兩比較。

也可用綜合法。

post-hoc analysis：事后分析; 因果分析

方差分析的基本假定：

1.正態性

每種處理的總體都應服從正態分布

2.方差齊性homogeneity variance

各個總體的方差必須相等

3.獨立性 independence

每個樣本數據都來自不同處理的獨立樣本。

方差分析對獨立性要求高，若獨立性得不到滿足，方差分析結構往往受到較大影響。

對正態性和方差齊性要求比較寬松，當正態性得不到滿足和方差略有不齊時，對影響不是很大。

單因素方差分析

數據來源

https://github.com/thomas-haslwanter/statsintro_python/tree/master/ISP/Code_Quantlets/08_TestsMeanValues/anovaOneway

# -*- coding: utf-8 -*-
'''
Name of QuantLet: ISP_anovaOneway
微信公眾號：pythonEducation
Published in:  An Introduction to Statistics with Python

Description: 'Analysis of Variance (ANOVA)
    - Levene test
    - ANOVA - oneway
    - Do a simple one-way ANOVA, using statsmodels
    - Show how the ANOVA can be done by hand.
    - For the comparison of two groups, a one-way ANOVA is equivalent to
      a T-test: t^2 = F'

Keywords: anova, levene test, one-way anova, t-test

See also: 'ISP_anovaTwoway, ISP_kruskalWallis,
    ISP_multipleTesting, ISP_oneGroup, ISP_twoGroups' 

Author: Thomas Haslwanter 

Submitted: October 31, 2015 

Datafile:  altman_910.txt, galton.csv

Example: anova_annotated.png

 
Analysis of Variance (ANOVA)
- Levene test
- ANOVA - oneway
- Do a simple one-way ANOVA, using statsmodels
- Show how the ANOVA can be done by hand.
- For the comparison of two groups, a one-way ANOVA is equivalent to
  a T-test: t^2 = F
'''

# Import standard packages
import numpy as np
import scipy.stats as stats
import pandas as pd

# additional packages
from statsmodels.formula.api import ols
#ANOVA table for one or more fitted linear models.
#anova_lm用於一個或多個因素的方差分析,analysis of variance_linear models
from statsmodels.stats.anova import anova_lm


#單因素方差檢驗，比較scipy.stats和statsmodels結果是否一致
def anova_oneway():
    ''' One-way ANOVA: test if results from 3 groups are equal.
    
    Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:
    
    Group I: Patients received 50% nitrous oxide and 50% oxygen mixture continuously for 24 h.
    Group II: Patients received a 50% nitrous oxide and 50% oxygen mixture only dirng the operation.
    Group III: Patients received no nitrous oxide but received 35-50% oxygen for 24 h.
    
    The data show red cell folate levels for the three groups after 24h' ventilation.
    
    '''
    
    # Get the data
    print('One-way ANOVA: -----------------')
    inFile = 'altman_910.txt'
    data = np.genfromtxt(inFile, delimiter=',')
    
    # Sort them into groups, according to column 1
    group1 = data[data[:,1]==1,0]
    group2 = data[data[:,1]==2,0]
    group3 = data[data[:,1]==3,0]
    
    # --- >>> START stats <<< ---
    # First, check if the variances are equal, with the "Levene"-test
    (W,p) = stats.levene(group1, group2, group3)
    if p<0.05:
        print(('Warning: the p-value of the Levene test is <0.05: p={0}'.format(p)))
    
    # Do the one-way ANOVA
    F_statistic, pVal = stats.f_oneway(group1, group2, group3)
    # --- >>> STOP stats <<< ---
    
    # Print the results
    print('Data form Altman 910:')
    print((F_statistic, pVal))
    if pVal < 0.05:
        print('One of the groups is significantly different.')
        
    # Elegant alternative implementation, with pandas & statsmodels
    df = pd.DataFrame(data, columns=['value', 'treatment'])    
    #如果沒有大寫C()，會出錯，因為表示分類變量，category
    model = ols('value ~ C(treatment)', df).fit()
    anovaResults = anova_lm(model)
    print(anovaResults)
    
    # Check if the two results are equal. If they are, there is no output
    #decimal=3表示精確到的小數位。如果兩個數相等，結果為空，否則出現異常提示
    np.testing.assert_almost_equal(F_statistic, anovaResults['F'][0],decimal=3)
    
    return (F_statistic, pVal) # should be (3.711335988266943, 0.043589334959179327)


#----------------------------------------------------------------------
#兩組數對比時，檢查獨立T檢驗和F檢驗是否一致
def show_teqf():
    """Shows the equivalence of t-test and f-test, for comparing two groups"""
    
    # Get the data
    data = pd.read_csv('galton.csv')
    
    # First, calculate the F- and the T-values, ...
    F_statistic, pVal = stats.f_oneway(data['father'], data['mother'])
    t_val, pVal_t = stats.ttest_ind(data['father'], data['mother'])
    
    # ... and show that t**2 = F
    print('\nT^2 == F: ------------------------------------------')
    print(('From the t-test we get t^2={0:5.3f}, and from the F-test F={1:5.3f}'.format(t_val**2, F_statistic)))
    
    # numeric test
    np.testing.assert_almost_equal(t_val**2, F_statistic,decimal=3)
    
    return F_statistic


# ---------------------------------------------------------------
def anova_statsmodels():
    ''' do the ANOVA with a function '''
    
    # Get the data
    data = pd.read_csv('galton.csv')
    #sex是性別，屬於分類變量,C(sex)表示把sex作為分類變量處理，C為category縮寫
    anova_results = anova_lm(ols('height~C(sex)', data).fit())
    print('\nANOVA with "statsmodels" ------------------------------')
    print(anova_results)
    
    return anova_results['F'][0]


#----------------------------------------------------------------------
#方差分析手動建模
def anova_byHand():
    """ Calculate the ANOVA by hand. While you would normally not do that, this function shows
    how the underlying values can be calculated.
    """

     # Get the data
    inFile = 'altman_910.txt'
    data = np.genfromtxt(inFile, delimiter=',')

    # Convert them to pandas-forman and group them by their group value
    df = pd.DataFrame(data, columns=['values', 'group'])
    groups = df.groupby('group')

    # The "total sum-square" is the squared deviation from the mean
    ss_total = np.sum((df['values']-df['values'].mean())**2)
    
    # Calculate ss_treatment and  ss_error
    (ss_treatments, ss_error) = (0, 0)
    for val, group in groups:
        ss_error += sum((group['values'] - group['values'].mean())**2)
        ss_treatments += len(group) * (group['values'].mean() - df['values'].mean())**2

    df_groups = len(groups)-1
    df_residuals = len(data)-len(groups)
    F = (ss_treatments/df_groups) / (ss_error/df_residuals)
    df = stats.f(df_groups,df_residuals)
    p = df.sf(F)

    print(('ANOVA-Results: F = {0}, and p<{1}'.format(F, p)))
    
    return (F, p)
    

if __name__ == '__main__':
    anova_oneway()
    anova_byHand()
    show_teqf()
    anova_statsmodels()    
    raw_input('Done!')

單尾方差分析模型：增加了多重分析

# -*- coding: utf-8 -*-
author:231469242@qq.com
微信公眾號：pythonEducation
import scipy,math
from scipy.stats import f
import numpy as np 
import matplotlib.pyplot as plt
import scipy.stats as stats
# additional packages
from statsmodels.stats.diagnostic import lillifors
#多重比較
from statsmodels.sandbox.stats.multicomp import multipletests
#用於排列組合
import itertools

group1=[2,3,7,2,6]
group2=[10,8,7,5,10]
group3=[10,13,14,13,15]
'''
group1=[2,2,3,4,4,5,3,4,4,4]
group2=[4,4,3,5,4,1,1,2,3,3]
group3=[1,2,2,2,3,2,3,1,3,1]

'''
list_groups=[group1,group2,group3]
list_total=group1+group2+group3
a=0.05

#one within group error,also know as random error
def SE(group):
    se=0
    mean1=np.mean(group)
    for i in group:
        error=i-mean1
        se+=error**2
    return se
    
'''
>>> SE(group1)
22.0
>>> SE(group2)
18.0
>>> SE(group3)
14.0
'''
#sum of squares within group error,also know as random error    
def SSE(list_groups):  
    sse=0
    for group in list_groups:
        se=SE(group)
        sse+=se
    return sse
        
    
#誤差總平方和    
def SST(list_total):
    sst=0
    mean1=np.mean(list_total)
    for i in list_total:
        error=i-mean1
        sst+=error**2
    return sst
    
    
#SSA,between-group sum of squares 組間平方和,公式1：ssa=sst-sse
def SSA(list_groups,list_total):
    sse=SSE(list_groups)
    sst=SST(list_total)
    ssa=sst-sse
    return ssa
    
    
#SSA,between-group sum of squares 組間平方和
def SSA1(list_groups,list_total):
    mean_total=np.mean(list_total)
    ssa=0
    for group in list_groups:
        group_mean=np.mean(group)
        distance=(mean_total-group_mean)**2
        ssa+=distance
    
    ssa=ssa*5
    return ssa        
    
#處理效應均方    
def MSA(list_groups,list_total):
    ssa=SSA(list_groups,list_total)
    msa=ssa/(len(list_groups)-1)*1.0
    return msa
    
# 誤差均方
def MSE(list_groups,list_total): 
    sse=SSE(list_groups)
    mse=sse/(len(list_total)-1*len(list_groups))*1.0  
    return mse              
    
#F score
def F(list_groups,list_total):
    msa=MSA(list_groups,list_total)
    mse=MSE(list_groups,list_total)
    ratio=msa/mse*1.0
    return ratio
'''
>>> F(list_groups,list_total)
22.592592592592595
'''    

#LSD(least significant difference)最小顯著差異
def LSD(list_groups,list_total,sample1,sample2):
    mean1=np.mean(sample1)
    mean2=np.mean(sample2)
    distance=abs(mean1-mean2)
    print"distance:",distance
    #t檢驗的自由度
    df=len(list_total)-1*len(list_groups)
    mse=MSE(list_groups,list_total)
    print"MSE:",mse
    t_value=stats.t(df).isf(a/2)
    print"t value:",t_value
    lsd=t_value*math.sqrt(mse*(1.0/len(sample1)+1.0/len(sample2)))
    print "LSD:",lsd
    if distance<lsd:
        print"no significant difference between:",sample1,sample2
    else:
        print"there is significant difference between:",sample1,sample2
    

#正態分布測試
def check_normality(testData):
    
    #20<樣本數<50用normal test算法檢驗正態分布性
    if 20<len(testData) <50:
       p_value= stats.normaltest(testData)[1]
       if p_value<0.05:
           print"use normaltest"
           print "data are not normal distributed"
           return  False
       else:
           print"use normaltest"
           print "data are normal distributed"
           return True
    
    #樣本數小於50用Shapiro-Wilk算法檢驗正態分布性
    if len(testData) <50:
       p_value= stats.shapiro(testData)[1]
       if p_value<0.05:
           print "use shapiro:"
           print "data are not normal distributed"
           return  False
       else:
           print "use shapiro:"
           print "data are normal distributed"
           return True
      
    if 300>=len(testData) >=50:
       p_value= lillifors(testData)[1]
       if p_value<0.05:
           print "use lillifors:"
           print "data are not normal distributed"
           return  False
       else:
           print "use lillifors:"
           print "data are normal distributed"
           return True
    
    if len(testData) >300:  
       p_value= stats.kstest(testData,'norm')[1]
       if p_value<0.05:
           print "use kstest:"
           print "data are not normal distributed"
           return  False
       else:
           print "use kstest:"
           print "data are normal distributed"
           return True


#對所有樣本組進行正態性檢驗
def NormalTest(list_groups):
    for group in list_groups:
        #正態性檢驗
        status=check_normality(group)
        if status==False :
            return False
            
#排列組合函數
def Combination(list_groups):
    combination= []
    for i in range(1,len(list_groups)+1):
        iter = itertools.combinations(list_groups,i)
        combination.append(list(iter)) 
    #需要排除第一個和最后一個
    return combination[1:-1][0] 
'''
Out[57]: 
[[([2, 3, 7, 2, 6], [10, 8, 7, 5, 10]),
  ([2, 3, 7, 2, 6], [10, 13, 14, 13, 15]),
  ([10, 8, 7, 5, 10], [10, 13, 14, 13, 15])]]
'''        
            
#多重比較            
def Multiple_test(list_groups):
    print"multiple test----------------------------------------------"
    combination=Combination(list_groups)
    for pair in combination:
        LSD(list_groups,list_total,pair[0],pair[1])
        
                
                            
                                                    
    
#對所有樣本組進行正態性檢驗  
print"M=Normality test:-----------------------------------"  
NormalTest(list_groups)

#方差齊性檢測
scipy.stats.levene(group1,group2,group3)
'''
H0成立，三組數據方差無顯著差異
Out[9]: LeveneResult(statistic=0.24561403508771934, pvalue=0.7860617221429711)
'''

print "result--------------------------------------------------"
f_score=F(list_groups,list_total)
print"F score:",f_score
#sf 為生存函數survival function
probability=f.sf(f_score,2,12) 
'''
Out[28]: 8.5385924542746692e-05
'''  
if probability<0.05:
    print"there is significance,H1 win"

#多重比較
print"Multiple test",Multiple_test
Multiple_test(list_groups)

多重分析綜合版本

https://github.com/thomas-haslwanter/statsintro_python/tree/master/ISP/Code_Quantlets/08_TestsMeanValues/multipleTesting

Name of QuantLet: ISP_multipleTesting Published in: An Introduction to Statistics with Python Description: 'Multiple testing  This script provides an example, where three treatments are compared. It  first performs a one-way ANOVA, to see if there is a difference between the  groups. Then it performs multiple comparisons, to check which of the groups  are different.   This dataset is taken from an R-tutorial, and contains a hypothetical sample  of 30 participants who are divided into three stress reduction treatment  groups (mental, physical, and medical). The values are represented on a  scale that ranges from 1 to 5. This dataset can be conceptualized as a  comparison between three stress treatment programs, one using mental  methods, one using physical training, and one using medication. The values  represent how effective the treatment programs were at reducing  participant''s stress levels, with higher numbers indicating higher  effectiveness.  Taken from an example by Josef Perktold (http://jpktd.blogspot.co.at/)' Keywords: 'post hoc test, anova, tukey''s HSD test, Holm test' See also: 'ISP_anovaOneway, ISP_anovaTwoway, ISP_kruskalWallis,  ISP_oneGroup, ISP_twoGroups' Author: Thomas Haslwanter Submitted: October 31, 2015 Example: multComp.png

# -*- coding: utf-8 -*-
 
# Import standard packages
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
import os
 
# additional packages
import sys
sys.path.append(os.path.join('..', '..', 'Utilities'))
 
try:
# Import formatting commands if directory "Utilities" is available
    from ISP_mystyle import showData
     
except ImportError:
# Ensure correct performance otherwise
    def showData(*options):
        plt.show()
        return
 
# Other required packages
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.libqsturng import psturng
import variance_check
 
#數據excel名               
excel="sample.xlsx"
#讀取數據
df=pd.read_excel(excel)
#獲取第一組數據，結構為列表
group_mental=list(df.StressReduction[(df.Treatment=="mental")])
group_physical=list(df.StressReduction[(df.Treatment=="physical")])
group_medical=list(df.StressReduction[(df.Treatment=="medical")])

#數據預處理
groupMental_removeNan=[x for x in group_mental if str(x) != 'nan' and str(x)!= '-inf']
groupPhysical_removeNan=[x for x in group_physical if str(x) != 'nan' and str(x)!= '-inf']
groupMedical_removeNan=[x for x in group_medical if str(x) != 'nan' and str(x)!= '-inf']

list_groups=[groupMental_removeNan,groupPhysical_removeNan,groupMedical_removeNan]
list_total=groupMental_removeNan+groupPhysical_removeNan+groupMedical_removeNan
#前期檢驗
normality=variance_check.NormalTest(list_groups)   
leveneResult=variance_check.Levene_test(list_groups[0],list_groups[1],list_groups[2]) 
equal_lenth=variance_check.Equal_lenth(list_groups)  






#如果每組樣本數量不一致，會異常出錯,需要數據預處理
def doAnova(df):
    #one-way ANOVA
    model = ols('StressReduction ~ C(Treatment)',df).fit()
    #如果每組樣本數量不一致，會異常出錯 
    anovaResults =  anova_lm(model)
    print(anovaResults)
    if anovaResults['PR(>F)'][0] < 0.05:
        print('One of the groups is different.')

 
def doTukey(data, multiComp,equal_lenth):   
    '''Do a pairwise comparison, and show the confidence intervals'''
    if equal_lenth==False:
        print"warning:the length of groups are not equal"
        return False
    
    
    print((multiComp.tukeyhsd().summary()))
     
    # Calculate the p-values:
    res2 = pairwise_tukeyhsd(data['StressReduction'], data['Treatment'])
    
    #下面代碼是繪圖
    df = pd.DataFrame(data)
    numData = len(df)
    numTreatments = len(df.Treatment.unique())
    dof = numData - numTreatments
     
    # Show the group names
    print((multiComp.groupsunique))
     
    # Generate a print -------------------
     
    # Get the data
    xvals = np.arange(3)
    res2 = pairwise_tukeyhsd(data['StressReduction'], data['Treatment'])
    errors = np.ravel(np.diff(res2.confint)/2)
     
    # Plot them
    plt.plot(xvals, res2.meandiffs, 'o')
    plt.errorbar(xvals, res2.meandiffs, yerr=errors, fmt='o')
     
    # Put on labels
    pair_labels = multiComp.groupsunique[np.column_stack(res2._multicomp.pairindices)]
    plt.xticks(xvals, pair_labels)
     
    # Format the plot
    xlim = -0.5, 2.5
    plt.hlines(0, *xlim)
    plt.xlim(*xlim)
    plt.title('Multiple Comparison of Means - Tukey HSD, FWER=0.05' +
              '\n Pairwise Mean Differences')         
     
    # Save to outfile, and show the data
    outFile = 'multComp.png'
    showData(outFile)
     
def Holm_Bonferroni(multiComp):
    ''' Instead of the Tukey's test, we can do pairwise t-test
    通過均分a=0.05,矯正a,得到更小a'''
     
    # First, with the "Holm" correction
    rtp = multiComp.allpairtest(stats.ttest_rel, method='Holm')
    print((rtp[0]))
     
    # and then with the Bonferroni correction
    print((multiComp.allpairtest(stats.ttest_rel, method='b')[0]))
     
    # Any value, for testing the program for correct execution
    checkVal = rtp[1][0][0,0]
    return checkVal
    
    
#正態分布測試
def check_normality(testData):
     
    #20<樣本數<50用normal test算法檢驗正態分布性
    if 20<len(testData) <50:
       p_value= stats.normaltest(testData)[1]
       if p_value<0.05:
           print"use normaltest"
           print "data are not normal distributed"
           return  False
       else:
           print"use normaltest"
           print "data are normal distributed"
           return True
     
    #樣本數小於50用Shapiro-Wilk算法檢驗正態分布性
    if len(testData) <50:
       p_value= stats.shapiro(testData)[1]
       if p_value<0.05:
           print "use shapiro:"
           print "data are not normal distributed"
           return  False
       else:
           print "use shapiro:"
           print "data are normal distributed"
           return True
       
    if 300>=len(testData) >=50:
       p_value= lillifors(testData)[1]
       if p_value<0.05:
           print "use lillifors:"
           print "data are not normal distributed"
           return  False
       else:
           print "use lillifors:"
           print "data are normal distributed"
           return True
     
    if len(testData) >300: 
       p_value= stats.kstest(testData,'norm')[1]
       if p_value<0.05:
           print "use kstest:"
           print "data are not normal distributed"
           return  False
       else:
           print "use kstest:"
           print "data are normal distributed"
           return True
 
 
#對所有樣本組進行正態性檢驗
def NormalTest(list_groups):
    for group in list_groups:
        #正態性檢驗
        status=check_normality(group1)
        if status==False :
            return False
     
def main():
    # Do a one-way ANOVA單因素方差
    #此函數有問題，需要糾正
    doAnova(df) 
    #stats.f_oneway(groupMental_removeNan,groupPhysical_removeNan,groupMedical_removeNan)
    #Then, do the multiple testing
    multiComp = MultiComparison(df['StressReduction'], df['Treatment'])
    
    doTukey(df, multiComp,equal_lenth)    # Tukey's HSD test
    checkVal = Holm_Bonferroni(multiComp)  # Alternatives to Tukey's HSD test
     
    return checkVal     # this is only for regression testing of the program
     
if __name__ == '__main__':
    main()

variance_check.py

# -*- coding: utf-8 -*-
'''
用於方差齊性檢驗
正太性檢驗
配對相等檢驗
'''
import scipy,math
from scipy.stats import f
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# additional packages
from statsmodels.stats.diagnostic import lillifors
#多重比較
from statsmodels.sandbox.stats.multicomp import multipletests
#用於排列組合
import itertools
'''
#測試數據
group1=[2,3,7,2,6]
group2=[10,8,7,5,10]
group3=[10,13,14,13,15]
list_groups=[group1,group2,group3]
list_total=group1+group2+group3
'''
a=0.05

#正態分布測試
def check_normality(testData):
     
    #20<樣本數<50用normal test算法檢驗正態分布性
    if 20<len(testData) <50:
       p_value= stats.normaltest(testData)[1]
       if p_value<0.05:
           print"use normaltest"
           print "data are not normal distributed"
           return  False
       else:
           print"use normaltest"
           print "data are normal distributed"
           return True
     
    #樣本數小於50用Shapiro-Wilk算法檢驗正態分布性
    if len(testData) <50:
       p_value= stats.shapiro(testData)[1]
       if p_value<0.05:
           print "use shapiro:"
           print "data are not normal distributed"
           return  False
       else:
           print "use shapiro:"
           print "data are normal distributed"
           return True
       
    if 300>=len(testData) >=50:
       p_value= lillifors(testData)[1]
       if p_value<0.05:
           print "use lillifors:"
           print "data are not normal distributed"
           return  False
       else:
           print "use lillifors:"
           print "data are normal distributed"
           return True
     
    if len(testData) >300: 
       p_value= stats.kstest(testData,'norm')[1]
       if p_value<0.05:
           print "use kstest:"
           print "data are not normal distributed"
           return  False
       else:
           print "use kstest:"
           print "data are normal distributed"
           return True
 
 
#對所有樣本組進行正態性檢驗
def NormalTest(list_groups):
    for group in list_groups:
        #正態性檢驗
        status=check_normality(group)
        if status==False :
            return False
    return True
             
#排列組合函數
def Combination(list_groups):
    combination= []
    for i in range(1,len(list_groups)+1):
        iter = itertools.combinations(list_groups,i)
        combination.append(list(iter))
    #需要排除第一個和最后一個
    return combination[1:-1][0]
'''
Out[57]:
[[([2, 3, 7, 2, 6], [10, 8, 7, 5, 10]),
  ([2, 3, 7, 2, 6], [10, 13, 14, 13, 15]),
  ([10, 8, 7, 5, 10], [10, 13, 14, 13, 15])]]
'''       


#方差齊性檢測
def Levene_test(group1,group2,group3):
    leveneResult=scipy.stats.levene(group1,group2,group3)
    p=leveneResult[1]
    print"levene test:"
    if p<0.05:
        print"variances of groups are not equal"
        return False
    else:
        print"variances of groups are equal"
        return True
          
'''
H0成立，三組數據方差無顯著差異
Out[9]: LeveneResult(statistic=0.24561403508771934, pvalue=0.7860617221429711)
'''

#比較組內的樣本是否相等，如果不相等，不適合於tukey等方法
#此函數有問題，無法解決nan排除
def Equal_lenth(list_groups):
    list1=list_groups[0]
    list2=list_groups[1]
    list3=list_groups[2]
    
    list1_removeNan=[x for x in list1 if str(x) != 'nan' and str(x)!= '-inf']
    list2_removeNan=[x for x in list2 if str(x) != 'nan' and str(x)!= '-inf']
    list3_removeNan=[x for x in list3 if str(x) != 'nan' and str(x)!= '-inf']
    
    len1=len(list1_removeNan)
    len2=len(list2_removeNan)
    len3=len(list3_removeNan)
    if len1==len2==len3:
        return True
    else:
        return False


'''
#返回True or false 
normality=NormalTest(list_groups)   
leveneResult=Levene_test(list_groups[0],list_groups[1],list_groups[2])  
'''

levene test

顧名思義，就是檢驗兩者方差是不是齊性的（齊性，就是一致的），保證兩者是來自同一個整體

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html

levene test 用於測試所有輸入樣本是否來自相同方差的總體。

# -*- coding: utf-8 -*-
import scipy
from scipy.stats import f
import numpy as np 

group1=[2,3,7,2,6]
group2=[10,8,7,5,10]
group3=[10,13,14,13,15]


#方差齊性檢測
scipy.stats.levene(group1,group2,group3)
'''
H0成立，三組數據方差無顯著差異
Out[9]: LeveneResult(statistic=0.24561403508771934, pvalue=0.7860617221429711)
'''

源代碼

def levene(*args, **kwds):
    """
    Perform Levene test for equal variances.
    The Levene test tests the null hypothesis that all input samples
    are from populations with equal variances.  Levene's test is an
    alternative to Bartlett's test `bartlett` in the case where
    there are significant deviations from normality.
    Parameters
    ----------
    sample1, sample2, ... : array_like
        The sample data, possibly with different lengths
    center : {'mean', 'median', 'trimmed'}, optional
        Which function of the data to use in the test.  The default
        is 'median'.
    proportiontocut : float, optional
        When `center` is 'trimmed', this gives the proportion of data points
        to cut from each end. (See `scipy.stats.trim_mean`.)
        Default is 0.05.
    Returns
    -------
    statistic : float
        The test statistic.
    pvalue : float
        The p-value for the test.
    Notes
    -----
    Three variations of Levene's test are possible.  The possibilities
    and their recommended usages are:
      * 'median' : Recommended for skewed (non-normal) distributions>
      * 'mean' : Recommended for symmetric, moderate-tailed distributions.
      * 'trimmed' : Recommended for heavy-tailed distributions.
    References
    ----------
    .. [1]  http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
    .. [2]   Levene, H. (1960). In Contributions to Probability and Statistics:
               Essays in Honor of Harold Hotelling, I. Olkin et al. eds.,
               Stanford University Press, pp. 278-292.
    .. [3]  Brown, M. B. and Forsythe, A. B. (1974), Journal of the American
              Statistical Association, 69, 364-367
    """
    # Handle keyword arguments.
    center = 'median'
    proportiontocut = 0.05
    for kw, value in kwds.items():
        if kw not in ['center', 'proportiontocut']:
            raise TypeError("levene() got an unexpected keyword "
                            "argument '%s'" % kw)
        if kw == 'center':
            center = value
        else:
            proportiontocut = value

    k = len(args)
    if k < 2:
        raise ValueError("Must enter at least two input sample vectors.")
    Ni = zeros(k)
    Yci = zeros(k, 'd')

    if center not in ['mean', 'median', 'trimmed']:
        raise ValueError("Keyword argument <center> must be 'mean', 'median'"
                        " or 'trimmed'.")

    if center == 'median':
        func = lambda x: np.median(x, axis=0)
    elif center == 'mean':
        func = lambda x: np.mean(x, axis=0)
    else:  # center == 'trimmed'
        args = tuple(stats.trimboth(np.sort(arg), proportiontocut)
                     for arg in args)
        func = lambda x: np.mean(x, axis=0)

    for j in range(k):
        Ni[j] = len(args[j])
        Yci[j] = func(args[j])
    Ntot = np.sum(Ni, axis=0)

    # compute Zij's
    Zij = [None] * k
    for i in range(k):
        Zij[i] = abs(asarray(args[i]) - Yci[i])

    # compute Zbari
    Zbari = zeros(k, 'd')
    Zbar = 0.0
    for i in range(k):
        Zbari[i] = np.mean(Zij[i], axis=0)
        Zbar += Zbari[i] * Ni[i]

    Zbar /= Ntot
    numer = (Ntot - k) * np.sum(Ni * (Zbari - Zbar)**2, axis=0)

    # compute denom_variance
    dvar = 0.0
    for i in range(k):
        dvar += np.sum((Zij[i] - Zbari[i])**2, axis=0)

    denom = (k - 1.0) * dvar

    W = numer / denom
    pval = distributions.f.sf(W, k-1, Ntot-k)  # 1 - cdf
    return LeveneResult(W, pval)

雙因素方差分析

# -*- coding: utf-8 -*-
'''
Name of QuantLet: ISP_anovaTwoway

Published in:  An Introduction to Statistics with Python

Description: 'Two-way Analysis of Variance (ANOVA)
    The model is formulated using the <patsy> formula description. This is very
    similar to the way models are expressed in R.'

Keywords: plot, fitting, two-way anova

See also: 'ISP_anovaOneway, ISP_kruskalWallis,
    ISP_multipleTesting, ISP_oneGroup, ISP_twoGroups' 

Author: Thomas Haslwanter 

Submitted: October 31, 2015 

Datafile: altman_12_6.txt 

Two-way Analysis of Variance (ANOVA)
The model is formulated using the "patsy" formula description. This is very
similar to the way models are expressed in R.
'''

# Copyright(c) 2015, Thomas Haslwanter. All rights reserved, under the CC BY-SA 4.0 International License

# Import standard packages
import numpy as np
import pandas as pd

# additional packages
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

inFile = 'altman_12_6.txt'


def anova_interaction(inFile):
    '''ANOVA with interaction: Measurement of fetal head circumference,
    by four observers in three fetuses, from a study investigating the
    reproducibility of ultrasonic fetal head circumference data.
    '''
    
    # Get the data
    
    data = np.genfromtxt(inFile, delimiter=',')
    
    # Bring them in DataFrame-format
    df = pd.DataFrame(data, columns=['hs', 'fetus', 'observer'])
    
    # --- >>> START stats <<< ---
    # Determine the ANOVA with interaction
    formula = 'hs ~ C(fetus) + C(observer) + C(fetus):C(observer)'
    lm = ols(formula, df).fit()
    anovaResults = anova_lm(lm)
    # --- >>> STOP stats <<< ---
    print(anovaResults)

    return  anovaResults['F'][0]
                              
if __name__ == '__main__':
    anova_interaction(inFile)

In words: While—as expected—different fetuses show highly significant differences
in their head size (p < 0:001), also the choice of the observer has a significant
effect (p < 0:05). However, no individual observer was significantly off with any
individual fetus (p > 0:05).

數據源

https://github.com/thomas-haslwanter/statsintro_python/blob/master/ISP/Code_Quantlets/08_TestsMeanValues/anovaTwoway/altman_12_6.txt

SPSS 如何進行方差分析

數據是3組的計量數據，那么我們需要對數據進行正態性和方差齊性的檢驗，正態性檢驗與前面介紹的相同，而方差齊性這里有些小小的區別，下面詳細說明具體步驟

http://jingyan.baidu.com/article/ca2d939d009bc8eb6c31cee4.html

第一步：將數據錄入到SPSS的數據視圖中，這一步與前面t檢驗相同，輸入數據后，選擇【分析】→【比較均值】→【單因素ANOVA】
2

第二步：點擊后，出現下圖的單因素方差分析的窗口，將【value】→【因子】，【group】→【因變量列表】
3

第三步：點擊【選項】出現線面單因素ANOVA的窗口，其中勾選【方差同質性檢驗】后，點擊【繼續】，

確定后，即可在結果中看到方差齊性的結果，

END

方法/步驟2

第四步：結果，如下圖所示，我們看到Levene檢驗的結果，知顯著性為0.382，即P>0.05，差異無統計學意義，表示方差齊，

Multiple Comparisons

多重比較：用於分析哪些處理之間有顯著差異，即具體水平的分析，就是找哪兩個均值不相等，多重比較方法多，Fisher提出最小差異顯著least significant difference,LSD

The null hypothesis in a one-way ANOVA is that the means of all the samples are
the same. So if a one-way ANOVA yields a significant result, we only know that
they are not the same.
However, often we are not just interested in the joint hypothesis if all samples are
the same, but we would also like to know for which pairs of samples the hypothesis
of equal values is rejected. In this case we conduct several tests at the same time,
one test for each pair of samples. (Typically, this is done with t-tests.)
These tests are sometimes referred to as post-hoc analysis. In the design and
analysis of experiments, a post hoc analysis (from Latin post hoc, “after this”)
consists of looking at the data—after the experiment has concluded—for patterns
that were not specified beforehand. Here this is the case, because the null hypothesis
of the ANOVA is that there is no difference between the groups.
This results, as a consequence, in a multiple testing problem: since we perform
multiple comparison tests, we should compensate for the risk of getting a significant
result, even if our null hypothesis is true. This can be done by correcting the p-values
to account for this. We have a number of options to do so:
• TukeyHSD
• Bonferroni correction
• Holms correction
• : : : and others : : :

python機器學習生物信息學系列課（博主錄制）：http://dwz.date/b9vw

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 單因素方差分析—R實現 python做單因素方差分析方差分析的MATLAB實現（單因子） SPSS方差分析方差分析多因素方差分析方差分析anova 方差分析(Anova) 方差分析方差分析 - GLM