數據集地址:http://jse.amstat.org/datasets/normtemp.dat.txt
數據集描述:總共只有三列:體溫、性別、心率
#代碼 from scipy import stats as st import matplotlib.pyplot as plt import pandas as pd #防止亂碼 mpl.rcParams['font.sans-serif'] = [u'SimHei'] mpl.rcParams['axes.unicode_minus'] = False #讀入數據 data = pd.read_csv('http://jse.amstat.org/datasets/normtemp.dat.txt',sep='\s+',header=None,names='temperature;Gender;Heart rate'.split(';')) #數據描述 data['temperature'].describe()
輸出:
count 130.000000
mean 98.249231
std 0.733183
min 96.300000
25% 97.800000
50% 98.300000
75% 98.700000
max 100.800000
#四種方法驗證 #1 shapiro方法來檢驗體溫是否符合正態分布 print(st.shapiro(data['temperature'])) #(0.9865769743919373, 0.2331680953502655) 第二個數為P值,大於0.05 #2 normaltest方法驗證體溫是否符合正態分布 print(st.normaltest(data['temperature'], axis=None)) #NormaltestResult(statistic=2.703801433319236, pvalue=0.2587479863488212) 第二個數為P值,大於0.05 #3 kstest方法來檢驗體溫是否符合正態分布 u = data['temperature'].mean() std = data['temperature'].std() print(st.kstest(data['temperature'], 'norm',(u,std))) #KstestResult(statistic=0.06472685044046644, pvalue=0.645030731743997) 第二個數為P值,大於0.05 #4 anderson方法來檢驗體溫是否符合正態分布 print(st.anderson(data['temperature'])) #AndersonResult(statistic=0.5201038826714353, critical_values=array([0.56 , 0.637, 0.765, 0.892, 1.061]), significance_level=array([15. , 10. , 5. , 2.5, 1. ])) #顯著性水平為[15. , 10. , 5. , 2.5, 1. ],statistic小於critical_values,該檢驗不能拒絕為正態分布,即該檢驗為正態分布。
anderson方法說明:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html#scipy.stats.anderson
normal/exponenential
15%, 10%, 5%, 2.5%, 1%
logistic
25%, 10%, 5%, 2.5%, 1%, 0.5%
Gumbel
25%, 10%, 5%, 2.5%, 1%
If the returned statistic is larger than these critical values then for the corresponding significance level,
the null hypothesis that the data come from the chosen distribution can be rejected.
#繪圖
x = data['temperature'] x = x.sort_values() loc,scale = st.norm.fit(x) plt.plot(x, st.norm.pdf(x,loc,scale),'b-',label = 'norm') plt.show()