python數據分析之數據分布

本文轉載自查看原文 2019-08-12 00:35 7222

轉自鏈接：https://blog.csdn.net/YEPAO01/article/details/99197487

一、查看數據分布趨勢

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 %matplotlib inline

#讀取源數據
df = pd.read_csv('http://jse.amstat.org/datasets/normtemp.dat.txt', header=None, sep='\s+', names = ['體溫','性別','心率'])
df.head()

#下載到本地
　　re = requests.get("http://jse.amstat.org/datasets/normtemp.dat.txt")
　　re.encoding = "utf-8"
　　with open("normtemp.dat.txt","w") as f:
　　f.write(re.text)
　　df = pd.read_csv("normtemp.dat.txt", header=None, sep="\s+")
　　df.columns = ['體溫','性別','心率']
　　df.head()
#2 不下載
　　columns = ['體溫','性別','心率']
　　df = pd.read_csv("http://jse.amstat.org/datasets/normtemp.dat.txt", header=None, sep="\s+")
　　df.columns = ['體溫','性別','心率']

#查看數據基本特征
df.describe()

繪制散點圖

# 散點圖
fig = plt.figure(figsize=(16,5))
df1 = df[df["性別"]==1]
df1.shape
plt.scatter(df1.index, df1["體溫"], c="r", label="male")
plt.legend()
df2 = df[df["性別"]==2]
df2.shape
plt.scatter(df2.index, df2["體溫"], c="b", label="female")
plt.legend()
plt.ylabel("tw")
plt.xlabel("x")
plt.grid()

柱形圖

# 柱形圖
x = np.arange(0,130,1)
y = df_tw.values
plt.bar(x,y)

繪制直方圖查看體溫分布趨勢

df_tw.hist(bins=20,alpha = 0.5) df_tw.plot(kind = 'kde', secondary_y=True)

計算溫度個數

# 針對溫度數據, 計算溫度的個數
df_tm01 = df_tm.value_counts() # 計數
df_tm01.sort_index(inplace=True) # 按照溫度排序
print(df_tm01.head())


96.3    1
96.4    1
96.7    2
96.8    1
96.9    1
Name: 體溫, dtype: int64

plt.scatter(df_tm01.index,df_tm01.values)

檢驗是否符合正太

方法1 ：scipy.stats.normaltest (a, axis=0)
參數：a - 待檢驗數據；axis - 可設置為整數或置空，如果設置為 none，則待檢驗數據被當作單獨的數據集來進行檢驗。該值默認為 0，即從 0 軸開始逐行進行檢驗。
返回：k2 - s^2 + k^2，s 為 skewtest 返回的 z-score，k 為 kurtosistest 返回的 z-score，即標准化值；p-value - p值

import scipy.stats
scipy.stats.normaltest(df_tm)


NormaltestResult(statistic=2.703801433319236, pvalue=0.2587479863488212)

得到的p值>0.05

方法2 Shapiro-Wilk test
方法：scipy.stats.shapiro(x)
官方文檔：SciPy v1.1.0 Reference Guide
參數：x - 待檢驗數據
返回：W - 統計數；p-value - p值

scipy.stats.shapiro(df_tm.values)

(0.9865770936012268, 0.233174666762352)

得到的p值 0.23 > 0.05, 符合正態分布

方法3: scipy.stats.kstest

方法：scipy.stats.kstest (rvs, cdf, args = ( ), N = 20, alternative =‘two-sided’, mode =‘approx’)
官方文檔：SciPy v0.14.0 Reference Guide
參數：rvs - 待檢驗數據，可以是字符串、數組；
cdf - 需要設置的檢驗，這里設置為 norm，也就是正態性檢驗；
alternative - 設置單雙尾檢驗，默認為 two-sided
返回：W - 統計數；p-value - p值

u = df_tm.mean()
std = df_tm.std()
scipy.stats.kstest(df_tm.values,'norm',args=(u,std))

KstestResult(statistic=0.06472685044046644, pvalue=0.6450307317439967)

方法4: Anderson-Darling test
方法：scipy.stats.anderson (x, dist =‘norm’ )
該方法是由 scipy.stats.kstest 改進而來的，可以做正態分布、指數分布、Logistic 分布、Gumbel 分布等多種分布檢驗。默認參數為 norm，即正態性檢驗。
官方文檔：SciPy v1.1.0 Reference Guide
參數：x - 待檢驗數據；dist - 設置需要檢驗的分布類型
返回：statistic - 統計數；critical_values - 評判值；significance_level - 顯著性水平

scipy.stats.anderson(df_tm.values,dist="norm")


AndersonResult(statistic=0.5201038826714353, critical_values=array([0.56 , 0.637, 0.765, 0.892, 1.061]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

結論:三種檢驗的pvalue值均大於5%，因此體溫值服從正態分布。第四種方法返回的不是pvalue值.

使用箱型圖查看是否存在異常值.

#箱型圖
df_tm.plot.box(vert=False, grid = True)

查找具體的異常值數據

# 上四分位數
q3 = df_tm.quantile(q=0.75)
#下四分位數
q1 = df_tm.quantile(q=0.25)
# 四分位差
iqr = q3-q1
print("上四分位數:{}\n下四分位數:{}\n四分位差{}".format(q3,q1,iqr))
df_tm_01 = df_tm[(df_tm>q3+1.5*iqr) | (df_tm<q1-1.5*iqr)]
print("異常值:\n{}".format(df_tm_01))

上四分位數:98.7
下四分位數:97.8
四分位差0.9000000000000057
異常值:
0       96.3
65      96.4
129    100.8
Name: 體溫, dtype: float64

利用python計算兩者之間的相關性系數
需要了解統計學三大相關系數: 絕對值越大，相關性越強

pearson
kendall
spearman
相關系數相關強度
0.8-1.0 極強
0.6-0.8 強
0.4-0.6 中等
0.2-0.4 弱
0.0-0.2 極弱

#相關系數
df["體溫"].corr(df["心率"], method='pearson')
0.24328483580230698

# spearman 相關系數
df["體溫"].corr(df["心率"], method='spearman')
0.265460363879611

# kendall 相關系數
df["體溫"].corr(df["心率"], method='kendall')
0.17673221630037853

或

df = df[["體溫","心率"]]
print(df.corr(method='pearson'),"\n")
print(df.corr(method='spearman'),"\n")
print(df.corr(method='kendall'),"\n")

          體溫        心率
體溫  1.000000  0.243285
心率  0.243285  1.000000 

         體溫       心率
體溫  1.00000  0.26546
心率  0.26546  1.00000 

          體溫        心率
體溫  1.000000  0.176732
心率  0.176732  1.000000

fig = plt.figure(figsize=(16,5))
plt.scatter(df.index, df["體溫"])
plt.scatter(df.index, df["心率"])

參考鏈接https://blog.csdn.net/cyan_soul/article/details/81236124

二、python中實現數據分布的方法

參考鏈接：https://www.cnblogs.com/pinking/p/7898313.html

#二項分布
from scipy.stats import binom

#幾何分布
from scipy.stats import geom

#泊松分布
from scipy.stats import poisson

#均勻分布
from scipy.stats import uniform

#指數分布
from scipy.stats import expon

#正太分布
from scipy.stats import norm

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 數據分析之正態分布檢驗及python實現 python_數據分析_正態分布 python 招聘數據分析數據分析——作圖（Python）五個 Python 常用數據分析庫 Python——氣象數據分析 python之數據分析pandas 從Excel到Python 數據分析 Python數據分析入門 Python數據分析之numpy學習