一、介紹
data.describe()
即可很方便的輸出數據的統計信息。
但還有更詳細的使用方法:
DataFrame.descirbe(percentiles=[0.1,0.2,0.5,0.75],
include=None,
exclude=None)
參數解釋:
percentiles -- 0-1之間的數字,以返回各自的百分位數
include -- 包含的數據類型
exclude -- 剔除的數據類型
二、實操
- 默認統計量
import pandas as pd
import numpy as np
series = pd.Series(np.random.randn(100))
series.describe()
'''
count 100.000000 計數
mean -0.049944 均值
std 0.967943 標准差
min -2.692278 最小值
25% -0.717809 25%分位數
50% -0.061116 中位數
75% 0.682023 75%分位數
max 1.825730 最大值
dtype: float64
'''
- percentiles參數
series.describe(percentiles=[0.05,0.25,0.3,0.7,0.8])
'''
count 100.000000
mean -0.049944
std 0.967943
min -2.692278
5% -1.617615
25% -0.717809
30% -0.574646
50% -0.061116
70% 0.543954
80% 0.776378
max 1.825730
dtype: float64
'''
- include參數
df = pd.DataFrame({"class":["語文","語文","語文","語文","語文","數學","數學","數學","數學","數學"],
"name":["小明","小蘇","小周","小孫","小王","小明","小蘇","小周","小孫","小王"],
"score":[137,125,125,115,115,80,111,130,130,140]})
df
# 默認輸出數值型特征的統計量
df.describe()
df.descirbe(include=[np.number])
'''
score
count 10.000000
mean 120.800000
std 17.203359
min 80.000000
25% 115.000000
50% 125.000000
75% 130.000000
max 140.000000
'''
# 計算離散型變量的統計特征
df.describe(include=['O'])
df.describe(include=[object])
'''
class name
count 10 10 非空計數
unique 2 5 唯一值
top 數學 小孫 出現最頻繁
freq 5 2 頻次
'''
# all 輸出全部特征
df.describe(include='all')
'''
class name score
count 10 10 10.000000
unique 2 5 NaN
top 數學 小孫 NaN
freq 5 2 NaN
mean NaN NaN 120.800000
std NaN NaN 17.203359
min NaN NaN 80.000000
25% NaN NaN 115.000000
50% NaN NaN 125.000000
75% NaN NaN 130.000000
max NaN NaN 140.000000
'''
- exclude參數
# 剔除統計類型
df.describe(exclude='O')
'''
score
count 10.000000
mean 120.800000
std 17.203359
min 80.000000
25% 115.000000
50% 125.000000
75% 130.000000
max 140.000000
'''