1.问题说明
输出数据集的基本信息,比如最大值,最小值,平均值等
统计确实的变量和样本个数
通过箱式图判断异常点
2.求最大值、最小值和平均值
求最大值:
import pandas as pd import numpy as np
data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\catering_sale.csv") data1 = data.describe() print(data1.max())
运行结果:
销量 9106.44
dtype: float64
求最小值:
import pandas as pd import numpy as np data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\catering_sale.csv") data1 = data.describe() print(data1.min())
运行结果:
销量 22.0
dtype: float64
求平均值:
import pandas as pd import numpy as np data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\catering_sale.csv") data1 = data.describe() print(data1.mean())
运行结果:
销量 2621.079309
dtype: float64
3.缺失值的数量
import pandas as pd import numpy as np data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\catering_sale.csv") data2 = data.isnull().sum() print(data2)
运行结果:
日期 0
销量 1
dtype: int64
4.箱式图判断异常点
plt.figure() plt.rcParams['font.sans-serif']=[u'SimHei'] plt.rcParams['axes.unicode_minus']=False p = data.boxplot(return_type='dict') #画箱式图 x = p['fliers'][0].get_xdata() y = p['fliers'][0].get_ydata() y.sort() for i in range(len(x)): if i > 0: plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.05 - 0.8/(y[i]-y[i-1]), y[i])) else: plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.08, y[i])) plt.show()
运行结果:
5.完整代码
import pandas as pd import numpy as np import matplotlib.pyplot as plt #导入图像库 data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\catering_sale.csv") data1 = data.describe() data2 = data.isnull().sum() print(data1.mean()) print(data2) plt.figure() plt.rcParams['font.sans-serif']=[u'SimHei'] plt.rcParams['axes.unicode_minus']=False p = data.boxplot(return_type='dict') #画箱式图 x = p['fliers'][0].get_xdata() y = p['fliers'][0].get_ydata() y.sort() for i in range(len(x)): if i > 0: plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.05 - 0.8/(y[i]-y[i-1]), y[i])) else: plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.08, y[i])) plt.show()