Matplotlib學習---用matplotlib畫箱線圖（boxplot）

本文轉載自查看原文 2018-08-20 21:01 6291 Matplotlib/Seaborn/ boxplot/ 箱線圖/ matplotlib

箱線圖通過數據的四分位數來展示數據的分布情況。例如：數據的中心位置，數據間的離散程度，是否有異常值等。

把數據從小到大進行排列並等分成四份，第一分位數（Q1），第二分位數（Q2）和第三分位數（Q3）分別為數據的第25%，50%和75%的數字。

I-------------I o I-------------I o I-------------I o I-------------I

Q1 Q2 Q3

(lower quartile) (median) (upper quartile)

四分位間距（Interquartile range（IQR））=上分位數（upper quartile） - 下分位數（lower quartile）

箱線圖分為兩部分，分別是箱（box）和須（whisker）。箱（box）用來表示從第一分位到第三分位的數據，須（whisker）用來表示數據的范圍。

箱線圖從上到下各橫線分別表示：數據上限（通常是Q3+1.5*IQR），第三分位數（Q3），第二分位數（中位數），第一分位數（Q1），數據下限（通常是Q1-1.5*IQR）。有時還有一些圓點，位於數據上下限之外，表示異常值（outliers）。

（注：如果數據上下限特別大，那么whisker將顯示數據的最大值和最小值。）

下圖展示了箱線圖各部分的含義。（摘自：https://datavizcatalogue.com/methods/box_plot.html）

下面利用Jake Vanderplas所著的《Python數據科學手冊》一書中的數據，學習畫圖。

數據地址：https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

這個數據文件在Matplotlib學習---用matplotlib畫折線圖（line chart）里已經用過，這里直接使用清洗過后的數據：

import pandas as pd
from matplotlib import pyplot as plt
birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")
fig,ax=plt.subplots()

birth=birth.iloc[:15067]
birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')
birth=birth[birth["date"].notnull()]

這是清洗過后的數據的前5行：

       year  month  day gender  births       date
0      1969      1    1      F    4046 1969-01-01
1      1969      1    1      M    4440 1969-01-01
2      1969      1    2      F    4454 1969-01-02
3      1969      1    2      M    4548 1969-01-02
4      1969      1    3      F    4548 1969-01-03

數據展示的是美國1969年-1988年每天出生的男女人數。

讓我們畫一個箱線圖，比較一下1986年，1987年和1988年男女每天出生人數的分布情況。

箱線圖： ax.boxplot(x)

完整代碼如下：

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")
fig,ax=plt.subplots()

birth=birth.iloc[:15067]
birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')
birth=birth[birth["date"].notnull()]

#提取1986年-1988年男女出生人數數據，並轉換成numpy的array格式
birth1986_female=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="F")])
birth1986_male=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="M")])
birth1987_female=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="F")])
birth1987_male=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="M")])
birth1988_female=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="F")])
birth1988_male=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="M")])

#由於需要繪制多個箱線圖，因此把這些數據放入一個列表
data=[birth1986_female,birth1986_male,birth1987_female,birth1987_male,birth1988_female,birth1988_male]
ax.boxplot(data,positions=[0,0.6,1.5,2.1,3,3.6]) #用positions參數設置各箱線圖的位置
ax.set_xticklabels(["1986\nfemale","1986\nmale","1987\nfemale","1987\nmale","1988\nfemale","1988\nmale"]) #設置x軸刻度標簽

plt.show()

圖像如下：

可以看出，這三個年份，男性每天出生的人數的中位數都比女性高。同時，箱體高度都差不多，說明數據離散程度相差不大。此外，箱體沒有關於中位線對稱，且中位線位於箱體中心偏上，說明數據成左偏態分布。最后，數據沒有出現異常值。

箱線圖也可以做成橫向的，在boxplot命令里加上參數vert=False即可。圖像如下：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python箱線圖matplotlib Matplotlib學習---用matplotlib畫折線圖（line chart） matplotlib畫直線圖的基本用法 matplotlib畫k線圖箱線圖boxplot()的繪制箱線圖boxplot Matplotlib學習---用matplotlib畫熱圖（heatmap） Matplotlib學習---用matplotlib畫誤差線（errorbar） matplotlib animation FuncAnimation畫2D線圖 python用matplotlib畫折線圖