Python中作Q-Q圖（quantile-quantile Plot）

本文轉載自查看原文 2016-08-11 22:47 10779 python/ 圖像處理

Q-Q圖主要可以用來回答這些問題：

兩組數據是否來自同一分布
PS：當然也可以用KS檢驗，利用python中scipy.stats.ks_2samp函數可以獲得差值KS statistic和P值從而實現判斷。
兩組數據的尺度范圍是否一致
兩組數據是否有類似的分布形狀
前面兩個問題可以用樣本數據集在Q-Q圖上的點與參考線的距離判斷；而后者則是用點的擬合線的斜率判斷。

用Q-Q圖來分析分布的好處都有啥？~~（誰說對了就給他）~~

兩組數據集的大小可以不同
可以回答上面的后兩個問題，這是更深入的數據分布層面的信息。

那么，Q-Q圖要怎么畫呢？
將其中一組數據作為參考，另一組數據作為樣本。樣本數據每個值在樣本數據集中的百分位數（percentile）作為其在Q-Q圖上的橫坐標值，而該值放到參考數據集中時的百分位數作為其在Q-Q圖上的縱坐標。一般我們會在Q-Q圖上做一條45度的參考線。如果兩組數據來自同一分布，那么樣本數據集的點應該都落在參考線附近；反之如果距離越遠，這說明這兩組數據很可能來自不同的分布。

python中利用scipy.stats.percentileofscore函數可以輕松計算上訴所需的百分位數；而利用numpy.polyfit函數和sklearn.linear_model.LinearRegression類可以用來擬合樣本點的回歸曲線

from scipy.stats import percentileofscore
from sklearn.linear_model import LinearRegression
import pandas as pd
import matplotlib.pyplot as plt

# df_samp, df_clu are two dataframes with input data set
ref = np.asarray(df_clu)
samp = np.asarray(df_samp)
ref_id = df_clu.columns
samp_id = df_samp.columns

# theoretical quantiles
samp_pct_x = np.asarray([percentileofscore(ref, x) for x in samp])
# sample quantiles
samp_pct_y = np.asarray([percentileofscore(samp, x) for x in samp])
# estimated linear regression model
p = np.polyfit(samp_pct_x, samp_pct_y, 1)
regr = LinearRegression()
model_x = samp_pct_x.reshape(len(samp_pct_x), 1)
model_y = samp_pct_y.reshape(len(samp_pct_y), 1)
regr.fit(model_x, model_y)
r2 = regr.score(model_x, model_y)
# get fit regression line
if p[1] > 0:
    p_function = "y= %s x + %s, r-square = %s" %(str(p[0]), str(p[1]), str(r2))
elif p[1] < 0:
    p_function = "y= %s x - %s, r-square = %s" %(str(p[0]), str(-p[1]), str(r2))
else:
    p_function = "y= %s x, r-square = %s" %(str(p[0]), str(r2))
print "The fitted linear regression model in Q-Q plot using data from enterprises %s and cluster %s is %s" %(str(samp_id), str(ref_id), p_function)

# plot q-q plot
x_ticks = np.arange(0, 100, 20)
y_ticks = np.arange(0, 100, 20)
plt.scatter(x=samp_pct_x, y=samp_pct_y, color='blue')
plt.xlim((0, 100))
plt.ylim((0, 100))
# add fit regression line
plt.plot(samp_pct_x, regr.predict(model_x), color='red', linewidth=2)
# add 45-degree reference line
plt.plot([0, 100], [0, 100], linewidth=2)
plt.text(10, 70, p_function)
plt.xticks(x_ticks, x_ticks)
plt.yticks(y_ticks, y_ticks)
plt.xlabel('cluster quantiles - id: %s' %str(ref_id))
plt.ylabel('sample quantiles - id: %s' %str(samp_id))
plt.title('%s VS %s Q-Q plot' %(str(ref_id), str(samp_id)))
plt.show()

效果如上圖所示，在本例中所用的樣本數據在左下稀疏，在右上集中，且整體往上偏移，說明其分布應該與參考數據是不一樣的（分布形狀不同），用KS檢驗得到ks-statistic: 0.171464; p_value: 0.000000也驗證了這一點；但是其斜率在約為1，且整體上偏的幅度不大，說明這兩組數據的尺度是接近的。

PS：這里的方法適用於不知道數據分布的情況。如果想檢驗數據是否符合某種已知的分布，例如正態分布請出門左轉用scipy.stats.probplot函數。

參考：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Q-Q圖 Q-Q圖原理詳解及Python實現在Excel里繪制Q-Q圖 R語言制作曼哈頓圖 Q-Q圖等怎么用Q-Q圖驗證數據集的分布如何使用Q-Q圖驗證數據的分布 pandas中的quantile函數【統計學筆記】正態概率圖與Q-Q圖驗證數據是否滿足正態分布——Q-Q圖和P-P圖 Python numpy.quantile函數方法的使用