分位數回歸及其Python源碼

天朗氣清,惠風和暢。賦閑在家,正宜讀書。前人文章,不得其解。代碼開源,無人注釋。你們不來,我行我上。廢話少說,直入主題。o( ̄︶ ̄)o
我們要探測自變量
與因變量
的關系,最簡單的方法是線性回歸,即假設:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD15XyU3QmklN0QrJTNEKyU1Q2JldGFfJTdCMCU3RCslMkIrJTVDYmV0YV8lN0IxJTdEK3hfJTdCaSU3RCslMkIrZV8lN0JpJTdEJTJDK2klM0QxJTJDLi4uJTJDbg==.png)
我們通過最小二乘方法 (OLS: ordinary least squares),
的可靠性問題,我們同時對殘差
做了假設,即:
為均值為0,方差恆定的獨立隨機變量。
即為給定自變量
下,因變量
的條件均值。
假如殘差
不滿足我們的假設,或者更重要地,我們不僅僅想要知道
的在給定
下的條件均值,而且想知道是條件中位數(更一般地,條件分位數),那么OLS下的線性回歸就不能滿足我們的需求。分位數回歸(Quantile Regression)[2]解決了這些問題,下面我先給出一個分位數回歸的實際應用例子,再簡述其原理,最后再分析其在Python實現的源代碼。
1. 一個例子:收入與食品消費
這個例子出自statasmodels:Quantile Regression.[3] 我們想探索家庭收入與食品消費的關系,數據出自working class Belgian households in 1857 (the Engel data).我們用Python包statsmodels實現分位數回歸。
1.1 預處理
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
data = sm.datasets.engel.load_pandas().data
data.head()
income foodexp
0 420.157651 255.839425
1 541.411707 310.958667
2 901.157457 485.680014
3 639.080229 402.997356
4 750.875606 495.560775
1.2 中位數回歸 (分位數回歸的特例,q=0.5)
mod = smf.quantreg('foodexp ~ income', data)
res = mod.fit(q=.5)
print(res.summary())
QuantReg Regression Results
==============================================================================
Dep. Variable: foodexp Pseudo R-squared: 0.6206
Model: QuantReg Bandwidth: 64.51
Method: Least Squares Sparsity: 209.3
Date: Mon, 21 Oct 2019 No. Observations: 235
Time: 17:46:59 Df Residuals: 233
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 81.4823 14.634 5.568 0.000 52.649 110.315
income 0.5602 0.013 42.516 0.000 0.534 0.586
==============================================================================
The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
由結果可以知道
,如何得到回歸系數的估計?結果中的std err, t, Pseudo R-squared等是什么?我會在稍后解釋。
1.3 數據可視化
我們先擬合10個分位數回歸,分位數q分別在0.05到0.95之間。
quantiles = np.arange(.05, .96, .1)
def fit_model(q):
res = mod.fit(q=q)
return [q, res.params['Intercept'], res.params['income']] + \
res.conf_int().loc['income'].tolist()
models = [fit_model(x) for x in quantiles]
models = pd.DataFrame(models, columns=['q', 'a', 'b', 'lb', 'ub'])
ols = smf.ols('foodexp ~ income', data).fit()
ols_ci = ols.conf_int().loc['income'].tolist()
ols = dict(a = ols.params['Intercept'],
b = ols.params['income'],
lb = ols_ci[0],
ub = ols_ci[1])
print(models)
print(ols)
q a b lb ub
0 0.05 124.880096 0.343361 0.268632 0.418090
1 0.15 111.693660 0.423708 0.382780 0.464636
2 0.25 95.483539 0.474103 0.439900 0.508306
3 0.35 105.841294 0.488901 0.457759 0.520043
4 0.45 81.083647 0.552428 0.525021 0.579835
5 0.55 89.661370 0.565601 0.540955 0.590247
6 0.65 74.033435 0.604576 0.582169 0.626982
7 0.75 62.396584 0.644014 0.622411 0.665617
8 0.85 52.272216 0.677603 0.657383 0.697823
9 0.95 64.103964 0.709069 0.687831 0.730306
{'a': 147.47538852370562, 'b': 0.48517842367692354, 'lb': 0.4568738130184233,
這里擬合了10個回歸,其中q是對應的分位數,a是斜率,b是回歸系數。lb和ub分別是b的95%置信區間的下界與上界。
現在來畫出這10條回歸線:
x = np.arange(data.income.min(), data.income.max(), 50)
get_y = lambda a, b: a + b * x
fig, ax = plt.subplots(figsize=(8, 6))
for i in range(models.shape[0]):
y = get_y(models.a[i], models.b[i])
ax.plot(x, y, linestyle='dotted', color='grey')
y = get_y(ols['a'], ols['b'])
ax.plot(x, y, color='red', label='OLS')
ax.scatter(data.income, data.foodexp, alpha=.2)
ax.set_xlim((240, 3000))
ax.set_ylim((240, 2000))
legend = ax.legend()
ax.set_xlabel('Income', fontsize=16)
ax.set_ylabel('Food expenditure', fontsize=16);

上圖中虛線是分位數回歸線,紅線是線性最小二乘(OLS)的回歸線。通過觀察,我們可以發現3個現象:
- 隨着收入提高,食品消費也在提高。
- 隨着收入提高,家庭間食品消費的差別拉大。窮人別無選擇,富人能選擇生活方式,有喜歡吃貴的,也有喜歡吃便宜的。然而我們無法通過OLS發現這個現象,因為它只給了我們一個均值。
- 對與窮人來說,OLS預測值過高。這是因為少數的富人拉高了整體的均值,可見OLS對異常點敏感,不是一個穩健的模型。
2.分位數回歸的原理
這部分是數理統計的內容,只關心如何實現的朋友可以略過。我們要解決以下這幾個問題:
- 什么是分位數?
- 如何求分位數?
- 什么是分位數回歸?
- 分位數回歸的回歸系數如何求得?
- 回歸系數的檢驗如何進行?
- 如何評估回歸擬合優度?
2.1 分位數的定義]
是隨機變量,
的累積密度函數是
.
的
分位數為:
, ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1xJTVDaW4lMjgwJTJDMSUyOQ==.png)
假設有100個人,95%的人身高少於1.9m, 1.9m就是身高的95%分位數。
2.2 分位數的求法
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUN0aWxkZSU3QllfJTdCcSU3RCU3RCslM0QrYXJnbWluXyU3QnlfJTdCcSU3RCU3RCU1Q3N1bV8lN0JpJTNEMSU3RCU1RSU3Qm4lN0QlN0IlNUNyaG9fJTdCcSU3RCUyOHlfJTdCaSU3RC15XyU3QnElN0QlN0QlMjkrJTNEYXJnbWluXyU3QnlfJTdCcSU3RCU3RCU1QiUyOHEtMSUyOSU1Q3N1bV8lN0J5XyU3QmklN0QlM0N5JTdCcSU3RCU3RCU1RSU3QiU3RCU3QiUyOHlfJTdCaSU3RC15XyU3QnElN0QlMjklMkJxJTVDc3VtXyU3QnlfJTdCaSU3RCU1Q2dlcSt5XyU3QnElN0QlN0QlNUUlN0IlN0QlN0IlMjh5XyU3QmklN0QteV8lN0JxJTI5JTdEJTdEJTVEJTdEKw==.png)
通過選擇不同的
值,使
最小,對應的
值即為
的
分位數的估計
.
2.3 分位數回歸
對於OLS, 我們有:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUN0aWxkZSU3QlklN0QlN0NYJTNEWCU1Q3RpbGRlJTdCJTVDYmV0YSU3RA==.png)
為
最小化所對應的
,類比地,對於
分位數回歸,我們有:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUN0aWxkZSU3QllfJTdCcSU3RCU3RCU3Q1glM0RYJTVDdGlsZGUlN0IlNUNiZXRhXyU3QnElN0QlN0Q=.png)
為最小化:
即最小化 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUNzdW1fJTdCaSUzRDElN0QlNUUlN0JuJTdEJTdCJTVDcmhvXyU3QnElN0QlMjh5XyU3QmklN0QteF8lN0JpJTdEJTI3JTVDYmV0YSU3RCUyOSUzRCU1Q3N1bV8lN0JpJTNEMSU3RCU1RSU3Qm4lN0QlNUIlN0IlMjhxLTElMjklNUNzdW1fJTdCeV8lN0JpJTdEJTNDeF8lN0JpJTdEJUUyJTgwJTk5JTVDYmV0YSU3RCU1RSU3QiU3RCUyOCU3QnlfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTI5JTJCcSU1Q3N1bV8lN0J5XyU3QmklN0QlNUNnZXEreF8lN0JpJTdEJTVDYmV0YSU3RCU1RSU3QiU3RCU3QiUyOHlfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTI5JTVEJTdEJTdEJTdE.png)
所對應的 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUNiZXRh.png)
2.4 系數估計
由於
不能直接對
求導,我們只能用迭代的方法來逼近
最小時對應的
值。statsmodels采用了Iteratively reweighted least squares (IRLS)的方法。
假設我們要求
最小化形如下的
范數:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1hcmdtaW5fJTdCJTVDYmV0YSU3RCU3QyU3Q1ktWCU1Q2JldGElN0MlN0NfJTdCcCU3RCUzRGFyZ21pbl8lN0IlNUNiZXRhJTdEJTVDc3VtXyU3QmklM0QxJTdEJTVFJTdCbiU3RCU3QiU3Q3lfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTdEJTdDJTVFJTdCcCU3RA==.png)
則第t+1步迭代的
值為:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUNiZXRhJTVFJTdCdCUyQjElN0QlM0RhcmdtaW5fJTdCJTVDYmV0YSU3RCU1Q3N1bV8lN0JpJTNEMSU3RCU1RSU3Qm4lN0QlN0J3XyU3QmklN0QlNUUlN0IlMjh0JTI5JTdEJTdEJTdDeV8lN0JpJTdELXhfJTdCaSU3RCUyNyU1Q2JldGElN0MlNUUlN0IyJTdEJTNEJTI4WCU1RSU3QlQlN0RXJTVFJTdCJTI4dCUyOSU3RFglMjklNUUlN0ItMSU3RFglNUUlN0JUJTdEVyU1RSU3QiUyOHQlMjklN0R5.png)
是對角矩陣且初始值為 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD13XyU3QmklN0QlNUUlN0IlMjgwJTI5JTdEJTNEMQ==.png)
第t次迭代 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD13XyU3QmklN0QlNUUlN0IlMjh0JTI5JTdEJTNEJTdDeV8lN0JpJTdELXhfJTdCaSU3RCUyNyU1Q2JldGElN0MlNUUlN0JwLTIlN0Q=.png)
以中位數回歸為例子(q=0.5),我們求:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1hcmdtaW5fJTdCJTVDYmV0YSU3RCU1Q3N1bV8lN0JpJTNEMSU3RCU1RSU3Qm4lN0QlNUIlN0IlMjhxLTElMjklNUNzdW1fJTdCeV8lN0JpJTdEJTNDeF8lN0JpJTdEJTI3JTVDYmV0YSU3RCU1RSU3QiU3RCUyOCU3QnlfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTI5JTJCcSU1Q3N1bV8lN0J5XyU3QmklN0QlNUNnZXEreF8lN0JpJTdEJTI3JTVDYmV0YSU3RCU1RSU3QiU3RCU3QiUyOHlfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTI5JTVEJTdEJTdEJTdE.png)
即 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1hcmdtaW5fJTdCJTVDYmV0YSU3RCU1Q3N1bV8lN0JpJTNEMSU3RCU1RSU3Qm4lN0QlNUIlN0ItMC41JTVDc3VtXyU3QnlfJTdCaSU3RCUzQ3hfJTdCaSU3RCUyNyU1Q2JldGElN0QlNUUlN0IlN0QlMjglN0J5XyU3QmklN0QteF8lN0JpJTdEJTI3JTVDYmV0YSUyOSUyQjAuNSU1Q3N1bV8lN0J5XyU3QmklN0QlNUNnZXEreF8lN0JpJTdEJTI3JTVDYmV0YSU3RCU1RSU3QiU3RCU3QiUyOHlfJTdCaSU3RC14XyU3QmklN0QlMjclNUNiZXRhJTI5JTVEJTdEJTdEJTdEJTNEYXJnbWluXyU3QiU1Q2JldGElN0QlNUNzdW1fJTdCaSUzRDElN0QlNUUlN0JuJTdEJTdCJTdDeV8lN0JpJTdELXhfJTdCaSU3RCUyNyU1Q2JldGElN0MlN0Q=.png)
即最小化形如上的
范數, ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD13XyU3QmklN0QlNUUlN0IlMjh0JTI5JTdEJTNEJTdDeV8lN0JpJTdELXhfJTdCaSU3RCUyNyU1Q2JldGElN0MlNUUlN0ItMSU3RA==.png)
為避免分母為0,我們取
,
是一個很小的數,例如0.0001.
2.5 回歸系數的檢驗
我們通過2.4,多次迭代得出
的估計值,為了得到假設檢驗的t統計量,我們只需得到
的方差的估計。
分位數回歸
的協方差矩陣的漸近估計為:
![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1Fc3QuQXN5LlZhciU1QiU1Q2JldGFfJTdCcSU3RCU1RCUzRCUyOFglMjdYJTI5JTVFJTdCLTElN0RYJTI3RFglMjhYJTI3WCUyOSU1RSU3Qi0xJTdE.png)
其中
是對角矩陣,
當
,
, 當
, ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1kXyU3QmklN0QlM0QlNUIlNUNmcmFjJTdCMS1xJTdEJTdCZiUyODAlMjklN0QlNUQlNUUlN0IyJTdE.png)
的估計為 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD0lNUN0aWxkZSU3QmYlMjgwJTI5JTdEJTNEJTVDZnJhYyU3QjElN0QlN0JuJTdEJTVDc3VtXyU3QmklM0QxJTdEJTVFJTdCbiU3RCU3QiU1Q2ZyYWMlN0IxJTdEJTdCaCU3REslNUIlNUNmcmFjJTdCZV8lN0JpJTdEJTdEJTdCaCU3RCU3RCU1RA==.png)
其中 ![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1lXyU3QmklN0QlM0R5XyU3QmklN0QteF8lN0JpJTdEJTI3JTVDYmV0YQ==.png)
為核函數(Kernel),可取Epa,Gaussian等.
為根據Stata 12所選的窗寬(bandwidth)[5]
回歸結果中的std err即由
獲得,t統計量等於
。
2.6 擬合優度
對於OLS,我們用
來衡量擬合優度。對於
分位數回歸,我們類比得到:
,其中
為所有
觀察值的
分位數。
即為回歸結果中的Pseudo R-squared。
3.Python源碼分析
實現分位數回歸的完整源碼在 ,里面主要含有兩個類QuantReg 和 QuantRegResults. 其中QuantReg是核心,包含了回歸系數估計,協方差計算等過程。QuantRegResults計算擬合優度並組織回歸結果。
3.1 QuantReg類
#QuantReg是包中RegressionModel的一個子類
class QuantReg(RegressionModel):
#計算回歸系數及其協方差矩陣。q是分位數,vcov是協方差矩陣,默認robust即2.5的方法。核函數kernel默認
#epa,窗寬bandwidth默認hsheather.IRLS最大迭代次數默認1000,差值默認小於1e-6時停止迭代
def fit(self, q=.5, vcov='robust', kernel='epa', bandwidth='hsheather',
max_iter=1000, p_tol=1e-6, **kwargs):
"""
Solve by Iterative Weighted Least Squares
Parameters
----------
q : float
Quantile must be between 0 and 1
vcov : str, method used to calculate the variance-covariance matrix
of the parameters. Default is ``robust``:
- robust : heteroskedasticity robust standard errors (as suggested
in Greene 6th edition)
- iid : iid errors (as in Stata 12)
kernel : str, kernel to use in the kernel density estimation for the
asymptotic covariance matrix:
- epa: Epanechnikov
- cos: Cosine
- gau: Gaussian
- par: Parzene
bandwidth : str, Bandwidth selection method in kernel density
estimation for asymptotic covariance estimate (full
references in QuantReg docstring):
- hsheather: Hall-Sheather (1988)
- bofinger: Bofinger (1975)
- chamberlain: Chamberlain (1994)
"""
if q < 0 or q > 1:
raise Exception('p must be between 0 and 1')
kern_names = ['biw', 'cos', 'epa', 'gau', 'par']
if kernel not in kern_names:
raise Exception("kernel must be one of " + ', '.join(kern_names))
else:
kernel = kernels[kernel]
if bandwidth == 'hsheather':
bandwidth = hall_sheather
elif bandwidth == 'bofinger':
bandwidth = bofinger
elif bandwidth == 'chamberlain':
bandwidth = chamberlain
else:
raise Exception("bandwidth must be in 'hsheather', 'bofinger', 'chamberlain'")
#endog樣本因變量,exog樣本自變量
endog = self.endog
exog = self.exog
nobs = self.nobs
exog_rank = np.linalg.matrix_rank(self.exog)
self.rank = exog_rank
self.df_model = float(self.rank - self.k_constant)
self.df_resid = self.nobs - self.rank
#IRLS初始化
n_iter = 0
xstar = exog
beta = np.ones(exog_rank)
# TODO: better start, initial beta is used only for convergence check
# Note the following does not work yet,
# the iteration loop always starts with OLS as initial beta
# if start_params is not None:
# if len(start_params) != rank:
# raise ValueError('start_params has wrong length')
# beta = start_params
# else:
# # start with OLS
# beta = np.dot(np.linalg.pinv(exog), endog)
diff = 10
cycle = False
history = dict(params = [], mse=[])
#IRLS迭代
while n_iter < max_iter and diff > p_tol and not cycle:
n_iter += 1
beta0 = beta
xtx = np.dot(xstar.T, exog)
xty = np.dot(xstar.T, endog)
beta = np.dot(pinv(xtx), xty)
resid = endog - np.dot(exog, beta)
mask = np.abs(resid) < .000001
resid[mask] = ((resid[mask] >= 0) * 2 - 1) * .000001
resid = np.where(resid < 0, q * resid, (1-q) * resid)
resid = np.abs(resid)
#1/resid[:, np.newaxis]為更新權重W
xstar = exog / resid[:, np.newaxis]
diff = np.max(np.abs(beta - beta0))
history['params'].append(beta)
history['mse'].append(np.mean(resid*resid))
#檢查是否收斂,若收斂則提前停止迭代
if (n_iter >= 300) and (n_iter % 100 == 0):
# check for convergence circle, should not happen
for ii in range(2, 10):
if np.all(beta == history['params'][-ii]):
cycle = True
warnings.warn("Convergence cycle detected", ConvergenceWarning)
break
if n_iter == max_iter:
warnings.warn("Maximum number of iterations (" + str(max_iter) +
") reached.", IterationLimitWarning)
#計算協方差矩陣
e = endog - np.dot(exog, beta)
# Greene (2008, p.407) writes that Stata 6 uses this bandwidth:
# h = 0.9 * np.std(e) / (nobs**0.2)
# Instead, we calculate bandwidth as in Stata 12
iqre = stats.scoreatpercentile(e, 75) - stats.scoreatpercentile(e, 25)
h = bandwidth(nobs, q)
h = min(np.std(endog),
iqre / 1.34) * (norm.ppf(q + h) - norm.ppf(q - h))
fhat0 = 1. / (nobs * h) * np.sum(kernel(e / h))
if vcov == 'robust':
d = np.where(e > 0, (q/fhat0)**2, ((1-q)/fhat0)**2)
xtxi = pinv(np.dot(exog.T, exog))
xtdx = np.dot(exog.T * d[np.newaxis, :], exog)
vcov = chain_dot(xtxi, xtdx, xtxi)
elif vcov == 'iid':
vcov = (1. / fhat0)**2 * q * (1 - q) * pinv(np.dot(exog.T, exog))
else:
raise Exception("vcov must be 'robust' or 'iid'")
#用系數估計值和協方差矩陣創建一個QuantResults對象,並輸出結果
lfit = QuantRegResults(self, beta, normalized_cov_params=vcov)
lfit.q = q
lfit.iterations = n_iter
lfit.sparsity = 1. / fhat0
lfit.bandwidth = h
lfit.history = history
return RegressionResultsWrapper(lfit)
#核函數表達式
def _parzen(u):
z = np.where(np.abs(u) <= .5, 4./3 - 8. * u**2 + 8. * np.abs(u)**3,
8. * (1 - np.abs(u))**3 / 3.)
z[np.abs(u) > 1] = 0
return z
kernels = {}
kernels['biw'] = lambda u: 15. / 16 * (1 - u**2)**2 * np.where(np.abs(u) <= 1, 1, 0)
kernels['cos'] = lambda u: np.where(np.abs(u) <= .5, 1 + np.cos(2 * np.pi * u), 0)
kernels['epa'] = lambda u: 3. / 4 * (1-u**2) * np.where(np.abs(u) <= 1, 1, 0)
kernels['gau'] = lambda u: norm.pdf(u)
kernels['par'] = _parzen
#kernels['bet'] = lambda u: np.where(np.abs(u) <= 1, .75 * (1 - u) * (1 + u), 0)
#kernels['log'] = lambda u: logistic.pdf(u) * (1 - logistic.pdf(u))
#kernels['tri'] = lambda u: np.where(np.abs(u) <= 1, 1 - np.abs(u), 0)
#kernels['trw'] = lambda u: 35. / 32 * (1 - u**2)**3 * np.where(np.abs(u) <= 1, 1, 0)
#kernels['uni'] = lambda u: 1. / 2 * np.where(np.abs(u) <= 1, 1, 0)
#窗寬計算
def hall_sheather(n, q, alpha=.05):
z = norm.ppf(q)
num = 1.5 * norm.pdf(z)**2.
den = 2. * z**2. + 1.
h = n**(-1. / 3) * norm.ppf(1. - alpha / 2.)**(2./3) * (num / den)**(1./3)
return h
def bofinger(n, q):
num = 9. / 2 * norm.pdf(2 * norm.ppf(q))**4
den = (2 * norm.ppf(q)**2 + 1)**2
h = n**(-1. / 5) * (num / den)**(1. / 5)
return h
def chamberlain(n, q, alpha=.05):
return norm.ppf(1 - alpha / 2) * np.sqrt(q*(1 - q) / n)
3.2 QuantRegResults類
這里我只給出計算擬合優度的代碼。
class QuantRegResults(RegressionResults):
'''Results instance for the QuantReg model'''
@cache_readonly
def prsquared(self):
q = self.q
endog = self.model.endog
#e為殘差
e = self.resid
e = np.where(e < 0, (1 - q) * e, q * e)
e = np.abs(e)
ered = endog - stats.scoreatpercentile(endog, q * 100)
ered = np.where(ered < 0, (1 - q) * ered, q * ered)
ered = np.abs(ered)
return 1 - np.sum(e) / np.sum(ered)
4.總結
上文我先給出了一個分位數回歸的應用例子,進而敘述了分位數回歸的原理,最后再分析了Python實現的源碼。
分位數回歸對比起OLS回歸,雖然較為復雜,但它有三個主要優勢:
- 能反映因變量分位數與自變量的關系,而不僅僅反映因變量均值與自變量的關系。
- 分位數回歸對殘差不作任何假設。
- 分位數回歸受異常點的影響較小。
參考
- ^https://en.wikipedia.org/wiki/Ordinary_least_squares
- ^QUANTILE REGRESSION http://www.econ.uiuc.edu/~roger/research/rq/rq.pdf
- ^https://www.statsmodels.org/dev/examples/notebooks/generated/quantile_regression.html
- [1](https://www.zhihu.com/#ref_4_0)bchttps://en.wikipedia.org/wiki/Quantile_regression
- [2](https://www.zhihu.com/#ref_5_0)bchttps://www.statsmodels.org/devel/_modules/statsmodels/regression/quantile_regression.html
- ^https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
- ^Green,W. H. (2008). Econometric Analysis. Sixth Edition. International Student Edition.
- ^https://www.ibm.com/support/knowledgecenter/en/SSLVMB_sub/statistics_mainhelp_ddita/spss/regression/idh_quantile.html
