分位數回歸及其Python源碼

本文轉載自查看原文 2019-10-25 12:21 2718 python雜項/ 數據分析

分位數回歸及其Python源碼

天朗氣清，惠風和暢。賦閑在家，正宜讀書。前人文章，不得其解。代碼開源，無人注釋。你們不來，我行我上。廢話少說，直入主題。o(￣︶￣)o

我們要探測自變量 [公式] 與因變量的關系，最簡單的方法是線性回歸，即假設：

[公式]

我們通過最小二乘方法 (OLS: ordinary least squares)， [公式] 的可靠性問題，我們同時對殘差做了假設，即：為均值為0，方差恆定的獨立隨機變量。即為給定自變量下，因變量的條件均值。

假如殘差 [公式] 不滿足我們的假設，或者更重要地，我們不僅僅想要知道的在給定下的條件均值，而且想知道是條件中位數（更一般地，條件分位數），那么OLS下的線性回歸就不能滿足我們的需求。分位數回歸(Quantile Regression)[2]解決了這些問題，下面我先給出一個分位數回歸的實際應用例子，再簡述其原理，最后再分析其在Python實現的源代碼。

1. 一個例子：收入與食品消費

這個例子出自statasmodels:Quantile Regression.[3] 我們想探索家庭收入與食品消費的關系，數據出自working class Belgian households in 1857 (the Engel data).我們用Python包statsmodels實現分位數回歸。

1.1 預處理

%matplotlib inline

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

data = sm.datasets.engel.load_pandas().data
data.head()

	income	        foodexp
0	420.157651	255.839425
1	541.411707	310.958667
2	901.157457	485.680014
3	639.080229	402.997356
4	750.875606	495.560775

1.2 中位數回歸（分位數回歸的特例，q=0.5）

mod = smf.quantreg('foodexp ~ income', data)
res = mod.fit(q=.5)
print(res.summary())

                         QuantReg Regression Results
==============================================================================
Dep. Variable:                foodexp   Pseudo R-squared:               0.6206
Model:                       QuantReg   Bandwidth:                       64.51
Method:                 Least Squares   Sparsity:                        209.3
Date:                Mon, 21 Oct 2019   No. Observations:                  235
Time:                        17:46:59   Df Residuals:                      233
                                        Df Model:                            1
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     81.4823     14.634      5.568      0.000      52.649     110.315
income         0.5602      0.013     42.516      0.000       0.534       0.586
==============================================================================

The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

由結果可以知道 [公式] ，如何得到回歸系數的估計？結果中的std err, t, Pseudo R-squared等是什么？我會在稍后解釋。

1.3 數據可視化

我們先擬合10個分位數回歸，分位數q分別在0.05到0.95之間。

quantiles = np.arange(.05, .96, .1)
def fit_model(q):
    res = mod.fit(q=q)
    return [q, res.params['Intercept'], res.params['income']] + \
            res.conf_int().loc['income'].tolist()

models = [fit_model(x) for x in quantiles]
models = pd.DataFrame(models, columns=['q', 'a', 'b', 'lb', 'ub'])

ols = smf.ols('foodexp ~ income', data).fit()
ols_ci = ols.conf_int().loc['income'].tolist()
ols = dict(a = ols.params['Intercept'],
           b = ols.params['income'],
           lb = ols_ci[0],
           ub = ols_ci[1])

print(models)
print(ols)

      q           a         b        lb        ub
0  0.05  124.880096  0.343361  0.268632  0.418090
1  0.15  111.693660  0.423708  0.382780  0.464636
2  0.25   95.483539  0.474103  0.439900  0.508306
3  0.35  105.841294  0.488901  0.457759  0.520043
4  0.45   81.083647  0.552428  0.525021  0.579835
5  0.55   89.661370  0.565601  0.540955  0.590247
6  0.65   74.033435  0.604576  0.582169  0.626982
7  0.75   62.396584  0.644014  0.622411  0.665617
8  0.85   52.272216  0.677603  0.657383  0.697823
9  0.95   64.103964  0.709069  0.687831  0.730306
{'a': 147.47538852370562, 'b': 0.48517842367692354, 'lb': 0.4568738130184233,

這里擬合了10個回歸，其中q是對應的分位數，a是斜率，b是回歸系數。lb和ub分別是b的95%置信區間的下界與上界。

現在來畫出這10條回歸線：

x = np.arange(data.income.min(), data.income.max(), 50)
get_y = lambda a, b: a + b * x

fig, ax = plt.subplots(figsize=(8, 6))

for i in range(models.shape[0]):
    y = get_y(models.a[i], models.b[i])
    ax.plot(x, y, linestyle='dotted', color='grey')

y = get_y(ols['a'], ols['b'])

ax.plot(x, y, color='red', label='OLS')
ax.scatter(data.income, data.foodexp, alpha=.2)
ax.set_xlim((240, 3000))
ax.set_ylim((240, 2000))
legend = ax.legend()
ax.set_xlabel('Income', fontsize=16)
ax.set_ylabel('Food expenditure', fontsize=16);

上圖中虛線是分位數回歸線，紅線是線性最小二乘（OLS）的回歸線。通過觀察，我們可以發現3個現象:

隨着收入提高，食品消費也在提高。
隨着收入提高，家庭間食品消費的差別拉大。窮人別無選擇，富人能選擇生活方式，有喜歡吃貴的，也有喜歡吃便宜的。然而我們無法通過OLS發現這個現象，因為它只給了我們一個均值。
對與窮人來說，OLS預測值過高。這是因為少數的富人拉高了整體的均值，可見OLS對異常點敏感，不是一個穩健的模型。

2.分位數回歸的原理

這部分是數理統計的內容，只關心如何實現的朋友可以略過。我們要解決以下這幾個問題：

什么是分位數？
如何求分位數？
什么是分位數回歸？
分位數回歸的回歸系數如何求得？
回歸系數的檢驗如何進行？
如何評估回歸擬合優度？

2.1 分位數的定義]

[公式] 是隨機變量，的累積密度函數是 . 的分位數為：

[公式] ,

假設有100個人，95%的人身高少於1.9m, 1.9m就是身高的95%分位數。

2.2 分位數的求法

[公式]

通過選擇不同的 [公式] 值，使最小，對應的值即為的分位數的估計 .

2.3 分位數回歸

對於OLS, 我們有：

[公式]

[公式] 為最小化所對應的，類比地，對於分位數回歸，我們有：

[公式]

[公式] 為最小化：

[公式] 即最小化

所對應的 [公式]

2.4 系數估計

由於 [公式] 不能直接對求導，我們只能用迭代的方法來逼近最小時對應的值。statsmodels采用了Iteratively reweighted least squares （IRLS）的方法。

假設我們要求 [公式] 最小化形如下的范數：

[公式]

則第t+1步迭代的 [公式] 值為：

[公式]

[公式] 是對角矩陣且初始值為

第t次迭代 [公式]

以中位數回歸為例子(q=0.5),我們求：

[公式]

即 [公式]

即最小化形如上的 [公式] 范數，

為避免分母為0，我們取 [公式] , 是一個很小的數，例如0.0001.

2.5 回歸系數的檢驗

我們通過2.4，多次迭代得出 [公式] 的估計值，為了得到假設檢驗的t統計量，我們只需得到的方差的估計。

[公式] 分位數回歸的協方差矩陣的漸近估計為：

[公式]

其中 [公式] 是對角矩陣，

當 [公式] , , 當，

[公式] 的估計為

其中 [公式]

[公式] 為核函數（Kernel），可取Epa,Gaussian等. 為根據Stata 12所選的窗寬(bandwidth)[5]

回歸結果中的std err即由 [公式] 獲得，t統計量等於。

2.6 擬合優度

對於OLS,我們用 [公式] 來衡量擬合優度。對於分位數回歸，我們類比得到：

[公式] ,其中為所有觀察值的分位數。即為回歸結果中的Pseudo R-squared。

3.Python源碼分析

實現分位數回歸的完整源碼在，里面主要含有兩個類QuantReg 和 QuantRegResults. 其中QuantReg是核心，包含了回歸系數估計，協方差計算等過程。QuantRegResults計算擬合優度並組織回歸結果。

3.1 QuantReg類

  #QuantReg是包中RegressionModel的一個子類
  class QuantReg(RegressionModel):
    #計算回歸系數及其協方差矩陣。q是分位數，vcov是協方差矩陣，默認robust即2.5的方法。核函數kernel默認
    #epa,窗寬bandwidth默認hsheather.IRLS最大迭代次數默認1000，差值默認小於1e-6時停止迭代
    def fit(self, q=.5, vcov='robust', kernel='epa', bandwidth='hsheather',
            max_iter=1000, p_tol=1e-6, **kwargs):
        """
        Solve by Iterative Weighted Least Squares

        Parameters
        ----------
        q : float
            Quantile must be between 0 and 1
        vcov : str, method used to calculate the variance-covariance matrix
            of the parameters. Default is ``robust``:

            - robust : heteroskedasticity robust standard errors (as suggested
              in Greene 6th edition)
            - iid : iid errors (as in Stata 12)

        kernel : str, kernel to use in the kernel density estimation for the
            asymptotic covariance matrix:

            - epa: Epanechnikov
            - cos: Cosine
            - gau: Gaussian
            - par: Parzene

        bandwidth : str, Bandwidth selection method in kernel density
            estimation for asymptotic covariance estimate (full
            references in QuantReg docstring):

            - hsheather: Hall-Sheather (1988)
            - bofinger: Bofinger (1975)
            - chamberlain: Chamberlain (1994)
        """

        if q < 0 or q > 1:
            raise Exception('p must be between 0 and 1')

        kern_names = ['biw', 'cos', 'epa', 'gau', 'par']
        if kernel not in kern_names:
            raise Exception("kernel must be one of " + ', '.join(kern_names))
        else:
            kernel = kernels[kernel]

        if bandwidth == 'hsheather':
            bandwidth = hall_sheather
        elif bandwidth == 'bofinger':
            bandwidth = bofinger
        elif bandwidth == 'chamberlain':
            bandwidth = chamberlain
        else:
            raise Exception("bandwidth must be in 'hsheather', 'bofinger', 'chamberlain'")
        #endog樣本因變量，exog樣本自變量
        endog = self.endog
        exog = self.exog
        nobs = self.nobs
        exog_rank = np.linalg.matrix_rank(self.exog)
        self.rank = exog_rank
        self.df_model = float(self.rank - self.k_constant)
        self.df_resid = self.nobs - self.rank
        #IRLS初始化
        n_iter = 0
        xstar = exog
 
        beta = np.ones(exog_rank)
        # TODO: better start, initial beta is used only for convergence check

        # Note the following does not work yet,
        # the iteration loop always starts with OLS as initial beta
        # if start_params is not None:
        #    if len(start_params) != rank:
        #       raise ValueError('start_params has wrong length')
        #       beta = start_params
        #    else:
        #       # start with OLS
        #       beta = np.dot(np.linalg.pinv(exog), endog)

        diff = 10
        cycle = False

        history = dict(params = [], mse=[])
        #IRLS迭代
        while n_iter < max_iter and diff > p_tol and not cycle:
            n_iter += 1
            beta0 = beta
            xtx = np.dot(xstar.T, exog)
            xty = np.dot(xstar.T, endog)
            beta = np.dot(pinv(xtx), xty)
            resid = endog - np.dot(exog, beta)

            mask = np.abs(resid) < .000001
            resid[mask] = ((resid[mask] >= 0) * 2 - 1) * .000001
            resid = np.where(resid < 0, q * resid, (1-q) * resid)
            resid = np.abs(resid)
            #1/resid[:, np.newaxis]為更新權重W
            xstar = exog / resid[:, np.newaxis]
            diff = np.max(np.abs(beta - beta0))
            history['params'].append(beta)
            history['mse'].append(np.mean(resid*resid))
            #檢查是否收斂，若收斂則提前停止迭代
            if (n_iter >= 300) and (n_iter % 100 == 0):
                # check for convergence circle, should not happen
                for ii in range(2, 10):
                    if np.all(beta == history['params'][-ii]):
                        cycle = True
                        warnings.warn("Convergence cycle detected", ConvergenceWarning)
                        break

        if n_iter == max_iter:
            warnings.warn("Maximum number of iterations (" + str(max_iter) +
                          ") reached.", IterationLimitWarning)
        #計算協方差矩陣
        e = endog - np.dot(exog, beta)
        # Greene (2008, p.407) writes that Stata 6 uses this bandwidth:
        # h = 0.9 * np.std(e) / (nobs**0.2)
        # Instead, we calculate bandwidth as in Stata 12
        iqre = stats.scoreatpercentile(e, 75) - stats.scoreatpercentile(e, 25)
        h = bandwidth(nobs, q)
        h = min(np.std(endog),
                iqre / 1.34) * (norm.ppf(q + h) - norm.ppf(q - h))

        fhat0 = 1. / (nobs * h) * np.sum(kernel(e / h))

        if vcov == 'robust':
            d = np.where(e > 0, (q/fhat0)**2, ((1-q)/fhat0)**2)
            xtxi = pinv(np.dot(exog.T, exog))
            xtdx = np.dot(exog.T * d[np.newaxis, :], exog)
            vcov = chain_dot(xtxi, xtdx, xtxi)
        elif vcov == 'iid':
            vcov = (1. / fhat0)**2 * q * (1 - q) * pinv(np.dot(exog.T, exog))
        else:
            raise Exception("vcov must be 'robust' or 'iid'")
        #用系數估計值和協方差矩陣創建一個QuantResults對象，並輸出結果
        lfit = QuantRegResults(self, beta, normalized_cov_params=vcov)

        lfit.q = q
        lfit.iterations = n_iter
        lfit.sparsity = 1. / fhat0
        lfit.bandwidth = h
        lfit.history = history

        return RegressionResultsWrapper(lfit)


#核函數表達式
def _parzen(u):
    z = np.where(np.abs(u) <= .5, 4./3 - 8. * u**2 + 8. * np.abs(u)**3,
                 8. * (1 - np.abs(u))**3 / 3.)
    z[np.abs(u) > 1] = 0
    return z


kernels = {}
kernels['biw'] = lambda u: 15. / 16 * (1 - u**2)**2 * np.where(np.abs(u) <= 1, 1, 0)
kernels['cos'] = lambda u: np.where(np.abs(u) <= .5, 1 + np.cos(2 * np.pi * u), 0)
kernels['epa'] = lambda u: 3. / 4 * (1-u**2) * np.where(np.abs(u) <= 1, 1, 0)
kernels['gau'] = lambda u: norm.pdf(u)
kernels['par'] = _parzen
#kernels['bet'] = lambda u: np.where(np.abs(u) <= 1, .75 * (1 - u) * (1 + u), 0)
#kernels['log'] = lambda u: logistic.pdf(u) * (1 - logistic.pdf(u))
#kernels['tri'] = lambda u: np.where(np.abs(u) <= 1, 1 - np.abs(u), 0)
#kernels['trw'] = lambda u: 35. / 32 * (1 - u**2)**3 * np.where(np.abs(u) <= 1, 1, 0)
#kernels['uni'] = lambda u: 1. / 2 * np.where(np.abs(u) <= 1, 1, 0)

#窗寬計算
def hall_sheather(n, q, alpha=.05):
    z = norm.ppf(q)
    num = 1.5 * norm.pdf(z)**2.
    den = 2. * z**2. + 1.
    h = n**(-1. / 3) * norm.ppf(1. - alpha / 2.)**(2./3) * (num / den)**(1./3)
    return h


def bofinger(n, q):
    num = 9. / 2 * norm.pdf(2 * norm.ppf(q))**4
    den = (2 * norm.ppf(q)**2 + 1)**2
    h = n**(-1. / 5) * (num / den)**(1. / 5)
    return h


def chamberlain(n, q, alpha=.05):
    return norm.ppf(1 - alpha / 2) * np.sqrt(q*(1 - q) / n)

3.2 QuantRegResults類

這里我只給出計算擬合優度的代碼。

class QuantRegResults(RegressionResults):
    '''Results instance for the QuantReg model'''

    @cache_readonly
    def prsquared(self):
        q = self.q
        endog = self.model.endog
        #e為殘差
        e = self.resid
        e = np.where(e < 0, (1 - q) * e, q * e)
        e = np.abs(e)
        ered = endog - stats.scoreatpercentile(endog, q * 100)
        ered = np.where(ered < 0, (1 - q) * ered, q * ered)
        ered = np.abs(ered)
        return 1 - np.sum(e) / np.sum(ered)

4.總結

上文我先給出了一個分位數回歸的應用例子，進而敘述了分位數回歸的原理，最后再分析了Python實現的源碼。

分位數回歸對比起OLS回歸，雖然較為復雜，但它有三個主要優勢：

能反映因變量分位數與自變量的關系，而不僅僅反映因變量均值與自變量的關系。
分位數回歸對殘差不作任何假設。
分位數回歸受異常點的影響較小。

參考

^https://en.wikipedia.org/wiki/Ordinary_least_squares
^QUANTILE REGRESSION http://www.econ.uiuc.edu/~roger/research/rq/rq.pdf
^https://www.statsmodels.org/dev/examples/notebooks/generated/quantile_regression.html
^[1](https://www.zhihu.com/#ref_4_0)b c https://en.wikipedia.org/wiki/Quantile_regression
^[2](https://www.zhihu.com/#ref_5_0)b c https://www.statsmodels.org/devel/_modules/statsmodels/regression/quantile_regression.html
^https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
^Green,W. H. (2008). Econometric Analysis. Sixth Edition. International Student Edition.
^https://www.ibm.com/support/knowledgecenter/en/SSLVMB_sub/statistics_mainhelp_ddita/spss/regression/idh_quantile.html

a ↩︎
a ↩︎

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 用R語言進行分位數回歸均值回歸、分位數回歸、嶺回歸、Lasso回歸用R語言的quantreg包進行分位數回歸 Python計算分位數 R語言拓展包--kernlab 是R中實現基於核技巧機器學習的擴展包,kernlab的算法群可以解決機器學習中分類、回歸、奇異值檢測、分位數回歸、降維等諸多任務。 kernlab還包括支持向量機（SVM）、譜聚類、核主成分分析（KPCA）和高斯過程等算法。非參數回歸-局部回歸機器學習之路: python 線性回歸LinearRegression, 隨機參數回歸SGDRegressor 預測波士頓房價 Python四分位和中位數計算代碼分位數 python pandas df.quantile 計算樣本的分位數