線性模型學習筆記

本文轉載自查看原文 2016-06-21 13:16 7032 線性模型/ Python

注：該文是根據開智學堂數據科學入門班的講課內容整理而成，主講人是肖凱老師。

線性模型

主要學習用 statsmodels 模塊進行線性回歸、邏輯回歸和時間序列分析。

線性模型基本概念

多個因素的定量化計算，是線性模型的最主要用途。

\[y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\epsilon \]

由上式，有兩個因素 \(x_{1}\) 和 \(x_{2}\) 同時影響 y，前面的系數 \(\beta_{1}\) 和 \(\beta_{2}\) 就是這個因素影響的力度大小，可以認為是方向和強度，負的就是負影響，正的就是正影響。\(\beta_{0}\)就是當\(x_{1}\)、\(x_{2}\) 都取 0 時，y 也有個正常的期望值。還有一些因素會影響 y，可以認為是個隨機項，或者噪音，也就是不在方程里考慮，不在思維框架里考慮，但它仍然會影響 y，只是影響比較小，就是 \(\epsilon\)。

線性模型就是衡量不同因素之間的關系，並且把它們定量化。

方程是我們的假設方程。

方程的左邊 y 是目標變量，或依賴變量，英文叫做 dependent variables，是我們希望去解釋的變量，即被解釋變量。

方程的右邊 x 是自變量，或解釋變量，explanatory variables。

通常左邊只有一個 y，右邊有多個 x。

求解的時候跟線性擬合非常相似，要找到一個方程，使誤差最小。求解的方法跟最優化的方法一樣，采用 ordinary least squares，即最小二乘法。

import statsmodels.api as sm # 基本 API
import statsmodels.formula.api as smf # 公式 API
import statsmodels.graphics.api as smg # 圖形界面 API
import patsy # 主要類似 R 語言的公式轉成 statsmodels 可以識別的形式
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

np.random.seed(123456789)
y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])
X = np.vstack([np.ones(5), x1, x2, x1*x2]).T
print X

[[   1.    6.   11.   66.]
 [   1.    7.   12.   84.]
 [   1.    8.   13.  104.]
 [   1.    9.   14.  126.]
 [   1.   10.   15.  150.]]

X 為構造的矩陣，線性回歸中的自變量，每一個行是一個樣本，每一列可認為是一個變量。全為 1 的變量可認為是截距項，x1*x2 是 x1 和 x2 的交互項。共有 3 個變量影響 y。

把 X 當作自變量，y 當作因變量，可以用 numpy 中最小二乘 np.linalg.lstsq 求對應的系數，跟最優化中的求擬合 opt.minimize 同樣的原理。

beta, res, rank, sval = np.linalg.lstsq(X, y)
print beta

[ -5.55555556e-01   1.88888889e+00  -8.88888889e-01  -1.33226763e-15]

這里使用 statsmodels 用的模塊來做多元線性回歸，結果跟上面差不多。

model = sm.OLS(y, X) # OLS 是普通最小二乘，sm 中還有很多其它方法解決不同的問題
result = model.fit() # 做擬合
print result.params # 對應的各個變量的權重

[ -5.55555556e-01   1.88888889e+00  -8.88888889e-01  -1.22124533e-15]

還可以采取公式的方式。

data = {"y": y, "x1": x1, "x2": x2}
df_data = pd.DataFrame(data)
model = smf.ols("y ~ 1 + x1 + x2 + x1:x2", df_data) # 1 表示截距，默認是有截距，x1:x2 交互項中的冒號是相乘
result = model.fit() # 做擬合
print result.params

Intercept   -5.555556e-01
x1           1.888889e+00
x2          -8.888889e-01
x1:x2       -1.221245e-15
dtype: float64

print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.723e+27
Date:                Sat, 18 Jun 2016   Prob (F-statistic):           2.12e-28
Time:                        00:19:04   Log-Likelihood:                 150.48
No. Observations:                   5   AIC:                            -295.0
Df Residuals:                       2   BIC:                            -296.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5556   7.42e-14  -7.49e+12      0.000        -0.556    -0.556
x1             1.8889   2.77e-13   6.82e+12      0.000         1.889     1.889
x2            -0.8889   9.43e-14  -9.43e+12      0.000        -0.889    -0.889
x1:x2      -1.221e-15    8.7e-15     -0.140      0.901     -3.86e-14  3.62e-14
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.034
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.319
Skew:                           0.407   Prob(JB):                        0.853
Kurtosis:                       2.069   Cond. No.                     8.37e+17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.82e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

公式構造還有其它的形式，如下所示，毋須多言。

from collections import defaultdict
data = defaultdict(lambda: np.array([1,2,3]))

patsy.dmatrices("y ~ a", data=data)[1].design_info.term_names # design_info.term_names 這些東西的含義不用在意，只看前面的公式表示就行

['Intercept', 'a']

patsy.dmatrices("y ~ 1 + a + b", data=data)[1].design_info.term_names

['Intercept', 'a', 'b']

patsy.dmatrices("y ~ -1 + a + b", data=data)[1].design_info.term_names # -1 表示不要截距項

['a', 'b']

patsy.dmatrices("y ~ a * b", data=data)[1].design_info.term_names # a * b 表示有 a 有 b 有 a*b

['Intercept', 'a', 'b', 'a:b']

patsy.dmatrices("y ~ a * b * c", data=data)[1].design_info.term_names

['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c', 'a:b:c']

patsy.dmatrices("y ~ a * b * c - a:b:c", data=data)[1].design_info.term_names # 只需要兩兩相乘，不需要三個相乘

['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c']

data = {k: np.array([]) for k in ["y", "a", "b", "c"]}

patsy.dmatrices("y ~ a + b", data=data)[1].design_info.term_names # 這里的運算符 + 並不是表示 a 和 b 要加起來，只是表示有 a 有 b

['Intercept', 'a', 'b']

patsy.dmatrices("y ~ I(a + b)", data=data)[1].design_info.term_names # 如果真要 a+b，I 類似轉義符號

['Intercept', 'I(a + b)']

patsy.dmatrices("y ~ I(a**2)", data=data)[1].design_info.term_names

['Intercept', 'I(a ** 2)']

patsy.dmatrices("y ~ np.log(a) + b", data=data)[1].design_info.term_names # 還可以做數學轉換

['Intercept', 'np.log(a)', 'b']

z = lambda x1, x2: x1+x2

patsy.dmatrices("y ~ z(a, b)", data=data)[1].design_info.term_names # 把 z 放到公式里

['Intercept', 'z(a, b)']

前面的都是數值變量，分類變量怎么在公式里體現呢？

data = {"y": [1, 2, 3], "a": [1, 2, 3]}

patsy.dmatrices("y ~ - 1 + a", data=data, return_type="dataframe")[1]

	a
0	1.0
1	2.0
2	3.0

patsy.dmatrices("y ~ - 1 + C(a)", data=data, return_type="dataframe")[1]

	C(a)[1]	C(a)[2]	C(a)[3]
0	1.0	0.0	0.0
1	0.0	1.0	0.0
2	0.0	0.0	1.0

data = {"y": [1, 2, 3], "a": ["type A", "type B", "type C"]}

patsy.dmatrices("y ~ - 1 + a", data=data, return_type="dataframe")[1]

	a[type A]	a[type B]	a[type C]
0	1.0	0.0	0.0
1	0.0	1.0	0.0
2	0.0	0.0	1.0

patsy.dmatrices("y ~ - 1 + C(a, Poly)", data=data, return_type="dataframe")[1] # C 符號轉成高次函數

	C(a, Poly).Constant	C(a, Poly).Linear	C(a, Poly).Quadratic
0	1.0	-7.071068e-01	0.408248
1	1.0	-5.551115e-17	-0.816497
2	1.0	7.071068e-01	0.408248

pasty 的作用是什么？有哪幾種方法構造線性回歸模型？模型 y = 1 + x_1 + x_2 + x_1 * x_2 用 pasty 怎么表示？

pasty 是把類似 R 語言的公式轉成 statsmodels 可以識別的形式。

可以使用 numpy.linalg.lstsq，或 scipy.optimize.minimize，或 statsmodels.OLS(y, X)，或 statsmodels.formula.api.ols("y ~ 1 + x1 + x2 + x1:x2", df_data)。

模型 y = 1 + x_1 + x_2 + x_1 * x_2 用 pasty 表示方法如下：

from collections import defaultdict
data = defaultdict(lambda: np.array([1,2,3]))
patsy.dmatrices("y ~ x1 * x2", data=data)[1].design_info.term_names

['Intercept', 'x1', 'x2', 'x1:x2']

patsy.dmatrices("y ~ 1 + x1 + x2 + x1:x2", data=data)[1].design_info.term_names

['Intercept', 'x1', 'x2', 'x1:x2']

線性回歸

上面介紹了線性模型基本概念，這里介紹線性回歸。

線性回歸指方程左邊的 y 取實數值，即 (\(-\infty,+\infty\))，如果右邊只有一個 x，就是一元線性回歸，當 x 變化時，y 也隨着變化，x 和 y 呈線性關系，可以用直線表示；如果 x 有多個，可以用超平面表示。線性回歸的線性，指 x 和 y 呈線性關系，回歸指 y 是連續的數值。

看例子：

np.random.seed(123456789)
N = 100
x1 = np.random.randn(N) # 自變量
x2 = np.random.randn(N)
data = pd.DataFrame({"x1": x1, "x2": x2})
def y_true(x1, x2):
    return 1  + 2 * x1 + 3 * x2 + 4 * x1 * x2
data["y_true"] = y_true(x1, x2) # 因變量
e = np.random.randn(N) # 標准正態分布
data["y"] = data["y_true"] + e # 加些噪音，加些擾動，噪音是符合標准正態分布的
data.head()

	x1	x2	y_true	y
0	2.212902	-0.474588	-0.198823	-1.452775
1	2.128398	-1.524772	-12.298805	-12.560965
2	1.841711	-1.939271	-15.420705	-14.715090
3	0.082382	0.345148	2.313945	1.190283
4	0.858964	-0.621523	-1.282107	0.307772

y_true 列是真正的 y，y 列是加了噪音的 y。造了該數據是為了做線性回歸，有兩個自變量，即二元線性回歸。加噪音是為了模仿真實場景，真實場景中自變量和因變量的關系往往不是固定的，是有隨機擾動的。目的是只給你看 y、x1 和 x2，你要反推回 y 和 x1 和 x2 之間的函數關系，假定這個函數關系是線性的。

只有 x1、x2 和 y 三列數據的情況下，先畫個散點圖看下。

fig, axes = plt.subplots(1, 2, figsize=(8, 4))

axes[0].scatter(data["x1"], data["y"])
axes[1].scatter(data["x2"], data["y"])

fig.tight_layout();

大體來看，存在一定的相關性，但由於噪音的存在，相關性不明顯。

使用上面介紹的 smf.ols 做線性回歸。

model = smf.ols("y ~ x1 + x2", data) # 這里沒有寫截距，截距是默認存在的
result = model.fit()
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.380
Model:                            OLS   Adj. R-squared:                  0.367
Method:                 Least Squares   F-statistic:                     29.76
Date:                Sat, 18 Jun 2016   Prob (F-statistic):           8.36e-11
Time:                        07:41:16   Log-Likelihood:                -271.52
No. Observations:                 100   AIC:                             549.0
Df Residuals:                      97   BIC:                             556.9
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.9868      0.382      2.581      0.011         0.228     1.746
x1             1.0810      0.391      2.766      0.007         0.305     1.857
x2             3.0793      0.432      7.134      0.000         2.223     3.936
==============================================================================
Omnibus:                       19.951   Durbin-Watson:                   1.682
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               49.964
Skew:                          -0.660   Prob(JB):                     1.41e-11
Kurtosis:                       6.201   Cond. No.                         1.32
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

上面的 summary 回歸結果主要看幾個地方：R-squared 在統計學里叫判定系數，或決定系數，也稱擬合優度，值在 0 到 1 之間，值越大，表示這個模型擬合的越好，這里 0.38 就擬合的不好；還要看系數對應的值 coef，這里有三個 Intercept、x1 和 x2，std err 是標准誤，還有 t 和 P，這是對每個系數做了個統計推斷，統計推斷的原假設是系數為 0，表示該系數在模型里不用存在，不用理解原理和具體過程，可以直接看 P 值，P 值如果很小，就推斷原假設，即其實系數不為 0，該變量在模型中應該是存在的，如上面的 summary 結果，x1 和 x2 的 P 值都很小，說明這兩個自變量在模型里都是有意義的，都應該存在模型里。有些回歸問題中，P 值比較大，那么對應的變量就可以扔掉。

初學者只關注 summary 結果中的判定系數，各自變量對應的系數值及 P 值即可。

print result.rsquared

0.380253832551

還要看對應的殘差，殘差表示真實值和模型擬合值的距離。這里有 100 個數據，也就有 100 個殘差。

result.resid.shape

(100,)

result.resid.head() # 殘差，result 的 resid 屬性

0    -3.370455
1   -11.153477
2   -11.721319
3    -0.948410
4     0.306215
dtype: float64

理論上殘差應該服從正態分布，可以檢驗下。

z, p = stats.normaltest(result.resid.values) # 正態性檢驗，原假設是數據服從正態性
print p

4.6524990253e-05

p 值很小，拒絕原假設，即殘差不服從正態分布。

除了使用正態性檢驗 stats.normaltest，還可以畫 QQ 圖。QQ 圖里殘差如果排成 45 度的斜線，表示數據是服從理論分布的。

fig, ax = plt.subplots(figsize=(8, 4))
smg.qqplot(result.resid, ax=ax)

fig.tight_layout()

這里殘差數據就不服從正態分布。

現在上面的模型做完之后有幾個不好的地方，一個是殘差不服從理論假定，說明殘差里有些東西，應該放到模型里，結果跑到殘差里來了；第二個是 R-squared 比較低，意味着有很多東西沒法解釋，y 里面應該有很多東西可以用 x 解釋，但這里只有 38% 被 x 解釋，很多東西在殘差里。所以用上面的公式 y ~ x1 + x2 來擬合是有問題的。所以這里改下公式，y 除了是由 x1 和 x2 決定的，還應該加上交互項。

model = smf.ols("y ~ x1 + x2 + x1:x2", data)
result = model.fit()
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.955
Model:                            OLS   Adj. R-squared:                  0.954
Method:                 Least Squares   F-statistic:                     684.5
Date:                Sat, 18 Jun 2016   Prob (F-statistic):           1.21e-64
Time:                        08:31:46   Log-Likelihood:                -140.01
No. Observations:                 100   AIC:                             288.0
Df Residuals:                      96   BIC:                             298.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.8706      0.103      8.433      0.000         0.666     1.076
x1             1.9693      0.108     18.160      0.000         1.754     2.185
x2             2.9670      0.117     25.466      0.000         2.736     3.198
x1:x2          3.9440      0.112     35.159      0.000         3.721     4.167
==============================================================================
Omnibus:                        2.950   Durbin-Watson:                   2.072
Prob(Omnibus):                  0.229   Jarque-Bera (JB):                2.734
Skew:                           0.327   Prob(JB):                        0.255
Kurtosis:                       2.521   Cond. No.                         1.38
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared 變成 0.95，模型非常好，再看幾個自變量對應的值，P 值都很小，表示幾個項都應該在模型里存在。

再來看殘差的正態性檢驗。

z, p = stats.normaltest(result.resid.values)
p

0.22874710482505187

fig, ax = plt.subplots(figsize=(8, 4))
smg.qqplot(result.resid, ax=ax)

fig.tight_layout()

接近 45 度的斜線，可以認為殘差服從理論分布。

這就是線性回歸的例子，可以看到，估計出的幾個自變量系數值 0.8706、1.9693、2.9670 和 3.9440 和真實值 1、2、3、4 還是比較接近的，估計結果很好。估計值與真實值不一樣是因為我們加了噪音在數據中。

可以把這個結果再畫個圖來研究下。可以用上面產生的模型做新的預測。

x = np.linspace(-1, 1, 50) # 建立新的 data
X1, X2 = np.meshgrid(x, x) # 轉成二維
new_data = pd.DataFrame({"x1": X1.ravel(), "x2": X2.ravel()})
y_pred = result.predict(new_data) # 使用上面擬合得到的模型
y_pred.shape

(2500,)

y_pred = y_pred.reshape(50, 50)

fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)

def plot_y_contour(ax, Y, title):
    c = ax.contourf(X1, X2, Y, 15, cmap=plt.cm.RdBu) # 二元，正好用平面圖表示，y 用顏色表示
    ax.set_xlabel(r"$x_1$", fontsize=20)
    ax.set_ylabel(r"$x_2$", fontsize=20)
    ax.set_title(title)
    cb = fig.colorbar(c, ax=ax)
    cb.set_label(r"$y$", fontsize=20)

plot_y_contour(axes[0], y_true(X1, X2), "true relation") # 真實值
plot_y_contour(axes[1], y_pred, "fitted model") # 擬合模型計算的值

fig.tight_layout()

上面兩圖沒什么區別，表明擬合的很好。

理解了線性回歸，再看個例子。statsmodels 可以讀 R 語言里的數據，比如冰淇淋數據，如果對數據不了解，可以用 print(dataset.doc) 查看數據信息。

dataset = sm.datasets.get_rdataset("Icecream", "Ecdat")
print dataset.title

Ice Cream Consumption

dataset.data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
cons      30 non-null float64
income    30 non-null int64
price     30 non-null float64
temp      30 non-null int64
dtypes: float64(2), int64(2)
memory usage: 1.0 KB

dataset.data.head()

	cons	income	price	temp
0	0.386	78	0.270	41
1	0.374	79	0.282	56
2	0.393	81	0.277	63
3	0.425	80	0.280	68
4	0.406	76	0.272	69

model = smf.ols("cons ~ -1 + price + temp", data=dataset.data) # 建立價格、氣溫和銷量的關系，二元回歸，不要截距項
result = model.fit()
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   cons   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1001.
Date:                Sat, 18 Jun 2016   Prob (F-statistic):           9.03e-27
Time:                        09:04:54   Log-Likelihood:                 51.903
No. Observations:                  30   AIC:                            -99.81
Df Residuals:                      28   BIC:                            -97.00
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
price          0.7254      0.093      7.805      0.000         0.535     0.916
temp           0.0032      0.000      6.549      0.000         0.002     0.004
==============================================================================
Omnibus:                        5.350   Durbin-Watson:                   0.637
Prob(Omnibus):                  0.069   Jarque-Bera (JB):                3.675
Skew:                           0.776   Prob(JB):                        0.159
Kurtosis:                       3.729   Cond. No.                         593.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared 為 0.98，擬合效果不錯。price 和 temp 的 P 值很小，都是很顯著的，表示應該放在模型里。

coef 的解釋為，temp 每增加 1 個單位，對應的 y 應該增加 0.003 個單位，price 每增加 1 個單位，對應的銷售增加 0.72 個單位。

再看下對應的圖形。

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

smg.plot_fit(result, 0, ax=ax1)
smg.plot_fit(result, 1, ax=ax2)

fig.tight_layout()

從圖中可以看到，溫度增加，銷售增加，趨勢是比較明顯的，但價格變化，銷售並沒怎么變，這就帶來個疑問，為什么價格變化，銷售沒什么變化呢？有個可能是，價格變動的范圍太小了，0.26~0.29，價格變化不大，所以價格變化並沒有引起銷售明顯的變化。還有個疑點，是上面的 coef 中，price 跟跟銷售的關系竟然是正向的，按常理，價格增加銷售應該下降，這是比較奇怪的，可以進一步研究相關數據。

也可用 seaborn 畫變量關系。

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

sns.regplot("price", "cons", dataset.data, ax=ax1);
sns.regplot("temp", "cons", dataset.data, ax=ax2);

fig.tight_layout()

注意到溫度跟消費呈正向關系，而 price 和 cons 又呈反向關系了，為什么會這樣？因為前面的圖是把消費里面關於溫度的影響給去掉了，也就是 cons 中沒有考慮 temp 的影響，單跟 price 的相關圖，而這里左圖 cons 是包含了 temp 的影響，這點要注意下，這是個單純的散點圖，所以才看到反向影響，price 左邊 cons 高，可能是溫度 temp 高，price 右邊可能是溫度較低。所以，一元回歸，只看一個變量，背后可能隱藏了第三者，多元回歸就是為了去除第三者的影響，散點圖可以做探索時看，但不能做為模型依據，模型依據還是要看 summary 中對應的系數，這些系數拋掉了其它因素的影響。

模型自變量的 P 值比較大說明了什么？模型擬合剩下的殘差有哪些用處？R 平方值是什么意思？qqplot 是什么圖？

P 值較大，說明對應的自變量系數為 0，對於因變量 y 來說就沒什么意義，可以把該自變量刪掉。

殘差表示真實值和模型擬合值的距離。比較兩個模型的擬合效果，可以通過比較它們的殘差平方和的大小來確定，殘差平方和越小的模型，擬合效果越好。

R 平方值在統計學里叫判定系數，或決定系數，也稱擬合優度，值在 0 到 1 之間，值越大，表示這個模型擬合的越好。

qqplot 用於檢驗一個序列是否服從正態分布，用 \(Q-Q'\) 圖與 y=x 比較，若基本吻合則序列服從正態分布，若相差較大則不服從正態分布。

Logistic 回歸

線性回歸的 y 是連續值，Logistic 回歸是要解可能發生事件的概率這樣的值。例如 x 可能是今天的溫度、濕度，y 是下雨的概率。

y 是個概率，就必然在 0 到 1 這個區間，如果用線性回歸來做，值在 (\(-\infty,+\infty\))，所以需要做些轉換。

\[log(p/(1-p))=\beta_{0}+\beta\cdot x \]

\[p=(1+exp(-\beta_{0}-\beta_{1}\cdot x))^{-1} \]

p 是求的概率，p/1-p 稱為機率。

df = sm.datasets.get_rdataset("iris").data # 鳶尾花數據
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
Sepal.Length    150 non-null float64
Sepal.Width     150 non-null float64
Petal.Length    150 non-null float64
Petal.Width     150 non-null float64
Species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

df.Species.unique() # 花的種類

array(['setosa', 'versicolor', 'virginica'], dtype=object)

df_subset = df[(df.Species == "versicolor") | (df.Species == "virginica" )].copy() # 取個子集，相當於去掉了 setosa，因為這里只做二分類
df_subset.Species = df_subset.Species.map({"versicolor": 1, "virginica": 0}) # 做個映射，因為做分類時 y 要取數值
df_subset.rename(columns={"Sepal.Length": "Sepal_Length", "Sepal.Width": "Sepal_Width",
                          "Petal.Length": "Petal_Length", "Petal.Width": "Petal_Width"}, inplace=True) # 重命名，因為點號在 python 里有特殊意義
df_subset.head(3)

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width	Species
50	7.0	3.2	4.7	1.4	1
51	6.4	3.2	4.5	1.5	1
52	6.9	3.1	4.9	1.5	1

model = smf.logit("Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width", data=df_subset)
result = model.fit() # 內部使用極大似然估計，所以會有結果返回，表示成功收斂

Optimization terminated successfully.
         Current function value: 0.059493
         Iterations 12

print(result.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Sat, 18 Jun 2016   Pseudo R-squ.:                  0.9142
Time:                        10:21:54   Log-Likelihood:                -5.9493
converged:                       True   LL-Null:                       -69.315
                                        LLR p-value:                 1.947e-26
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       42.6378     25.708      1.659      0.097        -7.748    93.024
Sepal_Length     2.4652      2.394      1.030      0.303        -2.228     7.158
Sepal_Width      6.6809      4.480      1.491      0.136        -2.099    15.461
Petal_Length    -9.4294      4.737     -1.990      0.047       -18.714    -0.145
Petal_Width    -18.2861      9.743     -1.877      0.061       -37.381     0.809
================================================================================

summary 跟前面線性回歸的 summary 是類似的，Pseudo R-squ. 稱為偽 R 方，同 R 方擬合優度，這里 0.91 結果不錯。

從 summary 中各自變量結果來看，Petal_Length 和 Petal_Width 的 P 值要顯著些，預測時只需要這兩個變量就行。

對於系數的影響 coef，因為 Logistic 回歸是對線性組合做了映射，所以對系數的解釋就不像線性回歸那么簡單明了。可以用 get_margeff 邊際效能函數來看下，取導數的形式，即 x 變動多少，y 跟着變動多少。

print(result.get_margeff().summary())

        Logit Marginal Effects       
=====================================
Dep. Variable:                Species
Method:                          dydx
At:                           overall
================================================================================
                  dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Sepal_Length     0.0445      0.038      1.163      0.245        -0.031     0.120
Sepal_Width      0.1207      0.064      1.891      0.059        -0.004     0.246
Petal_Length    -0.1703      0.057     -2.965      0.003        -0.283    -0.058
Petal_Width     -0.3303      0.110     -2.998      0.003        -0.546    -0.114
================================================================================

可以看到 y 隨 Petal_Length 和 Petal_Width 變化而變化的范圍較大。

model = smf.logit("Species ~ Petal_Length + Petal_Width", data=df_subset)
result = model.fit() # 重新擬合

Optimization terminated successfully.
         Current function value: 0.102818
         Iterations 10

print(result.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       97
Method:                           MLE   Df Model:                            2
Date:                Sat, 18 Jun 2016   Pseudo R-squ.:                  0.8517
Time:                        10:22:55   Log-Likelihood:                -10.282
converged:                       True   LL-Null:                       -69.315
                                        LLR p-value:                 2.303e-26
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       45.2723     13.612      3.326      0.001        18.594    71.951
Petal_Length    -5.7545      2.306     -2.496      0.013       -10.274    -1.235
Petal_Width    -10.4467      3.756     -2.782      0.005       -17.808    -3.086
================================================================================

print(result.get_margeff().summary())

        Logit Marginal Effects       
=====================================
Dep. Variable:                Species
Method:                          dydx
At:                           overall
================================================================================
                  dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Petal_Length    -0.1736      0.052     -3.347      0.001        -0.275    -0.072
Petal_Width     -0.3151      0.068     -4.608      0.000        -0.449    -0.181
================================================================================

這里有兩個系數，可以放到二維空間來表示，二維空間的模型會存在一個決策邊界，即，當 Petal_Length 或 Petal_Width 大於某個值或小於某個值時，花的種類就不一樣了，花的決策邊界可以由 dy/dx 的值構造而成。

params = result.params
beta0 = -params['Intercept']/params['Petal_Width']
beta1 = -params['Petal_Length']/params['Petal_Width']
df_new = pd.DataFrame({"Petal_Length": np.random.randn(20)*0.5 + 5,
                       "Petal_Width": np.random.randn(20)*0.5 + 1.7})
df_new["P-Species"] = result.predict(df_new)
df_new["P-Species"].head(3)

0    0.995472
1    0.799899
2    0.000033
Name: P-Species, dtype: float64

df_new["Species"] = (df_new["P-Species"] > 0.5).astype(int)
df_new.head()

	Petal_Length	Petal_Width	P-Species	Species
0	4.717684	1.218695	0.995472	1
1	5.280952	1.292013	0.799899	1
2	5.610778	2.230056	0.000033	0
3	4.458715	1.907844	0.421614	0
4	4.822227	1.938929	0.061070	0

fig, ax = plt.subplots(1, 1, figsize=(8, 4)) # 定義畫布

ax.plot(df_subset[df_subset.Species == 0].Petal_Length.values,
        df_subset[df_subset.Species == 0].Petal_Width.values, 's', label='virginica') # 畫真實數據
ax.plot(df_new[df_new.Species == 0].Petal_Length.values,
        df_new[df_new.Species == 0].Petal_Width.values,
        'o', markersize=10, color="steelblue", label='virginica (pred.)') # 畫預測數據

ax.plot(df_subset[df_subset.Species == 1].Petal_Length.values,
        df_subset[df_subset.Species == 1].Petal_Width.values, 's', label='versicolor') # 點圖
ax.plot(df_new[df_new.Species == 1].Petal_Length.values,
        df_new[df_new.Species == 1].Petal_Width.values,
        'o', markersize=10, color="green", label='versicolor (pred.)')

_x = np.array([4.0, 6.1])
ax.plot(_x, beta0 + beta1 * _x, 'k')

ax.set_xlabel('Petal length')
ax.set_ylabel('Petal width')
ax.legend(loc=2)
fig.tight_layout()

邏輯回歸又稱為線性分類器。

什么是邏輯回歸（Logistic Regression）？Logistic 模型中，每個自變量的系數有意義嗎？怎么樣衡量自變量的解釋度？

Logistic 回歸是要解可能發生事件的概率這樣的問題。例如 x 是今天的溫度、濕度，y 是下雨的概率。

因為 Logistic 回歸是對線性組合做了映射，所以對自變量系數的解釋就不像線性回歸那么簡單明了。

看 summary 的 P 值，P 值很小，即表明該自變量影響大。

時間序列預測

時間序列預測是個特別的線性回歸問題，方程左邊是未來，方程右邊是過去，把歷史數據作為 x，把未來作為 y。

df = pd.read_csv("Data/temperature_outdoor_2014.tsv", header=None, delimiter="\t", names=["time", "temp"])
df.time = pd.to_datetime(df.time, unit="s")
df = df.set_index("time").resample("H").mean() # 每一行是每一小時的平均溫度

df_march = df[df.index.month == 3]
df_april = df[df.index.month == 4]
df_march.plot(figsize=(12, 4));

可見溫度數據還是有明顯的規律，有規律才可以建模型。規律體現在什么地方，畫散點圖研究下。

plt.scatter(df_april[1:], df_april[:-1]); # 上一小時溫度和這一小時溫度

滯后項的相關性稱為自相關，自己的歷史數據會影響后來的數據，這種規律稱為自相關。

plt.scatter(df_april[2:], df_april[:-2]); # 滯后兩項

自相關有所減弱。

plt.scatter(df_april[24:], df_april[:-24]); # 昨天和今天相應時間溫度

自相關，越近的數據影響越大。可以畫很多相關圖，相關系數圖。建模時要把這種自相關性去掉，使用差分來去掉，第二張圖就使用了差分，可以看到差分之后，自相關性會有所減弱，第三張圖是差分的差分，第四張圖是三階差分，可以看到，差分階次越高，自相關性就越小。

fig, axes = plt.subplots(1, 4, figsize=(12, 3))
smg.tsa.plot_acf(df_march.temp, lags=72, ax=axes[0]) # 時間序列子模塊，自相關，今天的溫度跟歷史 72 個小時的相關性
smg.tsa.plot_acf(df_march.temp.diff().dropna(), lags=72, ax=axes[1]) # 每個條狀是個相關系數
smg.tsa.plot_acf(df_march.temp.diff().diff().dropna(), lags=72, ax=axes[2])
smg.tsa.plot_acf(df_march.temp.diff().diff().diff().dropna(), lags=72, ax=axes[3])
fig.tight_layout()

sm.tsa.AR 模型是自回歸模型，從圖中可以看到歷史數據和現在數據會自相關，再一個，差分之后，自相關減少了，肯定可以用 AR 模型。如果圖形跟上面的不一樣，就可能不能使用 AR，就比較復雜了，具體可查閱時間序列相關資料。

model = sm.tsa.AR(df_march.temp) # AR 模型就是自回歸模型
result = model.fit(72) #認為過去 72 小時都會影響現在時間點的溫度
sm.stats.durbin_watson(result.resid) # 把殘差拿出來檢驗下

1.998562300635305

殘差是不應該有自回歸的，把殘差拿出來檢驗下，在 2 附近，意味着沒有自回歸。

fig, ax = plt.subplots(1, 1, figsize=(8, 3))
smg.tsa.plot_acf(result.resid, lags=72, ax=ax) # 畫殘差自相關性圖
fig.tight_layout()

可以看出殘差是沒有自相關性的。

下面做預測，預測 4 月份數據。

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ax.plot(df_march.index.values[-72:], df_march.temp.values[-72:], label="train data")
ax.plot(df_april.index.values[:72], df_april.temp.values[:72], label="actual outcome")
ax.plot(pd.date_range("2014-04-01", "2014-04-4", freq="H").values,
        result.predict("2014-04-01", "2014-04-4"), label="predicted outcome")

ax.legend()
fig.tight_layout()

綠色是 4 月份真實值，紅色是預測值，可以看到預測還是比較准確的。

什么是自相關？時間序列做差分運算會有什么樣的效果？

滯后項的相關性稱為自相關，自己的歷史數據會影響后來的數據，這種規律稱為自相關。

差分之后，自相關性會有所減弱，差分階次越高，自相關性就越小。

補充閱讀

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pytorch線性規划模型學習筆記(一) 《機器學習(周志華)》筆記--線性模型（2）--線性回歸訓練流程、線性回歸的正規方程解、LinearRegression模型、線性回歸方法用於波士頓房價預測【學習筆記】線性基線性基學習筆記機器學習筆記—指數分布簇和廣義線性模型機器學習 —— 概率圖模型（學習：對數線性模型）機器學習中線性模型和非線性的區別雙塔模型學習筆記《機器學習(周志華)》筆記--線性模型（3）--邏輯回歸思想、概率計算、sigmoid 函數、邏輯回歸的損失函數計算高數學習筆記之線性和非線性的區別