線性回歸

本文轉載自查看原文 2016-11-26 19:28 17976 機器學習

一、回歸問題的定義

回歸是監督學習的一個重要問題，回歸用於預測輸入變量和輸出變量之間的關系。回歸模型是表示輸入變量到輸出變量之間映射的函數。回歸問題的學習等價於函數擬合：使用一條函數曲線使其很好的擬合已知函數且很好的預測未知數據。

回歸問題分為模型的學習和預測兩個過程。基於給定的訓練數據集構建一個模型，根據新的輸入數據預測相應的輸出。

回歸問題按照輸入變量的個數可以分為一元回歸和多元回歸；按照輸入變量和輸出變量之間關系的類型，可以分為線性回歸和非線性回歸。

一元回歸：y = ax + b

多元回歸：

二、回歸問題的求解

2.1解析解

2.1.1 最小二乘法

最小二乘法又稱最小平方法，它通過最小化誤差的平方和尋找數據的最佳函數匹配。利用最小二乘法可以簡便地求得未知的數據，並使得這些求得的數據與實際數據之間誤差的平方和為最小。最小二乘法還可用於曲線擬合。

2.1.2利用極大似然估計解釋最小二乘法

現在假設我們有m個樣本，我們假設有：

誤差項是IID，根據中心極限定理，由於誤差項是好多好多相互獨立的因素影響的綜合影響，我們有理由假設其服從高斯分布，又由於可以自己適配theta0，是的誤差項的高斯分布均值為0，所以我們有

所以我們有：

也即：

表示在theta給定的時候，給我一個x，就給你一個y

那么我們可以寫出似然函數：

由極大似然估計的定義，我們需要L(theta)最大，那么我們怎么才能是的這個值最大呢？兩邊取對數對這個表達式進行化簡如下：

需要 l(theta)最大，也即最后一項的后半部分最小，也即：

所以，我們最后由極大似然估計推導得出，我們希望 J(theta) 最小，而這剛好就是最小二乘法做的工作。而回過頭來我們發現，之所以最小二乘法有道理，是因為我們之前假設誤差項服從高斯分布，假如我們假設它服從別的分布，那么最后的目標函數的形式也會相應變化。

好了，上邊我們得到了有極大似然估計或者最小二乘法，我們的模型求解需要最小化目標函數J(theta)，那么我們的theta到底怎么求解呢？有沒有一個解析式可以表示theta?

2.1.3 theta的解析式的求解過程

我們需要最小化目標函數，關心 theta 取什么值的時候，目標函數取得最小值，而目標函數連續，那么 theta 一定為目標函數的駐點，所以我們求導尋找駐點。

求導可得：

最終我們得到參數 theta 的解析式：

關於向量、矩陣求導知識參見http://www.cnblogs.com/futurehau/p/6105236.html

上述最后一步有一些問題，假如 X'X不可逆呢？

我們知道 X'X 是一個辦正定矩陣，所以若X'X不可逆或為了防止過擬合，我們增加lambda擾動，得到

從另一個角度來看，這相當與給我們的線性回歸參數增加一個懲罰因子，這是必要的，我們數據是有干擾的，不正則的話有可能數據對於訓練集擬合的特別好，但是對於新數據的預測誤差很大。

2.1.4正則化

L2-norm: （Ridge回歸）

L1-norm: （Lasso回歸）

　　J(theta) = J + lambda * sum(|theta|)

L1-norm 和 L2-norm都能防止過擬合，一般L2-norm的效果更好一些。L1-norm能夠產生稀疏模型，能夠幫助我們去除某些特征，因此可以用於特征選擇。

L1-norm 和 L2-norm的直觀理解：摘自http://lib.csdn.net/article/machinelearning/42049

今天又看到一個比較好的解釋。可以把加入正則理解為加入約束條件，（類似於逆向拉格朗日）。那么，比如上邊的圖，L2約束就是一個圓，L1約束就是一個方形。那些關於w的圈圈都是等值線，代表了損失時多少，我們現在要求的就是在約束的條件下尋找最小的損失。所以其實就是找約束的圖形和等值線的交點。

L1的缺點：如果有幾個變量相關性比較大，那么它會隨機的選擇某一個。優化：Elastic Net

2.2 梯度下降算法

我們在上邊給出了最小二乘法求解線性回歸的參數theta,實際python 的 numpy庫就是使用的這種方法。

當然了，這需要我們的參數的維度不大，當維度大的時候，使用解析解就不適用了，這里討論梯度下降算法。

2.2.1梯度下降法步驟：

　　初始化theta

　　沿着負梯度方向迭代，更新后的theta使得J(theta)更小。

　　其中α表示學習率

　　一個優化技巧：不同的特征采用不同的學習率 Adagrad

梯度下系那個示意圖如下：

每次迭代找到一個更好的，最后得到一個局部最優解，不一定是全局最優，但是是堪用的。

2.2.2 具體實現

梯度方向：

2.2.2.1 批量梯度下降算法：

由於在線性回歸中，目標函數收斂而且為凸函數，是有一個極值點，所以局部最小值就是全局最小值。

2.2.2.2隨機梯度下降算法：

拿到一個樣本就下降一次。實際中隨機梯度下降算法其實是更適用的。出於一下亮點考慮：

1.由於噪聲的存在，不能保證每個變量都讓目標函數下降，但總的趨勢是下降的。但是另一方面，由於噪聲的存在，隨機梯度下降算法往往有利於跳出局部最小值。

2.流式數據的處理

2.2.2.3 mini-batch

拿到若干個樣本的平均梯度之后在更新梯度方向。

如上圖所示，一個批量梯度下降，一個隨機梯度下降，最終效果其實是相近的。

2.2.2.4 上升一個高度把三種梯度下降算法聯系起來

期望損失：理論上模型關於自變量因變量的平均意義下的損失，學習的目標就是選擇期望損失最小的模型。

經驗風險：模型關於訓練樣本集的平均損失。因為我們不可能得到所有的樣本來計算期望損失，所以我們使用經驗風險來代替期望損失。

那么怎么來處理選擇這些樣本呢？

BGD：我擁有的所有者n個樣本的平均損失

SGD:單個樣本處理

mini-batch：多個樣本處理

三、實際線性回歸時候的數據使用

此處分析不僅僅局限於線性回歸。

實際中可以把數據分為訓練數據和測試數據，然后根據不同模型在測試數據上的表現來選擇模型。

另外一些情況，比如上邊加上正則化之后，我們不能由訓練數據得到lambda，那么我們需要把訓練數據進一步划分為訓練數據和驗證數據。在訓練數據上學習theta和lambda，然后在驗證數據上選擇lambda,然后再在測試數據上驗證選擇不同模型。

實際中采用交叉驗證充分利用數據，例如五折交叉驗證。

四、幾個系數定義說明

對於m個樣本：

某模型的估計值為：

定義：

總平方和 TSS(Total Sum of Squares) : 即樣本偽方差的m倍 Var(Y) = TSS / m

殘差平方和 RSS(Residual Sum of Squares): RSS也記作誤差平方和SSE (Sum of Squares for Error)

可解釋平方和ESS(Explained Sum of Squares) : ESS又稱為回歸平方和SSR(Sum of Squares for Regression)

決定系數：

TSS >= RSS + ESS，在無偏估計的時候取等號。

R^2越大，擬合效果越好。

需要額外說明的是，這里所謂的線性回歸，主要是針對的參數theta,並不是針對x,我們可以對原來的數據進行處理，比如平方得到x^2的數據，然后把這個看作一個影響因素，這樣最終得到的y關於x的圖形就不是線性的，但當然這也是線性回歸。

另外，還有局部加權的線性回歸的概念，這部分內容在SVM中進一步解釋。

五. Linear Regression for binary classfication

考慮到線性分類問題求解不方便，所以可不可以通過線性回歸來求解線性分類問題呢？

兩者的差別主要在於損失函數。

平方損失是0/1損失的一個上限。

所以，使用Linear Regression來求解Linear Regression也是有道理的。

使用 sklearn 庫來進行訓練數據測試數據的划分學習線性回歸庫的調用：

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 
 4 import numpy as np
 5 import matplotlib.pyplot as plt
 6 import pandas as pd
 7 from sklearn.model_selection import train_test_split
 8 from sklearn.linear_model import Lasso, Ridge
 9 from sklearn.model_selection import GridSearchCV
10 
11 
12 if __name__ == "__main__":
13     # pandas讀入
14     data = pd.read_csv('8.Advertising.csv')    # TV、Radio、Newspaper、Sales
15     x = data[['TV', 'Radio', 'Newspaper']]
16     # x = data[['TV', 'Radio']]
17     y = data['Sales']
18     print x
19     print y
20 
21     x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
22     # print x_train, y_train
23     model = Lasso()
24     # model = Ridge()
25 
26     alpha_can = np.logspace(-3, 2, 10)  #10^(-3) ~ 10^(2) 等比10個數
27     lasso_model = GridSearchCV(model, param_grid={'alpha': alpha_can}, cv=5) #5折交叉驗證
28     lasso_model.fit(x, y)
29     print '驗證參數：\n', lasso_model.best_params_
30 
31     y_hat = lasso_model.predict(np.array(x_test))
32     mse = np.average((y_hat - np.array(y_test)) ** 2)  # Mean Squared Error
33     rmse = np.sqrt(mse)  # Root Mean Squared Error
34     print mse, rmse
35 
36     t = np.arange(len(x_test))
37     plt.plot(t, y_test, 'r-', linewidth=2, label='Test')
38     plt.plot(t, y_hat, 'g-', linewidth=2, label='Predict')
39     plt.legend(loc='upper right')
40     plt.grid()
41     plt.show()

View Code

Advertising 完整版代碼：

  1 #!/usr/bin/python
  2 # -*- coding:utf-8 -*-
  3 
  4 import csv
  5 import numpy as np
  6 import matplotlib.pyplot as plt
  7 import pandas as pd
  8 from sklearn.model_selection import train_test_split
  9 from sklearn.linear_model import LinearRegression
 10 
 11 
 12 if __name__ == "__main__":
 13     path = '8.Advertising.csv'
 14     # # 手寫讀取數據 - 請自行分析，在8.2.Iris代碼中給出類似的例子
 15     # f = file(path)
 16     # x = []
 17     # y = []
 18     # for i, d in enumerate(f):
 19     #     if i == 0: #第一行是標題欄
 20     #         continue
 21     #     d = d.strip() #去除首位空格
 22     #     if not d:
 23     #         continue
 24     #     d = map(float, d.split(',')) #每個數據都變為float
 25     #     x.append(d[1:-1])
 26     #     y.append(d[-1])
 27     # print x
 28     # print y
 29     # x = np.array(x) #顯示的更好看
 30     # y = np.array(y)
 31     # print x
 32     # print y
 33 
 34     # # Python自帶庫
 35     # f = file(path, 'rb')
 36     # print f
 37     # d = csv.reader(f)
 38     # for line in d:
 39     #     print line
 40     # f.close()
 41 
 42     # # numpy讀入
 43     # p = np.loadtxt(path, delimiter=',', skiprows=1)
 44     # print p
 45     # print p.shape
 46     # print p[1,2]
 47     # print type(p[1,2])
 48 
 49     # pandas讀入
 50     data = pd.read_csv(path)    # TV、Radio、Newspaper、Sales
 51     # x = data[['TV', 'Radio', 'Newspaper']]
 52     x = data[['TV', 'Radio']]
 53     y = data['Sales']
 54     # print x
 55     # print y
 56 
 57     # 繪制1
 58     plt.plot(data['TV'], y, 'ro', label='TV')
 59     plt.plot(data['Radio'], y, 'g^', label='Radio')
 60     plt.plot(data['Newspaper'], y, 'mv', label='Newspaer')
 61     plt.legend(loc='lower right')
 62     plt.grid()
 63     plt.show()
 64 
 65     # 繪制2
 66     plt.figure(figsize=(9,12)) #設置圖的大小 寬9inch 高12inch
 67     plt.subplot(311)
 68     plt.plot(data['TV'], y, 'ro')
 69     plt.title('TV')
 70     plt.grid()
 71     plt.subplot(312)
 72     plt.plot(data['Radio'], y, 'g^')
 73     plt.title('Radio')
 74     plt.grid()
 75     plt.subplot(313)
 76     plt.plot(data['Newspaper'], y, 'b*')
 77     plt.title('Newspaper')
 78     plt.grid()
 79     plt.tight_layout() # 緊湊顯示圖片，居中顯示
 80     plt.show()
 81 
 82     x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.75, random_state=1) #random_state 種子
 83     # print x_train, y_train
 84     linreg = LinearRegression()
 85     model = linreg.fit(x_train, y_train)
 86     print model
 87     print linreg.coef_ #系數
 88     print linreg.intercept_ #截距
 89 
 90     y_hat = linreg.predict(np.array(x_test))
 91     mse = np.average((y_hat - np.array(y_test)) ** 2)  # Mean Squared Error
 92     rmse = np.sqrt(mse)  # Root Mean Squared Error
 93     print mse, rmse
 94 
 95     t = np.arange(len(x_test))
 96     plt.plot(t, y_test, 'r-', linewidth=2, label='Test')
 97     plt.plot(t, y_hat, 'g-', linewidth=2, label='Predict')
 98     plt.legend(loc='upper right')
 99     plt.grid()
100     plt.show()

View Code

Advertising 正則化交叉驗證相關代碼：

View Code

線性回歸多項式擬合：

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 
 4 import numpy as np
 5 from sklearn.linear_model import LinearRegression, RidgeCV
 6 from sklearn.preprocessing import PolynomialFeatures
 7 import matplotlib.pyplot as plt
 8 from sklearn.pipeline import Pipeline
 9 import matplotlib as mpl
10 
11 
12 if __name__ == "__main__":
13     np.random.seed(0) # 指定種子
14     N = 9
15     x = np.linspace(0, 6, N) + np.random.randn(N)
16     x = np.sort(x)
17     y = x**2 - 4*x - 3 + np.random.randn(N)
18     # print x
19     # print y
20     x.shape = -1, 1
21     y.shape = -1, 1
22     # print x
23     # print y
24 
25     model_1 = Pipeline([
26         ('poly', PolynomialFeatures()),
27         ('linear', LinearRegression(fit_intercept=False))])
28     model_2 = Pipeline([
29         ('poly', PolynomialFeatures()),
30         ('linear', RidgeCV(alphas=np.logspace(-3, 2, 100), fit_intercept=False))])
31     models = model_1, model_2
32     mpl.rcParams['font.sans-serif'] = [u'simHei']
33     mpl.rcParams['axes.unicode_minus'] = False
34     np.set_printoptions(suppress=True)
35 
36     plt.figure(figsize=(9, 11), facecolor='w')
37     d_pool = np.arange(1, N, 1)  # 階
38     m = d_pool.size
39     clrs = []  # 顏色
40     for c in np.linspace(16711680, 255, m):
41         clrs.append('#%06x' % c)
42     line_width = np.linspace(5, 2, m)
43     titles = u'線性回歸', u'Ridge回歸'
44     for t in range(2):
45         model = models[t]
46         plt.subplot(2, 1, t+1)
47         plt.plot(x, y, 'ro', ms=10, zorder=N)
48         for i, d in enumerate(d_pool):
49             model.set_params(poly__degree=d)
50             model.fit(x, y)
51             lin = model.get_params('linear')['linear']
52             if t == 0:
53                 print u'%d階，系數為：' % d, lin.coef_.ravel()
54             else:
55                 print u'%d階，alpha=%.6f，系數為：' % (d, lin.alpha_), lin.coef_.ravel()
56             x_hat = np.linspace(x.min(), x.max(), num=100)
57             x_hat.shape = -1, 1
58             y_hat = model.predict(x_hat)
59             s = model.score(x, y)
60             print s, '\n'
61             zorder = N - 1 if (d == 2) else 0
62             plt.plot(x_hat, y_hat, color=clrs[i], lw=line_width[i], label=(u'%d階，score=%.3f' % (d, s)), zorder=zorder)
63         plt.legend(loc='upper left')
64         plt.grid(True)
65         plt.title(titles[t], fontsize=16)
66         plt.xlabel('X', fontsize=14)
67         plt.ylabel('Y', fontsize=14)
68     plt.tight_layout(1, rect=(0, 0, 1, 0.95))
69     plt.suptitle(u'多項式曲線擬合', fontsize=18)
70     plt.show()

View Code

不使用庫：

BGD 與 SGD:

 1 # -*- coding: cp936 -*-
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 
 5 
 6 def linear_regression_BGD(x, y, alpha, lamda):
 7     m = np.alen(x)
 8     ones = np.ones(m)
 9     x = np.column_stack((ones, x))
10     n = np.alen(x[0])
11     theta = np.ones(n)
12     x_traverse = np.transpose(x)
13 
14     for i in range(1000):
15         hypothesis = np.dot(x, theta)
16         loss = hypothesis - y
17         cost = np.sum(loss ** 2)
18         print i, cost
19         gradient = np.dot(x_traverse, loss)
20         theta = theta - alpha * gradient
21     return theta
22 
23 def liear_regression_SGD(x, y, alpha, lamda):
24     m = np.alen(x)
25     ones = np.ones(m)
26     x = np.column_stack((ones, x))
27     n = np.alen(x[0])
28     theta = np.ones(n)
29     for j in range(1, m):
30         hypothesis = np.dot(x[j], theta)
31         loss = hypothesis - y[j]
32         gradient = np.dot(loss, x[j])
33         theta = theta - alpha * gradient
34     return theta
35 
36 
37 
38 if __name__ == '__main__':
39     N = 10
40     # x = np.linspace(0, 10, N) + np.random.randn(N)
41     # y =  3 * x + 5 + np.random.randn(N)
42 
43     x = np.linspace(0, 10, N) + np.random.randn(N)
44     y = 4 * x * x + 3 * x + 5 + np.random.randn(N)
45     x_square = x * x
46     x_predict = np.column_stack((x, x_square))
47 
48     # theta = linear_regression_BGD(x_predict, y, 0.00001,0.1) # 批量梯度下降
49     theta = liear_regression_SGD(x_predict, y, 0.0001, 0.1) # 隨機梯度下降
50     plt.plot(x, y, 'ro')
51 
52     x = np.linspace(x.min(), x.max(), 10 * N) # 構建測試數據
53     ones = np.ones(10 * N)
54     # x_predict = np.column_stack((ones, x))x
55     x_test = np.column_stack((ones, x, x * x))
56     y = np.dot(x_test, theta)
57 
58     plt.plot(x, y, 'b-')
59     plt.show()

View Code

Regression代碼：

import numpy as np
import matplotlib.pyplot as plt
def regression(data, alpha, lamda ):
    n = len(data[0]) - 1
    theta = np.zeros(n)
    times = 1
    for i in range(times):
        for d in data:
            x = d[:-1]
            y = d[-1]
            h_theta = np.dot(theta, x) - y
            theta = theta - alpha * h_theta * x + lamda * theta
        #print i,theta
    return theta
def preduceData():
   x = []
   y = []
   for i in range(0, 50):
       x.append(i)
       y.append(3 * x[i] + np.random.random_sample()*3)
   data = np.array([x, y]).T
   theta = regression(data, 0.001, 0.1)
   return x, y, theta
def myplot(x, y, theta):
    plt.figure()
    plt.plot(x, y, 'go')
    plt.plot(x, theta * x, 'r')
    plt.show()
x, y, theta = preduceData()
myplot(x, y, theta)

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 線性回歸線性回歸線性回歸關於線性回歸線性回歸線性回歸線性回歸線性回歸線性回歸與邏輯回歸線性回歸和Logistic回歸