多元線性回歸


多元線性回歸


​ 其實就是把簡單線性回歸進行多元的推廣。其中的輸入x由單一特征變為含有n個特征的向量。每個特征向量乘以一個系數,再加上最后的偏置θ。

mark

​ 求解思路也是跟簡單線性回歸一致。

一、多元線性回歸的數學推導

mark

mark

mark

mark

mark

這個式子就成為多元線性回歸的正規方程解(Normal Equation)。

問題:時間復雜度O(n3)(優化O(n2.4))

優點:不需要對數據進行歸一化處理

二、多元線性回歸的編程實現

​ 在實際的程序實現過程中,參數分為兩部分:截距和系數。

mark

import numpy as np
from play_ML.multi_linear_regression.metrics import r2_score

class LinearRegression(object):

    def __init__(self):
        "初始化多元線性回歸模型"
        self.coef_ = None
        self.interception_ = None
        self._theta = None

    def fit_normal(self, x_train, y_train):
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"

        X = np.hstack([np.ones((len(x_train), 1)), x_train])
        self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, x_predict):
        assert self.interception_ is not None and self.coef_ is not None, "must fit before predict"
        assert x_predict.shape[1] == len(self.coef_), "the feature number must be equal to x_train"

        X = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return X.dot(self._theta)

    def score(self, x_test, y_test):
        y_preict = self.predict(x_test)
        return r2_score(y_test, y_preict)

    def __repr__(self):
        return "Multi_Linear_Regression"

​ 接下來使用sklearn中的波士頓房價進行預測:

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split

    boston = datasets.load_boston()
    x = boston.data
    y = boston.target

    X = x[y < 50]
    Y = y[y < 50]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
    reg = LinearRegression()
    reg.fit_normal(x_train, y_train)
    print(reg.coef_)
    print(reg.interception_)
    print(reg.score(x_test, y_test))

輸出結果:

[-1.20354261e-01  3.64423279e-02 -3.61493155e-02  5.12978140e-02
 -1.15775825e+01  3.42740062e+00 -2.32311760e-02 -1.19487594e+00
  2.60101728e-01 -1.40219119e-02 -8.35430488e-01  7.80472852e-03
 -3.80923751e-01]
34.117399723201785
0.8129794056212823

三、scikit-learn中的線性回歸

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)
print(reg.intercept_)
print(reg.score(x_test, y_test))

輸出結果:

array([-1.20354261e-01,  3.64423279e-02, -3.61493155e-02,  5.12978140e-02,
       -1.15775825e+01,  3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
        2.60101728e-01, -1.40219119e-02, -8.35430488e-01,  7.80472852e-03,
       -3.80923751e-01])
34.117399723229845
0.8129794056212809

通過輸出結果對比,跟上面自己寫的結果一致。其實sklearn中封裝的函數跟自己寫的還是有一點區別的。后面會有提到。

四、使用KNN進行多元回歸

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

knn_reg = KNeighborsRegressor()     # 默認情況下使用的k=5
knn_reg.fit(x_train, y_train)
print(knn_reg.score(x_test, y_test))

輸出結果:0.5865412198300899

顯然,需要超參尋優

from sklearn.model_selection import GridSearchCV
param_grid = [
    {
        'weights':['uniform'],
        'n_neighbor':[i for i in range(1, 11)]
    },
    {
        'weights':['distance'],
         'n_neighbor':[i for i in range(1, 11)],
        'p':[i for i in range(1, 6)]
    }
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(x_train, y_train)

輸出結果:

Fitting 3 folds for each of 60 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    2.0s finished
GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, 
{'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)
grid_search.best_params_
grid_search.best_score_
grid_search.best_estimator_
grid_search.best_estimator_.score(x_test, y_test)

輸出結果:

{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
0.6340477954176972
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=1,
          weights='distance')
0.7044357727037996

通過超參尋優交叉驗證,准確率由0.5865412198300899到0.6340477954176972,其中最優參數組合為k=5,使用曼哈頓距離進行加權。但這個score和多元線性回歸的score計算方式並不一樣。因此沒有可比性,所以使用grid_search.best_estimator_.score分類器對於測試集計算score,結果為0.7044357727037996,相比多元線性回歸還是有點差距。很多時候因為設置的計算方式不同,不能隨意地將其進行比較。

五、線性回歸算法的可解釋性

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)

輸出結果:

array([-1.20354261e-01,  3.64423279e-02, -3.61493155e-02,  5.12978140e-02,
       -1.15775825e+01,  3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
        2.60101728e-01, -1.40219119e-02, -8.35430488e-01,  7.80472852e-03,
       -3.80923751e-01])

這就是線性回歸出來各個特征向量的系數,正數表示正相關,負數表示負相關。我們把這些數字進行一下排序。

np.argsort(reg.coef_)

輸出結果:array([ 4, 7, 10, 12, 0, 2, 6, 9, 11, 1, 3, 8, 5], dtype=int64)

首先我們來先看一下都有哪些特征:

boston.feature_names

輸出結果:array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

然后根據排序結果將這些特征進行排序:

boston.feature_names[np.argsort(reg.coef_)]

輸出結果:array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX', 'B', 'ZN', 'CHAS', 'RAD', 'RM'], dtype='<U7')

這些特征究竟是什么?通過DESCR查看:

print(boston.DESCR)
# 輸出結果:
- CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

通過上面的描述可以看出,負相關前三分別是‘NOX':一氧化碳含量,'DIS':距離波士頓勞務雇佣中心的加權平均距離,正相關前三分別是'RM':房間數量,'RAD':交通方便,'CHAS':是否鄰近河邊(海景房)。通過幾條描述可以說明線性回歸具有較好的解釋性。

未完待續~


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM