多元線性回歸
其實就是把簡單線性回歸進行多元的推廣。其中的輸入x由單一特征變為含有n個特征的向量。每個特征向量乘以一個系數,再加上最后的偏置θ。

求解思路也是跟簡單線性回歸一致。
一、多元線性回歸的數學推導





這個式子就成為多元線性回歸的正規方程解(Normal Equation)。
問題:時間復雜度O(n3)(優化O(n2.4))
優點:不需要對數據進行歸一化處理
二、多元線性回歸的編程實現
在實際的程序實現過程中,參數分為兩部分:截距和系數。

import numpy as np
from play_ML.multi_linear_regression.metrics import r2_score
class LinearRegression(object):
def __init__(self):
"初始化多元線性回歸模型"
self.coef_ = None
self.interception_ = None
self._theta = None
def fit_normal(self, x_train, y_train):
assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"
X = np.hstack([np.ones((len(x_train), 1)), x_train])
self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train)
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def predict(self, x_predict):
assert self.interception_ is not None and self.coef_ is not None, "must fit before predict"
assert x_predict.shape[1] == len(self.coef_), "the feature number must be equal to x_train"
X = np.hstack([np.ones((len(x_predict), 1)), x_predict])
return X.dot(self._theta)
def score(self, x_test, y_test):
y_preict = self.predict(x_test)
return r2_score(y_test, y_preict)
def __repr__(self):
return "Multi_Linear_Regression"
接下來使用sklearn中的波士頓房價進行預測:
if __name__ == '__main__':
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
boston = datasets.load_boston()
x = boston.data
y = boston.target
X = x[y < 50]
Y = y[y < 50]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
reg = LinearRegression()
reg.fit_normal(x_train, y_train)
print(reg.coef_)
print(reg.interception_)
print(reg.score(x_test, y_test))
輸出結果:
[-1.20354261e-01 3.64423279e-02 -3.61493155e-02 5.12978140e-02
-1.15775825e+01 3.42740062e+00 -2.32311760e-02 -1.19487594e+00
2.60101728e-01 -1.40219119e-02 -8.35430488e-01 7.80472852e-03
-3.80923751e-01]
34.117399723201785
0.8129794056212823
三、scikit-learn中的線性回歸
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = datasets.load_boston()
x = boston.data
y = boston.target
X = x[y < 50]
Y = y[y < 50]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)
print(reg.intercept_)
print(reg.score(x_test, y_test))
輸出結果:
array([-1.20354261e-01, 3.64423279e-02, -3.61493155e-02, 5.12978140e-02,
-1.15775825e+01, 3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
2.60101728e-01, -1.40219119e-02, -8.35430488e-01, 7.80472852e-03,
-3.80923751e-01])
34.117399723229845
0.8129794056212809
通過輸出結果對比,跟上面自己寫的結果一致。其實sklearn中封裝的函數跟自己寫的還是有一點區別的。后面會有提到。
四、使用KNN進行多元回歸
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
boston = datasets.load_boston()
x = boston.data
y = boston.target
X = x[y < 50]
Y = y[y < 50]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
knn_reg = KNeighborsRegressor() # 默認情況下使用的k=5
knn_reg.fit(x_train, y_train)
print(knn_reg.score(x_test, y_test))
輸出結果:0.5865412198300899
顯然,需要超參尋優
from sklearn.model_selection import GridSearchCV
param_grid = [
{
'weights':['uniform'],
'n_neighbor':[i for i in range(1, 11)]
},
{
'weights':['distance'],
'n_neighbor':[i for i in range(1, 11)],
'p':[i for i in range(1, 6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(x_train, y_train)
輸出結果:
Fitting 3 folds for each of 60 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done 30 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 2.0s finished
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform'),
fit_params=None, iid='warn', n_jobs=-1,
param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
{'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=2)
grid_search.best_params_
grid_search.best_score_
grid_search.best_estimator_
grid_search.best_estimator_.score(x_test, y_test)
輸出結果:
{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
0.6340477954176972
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=1,
weights='distance')
0.7044357727037996
通過超參尋優交叉驗證,准確率由0.5865412198300899到0.6340477954176972,其中最優參數組合為k=5,使用曼哈頓距離進行加權。但這個score和多元線性回歸的score計算方式並不一樣。因此沒有可比性,所以使用grid_search.best_estimator_.score分類器對於測試集計算score,結果為0.7044357727037996,相比多元線性回歸還是有點差距。很多時候因為設置的計算方式不同,不能隨意地將其進行比較。
五、線性回歸算法的可解釋性
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = datasets.load_boston()
x = boston.data
y = boston.target
X = x[y < 50]
Y = y[y < 50]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)
輸出結果:
array([-1.20354261e-01, 3.64423279e-02, -3.61493155e-02, 5.12978140e-02,
-1.15775825e+01, 3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
2.60101728e-01, -1.40219119e-02, -8.35430488e-01, 7.80472852e-03,
-3.80923751e-01])
這就是線性回歸出來各個特征向量的系數,正數表示正相關,負數表示負相關。我們把這些數字進行一下排序。
np.argsort(reg.coef_)
輸出結果:array([ 4, 7, 10, 12, 0, 2, 6, 9, 11, 1, 3, 8, 5], dtype=int64)
首先我們來先看一下都有哪些特征:
boston.feature_names
輸出結果:array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
然后根據排序結果將這些特征進行排序:
boston.feature_names[np.argsort(reg.coef_)]
輸出結果:array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX', 'B', 'ZN', 'CHAS', 'RAD', 'RM'], dtype='<U7')
這些特征究竟是什么?通過DESCR查看:
print(boston.DESCR)
# 輸出結果:
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
通過上面的描述可以看出,負相關前三分別是‘NOX':一氧化碳含量,'DIS':距離波士頓勞務雇佣中心的加權平均距離,正相關前三分別是'RM':房間數量,'RAD':交通方便,'CHAS':是否鄰近河邊(海景房)。通過幾條描述可以說明線性回歸具有較好的解釋性。
未完待續~
