基本概念
1. 與簡單線性回歸區別(simple linear regression)
多個自變量(x)
2. 多元回歸模型
y=β
0+β
1x
1+β
2x
2+ ... +β
px
p+ε
其中:β
0,β
1,β
2... β
p是參數
ε是誤差值
3. 多元回歸方程
E(y)=β
0+β
1x
1+β
2x
2+ ... +β
px
p
4. 估計多元回歸方程:
y_hat=b0+b1x1+b2x2+ ... +bpxp
一個樣本被用來計算β
0,β
1,β
2... β
p的點估計b
0, b
1, b
2,..., b
p
5. 估計流程 (與簡單線性回歸類似)
6. 估計方法
使sum of squares最小

運算與簡單線性回歸類似,涉及到線性代數和矩陣代數的運算
推導過程

第一項的 (X^T)X是一個對稱陣,也可以看出第一項是一個標量,因為X是一個m*(1+n)的矩陣,θ是一個(1+n)*1的矩陣,
根據矩陣相乘,可以得到第一項是一個標量
此處用到了向量的求導,性質如下:
第二項同理根據矩陣相乘,也是一個標量
向量的求導,性質如下:
第三項同理根據矩陣相乘,也是一個標量
向量的求導,性質如下:

代碼實現
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
In [12]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
In [13]:
X.shape
Out[13]:
In [14]:
X = X[y < 50.0]
y = y[y < 50.0]
In [15]:
X.shape
Out[15]:
In [56]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [57]:
from ml09linearRegression2 import LinearRegression
reg = LinearRegression()
In [58]:
reg.fit_normal(X_train, y_train)
Out[58]:
In [59]:
reg.coef_
Out[59]:
In [60]:
reg.intercept_
Out[60]:
In [61]:
reg.score(X_test, y_test)
Out[61]:
import numpy as np from ml09metrics import r2_score class LinearRegression: def __init__(self): """初始化Linear Regression模型""" self.coef_ = None self.intercept_ = None self._theta = None def fit_normal(self, X_train, y_train): """根據訓練數據集X_train, y_train訓練Linear Regression模型""" assert X_train.shape[0] == y_train.shape[0], \ "the size of X_train must be equal to the size of y_train" X_b = np.hstack([np.ones((len(X_train), 1)), X_train]) self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train) self.intercept_ = self._theta[0] self.coef_ = self._theta[1:] return self def predict(self, X_predict): """給定待預測數據集X_predict,返回表示X_predict的結果向量""" assert self.intercept_ is not None and self.coef_ is not None, \ "must fit before predict!" assert X_predict.shape[1] == len(self.coef_), \ "the feature number of X_predict must be equal to X_train" X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict]) return X_b.dot(self._theta) def score(self, X_test, y_test): """根據測試數據集 X_test 和 y_test 確定當前模型的准確度""" y_predict = self.predict(X_test) return r2_score(y_test, y_predict) def __repr__(self): return "LinearRegression()"
import numpy as np from math import sqrt def accuracy_score(y_true, y_predict): """計算y_true和y_predict之間的准確率""" assert len(y_true) == len(y_predict), \ "the size of y_true must be equal to the size of y_predict" return np.sum(y_true == y_predict) / len(y_true) def mean_squared_error(y_true, y_predict): """計算y_true和y_predict之間的MSE""" assert len(y_true) == len(y_predict), \ "the size of y_true must be equal to the size of y_predict" return np.sum((y_true - y_predict) ** 2) / len(y_true) def root_mean_squared_error(y_true, y_predict): """計算y_true和y_predict之間的RMSE""" return sqrt(mean_squared_error(y_true, y_predict)) def mean_absolute_error(y_true, y_predict): """計算y_true和y_predict之間的RMSE""" assert len(y_true) == len(y_predict), \ "the size of y_true must be equal to the size of y_predict" return np.sum(np.absolute(y_true - y_predict)) / len(y_true) def r2_score(y_true, y_predict): """計算y_true和y_predict之間的R Square""" return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)
scikit-learn實現多元線性回歸
In [32]:
from sklearn import datasets
In [33]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
In [34]:
X = X[y < 50.0]
y = y[y < 50.0]
In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
In [36]:
from sklearn.linear_model import LinearRegression
In [37]:
lin_reg = LinearRegression()
In [38]:
lin_reg.fit(X_train, y_train)
Out[38]:
In [39]:
lin_reg.coef_
Out[39]:
In [40]:
lin_reg.intercept_
Out[40]:
In [41]:
lin_reg.score(X_test, y_test)
Out[41]:
In [42]:
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor()
In [44]:
knn_reg.fit(X_train, y_train)
Out[44]:
In [45]:
knn_reg.score(X_test, y_test)
Out[45]:
In [46]:
from sklearn.model_selection import GridSearchCV
In [50]:
para_grid = [
{"weights": ["uniform"],
"n_neighbors": [i for i in range(1, 11)]
},
{"weights": ["distance"],
"n_neighbors": [i for i in range(1, 11)],
"p": [i for i in range(1, 6)]
}
]
In [51]:
knn_reg = KNeighborsRegressor()
# n_jobs是使用多少個cpu,-1表示所有
grid_search = GridSearchCV(knn_reg, para_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
Out[51]:
In [52]:
grid_search.best_params_
Out[52]:
In [53]:
grid_search.best_score_ # 此處的scope使用了交叉驗證
Out[53]:
In [54]:
grid_search.best_estimator_.score(X_test, y_test)
# 這個與線性回歸的得分標准一樣
Out[54]:
In [55]:
import numpy as np
np.argsort(lin_reg.coef_)
Out[55]:
In [58]:
boston.feature_names
Out[58]:
In [59]:
boston.feature_names[np.argsort(lin_reg.coef_)]
Out[59]:
In [60]:
print(boston.DESCR)