09機器學習實戰之多元線性回歸


基本概念

1. 與簡單線性回歸區別(simple linear regression)
          多個自變量(x)
 
2. 多元回歸模型
     y=β 0+β x 12x 2+ ... +β px p
    其中:β 0,β 1,β 2... β p是參數
                 ε是誤差值
 
3. 多元回歸方程
     E(y)=β 0+β x 12x 2+ ... +β px p
 
4. 估計多元回歸方程:
     y_hat=b0+bx1+b2x2+ ... +bpxp
 
    一個樣本被用來計算β 0,β 1,β 2... β p的點估計b 0, b 1, b 2,..., b p
5. 估計流程  (與簡單線性回歸類似)
6. 估計方法
        使sum of squares最小    
運算與簡單線性回歸類似,涉及到線性代數和矩陣代數的運算

推導過程 

 

 

第一項的 (X^T)X是一個對稱陣,也可以看出第一項是一個標量,因為X是一個m*(1+n)的矩陣,θ是一個(1+n)*1的矩陣,

根據矩陣相乘,可以得到第一項是一個標量

此處用到了向量的求導,性質如下:

 第二項同理根據矩陣相乘,也是一個標量

 向量的求導,性質如下:

 第三項同理根據矩陣相乘,也是一個標量

   向量的求導,性質如下:

 

代碼實現

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
In [12]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
In [13]:
X.shape
Out[13]:
(506, 13)
In [14]:
X = X[y < 50.0]
y = y[y < 50.0]
In [15]:
X.shape
Out[15]:
(490, 13)
In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
In [57]:
from ml09linearRegression2 import LinearRegression

reg = LinearRegression()
In [58]:
reg.fit_normal(X_train, y_train)
Out[58]:
LinearRegression()
In [59]:
reg.coef_
Out[59]:
array([-1.02165165e-01,  2.90834759e-02, -3.20513846e-02,  3.87701319e-01,
       -1.22357592e+01,  3.55691305e+00, -2.81445439e-02, -1.10019435e+00,
        2.37232297e-01, -1.35143455e-02, -8.66922512e-01,  5.86471407e-03,
       -3.67607741e-01])
In [60]:
reg.intercept_
Out[60]:
34.867232717346994
In [61]:
reg.score(X_test, y_test)
Out[61]:
0.7790737176672187
 
import numpy as np
from ml09metrics import r2_score


class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def fit_normal(self, X_train, y_train):
        """根據訓練數據集X_train, y_train訓練Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        """給定待預測數據集X_predict,返回表示X_predict的結果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        """根據測試數據集 X_test 和 y_test 確定當前模型的准確度"""

        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

 

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """計算y_true和y_predict之間的准確率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """計算y_true和y_predict之間的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict) ** 2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """計算y_true和y_predict之間的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """計算y_true和y_predict之間的RMSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)


def r2_score(y_true, y_predict):
    """計算y_true和y_predict之間的R Square"""

    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)

 

scikit-learn實現多元線性回歸

In [32]:
from sklearn import datasets
In [33]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
In [34]:
X = X[y < 50.0]
y = y[y < 50.0]
In [35]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
In [36]:
from sklearn.linear_model import LinearRegression
In [37]:
lin_reg = LinearRegression()
In [38]:
lin_reg.fit(X_train, y_train)
Out[38]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [39]:
lin_reg.coef_
Out[39]:
array([-1.15625837e-01,  3.13179564e-02, -4.35662825e-02, -9.73281610e-02,
       -1.09500653e+01,  3.49898935e+00, -1.41780625e-02, -1.06249020e+00,
        2.46031503e-01, -1.23291876e-02, -8.79440522e-01,  8.31653623e-03,
       -3.98593455e-01])
In [40]:
lin_reg.intercept_
Out[40]:
32.59756158869987
In [41]:
lin_reg.score(X_test, y_test)
Out[41]:
0.8009390227581038
 

KNN Regressor

In [42]:
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
In [44]:
knn_reg.fit(X_train, y_train)
Out[44]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform')
In [45]:
knn_reg.score(X_test, y_test)
Out[45]:
0.602674505080953
 

sklearn的網格搜索

In [46]:
from sklearn.model_selection import GridSearchCV
In [50]:
para_grid = [
    {"weights": ["uniform"],
     "n_neighbors": [i for i in range(1, 11)]
     },
    {"weights": ["distance"],
     "n_neighbors": [i for i in range(1, 11)],
     "p": [i for i in range(1, 6)]
     }
]
In [51]:
knn_reg = KNeighborsRegressor()
# n_jobs是使用多少個cpu,-1表示所有
grid_search = GridSearchCV(knn_reg, para_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
 
Fitting 3 folds for each of 60 candidates, totalling 180 fits
 
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  70 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    3.7s finished
C:\Users\Administrator\Envs\MachineLearning\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[51]:
GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)
In [52]:
grid_search.best_params_
Out[52]:
{'n_neighbors': 6, 'p': 1, 'weights': 'distance'}
In [53]:
grid_search.best_score_  # 此處的scope使用了交叉驗證
Out[53]:
0.6060528490355778
In [54]:
grid_search.best_estimator_.score(X_test, y_test)
# 這個與線性回歸的得分標准一樣
Out[54]:
0.7353138117643773
 

多元線性回歸的更多思考

In [55]:
import numpy as np

np.argsort(lin_reg.coef_)
Out[55]:
array([ 4,  7, 10, 12,  0,  3,  2,  6,  9, 11,  1,  8,  5], dtype=int64)
In [58]:
boston.feature_names
Out[58]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
In [59]:
boston.feature_names[np.argsort(lin_reg.coef_)]
Out[59]:
array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'CHAS', 'INDUS', 'AGE',
       'TAX', 'B', 'ZN', 'RAD', 'RM'], dtype='<U7')
In [60]:
print(boston.DESCR)
 
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM