Python 線性回歸


 

分析:女性身高與體重的關系

該數據集源自The World Almanac and Book of Facts(1975)
給出了年齡在30-39歲之間的15名女性的身高和體重信息

1.線性回歸

# packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm

1.1數據處理

data = pd.read_csv("women.csv",index_col = 0)
X = data["height"]
X = sm.add_constant(X)
y = data["weight"]
data.describe() #數據描述性分析
 
  height weight
count 15.000000 15.000000
mean 65.000000 136.733333
std 4.472136 15.498694
min 58.000000 115.000000
25% 61.500000 124.500000
50% 65.000000 135.000000
75% 68.500000 148.000000
max 72.000000 164.000000
 
plt.scatter(data["height"],data["weight"])
plt.show()

1.2模型擬合

model1 = sm.OLS(y,X) #最小二成模型
result = model1.fit() #訓練模型
print(result.summary()) #輸出訓練結果
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 weight   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.990
Method:                 Least Squares   F-statistic:                     1433.
Date:                Wed, 01 Apr 2020   Prob (F-statistic):           1.09e-14
Time:                        21:40:44   Log-Likelihood:                -26.541
No. Observations:                  15   AIC:                             57.08
Df Residuals:                      13   BIC:                             58.50
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -87.5167      5.937    -14.741      0.000    -100.343     -74.691
height         3.4500      0.091     37.855      0.000       3.253       3.647
==============================================================================
Omnibus:                        2.396   Durbin-Watson:                   0.315
Prob(Omnibus):                  0.302   Jarque-Bera (JB):                1.660
Skew:                           0.789   Prob(JB):                        0.436
Kurtosis:                       2.596   Cond. No.                         982.
==============================================================================
#單獨調用回歸結果的參數命令:
result.params #回歸系數
result.rsquared #回歸擬合優度R
result.f_pvalue #F統計量p值
sm.stats.stattools.durbin_watson(result.resid) #dw統計量,檢驗殘差自相關性
sm.stats.stattools.jarque_bera(result.resid) #jb統計量,檢驗殘差是否服從正態分布(JB,JBp值,偏度,峰度)
(1.6595730644309838, 0.4361423787323849, 0.7893583826332282, 2.596304225738997)

1.3模型預測

y_pre = result.predict()
y_pre
array([112.58333333, 116.03333333, 119.48333333, 122.93333333,
       126.38333333, 129.83333333, 133.28333333, 136.73333333,
       140.18333333, 143.63333333, 147.08333333, 150.53333333,
       153.98333333, 157.43333333, 160.88333333])

1.4模型評價

#結果可視化
plt.rcParams['font.family']="simHei" #漢字顯示
plt.plot(data["height"], data["weight"],"o")
plt.plot(data["height"], y_pre)
plt.title('女性體重與身高的線性回歸分析')
Text(0.5, 1.0, '女性體重與身高的線性回歸分析')
 

從上圖來看,簡單線性回歸的效果並不好,我們采取多項式回歸

2.多項式回歸

2.1數據處理

data = pd.read_csv("women.csv",index_col = 0)
X = data["height"]
y = data["weight"]
X = np.column_stack((X,np.power(X,2),np.power(X,3))) #構造三階多項式
X = sm.add_constant(X) #添加截距項
X
array([[1.00000e+00, 5.80000e+01, 3.36400e+03, 1.95112e+05],
       [1.00000e+00, 5.90000e+01, 3.48100e+03, 2.05379e+05],
       [1.00000e+00, 6.00000e+01, 3.60000e+03, 2.16000e+05],
       [1.00000e+00, 6.10000e+01, 3.72100e+03, 2.26981e+05],
       [1.00000e+00, 6.20000e+01, 3.84400e+03, 2.38328e+05],
       [1.00000e+00, 6.30000e+01, 3.96900e+03, 2.50047e+05],
       [1.00000e+00, 6.40000e+01, 4.09600e+03, 2.62144e+05],
       [1.00000e+00, 6.50000e+01, 4.22500e+03, 2.74625e+05],
       [1.00000e+00, 6.60000e+01, 4.35600e+03, 2.87496e+05],
       [1.00000e+00, 6.70000e+01, 4.48900e+03, 3.00763e+05],
       [1.00000e+00, 6.80000e+01, 4.62400e+03, 3.14432e+05],
       [1.00000e+00, 6.90000e+01, 4.76100e+03, 3.28509e+05],
       [1.00000e+00, 7.00000e+01, 4.90000e+03, 3.43000e+05],
       [1.00000e+00, 7.10000e+01, 5.04100e+03, 3.57911e+05],
       [1.00000e+00, 7.20000e+01, 5.18400e+03, 3.73248e+05]])

2.2模型擬合

model2 = sm.OLS(y,X)
result = model2.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 weight   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.679e+04
Date:                Wed, 01 Apr 2020   Prob (F-statistic):           2.07e-20
Time:                        22:09:27   Log-Likelihood:                 1.3441
No. Observations:                  15   AIC:                             5.312
Df Residuals:                      11   BIC:                             8.144
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -896.7476    294.575     -3.044      0.011   -1545.102    -248.393
x1            46.4108     13.655      3.399      0.006      16.356      76.466
x2            -0.7462      0.211     -3.544      0.005      -1.210      -0.283
x3             0.0043      0.001      3.940      0.002       0.002       0.007
==============================================================================
Omnibus:                        0.028   Durbin-Watson:                   2.388
Prob(Omnibus):                  0.986   Jarque-Bera (JB):                0.127
Skew:                           0.049   Prob(JB):                        0.939
Kurtosis:                       2.561   Cond. No.                     1.25e+09
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.25e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

2.3模型預測

y_pre = result.predict()
y_pre
array([114.63856209, 117.40676937, 120.18801264, 123.00780722,
       125.89166846, 128.86511168, 131.95365223, 135.18280543,
       138.57808662, 142.16501113, 145.9690943 , 150.01585147,
       154.33079796, 158.93944911, 163.86732026])

2.4模型評價

#結果可視化
plt.rcParams['font.family']="simHei" #漢字顯示
plt.plot(data["height"], data["weight"],"o")
plt.plot(data["height"], y_pre)
plt.title('女性體重與身高的線性回歸分析')
Text(0.5, 1.0, '女性體重與身高的線性回歸分析')
 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM