使用LASSO回歸根據多個因素預測醫療費用
主要步驟流程:
- 1. 導入包
- 2. 導入數據集
- 3. 數據預處理
- 3.1 檢測缺失值
- 3.2 標簽編碼&獨熱編碼
- 3.3 得到自變量和因變量
- 3.4 拆分訓練集和測試集
- 3.5 特征縮放
- 4. 構建不同參數的LASSO回歸模型
- 4.1 模型1:構建LASSO回歸模型
- 4.1.1 構建LASSO回歸模型
- 4.1.2 得到模型表達式
- 4.1.3 預測測試集
- 4.1.4 得到模型MSE
- 4.2 模型2:構建LASSO回歸模型
- 4.3 模型3:構建LASSO回歸模型
- 4.4 模型4:構建LASSO回歸模型
- 4.1 模型1:構建LASSO回歸模型
1. 導入包
In [1]:
# 導入包
import numpy as np import pandas as pd import matplotlib.pyplot as plt
2. 導入數據集
In [2]:
# 導入數據集
data = pd.read_csv('insurance.csv') data.head()
Out[2]:
3. 數據預處理
3.1 檢測缺失值
In [3]:
# 檢測缺失值
null_df = data.isnull().sum() null_df
Out[3]:
3.2 標簽編碼&獨熱編碼
In [4]:
# 標簽編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True) data.head()
Out[4]:
3.3 得到自變量和因變量
In [5]:
# 得到自變量和因變量
y = data['charges'].values data = data.drop(['charges'], axis = 1) x = data.values
3.4 拆分訓練集和測試集
In [6]:
# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1) print(x_train.shape) print(x_test.shape) print(y_train.shape) print(y_test.shape)
3.5 特征縮放
In [7]:
# 特征縮放
from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) sc_y = StandardScaler() y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
4. 構建不同參數的LASSO回歸模型
4.1 模型1:構建LASSO回歸模型
4.1.1 構建LASSO回歸模型
In [8]:
# 構建不同參數的LASSO回歸模型 # 模型1:構建LASSO回歸模型(alpha = 0.1)
from sklearn.linear_model import Lasso regressor = Lasso(alpha = 0.1, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[8]:
4.1.2 得到模型表達式
In [9]:
# 得到模型表達式
print('數學表達式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.5f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
由數學表達式可見,bmi、children等特征的系數是0。達到了降維的目的。
4.1.3 預測測試集
In [10]:
# 預測測試集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特征縮放之前的
4.1.4 得到模型MSE
In [11]:
# 得到模型MSE
from sklearn.metrics import mean_squared_error mse_score = mean_squared_error(y_test, y_pred) print('alpha=0.1時,LASSO回歸模型的MSE是:', format(mse_score, ','))
4.2 模型2:構建LASSO回歸模型
In [12]:
# 模型2:構建LASSO回歸模型(alpha = 0.01)
regressor = Lasso(alpha = 0.01, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[12]:
In [13]:
# 得到線性表達式
print('數學表達式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [14]:
# 預測測試集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特征縮放之前的
In [15]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=0.01時,LASSO回歸模型的MSE是:', format(mse_score, ','))
4.3 模型3:構建LASSO回歸模型
In [16]:
# 模型3:構建LASSO回歸模型(alpha = 1e-5)
regressor = Lasso(alpha = 1e-5, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[16]:
In [17]:
# 得到線性表達式
print('數學表達式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [18]:
# 預測測試集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特征縮放之前的
In [19]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=1e-5時,LASSO回歸模型的MSE是:', format(mse_score, ','))
4.4 模型4:構建LASSO回歸模型
In [20]:
# 模型4:構建LASSO回歸模型(alpha = 1e-9)
regressor = Lasso(alpha = 1e-9, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[20]:
In [21]:
# 得到線性表達式
print('數學表達式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [22]:
# 預測測試集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特征縮放之前的
In [23]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=1e-9時,LASSO回歸模型的MSE是:', format(mse_score, ','))
結論: 由上面4個模型可見,不同超參數對LASSO回歸模型性能的影響不同
