使用LASSO回归根据多个因素预测医疗费用
主要步骤流程:
- 1. 导入包
- 2. 导入数据集
- 3. 数据预处理
- 3.1 检测缺失值
- 3.2 标签编码&独热编码
- 3.3 得到自变量和因变量
- 3.4 拆分训练集和测试集
- 3.5 特征缩放
- 4. 构建不同参数的LASSO回归模型
- 4.1 模型1:构建LASSO回归模型
- 4.1.1 构建LASSO回归模型
- 4.1.2 得到模型表达式
- 4.1.3 预测测试集
- 4.1.4 得到模型MSE
- 4.2 模型2:构建LASSO回归模型
- 4.3 模型3:构建LASSO回归模型
- 4.4 模型4:构建LASSO回归模型
- 4.1 模型1:构建LASSO回归模型
1. 导入包
In [1]:
# 导入包
import numpy as np import pandas as pd import matplotlib.pyplot as plt
2. 导入数据集
In [2]:
# 导入数据集
data = pd.read_csv('insurance.csv') data.head()
Out[2]:
3. 数据预处理
3.1 检测缺失值
In [3]:
# 检测缺失值
null_df = data.isnull().sum() null_df
Out[3]:
3.2 标签编码&独热编码
In [4]:
# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True) data.head()
Out[4]:
3.3 得到自变量和因变量
In [5]:
# 得到自变量和因变量
y = data['charges'].values data = data.drop(['charges'], axis = 1) x = data.values
3.4 拆分训练集和测试集
In [6]:
# 拆分训练集和测试集
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1) print(x_train.shape) print(x_test.shape) print(y_train.shape) print(y_test.shape)
3.5 特征缩放
In [7]:
# 特征缩放
from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) sc_y = StandardScaler() y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
4. 构建不同参数的LASSO回归模型
4.1 模型1:构建LASSO回归模型
4.1.1 构建LASSO回归模型
In [8]:
# 构建不同参数的LASSO回归模型 # 模型1:构建LASSO回归模型(alpha = 0.1)
from sklearn.linear_model import Lasso regressor = Lasso(alpha = 0.1, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[8]:
4.1.2 得到模型表达式
In [9]:
# 得到模型表达式
print('数学表达式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.5f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
由数学表达式可见,bmi、children等特征的系数是0。达到了降维的目的。
4.1.3 预测测试集
In [10]:
# 预测测试集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
4.1.4 得到模型MSE
In [11]:
# 得到模型MSE
from sklearn.metrics import mean_squared_error mse_score = mean_squared_error(y_test, y_pred) print('alpha=0.1时,LASSO回归模型的MSE是:', format(mse_score, ','))
4.2 模型2:构建LASSO回归模型
In [12]:
# 模型2:构建LASSO回归模型(alpha = 0.01)
regressor = Lasso(alpha = 0.01, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[12]:
In [13]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [14]:
# 预测测试集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
In [15]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=0.01时,LASSO回归模型的MSE是:', format(mse_score, ','))
4.3 模型3:构建LASSO回归模型
In [16]:
# 模型3:构建LASSO回归模型(alpha = 1e-5)
regressor = Lasso(alpha = 1e-5, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[16]:
In [17]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [18]:
# 预测测试集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
In [19]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=1e-5时,LASSO回归模型的MSE是:', format(mse_score, ','))
4.4 模型4:构建LASSO回归模型
In [20]:
# 模型4:构建LASSO回归模型(alpha = 1e-9)
regressor = Lasso(alpha = 1e-9, normalize = False, fit_intercept = True) regressor.fit(x_train, y_train)
Out[20]:
In [21]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='') columns = data.columns coefs = regressor.coef_ for i in range(len(columns)): print('%s * %.2f + ' %(columns[i], coefs[i]), end='') print(regressor.intercept_)
In [22]:
# 预测测试集
y_pred = regressor.predict(x_test) y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
In [23]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred) print('alpha=1e-9时,LASSO回归模型的MSE是:', format(mse_score, ','))
结论: 由上面4个模型可见,不同超参数对LASSO回归模型性能的影响不同