1. 目的:構建線性回歸模型並考慮自變量之間的交互效應。
2. 數據來源及背景
2.1 數據來源:數據為本人上課的案例數據,
2.2 數據背景:一公司想通過商品銷售價格及是否提供打折來預測顧客購買商品的可能性。
library(car) library(ggplot2) library(jtools) library(readxl) luxury <- read_excel("Luxury.xlsx") str(luxury) summary(luxury)
3. 應用
3.1 數據初期探索
繪制變量間的散點圖
# get the intuition of the data ggplot(luxury, aes(x = Selling_Price, y = Purchase_Intent, col = as.factor(Discount_Offered))) + geom_point()
通過散點圖,可以判斷出:提供折扣時,商品價格越高,購買者越容易購買;不提供折扣時,商品價格越低,購買者越容易購買,故存在交互效應(interaction)。
3.2 構建模型
模型1:無交互作用的線性回歸模型
step1:構建線性回歸模型
# LM w/no interactions/moderators l0 <- lm(Purchase_Intent ~ Selling_Price + Discount_Offered, data = luxury) summary(l0)
由模型結果可知,銷售價格對於購買意向的回歸並不顯著。
step2:交叉檢驗及模型預測准確性:
# Create the splitting plan for 3-fold cross validation set.seed(123) # set the seed for reproducibility library(vtreat) splitPlan <- kWayCrossValidation(nrow(luxury), 3, NULL, NULL) # get cross-val predictions for main-effects only model luxury$lm.pred <- 0 # initialize the prediction vector for(i in 1:3) { split <- splitPlan[[i]] lm <- lm(Purchase_Intent ~ Selling_Price + Discount_Offered, data = luxury[split$train, ]) luxury$lm.pred[split$app] <- predict(l0, newdata = luxury[split$app, ]) } library(forecast) accuracy(luxury$l0.pred, luxury$Purchase_Intent)
模型2:考慮交互作用的回歸模型
step1:對數據做中心化處理
# mean-center predictors# luxury$Mean_Center_Selling <- luxury$Selling_Price - mean(luxury$Selling_Price) # alternative1 # luxury$Mean_Center_Selling <- c(scale(luxury$Selling_Price,center = TRUE,scale = FALSE)) # alternative2: preProcess(data, method = 'center')
step2:構建模型
#LM Interaction w/mean Centering# l1 <- lm(Purchase_Intent ~ Discount_Offered * Mean_Center_Selling , data = luxury) summary(l1)
此時,全部自變量均與因變量顯著相關,且R-squared顯著提升。
step3:F檢驗檢測與模型1相比模型2是否顯著改進
# F test anova(l0, l1)
step4:繪制interaction plot
plot1
# interaction plot library(sjPlot) library(sjmisc) plot_model(l1, type = "pred", terms = c('Mean_Center_Selling', 'Discount_Offered'))
plot2
# alternative library(pequod)
# Fit moderated linear regression with both residual centering and mean centering methods. l2 <- lmres(Purchase_Intent ~ Mean_Center_Selling * Discount_Offered, data=luxury)
# Simple slopes analysis for Moderated regression s_slopes <- simpleSlope(l2, pred = "Mean_Center_Selling", mod1 ="Discount_Offered") # object: an object of class "lmres": a moderated regression function. # pred: name of the predictor variable # mod1: name of the first moderator variable # mod2: name of the second moderator variable. Default "none" is used in order to analyzing two way interaction
# Simple slopes plot PlotSlope(s_slopes)
step5:交叉檢驗及模型預測准確性
# Get the cross-val predictions for the model with interactions luxury$int.pred <- 0 # initialize the prediction vector for(i in 1:3) { split <- splitPlan[[i]] int <- lm(Purchase_Intent ~ Mean_Center_Selling * Discount_Offered, data = luxury[split$train, ]) luxury$int.pred[split$app] <- predict(int, newdata = luxury[split$app, ]) } accuracy(luxury$int.pred, luxury$Purchase_Intent)
3.3 模型間准確性比較
# compare RMSE library(tidyr) library(dplyr) luxury %>% gather(key = modeltype, value = pred, lm.pred, int.pred) %>% mutate(residuals = Purchase_Intent - pred) %>% group_by(modeltype) %>% summarize(rmse = sqrt(mean(residuals^2)))
根據數據結果可知,模型2的RMSE比模型1的RMSE稍微小一點,但是R-squared顯著提升。