【R語言學習筆記】11. 線性回歸中的交互效應(interaction)


 

1. 目的:構建線性回歸模型並考慮自變量之間的交互效應。

 

2. 數據來源及背景

2.1 數據來源:數據為本人上課的案例數據,

2.2 數據背景:一公司想通過商品銷售價格及是否提供打折來預測顧客購買商品的可能性。

library(car)
library(ggplot2)
library(jtools)

library(readxl)
luxury <- read_excel("Luxury.xlsx")
str(luxury)
summary(luxury)

 

 

 

3. 應用

3.1 數據初期探索

繪制變量間的散點圖

# get the intuition of the data
ggplot(luxury, aes(x = Selling_Price, y = Purchase_Intent, col = as.factor(Discount_Offered))) + geom_point()

 

 

通過散點圖,可以判斷出:提供折扣時,商品價格越高,購買者越容易購買;不提供折扣時,商品價格越低,購買者越容易購買,故存在交互效應(interaction)。

 

3.2 構建模型

模型1:無交互作用的線性回歸模型

step1:構建線性回歸模型

# LM w/no interactions/moderators
l0 <- lm(Purchase_Intent ~ Selling_Price + Discount_Offered, data = luxury)
summary(l0)

 

由模型結果可知,銷售價格對於購買意向的回歸並不顯著。

 

step2:交叉檢驗及模型預測准確性:

# Create the splitting plan for 3-fold cross validation
set.seed(123)  # set the seed for reproducibility
library(vtreat)
splitPlan <- kWayCrossValidation(nrow(luxury), 3, NULL, NULL)

# get cross-val predictions for main-effects only model
luxury$lm.pred <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  lm <- lm(Purchase_Intent ~ Selling_Price + Discount_Offered, data = luxury[split$train, ])
  luxury$lm.pred[split$app] <- predict(l0, newdata = luxury[split$app, ])
}

library(forecast)
accuracy(luxury$l0.pred, luxury$Purchase_Intent)

 

 

 

模型2:考慮交互作用的回歸模型

step1:對數據做中心化處理

# mean-center predictors#
luxury$Mean_Center_Selling <- luxury$Selling_Price - mean(luxury$Selling_Price)
# alternative1
# luxury$Mean_Center_Selling <- c(scale(luxury$Selling_Price,center = TRUE,scale = FALSE))
# alternative2: preProcess(data, method = 'center')

 

step2:構建模型

#LM Interaction w/mean Centering#
l1 <- lm(Purchase_Intent ~ Discount_Offered * Mean_Center_Selling
       , data = luxury)
summary(l1)

 

此時,全部自變量均與因變量顯著相關,且R-squared顯著提升。

 

step3:F檢驗檢測與模型1相比模型2是否顯著改進

# F test
anova(l0, l1)

 

 

 

step4:繪制interaction plot

plot1

# interaction plot
library(sjPlot)
library(sjmisc)
plot_model(l1, type = "pred", terms = c('Mean_Center_Selling', 'Discount_Offered'))

 

 

 

plot2

# alternative
library(pequod)
# Fit moderated linear regression with both residual centering and mean centering methods. l2 <- lmres(Purchase_Intent ~ Mean_Center_Selling * Discount_Offered, data=luxury)
# Simple slopes analysis for Moderated regression s_slopes <- simpleSlope(l2, pred = "Mean_Center_Selling", mod1 ="Discount_Offered") # object: an object of class "lmres": a moderated regression function. # pred: name of the predictor variable # mod1: name of the first moderator variable # mod2: name of the second moderator variable. Default "none" is used in order to analyzing two way interaction
# Simple slopes plot PlotSlope(s_slopes)

 

 

step5:交叉檢驗及模型預測准確性

# Get the cross-val predictions for the model with interactions
luxury$int.pred <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  int <- lm(Purchase_Intent ~ Mean_Center_Selling * Discount_Offered, data = luxury[split$train, ])
  luxury$int.pred[split$app] <- predict(int, newdata = luxury[split$app, ])
}

accuracy(luxury$int.pred, luxury$Purchase_Intent)

 

 

 

 

3.3 模型間准確性比較

# compare RMSE
library(tidyr)
library(dplyr)
luxury %>% 
  gather(key = modeltype, value = pred, lm.pred, int.pred) %>%
  mutate(residuals = Purchase_Intent - pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

 

根據數據結果可知,模型2的RMSE比模型1的RMSE稍微小一點,但是R-squared顯著提升。

  


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM