You can download the Rmd file here to follow along.

Purpose

Today we will review how to run models containing interactions between a continuous and categorical predictor. We will go over how to specify interaction terms in R, how to interpret the model output, and how to visualize the results.

Be sure to have the following packages loaded:

library(rio) # for importing
library(tidyverse) # for plotting and data wrangling
library(psych) # for calculating descriptives
library(sjPlot) # for plotting interactions
library(emmeans) # for simple slope tests

Research scenario

Today’s dataset was inspired by a recent study by Markowitz & Levine (2021) (the data you will be working with has been simulated). In the study, participants completed a matrix task under time pressure and then self-reported their scores. For each matrix problem that they got right, they could earn 25 cents, so it was tempting to cheat and self-report a higher score. Half of the participants shredded their worksheet before self-reporting and half of the participants handed the worksheet to the experimenter before self-reporting. Honesty scores were also self-reported from the HEXACO Personality Inventory (from 1 = extremely low honesty to 5 = extremely high honesty). The researchers hypothesized that personality and situation would interact to predict dishonest behavior.

Import the data

#import data

data <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-7/data/cheating_data.csv")

data <- data %>% mutate(condition = factor(condition,
                                  levels = c(0,1),
                                  labels = c("Non-shredder", "Shredder")))

Explore the data

  • Use str() to look at the structure of the data
  • Use head() to look at the first few rows of the data
  • Calculate descriptives for the variables honesty and claimed_solved
  • Calculate descriptives for the variables honesty and claimed_solved, grouped by condition
str(data)
## 'data.frame':    164 obs. of  3 variables:
##  $ condition     : Factor w/ 2 levels "Non-shredder",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ honesty       : num  1.44 3.3 1.83 1.22 2.72 ...
##  $ claimed_solved: int  6 6 4 8 7 5 5 5 7 6 ...
head(data)
data %>% summarise(m_honesty = mean(honesty),
                   m_solved = mean(claimed_solved))
data %>% 
  group_by(condition) %>% 
  summarise(m_honesty = mean(honesty),
                   m_solved = mean(claimed_solved))

Simple regression

Model A

\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty \] Run this model using lm().

model_a <- lm(data = data, claimed_solved ~ honesty)
summary(model_a)
## 
## Call:
## lm(formula = claimed_solved ~ honesty, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7943 -0.9189  0.0864  0.8906  3.2935 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   6.7614     0.3334  20.282 < 0.0000000000000002 ***
## honesty      -0.5019     0.1067  -4.706            0.0000054 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.408 on 162 degrees of freedom
## Multiple R-squared:  0.1202, Adjusted R-squared:  0.1148 
## F-statistic: 22.14 on 1 and 162 DF,  p-value: 0.000005396

Graph in ggplot().

data %>% 
ggplot(aes(x = honesty, y = claimed_solved)) +
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  geom_label(x=2, y = 5.9, label = "all") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Multiple regression

Model B

\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty + \beta_2Condition \]

So far in this course, we’ve worked only with additive effects, as shown here in this model. In this model, we are measuring the relation between honesty and how many matrices the person claimed to have solved while holding condition constant.

Let’s run this model using lm().

data %>% 
ggplot(aes(x = honesty, y = claimed_solved)) +
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  geom_label(x=2, y = 5.9, label = "all") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

\[\hat{Claimed} = 6.04 + 1.12Condition- .45Honesty \]

Question: interpret the model coefficients and R^2.

Simple slopes

To calculate simple slopes, you calculate the regression equation at specific levels of one of your variables. You probably wouldn’t calculate a simple slope for a multiple regression. This is just so that we can compare it later to our interaction model simple slopes.

When condition = 0 (there isn’t a shredder - you could get caught if you cheat)

\[Claimed = 6.04 + 1.12Condition- .45Honesty\] \[Claimed = 6.04 + 1.12(0) - .45Honesty\] \[Claimed = 6.04 - .45Honesty\]

When condition = 1 (there is a shredder - you can’t get caught if you cheat)

\[Claimed = 6.04 + 1.12Condition- .45Honesty\] \[Claimed = 6.04 + 1.12(1) - .45Honesty\] \[Claimed = 7.16 - .45Honesty\]

Question: now calculate the “simple slopes” when honesty = 0, honesty = 2, and honesty = 3

Plotting

data %>% 
ggplot(aes(x = honesty, y = claimed_solved)) +
  geom_point() +
  geom_abline(slope = -.45, intercept = 6.04, color = "red", size =1 ) +
  geom_abline(slope = -.45, intercept = 7.16, color = "blue", size =1) +
  geom_label(x = 1.5, y = 5.2, label = "shredder") +
  geom_label(x = 1.5, y =6.5, label = "no shredder")+
  theme_minimal()

What model do you run if you think that the honesty slope is going to be different under different conditions?

Categorical x continuous interactions

Model C

\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty + \beta_2Condition + \beta_3(Honesty*Condition)\] When the interaction term is added to the model, you are allowing for the possibility that the slope of one predictor differs at different levels of the other predictor. In this example, we can now account for the possibility that the relationship between honesty and matrices solved can differ at different levels of the condition variable.

Running an interaction model

To run an interaction model with the function lm(), enter the predictor variables you are interacting separated by an asterisk on the right side of the equation, e.g., lm(Y ~ X*Z). It is equivalent to running it spelled out, e.g., lm(Y ~ X + Z + X:Z).

model_c <- lm(data = data, claimed_solved ~ honesty*condition)
summary(model_c)
## 
## Call:
## lm(formula = claimed_solved ~ honesty * condition, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9265 -0.8677  0.0491  1.0581  3.3533 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                 7.2708     0.4359  16.681 < 0.0000000000000002 ***
## honesty                    -0.8532     0.1356  -6.294        0.00000000284 ***
## conditionShredder          -1.1772     0.5887  -2.000               0.0472 *  
## honesty:conditionShredder   0.7781     0.1881   4.137        0.00005660703 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3194 
## F-statistic:  26.5 on 3 and 160 DF,  p-value: 0.00000000000005695

Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder

Question: interpret the model coefficients and R^2.

Simple slopes

Calculating simple slopes

Shredder = 0 (There isn’t a shredder - you could get caught if you cheat)

\[Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder\] \[Claimed = 7.27 -.85Honesty -1.18(0) + .77Honesty*(0)\] \[Claimed = 7.27 -.85Honesty\]

Shredder = 1 (There is a shredder - you can’t get caught if you cheat)

\[Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder\] \[Claimed = 7.27 -.85Honesty -1.18(1) + .77Honesty*(1)\] \[Claimed = 6.09 -.08Honesty\]

Visualizing simple slopes

In ggplot:

data %>% 
ggplot(aes(x = honesty, y = claimed_solved)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, aes(color = condition)) +
  scale_color_manual(values = c("red", "blue")) + 
  theme_minimal() +
  theme(legend.position = "none") +
  geom_label(x = 4.5, y = 6, label = "shredder") +
  geom_label(x = 4.5, y = 3.6, label = "non-shredder")
## `geom_smooth()` using formula 'y ~ x'

In sjPlot

plot_model(model_c, type = "pred", terms = c("honesty", "condition"))

Testing simple slopes

For this section, uncomment and run this code. Take some time to look at these results and try to figure out what tests we are running and what you can conclude from them.

emtrends(model_c, ~condition, var = "honesty")
##  condition    honesty.trend    SE  df lower.CL upper.CL
##  Non-shredder       -0.8532 0.136 160   -1.121   -0.586
##  Shredder           -0.0751 0.130 160   -0.333    0.182
## 
## Confidence level used: 0.95

Question: What is this testing? What can you conclude?

emtrends(model_c, pairwise~condition, var = "honesty")
## $emtrends
##  condition    honesty.trend    SE  df lower.CL upper.CL
##  Non-shredder       -0.8532 0.136 160   -1.121   -0.586
##  Shredder           -0.0751 0.130 160   -0.333    0.182
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder   -0.778 0.188 160  -4.137  0.0001

Question: What is this testing? What can you conclude?

mylist <- list(honesty=c(1,2,3,4,5), condition=c("Non-shredder","Shredder"))
combinations <- emmeans(model_c, ~ honesty*condition, at=mylist)

contrast(combinations, "pairwise", by = "honesty")
## honesty = 1:
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder    0.399 0.416 160   0.960  0.3386
## 
## honesty = 2:
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder   -0.379 0.264 160  -1.434  0.1534
## 
## honesty = 3:
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder   -1.157 0.194 160  -5.968  <.0001
## 
## honesty = 4:
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder   -1.935 0.276 160  -7.016  <.0001
## 
## honesty = 5:
##  contrast                  estimate    SE  df t.ratio p.value
##  (Non-shredder) - Shredder   -2.714 0.431 160  -6.303  <.0001

Question: What is this testing? What can you conclude?

Centering

One thing you may have noticed while interpreting output is that we sometimes said “when honesty = 0.” However, for this example, honesty cannot be 0 since the scale goes from 1 to 5, making the interpretations less meaningful. Instead, we can mean center honesty.

data <- data %>%
  mutate(honesty_c = honesty - mean(honesty),
         honesty_center = scale(honesty, scale = FALSE))
head(data)

Now, we can re-run our model…

model_d <- lm(data = data, claimed_solved ~ honesty_c*condition)
summary(model_d)
## 
## Call:
## lm(formula = claimed_solved ~ honesty_c * condition, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9265 -0.8677  0.0491  1.0581  3.3533 
## 
## Coefficients:
##                             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)                   4.7531     0.1379  34.476 < 0.0000000000000002
## honesty_c                    -0.8532     0.1356  -6.294        0.00000000284
## conditionShredder             1.1190     0.1938   5.775        0.00000003894
## honesty_c:conditionShredder   0.7781     0.1881   4.137        0.00005660703
##                                
## (Intercept)                 ***
## honesty_c                   ***
## conditionShredder           ***
## honesty_c:conditionShredder ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3194 
## F-statistic:  26.5 on 3 and 160 DF,  p-value: 0.00000000000005695

Question: now, how can we interpret the intercept?

Let’s compare that to our model where the predictor was not centered…

summary(model_c)
## 
## Call:
## lm(formula = claimed_solved ~ honesty * condition, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9265 -0.8677  0.0491  1.0581  3.3533 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                 7.2708     0.4359  16.681 < 0.0000000000000002 ***
## honesty                    -0.8532     0.1356  -6.294        0.00000000284 ***
## conditionShredder          -1.1772     0.5887  -2.000               0.0472 *  
## honesty:conditionShredder   0.7781     0.1881   4.137        0.00005660703 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3194 
## F-statistic:  26.5 on 3 and 160 DF,  p-value: 0.00000000000005695

Question: what changed from the first model to the model with centered honesty?


Minihacks

You are interested in whether the time students spend studying (study_time) interacts with test anxiety (anxiety) to predict students’ test performance (perf).

Import the data below:

test_perf <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-7/data/test_perf.csv")
head(test_perf)
str(test_perf)
psych::describe(test_perf)

As you can see, anxiety is measured categorically (“low” or “high”) and study time ranges from .9 hours to 6.30 hours.

Minihack 1

Run a model testing whether test anxiety moderates the relationship between time spent studying and test performance. Center study time so that it is more meaningful.

test_perf <- test_perf %>% 
  mutate(anxiety = factor(anxiety,
                          levels= c("low", "high")),
         study_time_c = scale(study_time, scale = FALSE))

model <- lm(data= test_perf, perf ~ anxiety*study_time_c)
summary(model)
## 
## Call:
## lm(formula = perf ~ anxiety * study_time_c, data = test_perf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5858  -2.2278  -0.7574   1.0058  17.9678 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               79.4890     0.8682  91.552 < 0.0000000000000002 ***
## anxietyhigh               -6.2434     1.4033  -4.449        0.00003226284 ***
## study_time_c               7.8435     0.6197  12.657 < 0.0000000000000002 ***
## anxietyhigh:study_time_c  -6.1967     0.9290  -6.671        0.00000000523 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.815 on 69 degrees of freedom
## Multiple R-squared:  0.735,  Adjusted R-squared:  0.7235 
## F-statistic:  63.8 on 3 and 69 DF,  p-value: < 0.00000000000000022

Interpret the model coefficients.


Minihack 2

Visualize the interaction using sjPlot or ggplot.

test_perf %>% 
ggplot(aes(x= study_time, y = perf, color = anxiety)) + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'


Minihack 3

Test whether each simple slope is significantly different from 0.

emtrends(model, ~anxiety, var = "study_time_c")
##  anxiety study_time_c.trend    SE df lower.CL upper.CL
##  low                   7.84 0.620 69    6.607     9.08
##  high                  1.65 0.692 69    0.266     3.03
## 
## Confidence level used: 0.95

Test whether the simple slopes for low anxiety and high anxiety are significantly different from each other.

emtrends(model, pairwise~anxiety, var = "study_time_c")
## $emtrends
##  anxiety study_time_c.trend    SE df lower.CL upper.CL
##  low                   7.84 0.620 69    6.607     9.08
##  high                  1.65 0.692 69    0.266     3.03
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast   estimate    SE df t.ratio p.value
##  low - high      6.2 0.929 69   6.671  <.0001

What do the results of these significance tests mean?