You can download the Rmd file here to follow along.
Today we will review how to run models containing interactions between a continuous and categorical predictor. We will go over how to specify interaction terms in R, how to interpret the model output, and how to visualize the results.
Be sure to have the following packages loaded:
library(rio) # for importing
library(tidyverse) # for plotting and data wrangling
library(psych) # for calculating descriptives
library(sjPlot) # for plotting interactions
library(emmeans) # for simple slope tests
Today’s dataset was inspired by a recent study by Markowitz & Levine (2021) (the data you will be working with has been simulated). In the study, participants completed a matrix task under time pressure and then self-reported their scores. For each matrix problem that they got right, they could earn 25 cents, so it was tempting to cheat and self-report a higher score. Half of the participants shredded their worksheet before self-reporting and half of the participants handed the worksheet to the experimenter before self-reporting. Honesty scores were also self-reported from the HEXACO Personality Inventory (from 1 = extremely low honesty to 5 = extremely high honesty). The researchers hypothesized that personality and situation would interact to predict dishonest behavior.
#import data
data <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-7/data/cheating_data.csv")
data <- data %>% mutate(condition = factor(condition,
levels = c(0,1),
labels = c("Non-shredder", "Shredder")))
str()
to look at the structure of the datahead()
to look at the first few rows of the datahonesty
and claimed_solved
honesty
and claimed_solved
, grouped by condition
str(data)
## 'data.frame': 164 obs. of 3 variables:
## $ condition : Factor w/ 2 levels "Non-shredder",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ honesty : num 1.44 3.3 1.83 1.22 2.72 ...
## $ claimed_solved: int 6 6 4 8 7 5 5 5 7 6 ...
head(data)
data %>% summarise(m_honesty = mean(honesty),
m_solved = mean(claimed_solved))
data %>%
group_by(condition) %>%
summarise(m_honesty = mean(honesty),
m_solved = mean(claimed_solved))
\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty \] Run this model using lm()
.
model_a <- lm(data = data, claimed_solved ~ honesty)
summary(model_a)
##
## Call:
## lm(formula = claimed_solved ~ honesty, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7943 -0.9189 0.0864 0.8906 3.2935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7614 0.3334 20.282 < 0.0000000000000002 ***
## honesty -0.5019 0.1067 -4.706 0.0000054 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.408 on 162 degrees of freedom
## Multiple R-squared: 0.1202, Adjusted R-squared: 0.1148
## F-statistic: 22.14 on 1 and 162 DF, p-value: 0.000005396
Graph in ggplot()
.
data %>%
ggplot(aes(x = honesty, y = claimed_solved)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
geom_label(x=2, y = 5.9, label = "all") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty + \beta_2Condition \]
So far in this course, we’ve worked only with additive effects, as shown here in this model. In this model, we are measuring the relation between honesty and how many matrices the person claimed to have solved while holding condition constant.
Let’s run this model using lm()
.
data %>%
ggplot(aes(x = honesty, y = claimed_solved)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
geom_label(x=2, y = 5.9, label = "all") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
\[\hat{Claimed} = 6.04 + 1.12Condition- .45Honesty \]
Question: interpret the model coefficients and R^2.
To calculate simple slopes, you calculate the regression equation at specific levels of one of your variables. You probably wouldn’t calculate a simple slope for a multiple regression. This is just so that we can compare it later to our interaction model simple slopes.
When condition = 0 (there isn’t a shredder - you could get caught if you cheat)
\[Claimed = 6.04 + 1.12Condition- .45Honesty\] \[Claimed = 6.04 + 1.12(0) - .45Honesty\] \[Claimed = 6.04 - .45Honesty\]
When condition = 1 (there is a shredder - you can’t get caught if you cheat)
\[Claimed = 6.04 + 1.12Condition- .45Honesty\] \[Claimed = 6.04 + 1.12(1) - .45Honesty\] \[Claimed = 7.16 - .45Honesty\]
Question: now calculate the “simple slopes” when honesty = 0, honesty = 2, and honesty = 3
data %>%
ggplot(aes(x = honesty, y = claimed_solved)) +
geom_point() +
geom_abline(slope = -.45, intercept = 6.04, color = "red", size =1 ) +
geom_abline(slope = -.45, intercept = 7.16, color = "blue", size =1) +
geom_label(x = 1.5, y = 5.2, label = "shredder") +
geom_label(x = 1.5, y =6.5, label = "no shredder")+
theme_minimal()
What model do you run if you think that the honesty slope is going to be different under different conditions?
\[\hat{Claimed_i} = \beta_0 + \beta_1Honesty + \beta_2Condition + \beta_3(Honesty*Condition)\] When the interaction term is added to the model, you are allowing for the possibility that the slope of one predictor differs at different levels of the other predictor. In this example, we can now account for the possibility that the relationship between honesty and matrices solved can differ at different levels of the condition variable.
To run an interaction model with the function lm()
, enter the predictor variables you are interacting separated by an asterisk on the right side of the equation, e.g., lm(Y ~ X*Z)
. It is equivalent to running it spelled out, e.g., lm(Y ~ X + Z + X:Z)
.
model_c <- lm(data = data, claimed_solved ~ honesty*condition)
summary(model_c)
##
## Call:
## lm(formula = claimed_solved ~ honesty * condition, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9265 -0.8677 0.0491 1.0581 3.3533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2708 0.4359 16.681 < 0.0000000000000002 ***
## honesty -0.8532 0.1356 -6.294 0.00000000284 ***
## conditionShredder -1.1772 0.5887 -2.000 0.0472 *
## honesty:conditionShredder 0.7781 0.1881 4.137 0.00005660703 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3194
## F-statistic: 26.5 on 3 and 160 DF, p-value: 0.00000000000005695
Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder
Question: interpret the model coefficients and R^2.
Shredder = 0 (There isn’t a shredder - you could get caught if you cheat)
\[Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder\] \[Claimed = 7.27 -.85Honesty -1.18(0) + .77Honesty*(0)\] \[Claimed = 7.27 -.85Honesty\]
Shredder = 1 (There is a shredder - you can’t get caught if you cheat)
\[Claimed = 7.27 -.85Honesty -1.18Shredder + .77Honesty*Shredder\] \[Claimed = 7.27 -.85Honesty -1.18(1) + .77Honesty*(1)\] \[Claimed = 6.09 -.08Honesty\]
In ggplot
:
data %>%
ggplot(aes(x = honesty, y = claimed_solved)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = condition)) +
scale_color_manual(values = c("red", "blue")) +
theme_minimal() +
theme(legend.position = "none") +
geom_label(x = 4.5, y = 6, label = "shredder") +
geom_label(x = 4.5, y = 3.6, label = "non-shredder")
## `geom_smooth()` using formula 'y ~ x'
In sjPlot
plot_model(model_c, type = "pred", terms = c("honesty", "condition"))
For this section, uncomment and run this code. Take some time to look at these results and try to figure out what tests we are running and what you can conclude from them.
emtrends(model_c, ~condition, var = "honesty")
## condition honesty.trend SE df lower.CL upper.CL
## Non-shredder -0.8532 0.136 160 -1.121 -0.586
## Shredder -0.0751 0.130 160 -0.333 0.182
##
## Confidence level used: 0.95
Question: What is this testing? What can you conclude?
emtrends(model_c, pairwise~condition, var = "honesty")
## $emtrends
## condition honesty.trend SE df lower.CL upper.CL
## Non-shredder -0.8532 0.136 160 -1.121 -0.586
## Shredder -0.0751 0.130 160 -0.333 0.182
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder -0.778 0.188 160 -4.137 0.0001
Question: What is this testing? What can you conclude?
mylist <- list(honesty=c(1,2,3,4,5), condition=c("Non-shredder","Shredder"))
combinations <- emmeans(model_c, ~ honesty*condition, at=mylist)
contrast(combinations, "pairwise", by = "honesty")
## honesty = 1:
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder 0.399 0.416 160 0.960 0.3386
##
## honesty = 2:
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder -0.379 0.264 160 -1.434 0.1534
##
## honesty = 3:
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder -1.157 0.194 160 -5.968 <.0001
##
## honesty = 4:
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder -1.935 0.276 160 -7.016 <.0001
##
## honesty = 5:
## contrast estimate SE df t.ratio p.value
## (Non-shredder) - Shredder -2.714 0.431 160 -6.303 <.0001
Question: What is this testing? What can you conclude?
One thing you may have noticed while interpreting output is that we sometimes said “when honesty = 0.” However, for this example, honesty cannot be 0 since the scale goes from 1 to 5, making the interpretations less meaningful. Instead, we can mean center honesty.
data <- data %>%
mutate(honesty_c = honesty - mean(honesty),
honesty_center = scale(honesty, scale = FALSE))
head(data)
Now, we can re-run our model…
model_d <- lm(data = data, claimed_solved ~ honesty_c*condition)
summary(model_d)
##
## Call:
## lm(formula = claimed_solved ~ honesty_c * condition, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9265 -0.8677 0.0491 1.0581 3.3533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7531 0.1379 34.476 < 0.0000000000000002
## honesty_c -0.8532 0.1356 -6.294 0.00000000284
## conditionShredder 1.1190 0.1938 5.775 0.00000003894
## honesty_c:conditionShredder 0.7781 0.1881 4.137 0.00005660703
##
## (Intercept) ***
## honesty_c ***
## conditionShredder ***
## honesty_c:conditionShredder ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3194
## F-statistic: 26.5 on 3 and 160 DF, p-value: 0.00000000000005695
Question: now, how can we interpret the intercept?
Let’s compare that to our model where the predictor was not centered…
summary(model_c)
##
## Call:
## lm(formula = claimed_solved ~ honesty * condition, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9265 -0.8677 0.0491 1.0581 3.3533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2708 0.4359 16.681 < 0.0000000000000002 ***
## honesty -0.8532 0.1356 -6.294 0.00000000284 ***
## conditionShredder -1.1772 0.5887 -2.000 0.0472 *
## honesty:conditionShredder 0.7781 0.1881 4.137 0.00005660703 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.235 on 160 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3194
## F-statistic: 26.5 on 3 and 160 DF, p-value: 0.00000000000005695
Question: what changed from the first model to the model with centered honesty?
You are interested in whether the time students spend studying (study_time
) interacts with test anxiety (anxiety
) to predict students’ test performance (perf
).
Import the data below:
test_perf <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-7/data/test_perf.csv")
head(test_perf)
str(test_perf)
psych::describe(test_perf)
As you can see, anxiety is measured categorically (“low” or “high”) and study time ranges from .9 hours to 6.30 hours.
Run a model testing whether test anxiety moderates the relationship between time spent studying and test performance. Center study time so that it is more meaningful.
test_perf <- test_perf %>%
mutate(anxiety = factor(anxiety,
levels= c("low", "high")),
study_time_c = scale(study_time, scale = FALSE))
model <- lm(data= test_perf, perf ~ anxiety*study_time_c)
summary(model)
##
## Call:
## lm(formula = perf ~ anxiety * study_time_c, data = test_perf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5858 -2.2278 -0.7574 1.0058 17.9678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.4890 0.8682 91.552 < 0.0000000000000002 ***
## anxietyhigh -6.2434 1.4033 -4.449 0.00003226284 ***
## study_time_c 7.8435 0.6197 12.657 < 0.0000000000000002 ***
## anxietyhigh:study_time_c -6.1967 0.9290 -6.671 0.00000000523 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.815 on 69 degrees of freedom
## Multiple R-squared: 0.735, Adjusted R-squared: 0.7235
## F-statistic: 63.8 on 3 and 69 DF, p-value: < 0.00000000000000022
Interpret the model coefficients.
Visualize the interaction using sjPlot
or ggplot
.
test_perf %>%
ggplot(aes(x= study_time, y = perf, color = anxiety)) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Test whether each simple slope is significantly different from 0.
emtrends(model, ~anxiety, var = "study_time_c")
## anxiety study_time_c.trend SE df lower.CL upper.CL
## low 7.84 0.620 69 6.607 9.08
## high 1.65 0.692 69 0.266 3.03
##
## Confidence level used: 0.95
Test whether the simple slopes for low anxiety and high anxiety are significantly different from each other.
emtrends(model, pairwise~anxiety, var = "study_time_c")
## $emtrends
## anxiety study_time_c.trend SE df lower.CL upper.CL
## low 7.84 0.620 69 6.607 9.08
## high 1.65 0.692 69 0.266 3.03
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## low - high 6.2 0.929 69 6.671 <.0001
What do the results of these significance tests mean?