You can download the Rmd file here to follow along.
Today’s lab will focus on the one-way ANOVA. However, rather than spend a lot of time on the traditional way of doing an ANOVA, we will instead focus on running an ANOVA as a linear regression with a categorical predictor. In particular, we will discuss how to code categorical variables in R and how this affects interpretation of model coefficients.
Be sure to have the following packages loaded:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for descriptives
We’ll be looking at a dataset comparing different kinds of treatment for depression. In this study, depressed patients (n = 5 per group) were randomly assigned to one of three groups:
After 10 weeks, participants’ depression scores were measured (on a scale of 1 = no depression to 12 = severely depressed). Our dataset will have just 2 variables: group
(CBT, Psychotherapy, or Control) and depress
(depression scores).
NOTE: 1
= CBT; 2
= Psychotherapy; 3
= Control
Import the data
depression <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-5/data/depression_therapy.csv")
The first thing we should check is how the data are structured. We should have a factor called group
and a numeric variable for depression scores called depress
. Let’s use the function str()
.
#Your code here
group
as a factorR thinks that group
is an integer, so we need to make it into a factor in order to run our analysis. We can use mutate()
and base R’s factor()
, which can be used to turn variables into factors.
NOTE: You want to provide labels in the same order as the levels of the factor, so in this case we want them in the order CBT
, Psychotherapy
, and Control
(see above).
depression <- depression %>%
mutate(group = factor(group,
labels = c("CBT", "Psychotherapy", "Control"))) # order matters here!
Look at the structure of the data again. Now it’s clear that group
is a factor variable with the correct levels.
#Your code here
Next, we’ll get descriptives for each group using two tidyverse functions, group_by()
and summarize()
. First, you want to group by group
, and then you want to summarize with three arguments, mean, sd, and n. Last, you can quickly turn it into a nice table using the knitr::kable()
function.
Click here to see how we did it in psy611.
# Your code here
You can also get descriptives with the describeBy() function. Try adding mat = TRUE
as an argument.
#Your code here
There are many ways that you can visualize the data. In ggplot, I recommend adding a geom_boxplot()
, or geom_bar()
, or geom_violin()
layer.
This is a good resource for boxplots with ggplot, this is a good resource for bar plots with ggplot, and this is a good resource for violin plots with ggplot.
# Your code here
In regression, categorical predictors with more than two levels are broken up into more than one predictor. This actually happens under the hood in R when our independent variable is a factor. Let’s see what happens when we put group
as the predictor in our model:
# Your code here
Here is our regression equation:
\[Depression_i = \beta_0 + \beta_1psychotherapy_i + \beta_2control_i + e_i\]
R is creating two dummy variables for us under the hood: groupPsychotherapy
and groupControl
. This is what R is doing:
depression %>%
mutate(groupPsychotherapy = ifelse(group == "Psychotherapy", 1, 0),
groupControl = ifelse(group == "Control", 1, 0))
By default, R treats whatever the first level of the factor variable is as the reference group. In this case, CBT was the first level of group
(because it was initially coded as 1
in our raw data), so the model treated CBT as our reference group.
Here is our regression equation, created from the output:
\[Depression_i = 7.20 -3.80psychotherapy + 2.60control\]
Question: Determine the mean depression score of each group.
Now, let’s revisit the output.
model <- lm(depress ~ group, data = depression)
summary(model)
##
## Call:
## lm(formula = depress ~ group, data = depression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4 -1.2 -0.2 1.0 2.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2000 0.7394 9.738 0.000000477 ***
## groupPsychotherapy -3.8000 1.0456 -3.634 0.00342 **
## groupControl 2.6000 1.0456 2.487 0.02861 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared: 0.7595, Adjusted R-squared: 0.7195
## F-statistic: 18.95 on 2 and 12 DF, p-value: 0.0001934
Question: Interpret the intercept. What does a significant p-value signify?
Question: Interpret the slopes. What do the significant p-values signify?
Let’s imagine that we have the following (more intuitive) research questions:
Question: What do want our reference group to be to answer these research questions?
Now let’s make the appropriate dummy codes. Recall that we need k-1 dummy codes (where k is the number of groups). We have 3 groups, so we need 2 dummy codes.
Remember, our decision of how to set the dummy codes (which group to set as the reference group) should be guided by our research questions.
mutate()
# create new dummy variables containing 1's and 0's
depression <- depression %>%
mutate(CBT.v.Control =
ifelse(group == "CBT", 1, 0),
Psychotherapy.v.Control =
ifelse(group == "Psychotherapy", 1, 0))
depression
Now, we can run the linear model using these new dummy variables as the IV’s.
# Your code here
Question: What does the intercept mean? What do the slopes mean?
Question: What does the F test tell us?
Now we’re going use another method to create our dummy codes by creating a contrast matrix. When we pass a factor to lm()
it uses a contrast matrix under the hood. Although R creates the contrast matrix automatically with certain default assumptions, you can set it yourself if you have a specific set of codes in mind (which we do).
Use contrasts()
to view the default contrast matrix associated with our grouping variable.
# Your code here
Again, by default our contrasts are not how we want them because we want Control
to be the reference group. To manually specify our dummy codes, we’ll use the function contr.treatment()
, which will return a contrast matrix with specific characteristics, depending on what we tell it. The first argument, n
is the number of levels of the factor variable. The second argument, base
refers to the level of the factor that should be considered the reference group.
# Your code here
We can now overwrite the default contrast matrix for group
with the one we just manually specified.
# Your code here
group
?# Your code here
Let’s take a closer look at group
using attributes()
…
# Your code here
Now, run the linear model with group
(now containing our desired dummy codes) as the IV.
# Your code here
Let’s take another look at the levels of group
using the levels()
function.
# Your code here
We can manually change the order of the levels by using factor()
and manually specifying the levels
argument to contain the order we want. Again, since R treats the first level of the factor as the reference group, and we want Control
to be the reference group in order to obtain our contrasts of interests when we run the linear model, we will create a new version of group
called group_relevel
and make Control
the first level of the factor in this new variable.
depression <- depression %>%
mutate(group_relevel = factor(group, levels = c("Control",
"CBT",
"Psychotherapy")))
To double check that this worked, let’s look at the levels of group_relevel
# Your code here
Now when we run the linear model, the contrast matrix that will be automatically generated will correspond to the correct contrasts we want. To verify this, let’s re-run the linear model (yet again!) with the re-leveled group_relevel
variable as the IV.
# Your code here
Instead of dummy coding, we can other types of coding, like deviation coding (also called “effects coding”) to compare individual group means to the overall mean of the dependent variable (i.e. the grand mean). In R, we do this in basically the same way as dummy coding, but just change the values we use to set the codes.
NOTE: This will change how we interpret our coefficients!. When using effects coding, the intercept represents the grand mean. The coefficients of each of the effect-coded variables is equal to the difference between the mean of the group coded 1 and the grand mean.
With effects coding, the reference group will end up being coded with all -1
’s.
Let’s imagine our research questions are as follows: Is CBT effective compared to the overall sample? Is Psychotherapy effective compared to the overall sample?
We’ll create our contrast matrix of effects codes in a very similar way to how we made our dummy codes. The only difference is that this time we will use contr.sum()
, again telling it the number of levels of our grouping factor (n = 3
)
# Your code here
And as before with dummy codes, we overwrite the default contrast matrix with our deviation codes.
# Your code here
Run the linear model with deviation codes
# Your code here
Question: What does the intercept mean? The slopes?
There are many other coding schemes besides dummy and deviation coding. You can read more about that here.
We have just reviewed in detail how to do an ANOVA within a linear regression framework. We will briefly review how to generate a traditional ANOVA table, though this will give us the same information we already got from our regression model.
To get an ANOVA table, you can usestats::aov()
and provide it with the same information you would pass to lm()
, namely a formula (e.g. IV ~ DV
) and the data
. Again, make sure that the categorical IV is a factor.
model_anova <- aov(depress ~ group, data = depression)
summary(model_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 103.6 51.80 18.95 0.000193 ***
## Residuals 12 32.8 2.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To calculate pairwise comparisons between group levels, use pairwise.t.test()
. This function takes arguments x
(the outcome variable), g
(the grouping variable) an p.adjust.method
(e.g. “bonferroni” or “holm”).
pairwise.t.test(x = depression$depress,
g = depression$group,
p.adjust.method = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: depression$depress and depression$group
##
## CBT Psychotherapy
## Psychotherapy 0.0034 -
## Control 0.0286 0.000052
##
## P value adjustment method: none
Since we are running a lot of test, we need to correct for multiple comparisons so that we don’t have an inflated type I error rate.
pairwise.t.test(x = depression$depress,
g = depression$group,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: depression$depress and depression$group
##
## CBT Psychotherapy
## Psychotherapy 0.01027 -
## Control 0.08584 0.00016
##
## P value adjustment method: bonferroni
pairwise.t.test(x = depression$depress,
g = depression$group,
p.adjust.method = "holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: depression$depress and depression$group
##
## CBT Psychotherapy
## Psychotherapy 0.00685 -
## Control 0.02861 0.00016
##
## P value adjustment method: holm
Question: What difference do you notice about the Bonferroni vs. Holm correction?
Research Problem:
You are interested in the effectiveness of a new drug in treating the symptoms of depression. You randomly assign patients with depression into one of 3 treatment groups with different levels of the drug (low, medium, and high), and also one control group that only received a placebo. You then measure their symptoms after 4 weeks of drug treatment. You are interested in whether the drug is effective, and what level of the drug is most effective.
Load in the raw data with the following code
depression_drug <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-5/data/depression_drug.csv")
# Your code here
#Your code here
drug
and depression
. What does this tell you?# Your code here
low
, medium
, high
) and the none
group. Interpret all model coefficients.# Your code here
depression
scores across all levels of drug
. Use a correction method to account for multiple comparisons.# Your code here