You can download the Rmd file here to follow along.
Today we will review univariate regression, discuss how to summarize and visualize uncertainty in regression models, and discuss how to estimate regression coefficients using matrix algebra. At the end, we will introduce the General Linear Model and demonstrate how GLM can be used to understand all of the statistical tests we have learned so far (t-tests, ANOVA, correlations, regressions) within one unifying framework.
For today’s lab, you will need to load the following libraries:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(broom) # for cleaning
library(sjPlot) # for plotting
library(ggpubr) # for plotting
library(carData) # for Guyer dataset
For today’s lab, we are going to continue using the same dataset from previous labs that contains variables measuring gender, self-reported conscientiousness (from the BFI), and self-reported physical health.
First, let’s load in the data.
health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-3/data/consc_health.csv")
Our model can be described with the following regression equation:
\[health_i = b_0 + b_1consc_i + e_i\]
We can run the model in R using the lm()
function.
model <- lm(sr_health ~ consc, data = health)
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5450 -0.6414 -0.1623 0.6817 2.0154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6570 0.3571 4.641 0.0000203 ***
## consc 0.4904 0.1188 4.128 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2138
## F-statistic: 17.04 on 1 and 58 DF, p-value: 0.0001186
Question: What do the intercept and slope mean? What do the p-values tell us?
Our b's
(intercept and slope) are estimates from our sample of true population parameters (\(\beta\)’s). Whenever we calculate an estimate, we should also determine how precise our estimate is.
Recall the formula for calculating confidence intervals:
\[CI_b = b \pm CV(SE_b)\]
We can calculate the confidence interval around the estimates in R with the function stats::confint()
. This function takes the model object as the first argument. By default it will give you 95% CI’s.
confint(model)
## 2.5 % 97.5 %
## (Intercept) 0.9422515 2.3716950
## consc 0.2526050 0.7282069
Question: What does these 95% CI for the slope of conscientiousness mean in plain English?
In addition to estimating precision around the our coefficients, we can also estimate our precision around each fitted value, \(\hat{Y_i}\). These standard errors are generated by broom::augment()
(and are labeled .se.fit
).
model %>%
augment(se_fit = TRUE) %>%
select(sr_health, .fitted, .se.fit) %>%
head()
If we were to string all of this information together, it would generate a confidence band around our regression line. As we’ve already seen, you can get this confidence band when creating a scatter plot by adding geom_smooth(method = "lm")
.
health %>%
ggplot(aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = "lm") + # adds a layer that includes the regression line & 95% confidence band
labs(x = "Conscientiousness", y = "Self-rated health") +
theme_minimal()
The animation below is an example of a “Hypothetical Outcomes Plot” (HOP) that visually demonstrates what this 95% CI band represents. In essence, this plot shows what the regression line could look like if we were to repeat our experiment over and over (sampling from the same population each time).
The predict()
function allows us to get the fitted Y
values from all 60 conscientiousness values in our dataset. Note: you can also get the fitted Y
values with the broom::augment()
function like we did earlier in the lab.
predict(model)
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
We can use this information to create “prediction bands”. First we will generate our fitted values along with a “prediction interval” (lower and upper bound) for each of these values.
fitted_interval <- predict(model, interval = "prediction")
head(fitted_interval)
## fit lwr upr
## 1 2.961300 1.1737189 4.748882
## 2 4.119826 2.2594779 5.980173
## 3 3.543509 1.7407478 5.346271
## 4 2.507581 0.7010676 4.314095
## 5 3.350326 1.5575084 5.143143
## 6 3.621532 1.8133935 5.429671
Next we’ll bind these fitted values (and their intervals) to our original dataset.
new_df <- cbind(health, fitted_interval)
head(new_df)
And finally, we’ll plot a prediction band on top of the data by adding a geom_ribbon()
layer.
new_df %>%
ggplot(aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = lm, se = TRUE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.1) +
labs(x = "Conscientiousness", y = "Self-rated health") +
theme_minimal()
Question: What does the prediction band represent? How does it differ from the confidence band?
sjPlot::plot_model()`
plot_model(model = model, # name of model object
type = "pred", # show predicted values (i.e. regression line)
show.data = TRUE, # include data points on plot
jitter = TRUE) # add small amount of random variation to to prevent overlap
ggpubr::ggscatter()
ggscatter(data = health, # name of data.frame
x = "consc", # IV (must be quoted)
y = "sr_health", # DV (must be quoted)
add = "reg.line", # add regression line
xlab = "Conscientiousness", # x-axis label
ylab = "Self-rated health", # y-axis label
conf.int = TRUE, # show 95% confidence band around regression line
cor.coef = TRUE) # display correlation coefficient and p-value
Consider the following matrix algebra equation:
\[\mathbf{Y} = \mathbf{XB}\] In this equation: 1. Y
is a nx1 matrix of outcome variable values from the dataset (e.g., health scores)
2. X
is a nx2 matrix. The first column is filled with 1
s and the second column is the predictor variable values from the dataset (e.g., consciousness scores)
3. B
is a 2x1 matrix containing our estimates of the intercept and slope.
Using this equation, we can plug in our data and solve for the intercept and slope estimates.
X
matrix (first column 1s, second column consciousness scores)x_mat <- health %>% # start with the original data frame
mutate(ones = rep(1, nrow(.))) %>% # create a column of 1's to represent the intercept
select(ones, consc) %>% # select only the column of 1's and X variable
as.matrix() %>% # coerce to a matrix
unname() # get rid of dimnames
head(x_mat)
## [,1] [,2]
## [1,] 1 2.659688
## [2,] 1 5.022069
## [3,] 1 3.846887
## [4,] 1 1.734497
## [5,] 1 3.452961
## [6,] 1 4.005985
Y
matrix (self-reported health scores)y_mat <- health %>% # start with the original data frame
select(sr_health) %>% # select just the Y variable
as.matrix() %>% # coerce to a matrix
unname() # get rid of dimnames
head(y_mat)
## [,1]
## [1,] 1.909508
## [2,] 4.493539
## [3,] 3.403894
## [4,] 2.045359
## [5,] 4.210038
## [6,] 5.111872
In matrix algebra, you can’t divide by a matrix, so you will need to do some more complicated arithmetic to solve for B (the vector of slope and intercept estimates). You can calculate B with the following formula:
\[\mathbf{B} = (\mathbf{X'X})^{-1} \mathbf{X'Y}\]
solve(t(x_mat) %*% x_mat) %*% (t(x_mat) %*% y_mat)
## [,1]
## [1,] 1.6569733
## [2,] 0.4904059
Let’s check our work:
model$coefficients
## (Intercept) consc
## 1.6569733 0.4904059
We just saw that regression works “under the hood” by solving a matrix algebra equation to get the intercept and slope of the regression line. As Sara mentioned in class, this matrix algebra equation works for any type of data we have. This should clue us into the idea that there is some fundamental process going on behind the scenes of all of our models…
The general linear model (GLM) is a family of models that assume the relationship between your DV and IV(s) is additive, and that your outcome is normally distributed. In its simplest form, we can think of the general linear model as:
\[Data = Model + Error \]
This provides a unifying framework for all of the statistical tests we have learned (or at least touched on) so far: t-tests, correlations, ANOVA, and linear regression. All of these tests are really doing the same math at the end of the day.
To illustrate this, let’s look at the relationship between gender and self-reported health from the health
data set.
head(health)
str(health)
## 'data.frame': 60 obs. of 4 variables:
## $ pid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "male" "male" "male" "male" ...
## $ consc : num 2.66 5.02 3.85 1.73 3.45 ...
## $ sr_health: num 1.91 4.49 3.4 2.05 4.21 ...
Let’s covert the gender
variable to 0's
and 1's
. We will talk more about dummy coding in a future lab.
health <- health %>%
mutate(gender = case_when(gender == "male" ~ 0,
gender == "female" ~ 1))
head(health)
t_test <- t.test(formula = sr_health ~ gender, data = health, var.equal = TRUE)
t_test
##
## Two Sample t-test
##
## data: sr_health by gender
## t = 1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2157688 0.8133372
## sample estimates:
## mean in group 0 mean in group 1
## 3.202735 2.903950
cor_test <- cor.test(formula = ~ sr_health + gender, data = health)
cor_test
##
## Pearson's product-moment correlation
##
## data: sr_health and gender
## t = -1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3898633 0.1071553
## sample estimates:
## cor
## -0.1508746
anova_test <- aov(formula = sr_health ~ gender, data = health)
summary(anova_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 1.34 1.3391 1.351 0.25
## Residuals 58 57.49 0.9912
regression <- lm(formula = sr_health ~ gender, data = health)
summary(regression)
##
## Call:
## lm(formula = sr_health ~ gender, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76485 -0.92279 -0.05685 0.60980 2.02939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2027 0.1818 17.620 <0.0000000000000002 ***
## gender -0.2988 0.2571 -1.162 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared: 0.02276, Adjusted R-squared: 0.005914
## F-statistic: 1.351 on 1 and 58 DF, p-value: 0.2499
t_test$p.value
## [1] 0.2498611
cor_test$p.value
## [1] 0.2498611
anova(anova_test)
summary(regression)
##
## Call:
## lm(formula = sr_health ~ gender, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76485 -0.92279 -0.05685 0.60980 2.02939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2027 0.1818 17.620 <0.0000000000000002 ***
## gender -0.2988 0.2571 -1.162 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared: 0.02276, Adjusted R-squared: 0.005914
## F-statistic: 1.351 on 1 and 58 DF, p-value: 0.2499
Load the following data from Fox and Guyer’s (1978) anonymity and cooperation study. In the study, twenty groups of four participants each played 30 trials of the the prisoner’s dilemma game. The number of cooperative choices (cooperation
) made by each group were scored out of 120 (i.e., cooperative choices made by 4 participants over 30 trials). The groups either made decisions publicly or privately (condition
).
guyer <- carData::Guyer
Use matrix algebra to manually calculate the intercept and slope for the regression equation:
\[cooperation_i = b_0 + b_1condition_i + e_i\] 1. Dummy code the condition (public = 0 and anonymous = 1).
#your code here
X
matrix.#your code here
Y
matrix.#your code here
B
matrix.#your code here
lm()
function.#your code here
Using the mtcars
dataset, create a scatterplot of cyl (x-axis) and mpg (y-axis) that includes both a prediction band and a confidence band.
#your code here
predict()
function.#your code here
#your code here
#your code here
For this minihack, we will refer back to the example about conscientiousness and health. We used confint()
to calculate the 95% CI for our regression coefficients. Your job is start with the model output, stored as a list object (see below), and extract the relevant pieces of information to calculate the 95% CI around the intercept and slope.
Note: the formula for a confidence interval is estimate +- cv*se
model <- lm(sr_health ~ consc, data = health)
summary_model <- summary(model)
#your code here
#your code here
#your code here
#your code here
#your code here
confint()
.#your code here