You can download the Rmd file here to follow along.
Today we will review univariate regression, discuss how to summarize and visualize uncertainty in regression models, and discuss how to estimate regression coefficients using matrix algebra. At the end, we will introduce the General Linear Model and demonstrate how GLM can be used to understand all of the statistical tests we have learned so far (t-tests, ANOVA, correlations, regressions) within one unifying framework.
For today’s lab, you will need to load the following libraries:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(broom) # for cleaning
library(sjPlot) # for plotting
library(ggpubr) # for plotting
library(carData) # for Guyer dataset
For today’s lab, we are going to continue using the same dataset from previous labs that contains variables measuring gender, self-reported conscientiousness (from the BFI), and self-reported physical health.
First, let’s load in the data.
health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-3/data/consc_health.csv")
Our model can be described with the following regression equation:
\[health_i = b_0 + b_1consc_i + e_i\]
We can run the model in R using the lm()
function.
model <- lm(sr_health ~ consc, data = health)
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5450 -0.6414 -0.1623 0.6817 2.0154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6570 0.3571 4.641 0.0000203 ***
## consc 0.4904 0.1188 4.128 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2138
## F-statistic: 17.04 on 1 and 58 DF, p-value: 0.0001186
Question: What do the intercept and slope mean? What do the p-values tell us?
Our b's
(intercept and slope) are estimates from our sample of true population parameters (\(\beta\)’s). Whenever we calculate an estimate, we should also determine how precise our estimate is.
Recall the formula for calculating confidence intervals:
\[CI_b = b \pm CV(SE_b)\]
We can calculate the confidence interval around the estimates in R with the function stats::confint()
. This function takes the model object as the first argument. By default it will give you 95% CI’s.
confint(model)
## 2.5 % 97.5 %
## (Intercept) 0.9422515 2.3716950
## consc 0.2526050 0.7282069
Question: What does these 95% CI for the slope of conscientiousness mean in plain English?
In addition to estimating precision around the our coefficients, we can also estimate our precision around each fitted value, \(\hat{Y_i}\). These standard errors are generated by broom::augment()
(and are labeled .se.fit
).
model %>%
augment(se_fit = TRUE) %>%
select(sr_health, .fitted, .se.fit) %>%
head()
If we were to string all of this information together, it would generate a confidence band around our regression line. As we’ve already seen, you can get this confidence band when creating a scatter plot by adding geom_smooth(method = "lm")
.
health %>%
ggplot(aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = "lm") + # adds a layer that includes the regression line & 95% confidence band
labs(x = "Conscientiousness", y = "Self-rated health") +
theme_minimal()
The animation below is an example of a “Hypothetical Outcomes Plot” (HOP) that visually demonstrates what this 95% CI band represents. In essence, this plot shows what the regression line could look like if we were to repeat our experiment over and over (sampling from the same population each time).
The predict()
function allows us to get the fitted Y
values from all 60 conscientiousness values in our dataset. Note: you can also get the fitted Y
values with the broom::augment()
function like we did earlier in the lab.
predict(model)
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
We can use this information to create “prediction bands”. First we will generate our fitted values along with a “prediction interval” (lower and upper bound) for each of these values.
fitted_interval <- predict(model, interval = "prediction")
head(fitted_interval)
## fit lwr upr
## 1 2.961300 1.1737189 4.748882
## 2 4.119826 2.2594779 5.980173
## 3 3.543509 1.7407478 5.346271
## 4 2.507581 0.7010676 4.314095
## 5 3.350326 1.5575084 5.143143
## 6 3.621532 1.8133935 5.429671
Next we’ll bind these fitted values (and their intervals) to our original dataset.
new_df <- cbind(health, fitted_interval)
head(new_df)
And finally, we’ll plot a prediction band on top of the data by adding a geom_ribbon()
layer.
new_df %>%
ggplot(aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = lm, se = TRUE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.1) +
labs(x = "Conscientiousness", y = "Self-rated health") +
theme_minimal()
Question: What does the prediction band represent? How does it differ from the confidence band?
We just saw that regression works “under the hood” by solving a matrix algebra equation to get the intercept and slope of the regression line. As Sara briefly mentioned in class, this matrix algebra equation works for any type of data we have. This should clue us into the idea that there is some fundamental process going on behind the scenes of all of our models…
The general linear model (GLM) is a family of models that assume the relationship between your DV and IV(s) is additive, and that your outcome is normally distributed. In its simplest form, we can think of the general linear model as:
\[Data = Model + Error \]
This provides a unifying framework for all of the statistical tests we have learned (or at least touched on) so far: t-tests, correlations, ANOVA, and linear regression. All of these tests are really doing the same math at the end of the day.
To illustrate this, let’s look at the relationship between gender and self-reported health from the health
data set.
head(health)
str(health)
## 'data.frame': 60 obs. of 4 variables:
## $ pid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "male" "male" "male" "male" ...
## $ consc : num 2.66 5.02 3.85 1.73 3.45 ...
## $ sr_health: num 1.91 4.49 3.4 2.05 4.21 ...
Let’s covert the gender
variable to 0's
and 1's
. We will talk more about dummy coding in a future lab.
health <- health %>%
mutate(gender = case_when(gender == "male" ~ 0,
gender == "female" ~ 1))
head(health)
t_test <- t.test(formula = sr_health ~ gender, data = health, var.equal = TRUE)
t_test
##
## Two Sample t-test
##
## data: sr_health by gender
## t = 1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.2157688 0.8133372
## sample estimates:
## mean in group 0 mean in group 1
## 3.202735 2.903950
cor_test <- cor.test(formula = ~ sr_health + gender, data = health)
cor_test
##
## Pearson's product-moment correlation
##
## data: sr_health and gender
## t = -1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3898633 0.1071553
## sample estimates:
## cor
## -0.1508746
anova_test <- aov(formula = sr_health ~ gender, data = health)
summary(anova_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 1.34 1.3391 1.351 0.25
## Residuals 58 57.49 0.9912
regression <- lm(formula = sr_health ~ gender, data = health)
summary(regression)
##
## Call:
## lm(formula = sr_health ~ gender, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76485 -0.92279 -0.05685 0.60980 2.02939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2027 0.1818 17.620 <0.0000000000000002 ***
## gender -0.2988 0.2571 -1.162 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared: 0.02276, Adjusted R-squared: 0.005914
## F-statistic: 1.351 on 1 and 58 DF, p-value: 0.2499
t_test$p.value
## [1] 0.2498611
cor_test$p.value
## [1] 0.2498611
anova(anova_test)$`Pr(>F)`[1]
## [1] 0.2498611
summary(regression)$coefficients[2,4]
## [1] 0.2498611
Using the mtcars
dataset, create a scatterplot of cyl (x-axis) and mpg (y-axis) that includes both a prediction band and a confidence band.
#your code here
predict()
function.#your code here
#your code here
#your code here
For this minihack, we will refer back to the example about conscientiousness and health. We used confint()
to calculate the 95% CI for our regression coefficients. Your job is start with the model output, stored as a list object (see below), and extract the relevant pieces of information to calculate the 95% CI around the intercept and slope.
Note: the formula for a confidence interval is estimate +- cv*se
model <- lm(sr_health ~ consc, data = health)
summary_model <- summary(model)
#your code here
#your code here
#your code here
#your code here
#your code here
confint()
.#your code here