You can download the Rmd file here to follow along.

Purpose

Today we will review univariate regression, discuss how to summarize and visualize uncertainty in regression models, and discuss how to estimate regression coefficients using matrix algebra. At the end, we will introduce the General Linear Model and demonstrate how GLM can be used to understand all of the statistical tests we have learned so far (t-tests, ANOVA, correlations, regressions) within one unifying framework.

For today’s lab, you will need to load the following libraries:

library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(broom) # for cleaning
library(sjPlot) # for plotting
library(ggpubr) # for plotting
library(carData) # for Guyer dataset

Visualizing uncertainty

For today’s lab, we are going to continue using the same dataset from previous labs that contains variables measuring gender, self-reported conscientiousness (from the BFI), and self-reported physical health.

Data and review

First, let’s load in the data.

health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-3/data/consc_health.csv")

Our model can be described with the following regression equation:

\[health_i = b_0 + b_1consc_i + e_i\]

We can run the model in R using the lm() function.

model <- lm(sr_health ~ consc, data = health)
summary(model)

## 
## Call:
## lm(formula = sr_health ~ consc, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5450 -0.6414 -0.1623  0.6817  2.0154 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   1.6570     0.3571   4.641 0.0000203 ***
## consc         0.4904     0.1188   4.128  0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2138 
## F-statistic: 17.04 on 1 and 58 DF,  p-value: 0.0001186

Question: What do the intercept and slope mean? What do the p-values tell us?

Confidence intervals

Our b's (intercept and slope) are estimates from our sample of true population parameters ($\beta$’s). Whenever we calculate an estimate, we should also determine how precise our estimate is.

Recall the formula for calculating confidence intervals:

\[CI_b = b \pm CV(SE_b)\]

We can calculate the confidence interval around the estimates in R with the function stats::confint(). This function takes the model object as the first argument. By default it will give you 95% CI’s.

confint(model)

##                 2.5 %    97.5 %
## (Intercept) 0.9422515 2.3716950
## consc       0.2526050 0.7282069

Question: What does these 95% CI for the slope of conscientiousness mean in plain English?

Confidence bands

In addition to estimating precision around the our coefficients, we can also estimate our precision around each fitted value, $\hat{Y_i}$. These standard errors are generated by broom::augment() (and are labeled .se.fit).

model %>% 
  augment(se_fit = TRUE) %>% 
  select(sr_health, .fitted, .se.fit) %>% 
  head()

If we were to string all of this information together, it would generate a confidence band around our regression line. As we’ve already seen, you can get this confidence band when creating a scatter plot by adding geom_smooth(method = "lm").

health %>%
  ggplot(aes(x = consc, y = sr_health)) + 
  geom_point() + 
  geom_smooth(method = "lm") + # adds a layer that includes the regression line & 95% confidence band
  labs(x = "Conscientiousness", y = "Self-rated health") +
  theme_minimal()

The animation below is an example of a “Hypothetical Outcomes Plot” (HOP) that visually demonstrates what this 95% CI band represents. In essence, this plot shows what the regression line could look like if we were to repeat our experiment over and over (sampling from the same population each time).

Prediction bands

The predict() function allows us to get the fitted Y values from all 60 conscientiousness values in our dataset. Note: you can also get the fitted Y values with the broom::augment() function like we did earlier in the lab.

predict(model)

##        1        2        3        4        5        6        7        8 
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843 
##        9       10       11       12       13       14       15       16 
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321 
##       17       18       19       20       21       22       23       24 
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952 
##       25       26       27       28       29       30       31       32 
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783 
##       33       34       35       36       37       38       39       40 
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356 
##       41       42       43       44       45       46       47       48 
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330 
##       49       50       51       52       53       54       55       56 
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202 
##       57       58       59       60 
## 2.698847 3.737070 3.129544 2.779391

We can use this information to create “prediction bands”. First we will generate our fitted values along with a “prediction interval” (lower and upper bound) for each of these values.

fitted_interval <- predict(model, interval = "prediction")
head(fitted_interval)

##        fit       lwr      upr
## 1 2.961300 1.1737189 4.748882
## 2 4.119826 2.2594779 5.980173
## 3 3.543509 1.7407478 5.346271
## 4 2.507581 0.7010676 4.314095
## 5 3.350326 1.5575084 5.143143
## 6 3.621532 1.8133935 5.429671

Next we’ll bind these fitted values (and their intervals) to our original dataset.

new_df <- cbind(health, fitted_interval)
head(new_df)

And finally, we’ll plot a prediction band on top of the data by adding a geom_ribbon() layer.

new_df %>% 
  ggplot(aes(x = consc, y = sr_health)) +
  geom_point() +
  geom_smooth(method = lm, se = TRUE) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.1) + 
  labs(x = "Conscientiousness", y = "Self-rated health") +
  theme_minimal()

Question: What does the prediction band represent? How does it differ from the confidence band?

The General Linear Model

We just saw that regression works “under the hood” by solving a matrix algebra equation to get the intercept and slope of the regression line. As Sara briefly mentioned in class, this matrix algebra equation works for any type of data we have. This should clue us into the idea that there is some fundamental process going on behind the scenes of all of our models…

The general linear model (GLM) is a family of models that assume the relationship between your DV and IV(s) is additive, and that your outcome is normally distributed. In its simplest form, we can think of the general linear model as:

\[Data = Model + Error \]

This provides a unifying framework for all of the statistical tests we have learned (or at least touched on) so far: t-tests, correlations, ANOVA, and linear regression. All of these tests are really doing the same math at the end of the day.

To illustrate this, let’s look at the relationship between gender and self-reported health from the health data set.

head(health)

str(health)

## 'data.frame':    60 obs. of  4 variables:
##  $ pid      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender   : chr  "male" "male" "male" "male" ...
##  $ consc    : num  2.66 5.02 3.85 1.73 3.45 ...
##  $ sr_health: num  1.91 4.49 3.4 2.05 4.21 ...

Let’s covert the gender variable to 0's and 1's. We will talk more about dummy coding in a future lab.

health <-  health %>% 
  mutate(gender = case_when(gender == "male" ~ 0,
                               gender == "female" ~ 1))
head(health)

t-test

t_test <- t.test(formula = sr_health ~ gender, data = health, var.equal = TRUE)
t_test

## 
##  Two Sample t-test
## 
## data:  sr_health by gender
## t = 1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2157688  0.8133372
## sample estimates:
## mean in group 0 mean in group 1 
##        3.202735        2.903950

Correlation

cor_test <- cor.test(formula = ~ sr_health + gender, data = health)
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  sr_health and gender
## t = -1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3898633  0.1071553
## sample estimates:
##        cor 
## -0.1508746

ANOVA

anova_test <- aov(formula = sr_health ~ gender, data = health)
summary(anova_test)

##             Df Sum Sq Mean Sq F value Pr(>F)
## gender       1   1.34  1.3391   1.351   0.25
## Residuals   58  57.49  0.9912

Regression

regression <- lm(formula = sr_health ~ gender, data = health)
summary(regression)

## 
## Call:
## lm(formula = sr_health ~ gender, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76485 -0.92279 -0.05685  0.60980  2.02939 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3.2027     0.1818  17.620 <0.0000000000000002 ***
## gender       -0.2988     0.2571  -1.162                0.25    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared:  0.02276,    Adjusted R-squared:  0.005914 
## F-statistic: 1.351 on 1 and 58 DF,  p-value: 0.2499

Comparison

t_test$p.value

## [1] 0.2498611

cor_test$p.value

## [1] 0.2498611

anova(anova_test)$`Pr(>F)`[1]

## [1] 0.2498611

summary(regression)$coefficients[2,4]

## [1] 0.2498611

Minihacks

Minihack 1: Prediction and confidence bands

Using the mtcars dataset, create a scatterplot of cyl (x-axis) and mpg (y-axis) that includes both a prediction band and a confidence band.

Run a regression model.

#your code here

Generate fitted values and their prediction interval using the predict() function.

#your code here

Bind the fitted values and the prediction intervals to the original dataset.

#your code here

Using ggplot(), create a scatterplot that includes both a prediction band and a confidence band.

#your code here

Minihack 2: Confidence intervals

For this minihack, we will refer back to the example about conscientiousness and health. We used confint() to calculate the 95% CI for our regression coefficients. Your job is start with the model output, stored as a list object (see below), and extract the relevant pieces of information to calculate the 95% CI around the intercept and slope.

Note: the formula for a confidence interval is estimate +- cv*se

model <- lm(sr_health ~ consc, data = health)
summary_model <- summary(model)

Extract the estimates.

#your code here

Extract standard errors. Hint: use summary_model$coefficients.

#your code here

Extract the df (It will be the denominator df from F statistic)).

#your code here

Calculate the critical value.

#your code here

Calculate a 95% CI for both estimates.

#your code here

Verify that your answer corresponds to the result from confint().

#your code here

Lab 3: Univariate Regression (II) & GLM