You can download the Rmd file here to follow along.

Purpose

Today we will review univariate regression, discuss how to summarize and visualize uncertainty in regression models, and discuss how to estimate regression coefficients using matrix algebra. At the end, we will introduce the General Linear Model and demonstrate how GLM can be used to understand all of the statistical tests we have learned so far (t-tests, ANOVA, correlations, regressions) within one unifying framework.

For today’s lab, you will need to load the following libraries:

library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(broom) # for cleaning
library(sjPlot) # for plotting
library(ggpubr) # for plotting
library(carData) # for Guyer dataset

Visualizing uncertainty

For today’s lab, we are going to continue using the same dataset from previous labs that contains variables measuring gender, self-reported conscientiousness (from the BFI), and self-reported physical health.

Data and review

First, let’s load in the data.

health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-3/data/consc_health.csv")

Our model can be described with the following regression equation:

\[health_i = b_0 + b_1consc_i + e_i\]

We can run the model in R using the lm() function.

model <- lm(sr_health ~ consc, data = health)
summary(model)
## 
## Call:
## lm(formula = sr_health ~ consc, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5450 -0.6414 -0.1623  0.6817  2.0154 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   1.6570     0.3571   4.641 0.0000203 ***
## consc         0.4904     0.1188   4.128  0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2138 
## F-statistic: 17.04 on 1 and 58 DF,  p-value: 0.0001186

Question: What do the intercept and slope mean? What do the p-values tell us?

Confidence intervals

Our b's (intercept and slope) are estimates from our sample of true population parameters (\(\beta\)’s). Whenever we calculate an estimate, we should also determine how precise our estimate is.

Recall the formula for calculating confidence intervals:

\[CI_b = b \pm CV(SE_b)\]

We can calculate the confidence interval around the estimates in R with the function stats::confint(). This function takes the model object as the first argument. By default it will give you 95% CI’s.

confint(model)
##                 2.5 %    97.5 %
## (Intercept) 0.9422515 2.3716950
## consc       0.2526050 0.7282069

Question: What does these 95% CI for the slope of conscientiousness mean in plain English?

Confidence bands

In addition to estimating precision around the our coefficients, we can also estimate our precision around each fitted value, \(\hat{Y_i}\). These standard errors are generated by broom::augment() (and are labeled .se.fit).

model %>% 
  augment(se_fit = TRUE) %>% 
  select(sr_health, .fitted, .se.fit) %>% 
  head() 

If we were to string all of this information together, it would generate a confidence band around our regression line. As we’ve already seen, you can get this confidence band when creating a scatter plot by adding geom_smooth(method = "lm").

health %>%
  ggplot(aes(x = consc, y = sr_health)) + 
  geom_point() + 
  geom_smooth(method = "lm") + # adds a layer that includes the regression line & 95% confidence band
  labs(x = "Conscientiousness", y = "Self-rated health") +
  theme_minimal()

The animation below is an example of a “Hypothetical Outcomes Plot” (HOP) that visually demonstrates what this 95% CI band represents. In essence, this plot shows what the regression line could look like if we were to repeat our experiment over and over (sampling from the same population each time).

Prediction bands

The predict() function allows us to get the fitted Y values from all 60 conscientiousness values in our dataset. Note: you can also get the fitted Y values with the broom::augment() function like we did earlier in the lab.

predict(model)
##        1        2        3        4        5        6        7        8 
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843 
##        9       10       11       12       13       14       15       16 
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321 
##       17       18       19       20       21       22       23       24 
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952 
##       25       26       27       28       29       30       31       32 
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783 
##       33       34       35       36       37       38       39       40 
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356 
##       41       42       43       44       45       46       47       48 
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330 
##       49       50       51       52       53       54       55       56 
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202 
##       57       58       59       60 
## 2.698847 3.737070 3.129544 2.779391

We can use this information to create “prediction bands”. First we will generate our fitted values along with a “prediction interval” (lower and upper bound) for each of these values.

fitted_interval <- predict(model, interval = "prediction")
head(fitted_interval)
##        fit       lwr      upr
## 1 2.961300 1.1737189 4.748882
## 2 4.119826 2.2594779 5.980173
## 3 3.543509 1.7407478 5.346271
## 4 2.507581 0.7010676 4.314095
## 5 3.350326 1.5575084 5.143143
## 6 3.621532 1.8133935 5.429671

Next we’ll bind these fitted values (and their intervals) to our original dataset.

new_df <- cbind(health, fitted_interval)
head(new_df)

And finally, we’ll plot a prediction band on top of the data by adding a geom_ribbon() layer.

new_df %>% 
  ggplot(aes(x = consc, y = sr_health)) +
  geom_point() +
  geom_smooth(method = lm, se = TRUE) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.1) + 
  labs(x = "Conscientiousness", y = "Self-rated health") +
  theme_minimal()

Question: What does the prediction band represent? How does it differ from the confidence band?

Other visualization tools

sjPlot::plot_model()`

plot_model(model = model,    # name of model object
           type = "pred",    # show predicted values (i.e. regression line)
           show.data = TRUE, # include data points on plot
           jitter = TRUE)    # add small amount of random variation to  to prevent overlap

ggpubr::ggscatter()

ggscatter(data = health,              # name of data.frame
          x = "consc",                # IV (must be quoted)
          y = "sr_health",            # DV (must be quoted)
          add = "reg.line",           # add regression line
          xlab = "Conscientiousness", # x-axis label
          ylab = "Self-rated health", # y-axis label
          conf.int = TRUE,            # show 95% confidence band around regression line
          cor.coef = TRUE)            # display correlation coefficient and p-value


Regression with matrix algebra

Consider the following matrix algebra equation:

\[\mathbf{Y} = \mathbf{XB}\] In this equation: 1. Y is a nx1 matrix of outcome variable values from the dataset (e.g., health scores)
2. X is a nx2 matrix. The first column is filled with 1s and the second column is the predictor variable values from the dataset (e.g., consciousness scores)
3. B is a 2x1 matrix containing our estimates of the intercept and slope.
Using this equation, we can plug in our data and solve for the intercept and slope estimates.

Example

  1. Create the X matrix (first column 1s, second column consciousness scores)
x_mat <- health %>% # start with the original data frame
  mutate(ones = rep(1, nrow(.))) %>% # create a column of 1's to represent the intercept
  select(ones, consc) %>% # select only the column of 1's and X variable
  as.matrix() %>%  # coerce to a matrix
  unname() # get rid of dimnames 

head(x_mat)
##      [,1]     [,2]
## [1,]    1 2.659688
## [2,]    1 5.022069
## [3,]    1 3.846887
## [4,]    1 1.734497
## [5,]    1 3.452961
## [6,]    1 4.005985
  1. Create the Y matrix (self-reported health scores)
y_mat <- health %>% # start with the original data frame
  select(sr_health) %>% # select just the Y variable
  as.matrix() %>% # coerce to a matrix
  unname() # get rid of dimnames 

head(y_mat)
##          [,1]
## [1,] 1.909508
## [2,] 4.493539
## [3,] 3.403894
## [4,] 2.045359
## [5,] 4.210038
## [6,] 5.111872

In matrix algebra, you can’t divide by a matrix, so you will need to do some more complicated arithmetic to solve for B (the vector of slope and intercept estimates). You can calculate B with the following formula:

\[\mathbf{B} = (\mathbf{X'X})^{-1} \mathbf{X'Y}\]

solve(t(x_mat) %*% x_mat) %*% (t(x_mat) %*% y_mat)
##           [,1]
## [1,] 1.6569733
## [2,] 0.4904059

Let’s check our work:

model$coefficients
## (Intercept)       consc 
##   1.6569733   0.4904059

The General Linear Model

We just saw that regression works “under the hood” by solving a matrix algebra equation to get the intercept and slope of the regression line. As Sara mentioned in class, this matrix algebra equation works for any type of data we have. This should clue us into the idea that there is some fundamental process going on behind the scenes of all of our models…

The general linear model (GLM) is a family of models that assume the relationship between your DV and IV(s) is additive, and that your outcome is normally distributed. In its simplest form, we can think of the general linear model as:

\[Data = Model + Error \]

This provides a unifying framework for all of the statistical tests we have learned (or at least touched on) so far: t-tests, correlations, ANOVA, and linear regression. All of these tests are really doing the same math at the end of the day.

To illustrate this, let’s look at the relationship between gender and self-reported health from the health data set.

head(health)
str(health)
## 'data.frame':    60 obs. of  4 variables:
##  $ pid      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender   : chr  "male" "male" "male" "male" ...
##  $ consc    : num  2.66 5.02 3.85 1.73 3.45 ...
##  $ sr_health: num  1.91 4.49 3.4 2.05 4.21 ...

Let’s covert the gender variable to 0's and 1's. We will talk more about dummy coding in a future lab.

health <-  health %>% 
  mutate(gender = case_when(gender == "male" ~ 0,
                               gender == "female" ~ 1))
head(health)

t-test

t_test <- t.test(formula = sr_health ~ gender, data = health, var.equal = TRUE)
t_test
## 
##  Two Sample t-test
## 
## data:  sr_health by gender
## t = 1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2157688  0.8133372
## sample estimates:
## mean in group 0 mean in group 1 
##        3.202735        2.903950

Correlation

cor_test <- cor.test(formula = ~ sr_health + gender, data = health)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  sr_health and gender
## t = -1.1623, df = 58, p-value = 0.2499
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3898633  0.1071553
## sample estimates:
##        cor 
## -0.1508746

ANOVA

anova_test <- aov(formula = sr_health ~ gender, data = health)
summary(anova_test)
##             Df Sum Sq Mean Sq F value Pr(>F)
## gender       1   1.34  1.3391   1.351   0.25
## Residuals   58  57.49  0.9912

Regression

regression <- lm(formula = sr_health ~ gender, data = health)
summary(regression)
## 
## Call:
## lm(formula = sr_health ~ gender, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76485 -0.92279 -0.05685  0.60980  2.02939 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3.2027     0.1818  17.620 <0.0000000000000002 ***
## gender       -0.2988     0.2571  -1.162                0.25    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared:  0.02276,    Adjusted R-squared:  0.005914 
## F-statistic: 1.351 on 1 and 58 DF,  p-value: 0.2499

Comparison

t_test$p.value
## [1] 0.2498611
cor_test$p.value
## [1] 0.2498611
anova(anova_test)
summary(regression)
## 
## Call:
## lm(formula = sr_health ~ gender, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76485 -0.92279 -0.05685  0.60980  2.02939 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3.2027     0.1818  17.620 <0.0000000000000002 ***
## gender       -0.2988     0.2571  -1.162                0.25    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9956 on 58 degrees of freedom
## Multiple R-squared:  0.02276,    Adjusted R-squared:  0.005914 
## F-statistic: 1.351 on 1 and 58 DF,  p-value: 0.2499

Minihacks

Minihack 1: Matrix algebra

Load the following data from Fox and Guyer’s (1978) anonymity and cooperation study. In the study, twenty groups of four participants each played 30 trials of the the prisoner’s dilemma game. The number of cooperative choices (cooperation) made by each group were scored out of 120 (i.e., cooperative choices made by 4 participants over 30 trials). The groups either made decisions publicly or privately (condition).

guyer <- carData::Guyer

Use matrix algebra to manually calculate the intercept and slope for the regression equation:

\[cooperation_i = b_0 + b_1condition_i + e_i\] 1. Dummy code the condition (public = 0 and anonymous = 1).

#your code here
  1. Create the X matrix.
#your code here
  1. Create the Y matrix.
#your code here
  1. Solve for the B matrix.
#your code here
  1. Confirm your results using the lm() function.
#your code here

Minihack 2: Prediction and confidence bands

Using the mtcars dataset, create a scatterplot of cyl (x-axis) and mpg (y-axis) that includes both a prediction band and a confidence band.

  1. Run a regression model.
#your code here
  1. Generate fitted values and their prediction interval using the predict() function.
#your code here
  1. Bind the fitted values and the prediction intervals to the original dataset.
#your code here
  1. Using ggplot(), create a scatterplot that includes both a prediction band and a confidence band.
#your code here

Minihack 3: Confidence intervals

For this minihack, we will refer back to the example about conscientiousness and health. We used confint() to calculate the 95% CI for our regression coefficients. Your job is start with the model output, stored as a list object (see below), and extract the relevant pieces of information to calculate the 95% CI around the intercept and slope.

Note: the formula for a confidence interval is estimate +- cv*se

model <- lm(sr_health ~ consc, data = health)
summary_model <- summary(model)
  1. Extract the estimates.
#your code here
  1. Extract standard errors. Hint: use summary_model$coefficients.
#your code here
  1. Extract the df (It will be the denominator df from F statistic)).
#your code here
  1. Calculate the critical value.
#your code here
  1. Calculate a 95% CI for both estimates.
#your code here
  1. Verify that your answer corresponds to the result from confint().
#your code here