Purpose

Today we’ll be going over the basics of univariate regression in R, including reporting regression results in Tables.

To review.

  • Regression is a general approach for data analysis.
  • It can handle a variety of independent variables.
  • It includes everything we might want:
    • Statistical tests (often several).
    • Effect sizes (often several).
    • It can include many independent or non-independent effects.

Today, we’ll just be focusing on univariate regression with a single continuous predictor, but over the weeks we will build up into much more complicated regression equations.

To quickly navigate to the desired section, click one of the following links:

  1. Estimating Regression Models
  2. NHST in Regression
  3. A Tidier Way to Extract Information from Regression Models
  4. Reporting Regressions
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for covariance and correlation functions
library(apaTables) # for correlation tables
library(pwr) # for power calculation
library(broom) # for cleaning up output
library(stargazer)
library(sjPlot)

Estimating Regression Models

Even in simple univariate regression, we simultaneously estimate several values at once. Remember, one of our primary goals is estimate \(Y\) values with our regression model:

\[Y_i = b_0 + b_1X_i + e_i\]

Alternatively:

\[\hat{Y_i} = b_0 + b_1X_i\]

And sometimes we use greek letters for the model parameters when we’re referring to the (generally hypothetical) population parameters:

\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\] As Sara mentioned, it is common to use b for unstandardized slopes and \(\beta\) for standardized slopes within psychology.

Without further ado, let’s go ahead and estimate a regression equation in R.

Estimating Regressions in R

Conducting regressions in R is actually pretty simple. We use the lm() function which is part of the pre-loaded {stats} library. There are basically two ingredients we pass to the lm() function

  1. The formula: Specify your regression formula in the general form y ~ x.

  2. The data: the dataframe that contains the variables in the formula. This is technically optional.

Let’s take a look. First we’ll import the same data that we used last week:

health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")

Next we run the model, using lm(), setting the formula and data arguments. We want to see to what extent self-reported conscientiousness relates to self-reported health.

model <- lm(formula = sr_health ~ consc, # remember, outcome ~ predictors
            data = health) # enter the dataframe

More typically, people omit the formula =:

model <- lm(sr_health ~ consc, # remember, outcome ~ predictors
            data = health)

Now our regression model objct, called model above, is a list object with various useful information. Let’s take a look with the str() function:

Code
str(model)
Output
## List of 12
##  $ coefficients : Named num [1:2] 1.66 0.49
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "consc"
##  $ residuals    : Named num [1:60] -1.052 0.374 -0.14 -0.462 0.86 ...
##   ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:60] -23.6511 -3.655 -0.0716 -0.2861 0.9479 ...
##   ..- attr(*, "names")= chr [1:60] "(Intercept)" "consc" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:60] 2.96 4.12 3.54 2.51 3.35 ...
##   ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:60, 1:2] -7.746 0.129 0.129 0.129 0.129 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:60] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "consc"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.13 1.29
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 58
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = sr_health ~ consc, data = health)
##  $ terms        :Classes 'terms', 'formula'  language sr_health ~ consc
##   .. ..- attr(*, "variables")= language list(sr_health, consc)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "sr_health" "consc"
##   .. .. .. ..$ : chr "consc"
##   .. ..- attr(*, "term.labels")= chr "consc"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(sr_health, consc)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
##  $ model        :'data.frame':   60 obs. of  2 variables:
##   ..$ sr_health: num [1:60] 1.91 4.49 3.4 2.05 4.21 ...
##   ..$ consc    : num [1:60] 2.66 5.02 3.85 1.73 3.45 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language sr_health ~ consc
##   .. .. ..- attr(*, "variables")= language list(sr_health, consc)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "sr_health" "consc"
##   .. .. .. .. ..$ : chr "consc"
##   .. .. ..- attr(*, "term.labels")= chr "consc"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(sr_health, consc)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
##  - attr(*, "class")= chr "lm"

And we can extract elements from this list like any other list in R, by using LIST$ELEMENT or LIST[["ELEMENT"]]. There are also specific functions for extracting those elements. Let’s start by getting the predicted values of Y.

Extracting Predicted Y values

We can get predicted values from model by subsetting our list with $:

List Subsetting

Code
model$fitted.values
Output
##        1        2        3        4        5        6        7        8 
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843 
##        9       10       11       12       13       14       15       16 
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321 
##       17       18       19       20       21       22       23       24 
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952 
##       25       26       27       28       29       30       31       32 
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783 
##       33       34       35       36       37       38       39       40 
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356 
##       41       42       43       44       45       46       47       48 
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330 
##       49       50       51       52       53       54       55       56 
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202 
##       57       58       59       60 
## 2.698847 3.737070 3.129544 2.779391

List subsetting II

We can also use double-brackets:

Code
model[["fitted.values"]]
Output
##        1        2        3        4        5        6        7        8 
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843 
##        9       10       11       12       13       14       15       16 
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321 
##       17       18       19       20       21       22       23       24 
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952 
##       25       26       27       28       29       30       31       32 
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783 
##       33       34       35       36       37       38       39       40 
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356 
##       41       42       43       44       45       46       47       48 
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330 
##       49       50       51       52       53       54       55       56 
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202 
##       57       58       59       60 
## 2.698847 3.737070 3.129544 2.779391

fitted.values() function

Or the fitted.values function:

Code
fitted.values(model)
Output
##        1        2        3        4        5        6        7        8 
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843 
##        9       10       11       12       13       14       15       16 
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321 
##       17       18       19       20       21       22       23       24 
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952 
##       25       26       27       28       29       30       31       32 
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783 
##       33       34       35       36       37       38       39       40 
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356 
##       41       42       43       44       45       46       47       48 
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330 
##       49       50       51       52       53       54       55       56 
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202 
##       57       58       59       60 
## 2.698847 3.737070 3.129544 2.779391

Extracting Residuals

We can also extract our residuals or the deviation of each predicted score \(\hat{Y_i}\) from the observed scores \(Y_i\). That too can be done using list subetting:

With List Subsetting

Code
model$residuals
Output
##           1           2           3           4           5           6 
## -1.05179179  0.37371307 -0.13961566 -0.46222179  0.85971248  1.49034005 
##           7           8           9          10          11          12 
##  0.05099371  1.00473746  0.72444971 -0.51013035  0.84130202 -0.46213405 
##          13          14          15          16          17          18 
## -0.35681873  0.68032922  2.01542826 -0.05149984 -0.64163120  1.41852518 
##          19          20          21          22          23          24 
##  0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398 
##          25          26          27          28          29          30 
## -1.46937901 -0.96204444  0.13844794  0.42834928  0.08103582 -0.22450409 
##          31          32          33          34          35          36 
## -1.34106226  1.61985264  1.01873328  1.07527601 -0.27924189  0.38539640 
##          37          38          39          40          41          42 
## -0.99089855 -0.94483134  0.06312578  1.76842089  1.17242288 -1.12845379 
##          43          44          45          46          47          48 
## -0.95248954 -0.71015044 -1.54500530 -0.31210445  0.37530444 -0.70690430 
##          49          50          51          52          53          54 
##  0.63716375 -0.39605196  1.19505547 -1.33741467 -0.07534450  0.09696171 
##          55          56          57          58          59          60 
## -0.26867918  0.68566564 -0.71606326 -0.32167592  0.99511089 -1.12082173

With residuals() function

Or we could use the residuals() function:

Code
residuals(model)
Output
##           1           2           3           4           5           6 
## -1.05179179  0.37371307 -0.13961566 -0.46222179  0.85971248  1.49034005 
##           7           8           9          10          11          12 
##  0.05099371  1.00473746  0.72444971 -0.51013035  0.84130202 -0.46213405 
##          13          14          15          16          17          18 
## -0.35681873  0.68032922  2.01542826 -0.05149984 -0.64163120  1.41852518 
##          19          20          21          22          23          24 
##  0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398 
##          25          26          27          28          29          30 
## -1.46937901 -0.96204444  0.13844794  0.42834928  0.08103582 -0.22450409 
##          31          32          33          34          35          36 
## -1.34106226  1.61985264  1.01873328  1.07527601 -0.27924189  0.38539640 
##          37          38          39          40          41          42 
## -0.99089855 -0.94483134  0.06312578  1.76842089  1.17242288 -1.12845379 
##          43          44          45          46          47          48 
## -0.95248954 -0.71015044 -1.54500530 -0.31210445  0.37530444 -0.70690430 
##          49          50          51          52          53          54 
##  0.63716375 -0.39605196  1.19505547 -1.33741467 -0.07534450  0.09696171 
##          55          56          57          58          59          60 
## -0.26867918  0.68566564 -0.71606326 -0.32167592  0.99511089 -1.12082173

Residuals by hand

And, we could reproduce those values ourselves by subtracting the predicted values we just obtained from the observed Y values from our data

Code
health$sr_health - fitted.values(model)
Output
##           1           2           3           4           5           6 
## -1.05179179  0.37371307 -0.13961566 -0.46222179  0.85971248  1.49034005 
##           7           8           9          10          11          12 
##  0.05099371  1.00473746  0.72444971 -0.51013035  0.84130202 -0.46213405 
##          13          14          15          16          17          18 
## -0.35681873  0.68032922  2.01542826 -0.05149984 -0.64163120  1.41852518 
##          19          20          21          22          23          24 
##  0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398 
##          25          26          27          28          29          30 
## -1.46937901 -0.96204444  0.13844794  0.42834928  0.08103582 -0.22450409 
##          31          32          33          34          35          36 
## -1.34106226  1.61985264  1.01873328  1.07527601 -0.27924189  0.38539640 
##          37          38          39          40          41          42 
## -0.99089855 -0.94483134  0.06312578  1.76842089  1.17242288 -1.12845379 
##          43          44          45          46          47          48 
## -0.95248954 -0.71015044 -1.54500530 -0.31210445  0.37530444 -0.70690430 
##          49          50          51          52          53          54 
##  0.63716375 -0.39605196  1.19505547 -1.33741467 -0.07534450  0.09696171 
##          55          56          57          58          59          60 
## -0.26867918  0.68566564 -0.71606326 -0.32167592  0.99511089 -1.12082173

Checking our calculations

Let’s make sure those are all the same:

Code
round(health$sr_health - fitted.values(model), 8) == round(residuals(model), 8)
Output
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
##   49   50   51   52   53   54   55   56   57   58   59   60 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Sum of Squared Residuals

With this information, we could calculate the sum of squared residuals:

\[\Sigma(Y_i - \hat{Y_i})^2\]

Code
sum((health$sr_health - fitted.values(model))^2)
Output
## [1] 45.46788

Standard Error of the Estimate

This of course is one of the building blocks for estimating the standard error of the estimate:

\[\sqrt{\frac{\Sigma(Y_i - \hat{Y_i})^2}{N-2}}\]

Let’s go ahead and calculate that too:

Code
sqrt(sum((health$sr_health - fitted.values(model))^2)/(nrow(health)-2))
Output
## [1] 0.8853976

Extracting Standard Error with summary

We can also get the standard error more directly. However, it is not stored in the model list. We have to use the summary() function, which gives us some additional useful information about our model. Let’s take a look at the structure of it:

Code
str(summary(model))
Output
## List of 11
##  $ call         : language lm(formula = sr_health ~ consc, data = health)
##  $ terms        :Classes 'terms', 'formula'  language sr_health ~ consc
##   .. ..- attr(*, "variables")= language list(sr_health, consc)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "sr_health" "consc"
##   .. .. .. ..$ : chr "consc"
##   .. ..- attr(*, "term.labels")= chr "consc"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(sr_health, consc)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
##  $ residuals    : Named num [1:60] -1.052 0.374 -0.14 -0.462 0.86 ...
##   ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 1.657 0.49 0.357 0.119 4.641 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "consc"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "consc"
##  $ sigma        : num 0.885
##  $ df           : int [1:3] 2 58 2
##  $ r.squared    : num 0.227
##  $ adj.r.squared: num 0.214
##  $ fstatistic   : Named num [1:3] 17 1 58
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.1626 -0.0513 -0.0513 0.018
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "consc"
##   .. ..$ : chr [1:2] "(Intercept)" "consc"
##  - attr(*, "class")= chr "summary.lm"

Extracting sigma

Remember, R calls the standard error of the estimate sigma, so we can get it out of the new summary object using list subsetting:

Code
summary(model)$sigma
Output
## [1] 0.8853976

Check Equivalency

And we can see that it is equivalent to what we calculated above:

Code
sqrt(sum((health$sr_health - fitted.values(model))^2)/(nrow(health)-2)) == summary(model)$sigma
Output
## [1] TRUE

Q1:

Question: What does a standard error of the estimate of 0.89 mean? Is that good?

Extracting the coefficient of determination, \(R^2\)

Recall that another way we can evaluate our regression model is by extracting the coefficient of determination, \(R^2\). This represents the proportion of variance explained by the model. Like the standard error of the estimate, we need to use the summary function to get it:

Code
summary(model)$r.squared
Output
## [1] 0.2270871

Q2:

Question: What does an \(R^2\) of 0.23 mean (in plain english)?

Extracting Regression Coefficients

Recall that we also get estimates for the individual regression coefficients, \(b_0\) and \(b_1\) in the case of univariate regression. Like many of the other components we’ve covered, you can extract the coefficients with list subsetting:

Code
model$coefficients
Output
## (Intercept)       consc 
##   1.6569733   0.4904059

coefficients() function

Or we could get the same information with the coefficients() function:

Code
coefficients(model)
Output
## (Intercept)       consc 
##   1.6569733   0.4904059

Q3:

Question: What does the intercept mean?

Question: What about the slope for conscientiousness?

You probably recall that these are called the unstandardized coefficients. We can also get the standardized coefficients, but that is a little trickier. Before I show you how to do that:

Q4:

Question: How does a standardized coefficient differ from an unstandardized coefficient?

Getting standardized coefficients

Standardized regression coefficients, often notated as \(\beta\), are just the regression coefficients after the variables have been standardized or Z-scored. To obtain them, we need to z-score our data with scale() before we run the lm() function. One really cool thing is that we can do it in the lm() call:

Code
std_model <- lm(scale(sr_health) ~ scale(consc), data = health)

coefficients(std_model) %>% 
  round(3)
Output
##  (Intercept) scale(consc) 
##        0.000        0.477

Q5:

Question: What does the standardized slope for conscientiousness mean?

Question: Why is the intercept zero?

NHST in Regression

So far, we’ve been covering how to estimate the various regression components in R and how to extract those components from our model object. However, within the NHST tradition, we also typically want significance tests. With regression, we have several significance tests simultaneously.

Omnibus test

First, we have the omnibus test, which tests whether or not the regression model significantly outperforms the NULL model. This is typically done with an F ratio:

\[F = \frac{MS_{model}}{MS_{residual}}\]

Getting F

That is also stored in the summary of our model:

Code
summary(model)$fstatistic
Output
##   value   numdf   dendf 
## 17.0408  1.0000 58.0000

Getting p value

And we can use these numbers to look up its p value:

Code
pf(summary(model)$fstatistic[1], 
   summary(model)$fstatistic[2], 
   summary(model)$fstatistic[3], 
   lower.tail = FALSE)
Output
##        value 
## 0.0001186082

Getting p value II

Alternatively, we can pass our model to the anova() function, which gives us an F table or ANOVA table for our regression model:

Code
anova(model)
Output
## Analysis of Variance Table
## 
## Response: sr_health
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## consc      1 13.359 13.3588  17.041 0.0001186 ***
## Residuals 58 45.468  0.7839                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q6:

Question: Is the F significant? What does this mean?

coefficient tests

In addition to the omnibus test, we get a significance test for each of our model’s coefficients.

Recall that for each coefficient, we get a t test from the formula:

\[\begin{align} t = \frac{b_1}{se_b}\\ \\ \\ \\ se_b = \frac{s_Y}{s_X}\sqrt{\frac{1 - r^2_{XY}}{n-2}} \end{align}\]

This t is distributed with \(df = n - 2\).

We can get these from the summary of our model object, by extracting the coefficients from the summary.

Code
summary(model)$coefficients
Output
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.6569733  0.3570543 4.640676 0.0000202958
## consc       0.4904059  0.1187984 4.128050 0.0001186082

Q7:

Question: Is the test of the intercept significant? What does this mean?

Question: Is the test of the slope significant? What does this mean?

Also, recall that in the case of simple univariate regression, the omnibus F is equivalent to the t for the slope squared:

Code
summary(model)$fstatistic[[1]]
summary(model)$coefficients[2,3]^2
Output
## [1] 17.0408
## [1] 17.0408

Summary() on its own

Finally, all of this information is displayed in when we just run summary():

summary(model)
## 
## Call:
## lm(formula = sr_health ~ consc, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5450 -0.6414 -0.1623  0.6817  2.0154 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6570     0.3571   4.641 2.03e-05 ***
## consc         0.4904     0.1188   4.128 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2138 
## F-statistic: 17.04 on 1 and 58 DF,  p-value: 0.0001186

A Tidier way to Extract Information from Regression Models

You may have noticed at this point that working with lists has its challenges. Even just extracting the information we’ve extracted so far has some pretty messy code. There must be a better (tidier) way!

Thankfully, there is. This requires the {broom} library, which might be new. It is a package for tidying model results objects. It’s pretty easy to use - you just pass the model object to a function from {broom} called tidy. There are some more advanced things you can do, but just tidy(model_obj) (where model_obj is the name of the model) works for most purposes. And, one really nice thing about broom is that it works with a lot of different types of models, so this will continue to work as we move to other techniques (e.g., multi-level model with lme4).

broom::tidy()

Let’s see what happens when we tidy our model:

tidy(model)
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    1.66      0.357      4.64 0.0000203
## 2 consc          0.490     0.119      4.13 0.000119

You can see it produces a dataframe containing the model coefficients and their significance tests.

broom::glance()

{broom} also has a function called glance() that provides some of the omnibus model information we might want:

glance(model)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.227         0.214 0.885      17.0 1.19e-4     2  -76.8  160.  166.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

broom::augment()

augment provides more information from our model:

augment(model)
## # A tibble: 60 x 9
##    sr_health consc .fitted .se.fit  .resid   .hat .sigma   .cooksd .std.resid
##        <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
##  1      1.91  2.66    2.96   0.116 -1.05   0.0173  0.882 0.0126       -1.20  
##  2      4.49  5.02    4.12   0.283  0.374  0.102   0.892 0.0112        0.445 
##  3      3.40  3.85    3.54   0.165 -0.140  0.0347  0.893 0.000462     -0.160 
##  4      2.05  1.73    2.51   0.175 -0.462  0.0390  0.891 0.00575      -0.533 
##  5      4.21  3.45    3.35   0.135  0.860  0.0233  0.886 0.0115        0.982 
##  6      5.11  4.01    3.62   0.179  1.49   0.0408  0.870 0.0629        1.72  
##  7      3.09  2.82    3.04   0.114  0.0510 0.0167  0.893 0.0000286     0.0581
##  8      4.58  3.90    3.57   0.170  1.00   0.0367  0.883 0.0255        1.16  
##  9      3.90  3.10    3.18   0.118  0.724  0.0178  0.888 0.00619       0.826 
## 10      2.64  3.04    3.15   0.117 -0.510  0.0173  0.891 0.00298      -0.581 
## # … with 50 more rows

Using augment, we get the DV, IV, fitted values, residuals, and other diagnostic information we’ll learn about in future weeks.

So with broom, just remember:
1. We tidy up coefficients.
2. We glance at (omnibus) model statistics.
3. We augment to find everything else.

The payoff for these functions comes when you want to do something with some of your regression results. As an example, you could use tidy() to make it easier to make a plot of your regression coefficients:

tidy(model) %>% 
  ggplot(aes(x = term, y = estimate)) +
  geom_point()+
  geom_linerange(aes(ymin = estimate - std.error, 
                 ymax = estimate + std.error))

Reporting Regressions

The last thing we’ll cover today is how to report the results of your regression in Tables. We’ll cover three different methods, which each have their pro’s and con’s.

‘by hand’ using broom and kable

Our first option would be to make a table ‘by hand’ using a combination of tidy() and the kable function from {knitr}.

tidy(model) %>% # first run tidy on the model.
                # Then pipe it to knitr's kable function
  knitr::kable(digits = c(NA, 2, 2, 2, 3), # set digits; everything rounded to 
                                          # 2 except the labels (NA) and p's (3 digits)
               caption = "Results of Regressing Self-Reported Health on Conscientiousness") # provide a table caption
Results of Regressing Self-Reported Health on Conscientiousness
term estimate std.error statistic p.value
(Intercept) 1.66 0.36 4.64 0
consc 0.49 0.12 4.13 0

We could clean things up a bit more by changing the names and re-formatting that pesky p value column:

tidy(model) %>% # we'll still run tidy on the model first
  # but we'll pass it to rename (part of the tidyverse/dplyr)
  # and rename some of the columns to be more similar to how
  # we normally report these things.
  # rename is pretty easy, it's a way to rename column names
  # the general format is rename(new_name = old_name),
  # where new_name is the new name you want the column to have
  # and old_name is the old name that you're replacing.
  rename(coefficient = term,
        b = estimate,
        SE = std.error,
        t = statistic,
        p = p.value) %>%
  # Then we can mutate the p column to deal with the
  # triple zeroes
  mutate(p = ifelse(p > .001, round(p, 3), "< .001")) %>% 
  knitr::kable(digits = c(NA, 2, 2, 2, 3), # Then we'll do the same as above with kable
               caption = "Results of Regressing Self-Reported Health on Conscientiousness") 
Results of Regressing Self-Reported Health on Conscientiousness
coefficient b SE t p
(Intercept) 1.66 0.36 4.64 < .001
consc 0.49 0.12 4.13 < .001

This method is nice for two reasons:
1. You have a lot of control over how things look.
2. It’s pretty general-purpose, and you can easily adapt it to new things you learn how to do in R.

However, it has a downside in that it is hard to get this into picture perfect APA format (we didn’t get all the way there above) and so you may have to do some editing once you get it into word.

Stargazer

The {stargazer} library can be used to pretty easily get APA formatted tables from many kinds of results or model object. It’s super easy to use:

stargazer(model, # give it the model
          type = "html", # tell it the format type.
                        #I'm using html since this is an html doc
          out = "./tbl/reg_tbl_sg.html") # you can give it a path to save
Dependent variable:
sr_health
consc 0.490***
(0.119)
Constant 1.657***
(0.357)
Observations 60
R2 0.227
Adjusted R2 0.214
Residual Std. Error 0.885 (df = 58)
F Statistic 17.041*** (df = 1; 58)
Note: p<0.1; p<0.05; p<0.01

This is nice in that it is super easy, and puts it into APA format for you. It’s downsides are that it is less flexible and stargazer may not have a method for your results objct if you’re using more advanced or cutting edge tools.

sjPlot

A third option is to use tab_model() from the {sjPlot} library. This one does not always work well within the rMarkdown document, but it is very easy to export these tables to word. we can do so by setting the file extension of our output to .doc:

sjPlot::tab_model(model, file = "./tbl/reg_tbl_sjp.doc")
  sr health
Predictors Estimates CI p
(Intercept) 1.66 0.94 – 2.37 <0.001
consc 0.49 0.25 – 0.73 <0.001
Observations 60
R2 / R2 adjusted 0.227 / 0.214

This has similar cons to stargazer, and additionally doesn’t always play nicely with rmarkdown. However, it’s super easy to use and even easier than stargazer for exporting tables to MS word.

Minihack 1

For this minihack, you’ll demonstrate the equivalence of a correlation between two variables and the standardized coefficient from a univariate regression regressing one on the other. You’ll be working with a dataset called PSY612_Lab2_Data.csv, which is located in the lab-2/data subfolder. It can be downloaded from the following web address:

https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv

It contains a number of variables, but we’ll focus on Extraversion and Happiness first.

Calculate the bi-variate correlation between Extraversion and Happiness. Then, conduct a univariate regression, regressing Happiness on Extraversion and obtain the standardized estimate for the slope. Finally, conduct a series of logical tests showing the equivalence of the estimate (correlation and standardized slope value), their test statistic, and the p value associated with the test statistic (Hint: You can round to 3 digits, which will probably be necessary).

# code here

Minihack 2

For this minihack, you’ll calculate predicted scores ‘by hand’ using the regression equation, compare them to the predicted scores stored in the model, and finally use the predicted scores you calculate to estimate \(R^2\). We’ll work with the same dataset, but this time you’ll regress social support (SocSup) on Extraversion.

a.) Run the regression model. Extract the coefficients, calculate the predicted scores using the extracted coefficients (HINT: think about the regression equation).

# code here

b.) Demonstrate that the predicted scores are the same as the values from fitted.values.

# code here

c.) Use those predicted scores to calculate \(R^2\). Demonstrate its equivalence to the result of summary(model)$r.squared.

# code here

Minihack 3

Create two tables using two different methods we covered today. The first table should correspond to the regression from Minihack 1 (regressing happiness on Exraversion) and the second should correspond to the regression from Minihack 2 (regressing social support on Extraversion).

# Code here
# code here