You will need the Rmd file for today’s lab - can download it here.
Today we’ll be going over the basics of univariate linear regression in R, including reporting regression results in Tables.
To review.
Today, we’ll just be focusing on univariate linear regression with a single continuous predictor, but over the weeks we will build up into much more complicated regression equations.
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for covariance and correlation functions
library(apaTables) # for correlation tables
library(pwr) # for power calculation
library(broom) # for cleaning up output
library(stargazer)
library(sjPlot)
A univariate linear regression model is made up of two parameters: a y-intercept (β0) and a slope (β1). If we had population-level data, we could assess what the true values each of these model terms.
\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]
However, because we typically are working with sample data, we most often have to estimate these values in the univariate linear regression model:
\[Y_i = b_0 + b_1X_i + e_i\]
We determine the values of these estimates by finding the best-fitting linear model for a set of collected data and determining the slope and y-intercept of this model. Once we have determined our best estimates of the slope and y-intercept of our linear model, we can use these values to calculate fitted values of Y (\(\hat{Y_i}\)) given a particular value on our predictor variable, X.
\[\hat{Y_i} = b_0 + b_1X_i\]
Question: Using Ordinary Least Squares estimation method, what criteria do we use to determine what is the best-fitting linear model?
Without further ado, let’s go ahead and estimate a regression equation in R.
Conducting regressions in R is actually pretty simple. We use the lm()
function which is part of the pre-loaded {stats} library. The lm()
function takes two arguments:
The formula: Specify your regression formula in the general form y ~ x
.
The data: the dataframe that contains the variables in the formula. This is technically optional.
Example: model <- lm(y ~ x, data = df)
Let’s perform a linear regression analysis. First we’ll import the same data that we used last week:
health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")
head(health)
Say we’re interested in whether self-reported health scores are related to self-reported conscientiousness scores. Here’s a quick reminder of the visualization you looked at last week:
ggplot(data = health, aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Conscientiousness", y = "Self-reported health", title = "Conscientiousness and self-reported health")
lm()
function below.lm()
in an object called model.Let’s see what the output looks like using print()
.
print(model)
The output provides us with the following: * The formula we used for the linear model. * The estimates of the model coefficients, b0 (Intercept) and b1 (the slope corresponding to the predictor variable)
We can get even more information using summary()
.
summary(model)
This output provides us with the following:
Let’s unpack some of that output. First, let’s start with understanding how the two measures of overall model fit, residual standard error and r-squared, are being derived.
The residual standard error, also called the standard error of the estimate, measures the average deviation between observed and fitted values of Y. It is in the original units of the outcome variable, Y. See the formula below.
\[\sqrt{\frac{\Sigma(Y_i - \hat{Y_i})^2}{N-2}}\] Let’s calculate the residual standard error from scratch, and then I will show you how to pull this value directly from the output.
First, we need to get the predicted values of Y.
Our regression model object, called model
, is actually a list
that contains a lot of useful information that we don’t see in the output above. Let’s take a look at all the information stored in this object using the str()
function:
str(model)
We can extract elements from this list by using LIST$ELEMENT
or LIST[["ELEMENT"]]
.
The fitted values of Y are stored in model
in the element called fitted.values. Using the instructions about how to extract elements from lists right above, extract the fitted values from the model and store these in an object called fitted_Y. Print the fitted Y values.
Next, calculate the difference between the actual values of Y and the fitted values of Y. Store the output in an object called residuals and print them.
OK, so I made you do extra work. You can actually grab the residuals directly from the model
object using model$residuals.
model$residuals
Let’s check to see that our calculation of the residuals matches what we grabbed directly from the model output.
round(residuals, 8) == round(model$residuals, 8)
With this information, we could calculate the sum of squared residuals, which is the numerator of the residual standard error equation:
\[\Sigma(Y_i - \hat{Y_i})^2\] Calculate the sum of squared residuals and store the output in an object called sum_sq_residuals.
Finally, now that we have pulled the necessary components, calculate the residual standard error using the formula above. Store the results of the calculation to an object called residual_standard_error.
We can also get the standard error more directly. However, it is not stored in the model object that we stored the output of the lm()
function to. It is stored in the output of the summary()
function. Let’s assign the output of this function to an object called model_summary, and then inspect its structure:
model_summary <- summary(model)
str(model_summary)
Can you find what the residual standard error is called? Once you find it, extract it from the model_summary.
And see that it is equivalent to what we calculated above:
Question: What does this residual standard error (aka standard error of the estimate) omean?
Another measure of how well our linear model is capturing variation in the outcome variable is R^2.
Let’s calculate R^2 by hand first. Here is the formula: \[R^2 = \frac{SS_{Model}}{SS_Y} = \frac{\Sigma(\hat{Y_i} - \bar{Y})^2}{\Sigma(Y_i - \bar{Y})^2}\]
Calculate the following: * SS_Model * SS_Y * r_squared
Another way we can evaluate our regression model is by extracting the coefficient of determination, \(R^2\), which is stored in the element called r.squared. This represents the proportion of variance explained by the model. Like the residual standard error, we need to use the summary function to get it:
model_summary$r.squared
round(r_squared, 8) == round(model_summary$r.squared, 8)
Question: What does this r-squared value mean (in plain english)?
We can perform an omnibus test to see, overall, whether the amount of variation accounted for by the model is significantly different from what would be expected if the null hypothesis were true. The omnibus test is typically tested using an F-statistic that compares the variance accounted for by the model divided by the residual variance:
\[F = \frac{MS_{model}}{MS_{residual}}\]
Remember that variance (aka mean square) is the sum of squares divided by the degrees of freedom for each respective term in the equation above.
Let’s get MS_Model, which is the SS_Model divided by df_Model. * We already calculated SS_Model above, so we just need to divide it by df_Model, which is equal to k (the number of predictors).
MS_Model <- SS_Model/1
Now get MS_Residual. We could get SS_Residual to calculate this a couple of ways:
\[SS_Y = SS_{Model} + SS_{Residual}\]
Either rearrange the above equation to solve for SS_Residual, or calculate it from the raw and fitted values using the equation below:
\[SS_{Residual} = \Sigma(Y_i - \hat{Y_i})^2\]
Use one of these methods to calculate SS_Residual below.
Divide the SS_Residual by the df_Residual, which are n - k - 1.
We can finally calculate the F-statistic. Store it in an object called F_stat.
The value of the F-statistic is also stored in the summary of our model:
summary(model)$fstatistic
Does it match what we got when we calculated from scratch?
And we can use these numbers to look up its p value:
pf(summary(model)$fstatistic[1], # the F-statistic
summary(model)$fstatistic[2], # numerator df
summary(model)$fstatistic[3], # denominator df
lower.tail = FALSE) # we want the probability of obtaining an F-statistic more extreme than ours
Question: Is the F significant? What does this mean?
Recall that we also get estimates for the individual regression coefficients, \(b_0\) and \(b_1\) in the case of univariate regression.
We can calculate these from scratch using the following formulas.
For obtaining b1: \[b_1 = r(\frac{s_y}{s_x})\] Calculate b1 using the equation above. Store the value in an object called b1.
And for obtaining b0: \[b0 = M_y - b_1M_x\]
Calculate b0 using the equation above. Store the value in an object called b0.
We can also grab these estimates from the model output. They are stored in the element called coefficients.
model$coefficients
Question: What does the intercept mean?
Question: What about the slope for conscientiousness?
You probably recall that these are called the unstandardized coefficients. We can also get the standardized coefficients, but that is a little trickier. Before I show you how to do that:
Question: How does a standardized coefficient differ from an unstandardized coefficient?
Standardized regression coefficients, often notated as \(\beta\), are just the regression coefficients after the variables have been standardized or Z-scored. To obtain them, we need to z-score our data with scale()
before we run the lm()
function. One really cool thing is that we can do it in the lm()
call:
std_model <- lm(scale(sr_health) ~ scale(consc), data = health)
coefficients(std_model) %>%
round(3)
Question: What does the standardized slope for conscientiousness mean?
Question: Why is the intercept zero?
In addition to the omnibus test, we can also test the significance of each of the regression coefficient estimates.
Recall that for each coefficient, we get a t test from the formula:
\[t = \frac{b_1}{se_b}\]
Where the standard error is: \[se_b = \frac{s_Y}{s_X}\sqrt{\frac{1 - r^2_{XY}}{n-2}}\]
This t is distributed with \(df = n - 2\).
We can get these from the summary of our model object, by extracting the coefficients from the summary.
summary(model)$coefficients
Question: Is the test of the intercept significant? What does this mean?
Question: Is the test of the slope significant? What does this mean?
You may have noticed at this point that working with lists has its challenges. Even just extracting the information we’ve extracted so far has some pretty messy code. There must be a better (tidier) way!
Thankfully, there is. This requires the {broom} library, which might be new. It is a package for tidying model results objects. It’s pretty easy to use - you just pass the model object to a function from {broom} called tidy
. There are some more advanced things you can do, but just tidy(model_obj)
(where model_obj is the name of the model) works for most purposes. And, one really nice thing about broom
is that it works with a lot of different types of models, so this will continue to work as we move to other techniques (e.g., multi-level model with lme4
).
Let’s see what happens when we tidy our model:
tidy(model)
You can see it produces a dataframe containing the model coefficients and their significance tests.
{broom} also has a function called glance()
that provides some of the omnibus model information we might want:
glance(model)
augment
provides more information from our model:
augment(model)
Using augment, we get the DV, IV, fitted values, residuals, and other diagnostic information we’ll learn about in future weeks.
So with broom, just remember:
1. We tidy
up coefficients.
2. We glance
at (omnibus) model statistics.
3. We augment
to find everything else.
The payoff for these functions comes when you want to do something with some of your regression results. As an example, you could use tidy()
to make it easier to make a plot of your regression coefficients:
tidy(model) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point()+
geom_linerange(aes(ymin = estimate - std.error,
ymax = estimate + std.error))
The last thing we’ll cover today is how to report the results of your regression in Tables.
Our first option would be to make a table ‘by hand’ using a combination of tidy()
and the kable
function from {knitr}.
tidy(model) %>% # first run tidy on the model.
# Then pipe it to knitr's kable function
knitr::kable(digits = c(NA, 2, 2, 2, 3), # set digits; everything rounded to
# 2 except the labels (NA) and p's (3 digits)
caption = "Results of Regressing Self-Reported Health on Conscientiousness") # provide a table caption
We could clean things up a bit more by changing the names and re-formatting that pesky p value column:
tidy(model) %>% # we'll still run tidy on the model first
# but we'll pass it to rename (part of the tidyverse/dplyr)
# and rename some of the columns to be more similar to how
# we normally report these things.
# rename is pretty easy, it's a way to rename column names
# the general format is rename(new_name = old_name),
# where new_name is the new name you want the column to have
# and old_name is the old name that you're replacing.
rename(coefficient = term,
b = estimate,
SE = std.error,
t = statistic,
p = p.value) %>%
# Then we can mutate the p column to deal with the
# triple zeroes
mutate(p = ifelse(p > .001, round(p, 3), "< .001")) %>%
knitr::kable(digits = c(NA, 2, 2, 2, 3), # Then we'll do the same as above with kable
caption = "Results of Regressing Self-Reported Health on Conscientiousness")
This method is nice for two reasons:
1. You have a lot of control over how things look.
2. It’s pretty general-purpose, and you can easily adapt it to new things you learn how to do in R.
However, it has a downside in that it is hard to get this into picture perfect APA format (we didn’t get all the way there above) and so you may have to do some editing once you get it into word.
Now this gets us most of the way to an APA formmated table, but not quite all the way. We can get APA formatted tables with even less code by using the {stargazer} library.
stargazer(model, type = "html", out = "./tbl/reg_tbl_sg.html")
sjPlot
A third option is to use tab_model()
from the {sjPlot} library. This one does not work well within the rMarkdown document, but it is very easy to export these tables to word. we can do so by setting the file extension of our output to .doc
:
sjPlot::tab_model(model, file = "./tbl/reg_tbl_sjp.doc")
For this minihack, you’ll demonstrate the equivalence of a correlation between two variables and the standardized coefficient from a univariate regression regressing one on the other. You’ll be working with a dataset called PSY612_Lab2_Data.csv, which is located in the lab-2/data subfolder. It can be downloaded from the following web address:
“https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv”
It contains a number of variables, but we’ll focus on Extraversion and Happiness first.
Calculate the bi-variate correlation between Extraversion and Happiness. Then, conduct a univariate regression, regressing Happiness on Extraversion and obtain the standardized estimate for the slope. Finally, conduct a series of logical tests showing the equivalence of the estimate (correlation and standardized slope value), their test statistic, and the p value associated with the test statistic (Hint: You can round to 3 digits, which will probably be necessary).
# code here
For this minihack, you’ll calculate fitted scores ‘by hand’ using the regression equation, compare them to the fitted scores stored in the model, and finally use the fitted scores you calculate to estimate \(R^2\). We’ll work with the same dataset, but this time you’ll regress social support (SocSup) on Extraversion.
a.) Run the regression model. Extract the coefficients, calculate the fitted scores using the extracted coefficients (HINT: think about the regression equation).
# code here
b.) Demonstrate that the fitted scores are the same as the values from fitted.values
.
# code here
c.) Use those fitted scores to calculate \(R^2\). Demonstrate its equivalence to the result of summary(model)$r.squared
.
# code here
Create two tables using two different methods we covered today. The first table should correspond to the regression from Minihack 1 (regressing happiness on Exraversion) and the second should correspond to the regression from Minihack 2 (regressing social support on Extraversion).
# Code here
# code here