You will need the Rmd file for today’s lab - can download it here.
Before we begin, load the following libraries. Install any of the packages you have not yet installed on your computers.
library(rio) # for importing data
library(tidyverse) # for plotting and data wrangling
library(ppcor) # for calculating partial and semi-partial correlations
library(janitor) # for cleaning up data names
library(scatterplot3d) # for drawing non-interactive 3d plots
library(car) # for drawing interactive 3d plots
# suppress scientific notation
options(scipen = 999)
Import the data set and inspect it. What variables do we have in our data set? What measurement scale is each variable on?
# import data
data <- import(file = paste("https://raw.githubusercontent.com",
"uopsych",
"psy612",
"master",
"labs",
"lab-4",
"data",
"data_htwtage.csv",
sep = "/"),
setclass = "tibble") %>%
clean_names()
# look at the data
data
Let’s start with inspecting the relationship between age
and weight
.
Visualize the relationship between age
and weight
. Let’s also label the data points with the name
of each participant.
ggplot(data = data,
mapping = aes(x = scale(age),
y = scale(weight))) +
geom_smooth(method = "lm",
se = TRUE,
lwd = 1.5,
color = "turquoise4") +
geom_text(aes(label = name,
hjust = -0.1,
vjust = -0.1)) +
geom_point(alpha = .8,
color = "red3") +
labs(x = "Standardized Age",
y = "Standardized Weight",
title = "Weight Regressed On Age") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Regress weight
on age
. Standardize the variables in the model. Save the model to an object called model_1
.
model_1 <- # your code here
summary(model_1)
The results provide us with the following model:
\[W_i = .00 + .74A_i + \epsilon_i\]
Question What is the relationship between the standardized regression coefficient in a univariate regression model to the zero-order correlation between the predictor and outcome variable?
Question What’s another variable that could help explain the differences between the predicted values and the actual values? In other words, what’s another variable that could explain variation in
weight
that is not explained byage
?
Now, let’s use both age
and height
to predict weight
.
Regress weight
on age
and height
. Standardize the variables in the model. Save the model to an object called model_2
.
model_2 <- # your code here
summary(model_2)
The results provide us with the following model:
\[W_i = .00 + .08A_i + .81H_i + \epsilon_i\]
Question How does the regression coefficient corresponding to
age
differ inmodel_2
compared to inmodel_1
?
We can visualize the relationship among weight
, age
, and height
using a 3D scatterplot. We can create a non-interactive version of a 3D scatterplot using the scatterplot3d()
function from the {scatterplot3d}
package.
s3d <- scatterplot3d(x = scale(data$age),
y = scale(data$height),
z = scale(data$weight),
# labels
xlab = "Age",
ylab = "Height",
zlab = "Weight",
# point color
color = "red3",
# point shape
pch = 20)
s3d$plane3d(model_2, lty.box = "solid", col = "turquoise4")
We can also create an interactive version of a 3D scatterplot using the scatter3d()
function from the {car}
package.
scatter3d(formula = scale(weight) ~ scale(age) + scale(height),
data = data,
# labels
xlab = "Age",
ylab = "Weight",
zlab = "Height",
# colours
point.col = "red3",
surface.col = "turquoise4",
neg.res.col = "darkorange1",
pos.res.col = "darkorange1",
axis.col = rep("black", 3),
# animation
revolutions = 1.50,
speed = 0.25)
The regression coefficients corresponding to the predictors in a multiple regression model are called partial regression coefficients. Let’s break down what this term means.
A new concept that gets introduced when we have multiple predictors in the regression model is redundancy, which occurs when the predictors in a multiple regression model are correlated (which they often are) and there is overlap in the variability each can account for in the outcome variable. In other words, some of the variability X1 can account for in Y is redundant with the variability X2 can account for in Y.
The partial regression coefficients that we obtain in the “Estimate” column from the summary output of our multiple regression model account for this redundancy. A partial regression coefficient is the relationship between a predictor variable (X1) and outcome variable (Y) when the relationship between X1 and the other predictor variable(s) in the model has been removed.
Let’s walk through the logic of a partial regression coefficient by re-creating the partial regression coefficient corresponding to age
(\(b_1\) = 0.0838) from our multiple regression model above.
age
age
on height
. Standardize the variables. Store the model in an object called model_3
model_3 <- # your code here
age
that is unrelated to height
in a new column in data
called age_resid
.data$age_resid <- # your code here
weight
on the part of age
that is unrelated to height
(i.e., data$age_resid
). Don’t forget to standardize weight
to stay consistent with how we ran the multiple regression model earlier. Store the model in an object called model_4
.model_4 <- # your code here
summary()
function.summary(model_4)
Question How does the regression coefficient corresponding to the part of
age
that is unrelated toheight
in the univariate model (i.e.,model_4
) predictingweight
compare to the regression coefficient corresponding toage
in the multiple regression model (i.e.,model_2
)?
Question Which type of correlation is the partial regression coefficient conceptually most similar to?
If we were to calculate the correlation between weight
and age_resid
, it would produce the semi-partial correlation.
cor(data$weight, data$age_resid)
Let’s see that this correlation is equal to the semi-partial correlation by calculating it below.
Given X1, X2, and Y, the semi-partial correlation between X1 and Y (controlling for X2) is the correlation between the part of X1 that is not related to X2 and Y:
\[\large r_{Y(X1.X2)} = \frac{r_{YX1}-r_{YX2}r_{X1X2}}{\sqrt{1-r_{X1X2}^2}}\]
r_yx1 <- cor(data$weight, data$age)
r_yx2 <- cor(data$weight, data$height)
r_x1x2 <- cor(data$age, data$height)
semi_partial <- (r_yx1 - (r_yx2 * r_x1x2)) / (sqrt(1 - (r_x1x2^2)))
semi_partial
Notice the similarity between the formula for the semi-partial correlation and the formula for the standardized regression coefficient.
The equation for calculating the standardized regression coefficient is:
\[\large r_{Y(X1.X2)} = \frac{r_{YX1}-r_{YX2}r_{X1X2}}{1-r_{X1X2}^2}\]
std_regr_coef <- (r_yx1 - (r_yx2 * r_x1x2)) / ((1 - (r_x1x2^2)))
std_regr_coef
You can also use functions in the {ppcor}
package to calculate semi-partial and partial correlations more quickly.
semi_partial_cor <- spcor.test(
x = data$weight, # Y (var of interest 1)
y = data$age, # X1 (var of interest 2 - the one you want residualized)
z = data$height # X2 (control variable)
)
semi_partial_cor
semi_partial_cor$estimate^2
partial_cor <- pcor.test(
x = data$weight, # Y (var of interest 1)
y = data$age, # X1 (var of interest 2)
z = data$height # X2 (control variable)
)
partial_cor
partial_cor$estimate^2
Question What’s the difference conceptually in how we should understand the partial correlation between
weight
andage
and the semi-partial correlation betweenweight
andage
?
Is there a way to test if pcor.test
is really only predicting the variation in Y that is not already explained by X2?
cor(resid(lm(weight ~ height, data = data)),
resid(lm(age ~ height, data = data)))
A helpful way of framing the null hypothesis we want to test is using the model comparison approach. In this approach, you construct two models. One model represents what you would expect if the null hypothesis was true, and one model represents the alternative hypothesis. The model representing the alternative hypothesis includes the predictor(s) you want to test the significance of. The model representing the null hypothesis does not include these predictor(s). These predictors should be the only thing that differs between the two models.
Testing the significance of the overall model (AKA, the omnibus test), means examining whether, as a set, both age
and height
account for a significant amount of variation in weight
scores.
We can test this by comparing two different models:
\[Model 1: W_i = \beta_0 + \epsilon_i\]
\[Model 2: W_i = \beta_0 + \beta_1A_i + \beta_2H_i + \epsilon_i\]
Where the null hypothesis states:
\[H0: \beta_1 = \beta_2 = 0\]
And the alternative hypothesis states:
\[H1: \beta_1 \neq 0 \; \textrm{AND/OR} \; \beta_2 \neq 0 \]
The key to the model comparison approach is comparing the amount of unaccounted for error remaining when Model 1 is used compared to when Model 2 is used. Then, we can test whether the difference in the amount of unaccounted for error remaining between the two models is significant.
Let’s visualize Model 1.
ggplot(data = data, aes(x = name, y = weight)) +
geom_point(color = "red3") +
geom_hline(yintercept = mean(data$weight),
color = "turquoise4") +
theme_bw()
Now, let’s run Model 1 and see how much error is left unaccounted for. Save Model 1 to model_1
.
model_1 <- # your code here
Let me introduce a new way of looking at the output of a regression analysis, the anova()
function. This function gives us the SSE (the amount of unaccounted for error remaining in Model 1).
model_1_anova <- # your code here
model_1_anova
The Sum of Squares on the residuals row corresponds to the SSE. That is, if you were to calculate the distance between each person’s actual weight and the weight predicted by the model (in this case, the mean), square those distances, and sum them up, you would get 9,335.7.
Let’s visualize Model 2.
scatter3d(formula = weight ~ age + height,
data = data,
# labels
xlab = "Age",
ylab = "Weight",
zlab = "Height",
# colours
point.col = "red3",
surface.col = "turquoise4",
neg.res.col = "darkorange1",
pos.res.col = "darkorange1",
axis.col = rep("black", 3))
Now, let’s run Model 2 and see how much error is left unaccounted for. Save Model 2 to model_2
.
model_2 <- # your code here
model_2_anova <- # your code here
model_2_anova
Look at the residuals row again to get the SSE. The amount of unaccounted for error remaining when we use model 2 is 2,120.1. An improvement from Model 1!
How much was SSE reduced by using Model 2 instead of Model 1? Save the output to a variable called ssr
(i.e., sum of squares reduced).
model_1_sse <- model_1_anova$`Sum Sq`[1]
model_2_sse <- model_2_anova$`Sum Sq`[3]
ssr <- model_1_sse - model_2_sse
ssr
anova()
functionWe can give both models as arguments to the anova()
function to make the model comparison in a single step and test whether the reduction in SSE is significant. Save the results of the model comparison to an object called model_comparison
.
model_comparison <- # your code here
model_comparison
Let’s also calculate how much we reduced the amount of unaccounted for error using Model 2 compared to Model 1 as a proportion of the total amount of unaccounted for error we started with in Model 1. Save the output to a variable called per
(i.e., proportion_error_reduced).
per <- (model_1_sse - model_2_sse) / model_1_sse
per
Question What does this proportional reduction in error mean?
Question What is the more familiar term for this value?
Look again at our summary output if we run our multiple regression model (i.e., model_2
) and see how our results map on.
summary(model_2)
The statistics given at the bottom of the summary output correspond to a test of the significance of the overall model i.e., the omnibus test.
We can also construct a model comparison to represent the null hypothesis we want to test to see if a specific predictor in the model is significant. Let’s do an example testing the significance of Height in our multiple regression model.
\[Model 1: W_i = \beta_0 + \beta_1A_i + \epsilon_i\]
\[Model 2: W_i = \beta_0 + \beta_1A_i + \beta_2H_i + \epsilon_i\]
Where the null hypothesis states:
\[H0: \beta_2 = 0\]
And the alternative hypothesis states:
\[H1: \beta_2 \neq 0\]
Construct Model 1 and Model 2. Store in variables labelled model_1
and model_2
, respectively.
model_1 <- # your code here
model_2 <- # your code here
Compare the models using the anova()
function. Save the results of the model comparison to an object called model_comparison
.
model_comparison <- # your code here
model_comparison
Compute \(R^2\).
r_squared <- model_comparison$`Sum of Sq`[2] / model_comparison$RSS[1]
r_squared
Look at summary output from Model 2 again.
summary(model_2)
There’s equivalence between what we’ve done using the model comparison approach and the t-test approach that’s used to test the significance of the specific predictors in the model in the summary output. Notice what we get if we take the square root of the F-statistic from the model comparison output.
sqrt(model_comparison$F[2])
And also notice the correspondence in the p-values.
model_comparison$`Pr(>F)`[2]
Finally, we may also want confidence intervals around our regression coefficients.
confint(model_2)
Use the data from Kay (2021) for these minihacks.
df <- import(file = paste("https://raw.githubusercontent.com",
"uopsych",
"psy612",
"master",
"labs",
"lab-4",
"data",
"data_kay2021.csv",
sep = "/"),
setclass = "tibble")
df
Say a researcher is interested in whether there is a relationship between the tendency to believe in conspiracy theories (Y = conspiracy
) and paranoia (X1 = paranoia
) when controlling for person’s general tendency to have delusions (X2 = delusions
). Calculate the semi-partial and partial correlations for this scenario. Explain the difference in what the semi-partial versus the partial correlation means.
# your code here
Run a multiple regression predicting the tendency to believe in conspiracy theories (Y = conspiracy
) from paranoia (X1 = paranoia
) and delusional ideation (X2 = delusions
). Interpret the meaning of the regression coefficients in your output.
# your code here
Treat the model you ran in minihack 2 as Model 1. Make a new model, Model 2, that includes all the predictors in Model 1 plus the desire for uniqueness (uniqueness
) and the desire for control (control
).
Perform a model comparison to test whether the variation in the tendency to believe in conspiracy theories is accounted for by a desire for uniqueness and control above and beyond the variation already accounted for by paranoia and delusional ideation.
Report the change in SSE between Model 1 and Model 2, the F-statistic with its degrees of freedom, the p-value. Also, calculate \(R^2\).
# your code here