You will need the Rmd file for today’s lab - you can download it here.
Install any packages you have not used before.
# load packages
library(rio) # for importing data
library(tidyverse) # for dplyr, ggplot2, ETC
library(gridExtra) # for plotting multiple figures side-by-side
library(broom) # for extracting residuals and outlier indices
library(olsrr) # for creating outlier pltos
library(psych) # for calculating mahalanobis distance
# suppress scientific notation
options(scipen = 999)
We’ll be working with several versions of a dataset that includes participants’ scores on happiness and extraversion. Each version has modified the data to represent one of the violations of the assumptions underlying linear regression.
df1 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_raw.csv",
setclass = "tibble") %>%
janitor::clean_names()
df2 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nl.csv",
setclass = "tibble") %>%
janitor::clean_names()
df3 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nne.csv",
setclass = "tibble") %>%
janitor::clean_names()
df4 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_het.csv",
setclass = "tibble") %>%
janitor::clean_names()
df5 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_out.csv",
setclass = "tibble") %>%
janitor::clean_names()
df6 <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_mc.csv",
setclass = "tibble") %>%
janitor::clean_names()
The same variables (mostly) are contained in each dataset. Let’s inspect just df1
.
df1
We’re going to focus on the relationship between happiness
(Y) and extraversion
(X). Let’s visualize this using a scatterplot for each dataset. Visualization is always a good place to start when inspecting the relationships between variables.
# scatterplot using df1
scatter1 <- ggplot(data = df1, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df1") +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# scatterplot using df2
scatter2 <- ggplot(data = df2, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df2") +
theme(plot.title = element_text(hjust = 0.5))
# scatterplot using df3
scatter3 <- ggplot(data = df3, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df3") +
theme(plot.title = element_text(hjust = 0.5))
# scatterplot using df4
scatter4 <- ggplot(data = df4, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df4") +
theme(plot.title = element_text(hjust = 0.5))
# scatterplot using df5
scatter5 <- ggplot(data = df5, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df5") +
theme(plot.title = element_text(hjust = 0.5))
# scatterplot using df6
scatter6 <- ggplot(data = df6, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion",
y = "Happiness",
title = "df6") +
theme(plot.title = element_text(hjust = 0.5))
Let’s look at them side-by-side by using the grid.arrange()
function from the {gridExtra}
package.
grid.arrange(scatter1,
scatter2,
scatter3,
scatter4,
scatter5,
scatter6,
nrow = 2)
Question: Do you notice any potential issues from any of these scatterplots that we used to represent the data so far?
Listed below are a number of issues that violate one or more of the assumptions underlying linear regression analysis. Included with each issue is a method for determining the presence of each issue.
Examine the distribution of the residuals
Examine a Q-Q plot
Univariate outliers: Examine a histogram or boxplot
Multivariate outliers: Examine a scatterplot, Mahalanobis Distances, Cook’s D, dfbetas, and dffits
model1 <- lm(happiness ~ extraversion, data = df1)
model2 <- lm(happiness ~ extraversion, data = df2)
model3 <- lm(happiness ~ extraversion, data = df3)
model4 <- lm(happiness ~ extraversion, data = df4)
model5 <- lm(happiness ~ extraversion, data = df5)
To examine whether there is the presence of a non-linear trend in the relationship between extraversion
and happiness
that is not captured by our linear model, you can first examine a scatterplot (as we did above).
Additionally, you can examine a plot of the model’s fitted values by the residuals. We can obtain this plot using the function plot()
, which is included in base R. The plot()
function provides a set of four different diagnostic plots. Let’s look at them all for model1
.
par(mfrow = c(2, 2)) # display plots in a 2x2 grid
plot(model1)
Now, let’s focus just on examining the plot of the fitted values by the residuals, which is the first plot.
par(mfrow = c(1, 1))
plot(model1, 1)
Let’s also examine the plot for model_2
.
# your code here
Question: Do you see a pattern in either of these plots that suggest the presence of a non-linear trend between extraversion and happiness that is not captured by the linear model?
You can check whether a model’s residuals are normally distributed by plotting either a (a) distribution of the residuals or (b) Q-Q plot.
Remember, we can pass our model to the augment()
function to retrieve a model’s residuals. Let’s run the augment()
function on each of our models so we can use the output later.
model1_aug <- augment(model1)
model2_aug <- augment(model2)
model3_aug <- augment(model3)
model4_aug <- augment(model4)
model5_aug <- augment(model5)
Let’s check whether the residuals are normally distributed for model1
. First, we’ll use a density plot with a normal distribution overlaid on top to easily compare the two:
ggplot(data = model1_aug, aes(x = .resid)) +
geom_density(fill = "purple") +
stat_function(linetype = 2,
fun = dnorm,
args = list(mean = mean(model1_aug$.resid),
sd = sd(model1_aug$.resid))) +
theme_minimal()
Next, let’s use a Q-Q plot. We can create a Q-Q plot by using the stat_qq()
function in the {ggplot2}
package.
ggplot(model1) +
geom_abline(color = "turquoise4") +
stat_qq(aes(sample = .stdresid), color = "darkorchid4", alpha = .50) +
theme_minimal()
A Q-Q plot plots the quantiles of a reference distribution (in this case, the normal distribution) on the x-axis by the quantiles of the residuals on the y-axis. If your residuals come from a normal distribution, the percentage of data lying between any two points should approximate the percentage of data lying between any two points on a normal distribution.
When the points on a Q-Q plot are close to lying on the diagonal line (with a slope of 1) the residuals are approximately normally distributed.
Now, do the same thing, but this time let’s examine whether we have violated the assumption of normally distributed residuals for model3
.
Obtain a density plot of the residuals from model3
:
# your code here
Obtain a Q-Q plot for model3
:
# your code here
Question: Do the errors appear to be approximately normally distributed?
The plot of the fitted values by the residuals for a model can also be used to check whether the assumption of homogeneity of variance has been met (i.e., whether error variance appears to be approximately the same across all values fitted by the model).
Let’s check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model1
.
plot(model1, 1)
Now, check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model4
.
# your code here
Question: Does the amount of variability in the residuals seem to differ across the model’s fitted values for either of these models? In other words, is there evidence of heteroscedasticity?
One way to check for non-independence among errors is to examine whether the errors have a non-random (i.e., systematic) relationship with a variable they shouldn’t, such as ID number. In this case, ID represents the order in which participants participated.
Let’s check whether we have violated the assumption of independence of errors by plotting residuals against ID numbers for model1
.
model1_aug$id <- df1$id
ggplot(data = model1_aug, aes(x = id, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
Repeat the same thing, but this time testing the independence of errors assumption for model4
.
# your code here
Question: Does there appear to be a relationship between ID number and the residuals for either model above?
The presence of outliers means a few values could be having a disproportionate influence on a single variable (univariate outliers) or on the fit of a regression model (multivariate outliers). Therefore, we want to identify outliers so we can consider whether they should be removed or changed.
One way of visually seeing potential outliers for a single variable is by using a boxplot.
Let’s obtain a boxplot of the happiness outcome variable from the dataset called df5
.
ggplot(df5) +
aes(y = happiness) +
geom_boxplot() +
theme_minimal()
The outer edges of the boxplot correspond to the first and third quartiles (the values partitioning where the middle 50% of the data is located).
The upper whisker extends from the top edge of the boxplot to the largest value no further than 1.5 * IQR from the edge (recall that IQR is the distance from the first to the third quartile). The lower whisker extends from the lower edge of the boxplot to the smallest value that is smallest value that is no further than 1.5 * IQR from the edge. Data points beyond these are considered outliers.
To determine the value of the potential outliers, you can use the function boxplot.stats()
. The value(s) underneath $out
are scores on the variable that landed more extreme than the boxplot’s whiskers.
boxplot.stats(df5$happiness)
Let’s find out who in the original dataset this value corresponds to.
df5[df5$happiness == 32,]
Not all outliers determined in this way necessarily need to be removed from or changed in the dataset. Use your knowledge of your field, the methods used, and how extreme the outlier is to determine if it should be addressed. You can also examine the mean of your variable with and without the outlier present to gauge how strong of an influence it is having.
You can also have multivariate outliers, which are outliers because they are far away from the values predicted by your linear regression model or they have an undue influence on the fit of the model.
One straightforward way to visually inspect the data for multivariate outliers is to use bivariate scatterplots for each combination of outcome and predictor.
Let’s look again at the scatterplot of the relationship between extraversion
and happiness
for df5
.
scatter5
Question: Do there appear to be any multivariate outliers in the scatterplot?
We can get a quantitative measure of the distance of points from the regression model using either 1) standardized residuals or 2) studentized deleted residuals.
Standardized residuals are the raw residuals put into standardized units by dividing by the residual standard error.
We can plot the standardized residuals by using the ols_plot_resid_stand()
function from the {olsrr}
package.
ols_plot_resid_stand(model5)
We can also look at the standardized residuals stored in model5_aug
:
model5_aug$id <- df5$id
model5_aug %>%
select(id, .std.resid) %>%
arrange(desc(abs(.std.resid)))
One standard is that standardized residuals greater than \(\pm2\) are worth inspecting further.
Studentized deleted residuals are calculated by deleting data points from a dataset one by one, fitting models with the new datasets, and calculating the standardized distance between the deleted data points and the predicted values for the models without the data points.
Let’s plot the studentized deleted residuals by using the ols_plot_resid_stud()
function from the {olsrr}
package.
ols_plot_resid_stud(model5)
# for some of you, the above function might not work. try this...
data.frame(id = seq_along(rstudent(model5)),
rs = rstudent(model5)) %>%
ggplot(aes(x = id, y = rs, label = id)) +
geom_hline(yintercept = c(-3, 3), color = "red") +
geom_bar(width = 0.25,
stat = "identity",
colour = "turquoise4",
fill = "turquoise4") +
geom_text(hjust = -0.2,
nudge_x = 0.05,
size = 2,
na.rm = TRUE) +
xlab("Observation") +
ylab("Deleted Studentized Residuals") +
ggtitle("Studentized Residuals Plot") +
theme_bw()
We can also view the studentized residuals by using the rstudent()
function from {stats}
package.
studentized_resids <- rstudent(model5)
sort(abs(studentized_resids), decreasing = TRUE)
One standard is that studentized residuals greater than \(\pm3\) are worth inspecting further.
Another way of examining whether outliers are present and should be dealt with is by measuring their influence on the regression model. Three measures you can use to measure influence of a data point on the regression model are 1) Cook’s Distance, 2) DFFITS, and 3) DFBETAS
Cook’s Distance summarizes how much the regression model would change if you removed a particular value. In other words, it represents the difference between the regression coefficients obtained from the full dataset and the regression coefficients that would be obtained if you deleted that particular value.
We can plot the Cook’s Distances using the ols_plot_cooksd_chart()
function from the {olsrr}
package:
ols_plot_cooksd_chart(model5)
We can also view the individual Cook’s Distance in model5_aug
:
model5_aug %>%
select(id, .cooksd) %>%
arrange(desc(abs(.cooksd)))
One standard is that Cook’s D values greater than 3 times the average Cook’s D values are worth investigating.
DFBETA (AKA difference in betas) measures the difference in each parameter estimate with and without a particular point being included in the data.
We can visualize this by using the ols_plot_dfbetas()
function from the {olsrr}
package:
ols_plot_dfbetas(model5)
We can also view the individual values by using the dfbeta
function from the {stats}
package:
dfbetas <- dfbeta(model5)
sort(abs(dfbetas[,1]), decreasing = TRUE)
sort(abs(dfbetas[,2]), decreasing = TRUE)
One standard is that DFBETAS larger than \(\frac{2}{\sqrt{n}}\) should be investigated as potential outliers.
DFFITS (AKA difference in fits) measure the difference in the model’s fitted values with and without a particular point being included in the data.
We can visualize this by using the ols_plot_dffits()
function from the {olsrr}
package:
ols_plot_dffits(model5)
We can also view the individual values by using the dffits
function from the {stats}
package:
dffits <- dffits(model5)
sort(abs(dffits), decreasing = TRUE)
One standard is that DFFITS larger than \(2 \times \frac{p+1}{\sqrt{n−p−1}}\) should be investigated as potential outliers (n = number of observations, p = number of predictors including the intercept).
Multicollinearity occurs when one or more predictors in our model are highly correlated.
One way we can check this before we even run our model is to look at the zero-order correlations between variables. The df6
dataset contains two predictor variables. Let’s investigate the correlation between them.
cor(df6)
The variables extraversion
and social_engagement
appear to be VERY highly correlated (r = 0.96)!
Recall the two measures of multicollinearity that we discussed in lecture: tolerance and VIF.
Tolerance is: \[1 - R_{12}^2\]
And VIF is \[\frac{1}{Tolerance}\]
Let’s run a model with two predictors using df6
. We’ll predict happiness
from both extraversion
and social_engagement
.
model6 <- lm(happiness ~ extraversion + social_engagement, data = df6)
We can examine the tolerance and VIF values of the model by using the ols_vif_tol()
function from the {olsrr}
package.
ols_vif_tol(model6)
Either a low tolerance (below 0.20 is one standard) or a high VIF (above 5 or 10 are standards) is an indication of a problem with multicollinearity.
Load the bfi
dataset from the {psych}
package (this is done for you below, in case you haven’t loaded a dataset from a package).
data(bfi, package = "psych")
First, create a model regressing age on the 5 items for Conscientiousness
(labeled C1
, C2
, C3
, C4
, and C5
). Asign the object to a variable called model
.
# your code here
Next, use augment()
on the model. Save the output to variable called model_aug
.
# your code here
Next, check if the data meet the homoscedasticity assumption.
# your code here
Next, check if the errors are normally distributed using a plot.
# your code here
Using df5
, run a regression model with and without the outliers we identified.
# your code here
Compare the output of the two models. How similar/different are they? What values change?
For this Minihack, use the fitness
data from the {olsrr}
package (loaded for you below).
data(fitness, package = "olsrr")
Regress runtime
on weight
, restpulse
, runpulse
, and maxpulse
.
# your code here
Check the multicollinearity of predictors. Is multicollinearity at a tolerable level?
# your code here