You will need the Rmd file for today’s lab - can download it here.
Install any packages you have not used before.
library(rio)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom)
library(sjPlot)
library(olsrr)
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
We’ll be working with several versions of a dataset that includes participants’ scores on happiness and extraversion. Each version has modified the data to represent one of the violations of the assumptions underlying linear regression.
df <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv") %>% janitor::clean_names() %>% select(extraversion, happiness) # to get things into snake_case
df2 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nl.csv")
df3 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nne.csv")
df4 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_het.csv")
df5 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/lab6_outliers.txt") %>%
janitor::clean_names()
df6 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_mc.csv")
The same variables (mostly) are contained in each dataset. Let’s inspect just the df dataset.
head(df)
## extraversion happiness
## 1 66 68
## 2 69 78
## 3 56 70
## 4 91 88
## 5 97 90
## 6 47 63
str(df)
## 'data.frame': 147 obs. of 2 variables:
## $ extraversion: int 66 69 56 91 97 47 28 63 72 72 ...
## $ happiness : int 68 78 70 88 90 63 60 65 75 83 ...
We’re going to focus on the relationship between happiness (Y) and extraversion (X). Let’s visualize this using a scatterplot for each dataset. Visualization is always a good place to start when inspecting the relationships between variables.
# Scatterplot using df
scatter1 <- ggplot(data = df, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df") +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Scatterplot using df2
scatter2 <- ggplot(data = df2, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df2") +
theme(plot.title = element_text(hjust = 0.5))
# Scatterplot using df3
scatter3 <- ggplot(data = df3, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df3") +
theme(plot.title = element_text(hjust = 0.5))
# Scatterplot using df4
scatter4 <- ggplot(data = df4, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df4") +
theme(plot.title = element_text(hjust = 0.5))
# Scatterplot using df5
scatter5 <- ggplot(data = df5, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df5") +
theme(plot.title = element_text(hjust = 0.5))
# Scatterplot using df6
scatter6 <- ggplot(data = df6, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df6") +
theme(plot.title = element_text(hjust = 0.5))
Let’s look at them side-by-side.
grid.arrange(scatter1, scatter2, scatter3, scatter4, scatter5, scatter6)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Question: Do you notice any potential issues from any of these scatterplots that we used to represent the data so far?
Some of the potential issues that we can run into in the data that would violate one or more of the assumptions underlying the linear regression analysis include (along with ways of investigating each):
model1 <- lm(happiness ~ extraversion, data = df)
model2 <- lm(happiness ~ extraversion, data = df2)
model3 <- lm(happiness ~ extraversion, data = df3)
model4 <- lm(happiness ~ extraversion, data = df4)
model5 <- lm(happiness ~ extraversion, data = df5)
To examine whether there is the presence of a non-linear trend in the relationship between X and Y that is not captured by our linear model, you can first examine a scatterplot (as we did above).
Additionally, you can examine a plot of the model’s fitted values by the residuals. We can obtain this plot using the function plot(), which is included in base R. Give the model as an input.
The plot() function provides a set of four different diagnostic plots. Let’s look at them all first for model1.
par(mfrow = c(2, 2)) # display plots in a 2x2 grid
plot(model1)
Let’s focus just on examining the plot of the fitted values by the residuals, which is the first plot.
Checking for non-linearity in model 1
par(mfrow = c(2, 1))
plot(model1, 1)
Now, check for non-linearity in model2.
Question: Do you see a pattern in either of these plots that suggest the presence of a non-linear trend between income and happiness that is not captured by the linear model?
You can check whether the model’s residuals are normally distributed by plotting either: a) a distribution of the residuals, b) a Q-Q plot
Remember we can pass our model to the augment() function to retrieve the model residuals. Let’s run the augment() function on each mdoel here so we can use the output later.
aug_model1 <- augment(model1)
str(aug_model1)
## tibble [147 x 8] (S3: tbl_df/tbl/data.frame)
## $ happiness : int [1:147] 68 78 70 88 90 63 60 65 75 83 ...
## $ extraversion: int [1:147] 66 69 56 91 97 47 28 63 72 72 ...
## $ .fitted : num [1:147] 72.3 73.5 68.4 82.2 84.6 ...
## $ .resid : num [1:147] -4.33 4.49 1.62 5.81 5.44 ...
## $ .std.resid : num [1:147] -0.442 0.459 0.165 0.601 0.566 ...
## $ .hat : num [1:147] 0.00788 0.00907 0.00708 0.03127 0.04144 ...
## $ .sigma : num [1:147] 9.85 9.85 9.86 9.84 9.85 ...
## $ .cooksd : num [1:147] 7.77e-04 9.64e-04 9.73e-05 5.83e-03 6.92e-03 ...
## - attr(*, "terms")=Classes 'terms', 'formula' language happiness ~ extraversion
## .. ..- attr(*, "variables")= language list(happiness, extraversion)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "happiness" "extraversion"
## .. .. .. ..$ : chr "extraversion"
## .. ..- attr(*, "term.labels")= chr "extraversion"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(happiness, extraversion)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "happiness" "extraversion"
aug_model2 <- augment(model2)
aug_model3 <- augment(model3)
aug_model4 <- augment(model4)
aug_model5 <- augment(model5)
Let’s check whether the residuals are normally distributed for model 1. First, we’ll use a density plot with a normal distribution overlaid on top to easily compare the two.
Density Plot
ggplot(data = aug_model1, aes(x = .resid)) + # Plot the residuals
geom_density(fill = "purple") + # Density plot
stat_function(linetype = 2, fun = dnorm, # Add a normal curve (fun = dnorm) as a dashed line (linetype = 2)
args = list(mean = mean(aug_model1$.resid), # define mean and sd to line up the normal curve with the plot of the residuals
sd = sd(aug_model1$.resid))) +
theme_minimal()
Next, let’s use a Q-Q plot.
QQ-Plot
graphics.off() # Returns the plotting dimensions back to the default
ggplot(model1) +
stat_qq(aes(sample = .stdresid)) + # Plot a qq-plot
geom_abline() + # Add a diagonal line
theme_minimal()
A qq-plot plots the quantiles of a reference distribution (in this case, the normal distribution) on the x-axis by the quantiles of the residuals on the y-axis. If your residuals come from a normal distribution, the percentage of data lying between any two points should approximate the percentage of data lying between any two points on a normal distribution.
When the points on a QQ-plot are close to lying on the diagonal line (with a slope of 1) the residuals are approximately normally distributed.
Now, do the same thing, but this time let’s examine whether we have violated the assumption of normally distributed residuals for model 3.
Obtain a density plot of the residuals from model 3:
Obtain a Q-Q plot for model 3:
Question: Do the errors appear to be approximately normally distributed?
The plot of the fitted values by the residuals for a model can also be used to check whether the assumption of homogeneity of variance has been met, aka whether error variance appears to be approximately the same across all values fitted by the model.
Let’s check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model1.
plot(model1, 1)
Now, check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model4.
Question: Does the amount of variability in the residuals seem to differ across the model’s fitted values for either of these models (aka, is there evidence of heteroscedasticity)?
One way to check for non-independence among errors is to examine whether the errors have a non-random (aka, systematic) relationship with a variable they shouldn’t, such as ID number (which indicates the order in which participants participated in the study), season, month, etc.
Let’s check whether we have violated the assumption of independence of errors by plotting residuals against ID numbers for model 1.
aug_model1$ID <- c(1:nrow(aug_model1)) # Add a column of ID numbers to the object containing the model residuals
ggplot(data = aug_model1,
aes(x=ID, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Repeat the same thing, but this time testing the independence of errors assumption for model 2.
Question: Does there appear to be a relationship between ID number and the residuals for either model above?
The presence of outliers means a few values could be having a disproportionate influence on a single variable (univariate outliers) or on the fit of a regression model (multivariate outliers). Therefore, we want to identify outliers so we can consider whether they should be removed or changed.
One way of visually seeing potential outliers for a single variable is by using a boxplot.
Let’s obtain a boxplot of the happiness outcome variable from the dataset called df5.
ggplot(df5) +
aes(y = happiness) +
geom_boxplot() +
theme_minimal()
The outer edges of the boxplot correspond to the first and third quartiles (the values partitioning where the middle 50% of the data is located).
The upper whisker extends from the top edge of the boxplot to the largest value no further than 1.5 * IQR from the edge (recall that IQR is the distance from the first to the third quartile). The lower whisker extends from the lower edge of the boxplot to the smallest value that is smallest value that is no further than 1.5 * IQR from the edge Data points beyond these are considered outliers.
To determine the value of the potential outliers, you can use boxplot.stats(). The value(s) underneath $out are scores on the variable that landed more extreme than the boxplot’s whiskers.
boxplot.stats(df5$happiness)
## $stats
## [1] 35 60 70 78 98
##
## $n
## [1] 165
##
## $conf
## [1] 67.78595 72.21405
##
## $out
## [1] 32
Let’s find out who in the original dataset this value corresponds to.
df5$ID <- c(1:nrow(df5)) # Add a column of ID numbers
df5[df5$happiness == 32,] # Use indexing [row, column] to grab the row for the person whose score on happiness is equal to the value of the outlier
## extraversion happiness ID
## 106 18 32 106
Not all outliers determined in this way necessarily need to be removed from or changed in the dataset. Use your knowledge of your field, the methods used, and how extreme the outlier is to determine if it should be addressed. You can also examine the mean of your variable with and without the outlier present to gauge how strong of an influence it is having.
You can also have multivariate outliers, which are outliers because they are far away from the values predicted by your linear regression model or they have an undue influence on the fit of the model.
One straightforward way to visually inspect the data for multivariate outliers is to use bi-variate scatterplots for each combination of outcome and predictor.
Let’s look again at the scatterplot of the relationship between extraversion in happiness from df5.
scatter5 <- ggplot(data = df5, aes(x = extraversion, y = happiness)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Extraversion", y = "Happiness", title = "df5") +
theme(plot.title = element_text(hjust = 0.5))
scatter5
## `geom_smooth()` using formula 'y ~ x'
Question: Do there appear to be any multivariate outliers in the scatterplot?
We can get a quantitative measure of the distance of points from the regression model using either 1) Standardized residuals or 2) Studentized residuals.
Standardized residuals are the raw residuals put into standardized units by dividing by the residual standard error.
Plot
ols_plot_resid_stand(model5)
Values
aug_model5$ID <- c(1:nrow(aug_model5))
aug_model5 %>%
arrange(desc(abs(.std.resid)))%>% # Arrange the dataset by highest to lowest standardized residuals
head(n = 20)
## # A tibble: 20 x 9
## happiness extraversion .fitted .resid .std.resid .hat .sigma .cooksd ID
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 95 22 54.4 40.6 3.67 0.0235 10.8 0.162 161
## 2 91 17 52.4 38.6 3.50 0.0292 10.8 0.184 165
## 3 87 20 53.6 33.4 3.02 0.0257 10.9 0.120 149
## 4 85 16 52.0 33.0 2.99 0.0304 10.9 0.141 134
## 5 42 53 66.9 -24.9 -2.23 0.00615 11.1 0.0154 43
## 6 50 66 72.1 -22.1 -1.98 0.00785 11.1 0.0156 159
## 7 78 28 56.8 21.2 1.91 0.0178 11.1 0.0330 150
## 8 32 18 52.8 -20.8 -1.88 0.0280 11.1 0.0511 106
## 9 90 59 69.3 20.7 1.85 0.00627 11.1 0.0109 17
## 10 53 69 73.3 -20.3 -1.83 0.00900 11.1 0.0151 117
## 11 98 81 78.2 19.8 1.79 0.0164 11.1 0.0266 140
## 12 93 69 73.3 19.7 1.76 0.00900 11.1 0.0141 11
## 13 39 32 58.4 -19.4 -1.75 0.0146 11.1 0.0227 44
## 14 80 38 60.9 19.1 1.72 0.0108 11.1 0.0161 74
## 15 43 41 62.1 -19.1 -1.71 0.00929 11.1 0.0137 71
## 16 90 63 70.9 19.1 1.71 0.00699 11.1 0.0103 87
## 17 58 75 75.8 -17.8 -1.60 0.0121 11.1 0.0157 22
## 18 48 50 65.7 -17.7 -1.58 0.00651 11.1 0.00823 155
## 19 39 27 56.4 -17.4 -1.57 0.0187 11.1 0.0235 113
## 20 35 15 51.6 -16.6 -1.51 0.0316 11.2 0.0370 160
One standard I have seen is that standardized residuals greater than +/- 2 are worth inspecting further.
Studentized residuals are calculatd by fitting a regression model, deleting each point one at a time, and calculating the distance between that data point and the fitted value from the model fit without it, standardized.
Plot
ols_plot_resid_stud(model5)
Values
studentized_resids <- rstudent(model5)
sort(desc(abs(studentized_resids)))
## 161 165 149 134 43 159
## -3.819637376 -3.628564078 -3.101559004 -3.071176868 -2.258516493 -2.002908811
## 150 106 17 117 140 11
## -1.924585581 -1.899319160 -1.868277131 -1.838297489 -1.798356902 -1.776114639
## 44 74 71 87 22 155
## -1.760219608 -1.730393699 -1.720887176 -1.720659584 -1.603405223 -1.592438965
## 113 160 63 131 124 39
## -1.578172050 -1.511761361 -1.466729648 -1.431827257 -1.428185997 -1.427937918
## 154 139 143 57 18 99
## -1.420385410 -1.399720929 -1.356638945 -1.324457667 -1.300614610 -1.300614610
## 110 37 94 80 50 103
## -1.300614610 -1.264485769 -1.264485769 -1.258452846 -1.230475980 -1.175887568
## 84 104 164 14 35 56
## -1.147443531 -1.137890607 -1.083146892 -1.082942241 -1.082942241 -1.066642279
## 69 162 89 122 92 68
## -1.065995704 -1.052278638 -1.047678112 -1.046895902 -1.031431416 -1.013588724
## 62 12 152 21 78 40
## -0.995262781 -0.978877243 -0.940493208 -0.904623143 -0.884808315 -0.872282646
## 45 119 58 153 53 114
## -0.872282646 -0.872282646 -0.846361143 -0.831188963 -0.817908797 -0.812266845
## 65 34 10 60 42 76
## -0.786833448 -0.763205589 -0.758238330 -0.740580529 -0.725408220 -0.725408220
## 105 112 46 101 163 115
## -0.725408220 -0.725408220 -0.724859243 -0.711570632 -0.705296264 -0.686842591
## 126 20 125 109 130 86
## -0.654491052 -0.650438765 -0.640918555 -0.638205170 -0.632446679 -0.625232099
## 156 97 111 31 142 145
## -0.602653888 -0.579445414 -0.574147588 -0.542050868 -0.540142567 -0.531974790
## 8 36 64 88 4 100
## -0.529595635 -0.529595635 -0.525378227 -0.524437649 -0.524009839 -0.516000427
## 135 19 51 151 83 95
## -0.516000427 -0.508785214 -0.508785214 -0.508785214 -0.508326032 -0.508326032
## 33 72 30 54 5 132
## -0.505832792 -0.504462925 -0.494072406 -0.489145899 -0.488158235 -0.477821753
## 137 133 120 13 2 146
## -0.470394341 -0.469615522 -0.437864206 -0.433716781 -0.417361066 -0.417361066
## 96 1 128 49 48 52
## -0.385754840 -0.369328478 -0.364726467 -0.361427504 -0.347900851 -0.337187531
## 144 158 67 116 47 107
## -0.337187531 -0.315039301 -0.309353202 -0.309353202 -0.298643122 -0.292303940
## 7 118 81 85 127 24
## -0.285504702 -0.284783406 -0.277101512 -0.277101512 -0.277101512 -0.256752149
## 98 38 55 82 26 93
## -0.255371344 -0.246809573 -0.246809573 -0.239914054 -0.206905343 -0.190373519
## 66 91 3 77 32 102
## -0.185853704 -0.185853704 -0.169629946 -0.169629946 -0.169118573 -0.169118573
## 129 59 79 61 70 6
## -0.164297772 -0.154887759 -0.154887759 -0.148760343 -0.148760343 -0.131958608
## 41 141 27 138 16 25
## -0.131958608 -0.123987655 -0.111199948 -0.093085898 -0.077880522 -0.072353091
## 147 123 29 108 9 28
## -0.067554984 -0.061683111 -0.061172369 -0.061172369 -0.040681337 -0.040681337
## 136 23 148 73 75 90
## -0.040681337 -0.034885310 -0.030174354 -0.023965022 -0.015314419 -0.015314419
## 121 157 15
## -0.009045418 -0.009045418 -0.002869881
One standard I have seen is that studentized residuals greater than +/- 3 are worth inspecting further.
Another way of examining whether outliers are present and should be dealt with is by measuring their influence on the regression model. Three measures you can use to measure influence of a data point on the regression model are 1) Cook’s Distance, 2) DFFITS, and 3) DFBETAS
Summarizes how much the regression model would change if you removed a particular value (measures the difference between the regression coefficients obtained from the full dataset to the regression coefficients that would be obtained if you deleted that particular value).
Plot
ols_plot_cooksd_chart(model5)
Values
aug_model5 %>%
arrange(desc(abs(.cooksd)))%>% # Arrange the dataset by highest to lowest Cook's D
head(n = 20)
## # A tibble: 20 x 9
## happiness extraversion .fitted .resid .std.resid .hat .sigma .cooksd ID
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 91 17 52.4 38.6 3.50 0.0292 10.8 0.184 165
## 2 95 22 54.4 40.6 3.67 0.0235 10.8 0.162 161
## 3 85 16 52.0 33.0 2.99 0.0304 10.9 0.141 134
## 4 87 20 53.6 33.4 3.02 0.0257 10.9 0.120 149
## 5 32 18 52.8 -20.8 -1.88 0.0280 11.1 0.0511 106
## 6 35 15 51.6 -16.6 -1.51 0.0316 11.2 0.0370 160
## 7 78 28 56.8 21.2 1.91 0.0178 11.1 0.0330 150
## 8 37 17 52.4 -15.4 -1.40 0.0292 11.2 0.0293 139
## 9 98 81 78.2 19.8 1.79 0.0164 11.1 0.0266 140
## 10 39 27 56.4 -17.4 -1.57 0.0187 11.1 0.0235 113
## 11 39 32 58.4 -19.4 -1.75 0.0146 11.1 0.0227 44
## 12 40 15 51.6 -11.6 -1.05 0.0316 11.2 0.0181 162
## 13 80 38 60.9 19.1 1.72 0.0108 11.1 0.0161 74
## 14 58 75 75.8 -17.8 -1.60 0.0121 11.1 0.0157 22
## 15 50 66 72.1 -22.1 -1.98 0.00785 11.1 0.0156 159
## 16 42 53 66.9 -24.9 -2.23 0.00615 11.1 0.0154 43
## 17 53 69 73.3 -20.3 -1.83 0.00900 11.1 0.0151 117
## 18 93 84 79.4 13.6 1.23 0.0190 11.2 0.0146 50
## 19 93 69 73.3 19.7 1.76 0.00900 11.1 0.0141 11
## 20 43 41 62.1 -19.1 -1.71 0.00929 11.1 0.0137 71
One standard I have seen is that Cook’s D values greater than 3 times the average Cook’s D values are worth investigating.
#DFBETAS
DFBETA measures the difference in each parameter estimate with and without a particular point being included in the data.
Plot
ols_plot_dfbetas(model5)
Value
dfbetas <- dfbeta(model5)
sort(desc(abs(dfbetas)))
## [1] -1.567056e+00 -1.457327e+00 -1.371171e+00 -1.261344e+00 -8.244636e-01
## [6] -7.048208e-01 -6.432736e-01 -6.248112e-01 -5.451115e-01 -5.198217e-01
## [11] -4.923671e-01 -4.095238e-01 -4.091340e-01 -3.736228e-01 -3.564059e-01
## [16] -3.320981e-01 -3.281228e-01 -3.136707e-01 -3.108772e-01 -3.090735e-01
## [21] -2.870816e-01 -2.788880e-01 -2.622220e-01 -2.567645e-01 -2.363793e-01
## [26] -2.255097e-01 -2.255097e-01 -2.185162e-01 -2.025141e-01 -2.007930e-01
## [31] -1.986876e-01 -1.957449e-01 -1.940032e-01 -1.940032e-01 -1.940032e-01
## [36] -1.935363e-01 -1.903847e-01 -1.845945e-01 -1.774783e-01 -1.763052e-01
## [41] -1.694113e-01 -1.682222e-01 -1.626504e-01 -1.561528e-01 -1.561528e-01
## [46] -1.561528e-01 -1.536908e-01 -1.527392e-01 -1.487820e-01 -1.484796e-01
## [51] -1.465284e-01 -1.427316e-01 -1.371014e-01 -1.365570e-01 -1.326626e-01
## [56] -1.320124e-01 -1.227999e-01 -1.213243e-01 -1.197476e-01 -1.157756e-01
## [61] -1.142295e-01 -1.110806e-01 -1.082132e-01 -1.080169e-01 -1.066312e-01
## [66] -9.650634e-02 -9.648761e-02 -9.378822e-02 -9.378822e-02 -9.178472e-02
## [71] -9.036037e-02 -8.679186e-02 -8.369211e-02 -8.318888e-02 -7.875395e-02
## [76] -7.402521e-02 -7.215159e-02 -7.193058e-02 -6.867100e-02 -6.595416e-02
## [81] -6.595416e-02 -6.595416e-02 -6.595416e-02 -6.554705e-02 -6.439321e-02
## [86] -6.439321e-02 -6.119816e-02 -6.119816e-02 -6.092314e-02 -5.965686e-02
## [91] -5.871003e-02 -5.564519e-02 -5.246369e-02 -5.243608e-02 -5.158903e-02
## [96] -4.722237e-02 -4.702550e-02 -4.645731e-02 -4.462771e-02 -3.260897e-02
## [101] -3.167067e-02 -3.142636e-02 -3.093977e-02 -3.081960e-02 -3.081960e-02
## [106] -3.027250e-02 -2.967387e-02 -2.948057e-02 -2.948057e-02 -2.890152e-02
## [111] -2.779041e-02 -2.779041e-02 -2.745444e-02 -2.745444e-02 -2.698589e-02
## [116] -2.654446e-02 -2.492622e-02 -2.396734e-02 -2.317284e-02 -2.178640e-02
## [121] -2.105370e-02 -2.056027e-02 -1.978474e-02 -1.978474e-02 -1.965020e-02
## [126] -1.908387e-02 -1.904292e-02 -1.708402e-02 -1.708402e-02 -1.708402e-02
## [131] -1.640725e-02 -1.640725e-02 -1.640725e-02 -1.539987e-02 -1.539987e-02
## [136] -1.438501e-02 -1.342129e-02 -1.255802e-02 -1.203048e-02 -1.181446e-02
## [141] -1.086274e-02 -1.045967e-02 -1.045967e-02 -9.858514e-03 -9.790199e-03
## [146] -9.790199e-03 -9.556175e-03 -9.332325e-03 -9.264974e-03 -8.718321e-03
## [151] -8.718321e-03 -8.437191e-03 -8.138992e-03 -7.907866e-03 -7.588388e-03
## [156] -7.370422e-03 -7.370422e-03 -7.234914e-03 -6.920255e-03 -6.762375e-03
## [161] -6.260426e-03 -6.122199e-03 -5.585495e-03 -5.559828e-03 -5.274375e-03
## [166] -5.049406e-03 -4.933507e-03 -4.855396e-03 -4.707727e-03 -4.707647e-03
## [171] -4.638047e-03 -4.569334e-03 -4.412349e-03 -4.334384e-03 -4.334163e-03
## [176] -4.300185e-03 -4.265996e-03 -4.146023e-03 -4.034497e-03 -3.994025e-03
## [181] -3.985003e-03 -3.947269e-03 -3.877952e-03 -3.877952e-03 -3.877952e-03
## [186] -3.743306e-03 -3.705246e-03 -3.666236e-03 -3.666236e-03 -3.650748e-03
## [191] -3.388141e-03 -3.341668e-03 -3.149493e-03 -3.001766e-03 -2.977125e-03
## [196] -2.931791e-03 -2.831293e-03 -2.821974e-03 -2.821974e-03 -2.742507e-03
## [201] -2.742507e-03 -2.684721e-03 -2.645482e-03 -2.624258e-03 -2.583539e-03
## [206] -2.530327e-03 -2.526043e-03 -2.517873e-03 -2.318212e-03 -2.239954e-03
## [211] -2.183546e-03 -2.110426e-03 -2.108478e-03 -2.060274e-03 -1.990716e-03
## [216] -1.985961e-03 -1.969046e-03 -1.934420e-03 -1.909295e-03 -1.909295e-03
## [221] -1.909295e-03 -1.805715e-03 -1.801900e-03 -1.801900e-03 -1.747613e-03
## [226] -1.747613e-03 -1.747613e-03 -1.710694e-03 -1.710694e-03 -1.642869e-03
## [231] -1.605451e-03 -1.544244e-03 -1.491184e-03 -1.484718e-03 -1.389697e-03
## [236] -1.350893e-03 -1.326901e-03 -1.301347e-03 -1.287587e-03 -1.287587e-03
## [241] -1.277796e-03 -1.277796e-03 -1.208891e-03 -1.205927e-03 -1.202503e-03
## [246] -1.157970e-03 -1.124196e-03 -1.069580e-03 -1.011616e-03 -1.011616e-03
## [251] -9.931352e-04 -9.478930e-04 -9.166344e-04 -9.152274e-04 -9.152274e-04
## [256] -8.841690e-04 -8.624174e-04 -8.624174e-04 -8.541673e-04 -7.853943e-04
## [261] -7.509008e-04 -7.285237e-04 -7.240528e-04 -7.193823e-04 -7.193823e-04
## [266] -6.984511e-04 -6.572989e-04 -6.337364e-04 -6.168387e-04 -5.868480e-04
## [271] -5.578053e-04 -5.578053e-04 -4.956556e-04 -4.856593e-04 -4.826419e-04
## [276] -4.793333e-04 -4.793333e-04 -4.647418e-04 -4.617870e-04 -4.215319e-04
## [281] -4.089958e-04 -3.753083e-04 -3.638759e-04 -3.607404e-04 -3.607404e-04
## [286] -3.601338e-04 -3.307209e-04 -3.307209e-04 -3.307209e-04 -3.110215e-04
## [291] -3.110215e-04 -2.985266e-04 -2.985266e-04 -2.985266e-04 -2.985266e-04
## [296] -2.610847e-04 -2.526480e-04 -2.526480e-04 -2.360959e-04 -2.263705e-04
## [301] -1.952341e-04 -1.947128e-04 -1.947128e-04 -1.607277e-04 -1.473420e-04
## [306] -1.433501e-04 -1.266422e-04 -1.203914e-04 -1.203914e-04 -1.203914e-04
## [311] -1.075110e-04 -8.622436e-05 -7.317689e-05 -6.999826e-05 -6.999826e-05
## [316] -6.970400e-05 -6.970400e-05 -5.772868e-05 -5.772868e-05 -5.213892e-05
## [321] -4.812631e-05 -4.012685e-05 -3.300771e-05 -3.300771e-05 -3.300771e-05
## [326] -2.020893e-05 -2.020893e-05 -1.242336e-05 -1.077725e-06 -1.077725e-06
One standard I have seen is that DFBETAS larger than 2/sqrt(n) should be investigated as potential outliers.
DFFITS measure the difference in the model’s fitted values with and without a particular point being included in the data.
Plot
ols_plot_dffits(model5)
Values
dffits <- dffits(model5)
sort(desc(abs(dffits)))
## 165 161 134 149 106
## -0.6288673131 -0.5928872219 -0.5436387359 -0.5035906823 -0.3221894932
## 160 150 139 140 113
## -0.2732388551 -0.2591164611 -0.2425859709 -0.2323931665 -0.2177338052
## 44 162 74 159 22
## -0.2143892880 -0.1901910036 -0.1806530784 -0.1781795750 -0.1777962558
## 43 117 50 11 71
## -0.1776030053 -0.1751895094 -0.1711919473 -0.1692634919 -0.1666295913
## 92 154 80 17 152
## -0.1600996238 -0.1575017993 -0.1507366838 -0.1484347895 -0.1459841211
## 87 124 69 89 155
## -0.1443254034 -0.1429806471 -0.1377535888 -0.1342209106 -0.1288744686
## 84 39 153 58 131
## -0.1272361850 -0.1270299326 -0.1231881531 -0.1225143224 -0.1200987392
## 163 57 164 63 78
## -0.1196423700 -0.1195248493 -0.1190417157 -0.1165318051 -0.1143395986
## 115 18 99 110 20
## -0.1115205380 -0.1103803468 -0.1103803468 -0.1103803468 -0.1064207823
## 37 94 143 14 35
## -0.1060624777 -0.1060624777 -0.1059982621 -0.1048588342 -0.1048588342
## 122 88 103 104 65
## -0.0997690421 -0.0987382253 -0.0924682935 -0.0920883315 -0.0910491295
## 5 4 54 56 142
## -0.0906991284 -0.0857352605 -0.0853982417 -0.0833399544 -0.0825933270
## 12 68 111 40 45
## -0.0821062192 -0.0805297172 -0.0798792054 -0.0787185988 -0.0787185988
## 119 114 72 62 10
## -0.0787185988 -0.0786499512 -0.0783031458 -0.0782644986 -0.0778510178
## 86 53 21 33 133
## -0.0748899043 -0.0727614962 -0.0706809143 -0.0703747997 -0.0696003813
## 101 125 60 31 34
## -0.0688997660 -0.0669118883 -0.0668332241 -0.0649265092 -0.0647716064
## 130 13 100 135 42
## -0.0612383735 -0.0583934321 -0.0572175658 -0.0572175658 -0.0570439403
## 76 105 112 109 46
## -0.0570439403 -0.0570439403 -0.0570439403 -0.0567749891 -0.0566354226
## 49 156 126 137 145
## -0.0552659646 -0.0543861212 -0.0529672963 -0.0521604591 -0.0515099090
## 97 64 132 8 36
## -0.0491762781 -0.0467378588 -0.0455363503 -0.0444213977 -0.0444213977
## 30 83 95 19 51
## -0.0419308557 -0.0411383096 -0.0411383096 -0.0404230319 -0.0404230319
## 151 2 146 52 144
## -0.0404230319 -0.0397744548 -0.0397744548 -0.0390179538 -0.0390179538
## 7 118 120 1 67
## -0.0384389079 -0.0368012141 -0.0342116412 -0.0328556102 -0.0317623901
## 116 96 128 47 38
## -0.0317623901 -0.0312187475 -0.0305925094 -0.0284606503 -0.0273678901
## 55 48 158 98 107
## -0.0273678901 -0.0273578860 -0.0267367036 -0.0266607648 -0.0263787852
## 24 81 85 127 129
## -0.0228407747 -0.0216507706 -0.0216507706 -0.0216507706 -0.0212314249
## 82 141 93 26 66
## -0.0194160008 -0.0172499817 -0.0169357050 -0.0167446393 -0.0155890282
## 91 61 70 59 79
## -0.0155890282 -0.0141768412 -0.0141768412 -0.0139777484 -0.0139777484
## 32 102 3 77 27
## -0.0132989805 -0.0132989805 -0.0132536955 -0.0132536955 -0.0128676004
## 25 6 41 138 147
## -0.0118379669 -0.0111990415 -0.0111990415 -0.0111497698 -0.0074909468
## 16 23 29 108 123
## -0.0069282826 -0.0064816426 -0.0049506177 -0.0049506177 -0.0049007288
## 9 28 136 148 73
## -0.0041768971 -0.0041768971 -0.0041768971 -0.0028756119 -0.0021627084
## 75 90 121 157 15
## -0.0019790100 -0.0019790100 -0.0007067456 -0.0007067456 -0.0003585266
One standard I have seen is that DFFITS larger than 2∗(p+1)/√(n−p−1) should be investigated as potential outliers (n = number of observations, p = number of predictors including the intercept).
Lastly, there is another handy function called plot_model() in the sjPlot package that is another option for easily obtaining a variety of regression diagnostic plots.
Within the plot_model() function, provide the argument type = “diag” in order to obtain the regression diagnostics plots.
plot_model(model5, type = "diag")
## [[1]]
## `geom_smooth()` using formula 'y ~ x'
##
## [[2]]
##
## [[3]]
## `geom_smooth()` using formula 'y ~ x'
Multicollinearity occurs when one or more predictors in our model are highly correlated.
One way we can check this before we even run our model is to look at the zero-order correlations between variables. The df6 dataset contains two predictor variables. Let’s investigate the correlation between them.
head(df6)
## extraversion happiness social_engagement
## 1 66 68 72
## 2 69 78 85
## 3 56 70 63
## 4 91 88 104
## 5 97 90 100
## 6 47 63 56
cor(df6$extraversion, df6$social_engagement)
## [1] 0.9619373
Extraversion and social engagement appear to be VERY highly correlated (r = 0.96)!
#VIF (Variance Inflation Factor) / Tolerance
Recall two measures of multicollinearity that we have, tolerance and VIF, where
Tolerance is: \[1 - R_{12}^2\]
And VIF (variance inflation factor) is \[1/Tolerance\]
Let’s run a model with two predictors using df6. We’ll predict happiness from both extraversion and social engagement, two predictors that sound like they have the potential to be correlated.
model6 <- lm(happiness ~ extraversion + social_engagement, data = df6)
When there are multiple predictors in a model, the output of the plot_model() function will also provide a plot of VIFs.
plot_model(model6, type = "diag")
## [[1]]
##
## [[2]]
## `geom_smooth()` using formula 'y ~ x'
##
## [[3]]
##
## [[4]]
## `geom_smooth()` using formula 'y ~ x'
Values
ols_vif_tol(model6)
## Variables Tolerance VIF
## 1 extraversion 0.07467659 13.39108
## 2 social_engagement 0.07467659 13.39108
Either a low tolerance (below 0.20 is one standard I have seen) or a high VIF (above 5 or 10 are standards I have seen) is an indication of a problem with multicollinearity.
Load the bfi dataset from the psych package (this is done for you below, in case you haven’t loaded a dataset from a package).
First, create a model regressing age on the 5 items for Conscientiousness (labeled C1, C2, C3, C4, and C5).
data(bfi, package = "psych")
Next, check if the data meet the homoscedasticity assumption.
Next, check if the errors are normally distributed using a plot.
Using the df5
data from above, run a regression model with and without the outliers we identified.
Compare the output of the two models. How similar/different are they? What values change?
Using the fitness data from the olsrr package (loaded for you below), regress runtime
on weight
, restpulse
, runpulse
, and maxpulse
.
data("fitness") # load data
Check the multicollinearity of predictors. Is multicollinearity at a tolerable level?