You will need the Rmd file for today’s lab - can download it here.

Load Libraries

Install any packages you have not used before.

library(rio)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(broom)
library(sjPlot)
library(olsrr)
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some

Import the Data

We’ll be working with several versions of a dataset that includes participants’ scores on happiness and extraversion. Each version has modified the data to represent one of the violations of the assumptions underlying linear regression.

df <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv") %>% janitor::clean_names() %>% select(extraversion, happiness) # to get things into snake_case

df2 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nl.csv")

df3 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_nne.csv")

df4 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_het.csv")

df5 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/lab6_outliers.txt") %>% 
  janitor::clean_names()

df6 <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_mc.csv")

Inspect the Data

The same variables (mostly) are contained in each dataset. Let’s inspect just the df dataset.

head(df)
##   extraversion happiness
## 1           66        68
## 2           69        78
## 3           56        70
## 4           91        88
## 5           97        90
## 6           47        63
str(df)
## 'data.frame':    147 obs. of  2 variables:
##  $ extraversion: int  66 69 56 91 97 47 28 63 72 72 ...
##  $ happiness   : int  68 78 70 88 90 63 60 65 75 83 ...

Visualize Each Dataset

We’re going to focus on the relationship between happiness (Y) and extraversion (X). Let’s visualize this using a scatterplot for each dataset. Visualization is always a good place to start when inspecting the relationships between variables.

# Scatterplot using df
scatter1 <- ggplot(data = df, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df") +
  theme(plot.title = element_text(hjust = 0.5)) # Center the title

# Scatterplot using df2
scatter2 <- ggplot(data = df2, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df2") +
  theme(plot.title = element_text(hjust = 0.5))

# Scatterplot using df3
scatter3 <- ggplot(data = df3, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df3") +
  theme(plot.title = element_text(hjust = 0.5))

# Scatterplot using df4
scatter4 <- ggplot(data = df4, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df4") +
  theme(plot.title = element_text(hjust = 0.5))

# Scatterplot using df5
scatter5 <- ggplot(data = df5, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df5") +
  theme(plot.title = element_text(hjust = 0.5))

# Scatterplot using df6
scatter6 <- ggplot(data = df6, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df6") +
  theme(plot.title = element_text(hjust = 0.5))

Let’s look at them side-by-side.

grid.arrange(scatter1, scatter2, scatter3, scatter4, scatter5, scatter6)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Q1

Question: Do you notice any potential issues from any of these scatterplots that we used to represent the data so far?

Potential Issues in the Data

Some of the potential issues that we can run into in the data that would violate one or more of the assumptions underlying the linear regression analysis include (along with ways of investigating each):

  1. Non-linearity
  1. Non-normally distributed errors
  1. Heteroscedasticity
  1. Non-independence among errors
  1. Outliers
  1. Multicollinearity

Running the linear regression models

model1 <- lm(happiness ~ extraversion, data = df)
model2 <- lm(happiness ~ extraversion, data = df2)
model3 <- lm(happiness ~ extraversion, data = df3)
model4 <- lm(happiness ~ extraversion, data = df4)
model5 <- lm(happiness ~ extraversion, data = df5)

Checking for non-linearity

To examine whether there is the presence of a non-linear trend in the relationship between X and Y that is not captured by our linear model, you can first examine a scatterplot (as we did above).

Additionally, you can examine a plot of the model’s fitted values by the residuals. We can obtain this plot using the function plot(), which is included in base R. Give the model as an input.

The plot() function provides a set of four different diagnostic plots. Let’s look at them all first for model1.

par(mfrow = c(2, 2)) # display plots in a 2x2 grid
plot(model1)

Let’s focus just on examining the plot of the fitted values by the residuals, which is the first plot.

Checking for non-linearity in model 1

par(mfrow = c(2, 1))
plot(model1, 1)

Now, check for non-linearity in model2.

Q2

Question: Do you see a pattern in either of these plots that suggest the presence of a non-linear trend between income and happiness that is not captured by the linear model?

Checking for normally distributed errors

You can check whether the model’s residuals are normally distributed by plotting either: a) a distribution of the residuals, b) a Q-Q plot

Remember we can pass our model to the augment() function to retrieve the model residuals. Let’s run the augment() function on each mdoel here so we can use the output later.

aug_model1 <- augment(model1)
str(aug_model1)
## tibble [147 x 8] (S3: tbl_df/tbl/data.frame)
##  $ happiness   : int [1:147] 68 78 70 88 90 63 60 65 75 83 ...
##  $ extraversion: int [1:147] 66 69 56 91 97 47 28 63 72 72 ...
##  $ .fitted     : num [1:147] 72.3 73.5 68.4 82.2 84.6 ...
##  $ .resid      : num [1:147] -4.33 4.49 1.62 5.81 5.44 ...
##  $ .std.resid  : num [1:147] -0.442 0.459 0.165 0.601 0.566 ...
##  $ .hat        : num [1:147] 0.00788 0.00907 0.00708 0.03127 0.04144 ...
##  $ .sigma      : num [1:147] 9.85 9.85 9.86 9.84 9.85 ...
##  $ .cooksd     : num [1:147] 7.77e-04 9.64e-04 9.73e-05 5.83e-03 6.92e-03 ...
##  - attr(*, "terms")=Classes 'terms', 'formula'  language happiness ~ extraversion
##   .. ..- attr(*, "variables")= language list(happiness, extraversion)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "happiness" "extraversion"
##   .. .. .. ..$ : chr "extraversion"
##   .. ..- attr(*, "term.labels")= chr "extraversion"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(happiness, extraversion)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "happiness" "extraversion"
aug_model2 <- augment(model2)
aug_model3 <- augment(model3)
aug_model4 <- augment(model4)
aug_model5 <- augment(model5)

Let’s check whether the residuals are normally distributed for model 1. First, we’ll use a density plot with a normal distribution overlaid on top to easily compare the two.

Density Plot

ggplot(data = aug_model1, aes(x = .resid)) + # Plot the residuals
  geom_density(fill = "purple") + # Density plot
  stat_function(linetype = 2, fun = dnorm, # Add a normal curve (fun = dnorm) as a dashed line (linetype = 2)
                args = list(mean = mean(aug_model1$.resid), # define mean and sd to line up the normal curve with the plot of the residuals
                            sd = sd(aug_model1$.resid))) +
  theme_minimal()

Next, let’s use a Q-Q plot.

QQ-Plot

graphics.off() # Returns the plotting dimensions back to the default

ggplot(model1) +
  stat_qq(aes(sample = .stdresid)) + # Plot a qq-plot
  geom_abline() + # Add a diagonal line
  theme_minimal()

A qq-plot plots the quantiles of a reference distribution (in this case, the normal distribution) on the x-axis by the quantiles of the residuals on the y-axis. If your residuals come from a normal distribution, the percentage of data lying between any two points should approximate the percentage of data lying between any two points on a normal distribution.

When the points on a QQ-plot are close to lying on the diagonal line (with a slope of 1) the residuals are approximately normally distributed.

Now, do the same thing, but this time let’s examine whether we have violated the assumption of normally distributed residuals for model 3.

Obtain a density plot of the residuals from model 3:

Obtain a Q-Q plot for model 3:

Q3

Question: Do the errors appear to be approximately normally distributed?

Checking for heteroscedasticity

The plot of the fitted values by the residuals for a model can also be used to check whether the assumption of homogeneity of variance has been met, aka whether error variance appears to be approximately the same across all values fitted by the model.

Let’s check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model1.

plot(model1, 1)

Now, check for heteroscedasticity by examining the plot of the fitted values vs. residuals for model4.

Q4

Question: Does the amount of variability in the residuals seem to differ across the model’s fitted values for either of these models (aka, is there evidence of heteroscedasticity)?

Checking for non-independence among errors

One way to check for non-independence among errors is to examine whether the errors have a non-random (aka, systematic) relationship with a variable they shouldn’t, such as ID number (which indicates the order in which participants participated in the study), season, month, etc.

Let’s check whether we have violated the assumption of independence of errors by plotting residuals against ID numbers for model 1.

aug_model1$ID <- c(1:nrow(aug_model1)) # Add a column of ID numbers to the object containing the model residuals

ggplot(data = aug_model1, 
       aes(x=ID, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Repeat the same thing, but this time testing the independence of errors assumption for model 2.

Q5

Question: Does there appear to be a relationship between ID number and the residuals for either model above?

Checking for outliers

The presence of outliers means a few values could be having a disproportionate influence on a single variable (univariate outliers) or on the fit of a regression model (multivariate outliers). Therefore, we want to identify outliers so we can consider whether they should be removed or changed.

Univariate outliers

One way of visually seeing potential outliers for a single variable is by using a boxplot.

Let’s obtain a boxplot of the happiness outcome variable from the dataset called df5.

ggplot(df5) +
  aes(y = happiness) +
  geom_boxplot() +
  theme_minimal()

The outer edges of the boxplot correspond to the first and third quartiles (the values partitioning where the middle 50% of the data is located).

The upper whisker extends from the top edge of the boxplot to the largest value no further than 1.5 * IQR from the edge (recall that IQR is the distance from the first to the third quartile). The lower whisker extends from the lower edge of the boxplot to the smallest value that is smallest value that is no further than 1.5 * IQR from the edge Data points beyond these are considered outliers.

To determine the value of the potential outliers, you can use boxplot.stats(). The value(s) underneath $out are scores on the variable that landed more extreme than the boxplot’s whiskers.

boxplot.stats(df5$happiness)
## $stats
## [1] 35 60 70 78 98
## 
## $n
## [1] 165
## 
## $conf
## [1] 67.78595 72.21405
## 
## $out
## [1] 32

Let’s find out who in the original dataset this value corresponds to.

df5$ID <- c(1:nrow(df5)) # Add a column of ID numbers

df5[df5$happiness == 32,] # Use indexing [row, column] to grab the row for the person whose score on happiness is equal to the value of the outlier
##     extraversion happiness  ID
## 106           18        32 106

Not all outliers determined in this way necessarily need to be removed from or changed in the dataset. Use your knowledge of your field, the methods used, and how extreme the outlier is to determine if it should be addressed. You can also examine the mean of your variable with and without the outlier present to gauge how strong of an influence it is having.

Multivariate outliers

You can also have multivariate outliers, which are outliers because they are far away from the values predicted by your linear regression model or they have an undue influence on the fit of the model.

One straightforward way to visually inspect the data for multivariate outliers is to use bi-variate scatterplots for each combination of outcome and predictor.

Let’s look again at the scatterplot of the relationship between extraversion in happiness from df5.

scatter5 <- ggplot(data = df5, aes(x = extraversion, y = happiness)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Extraversion", y = "Happiness", title = "df5") +
  theme(plot.title = element_text(hjust = 0.5))

scatter5
## `geom_smooth()` using formula 'y ~ x'

Q6

Question: Do there appear to be any multivariate outliers in the scatterplot?

We can get a quantitative measure of the distance of points from the regression model using either 1) Standardized residuals or 2) Studentized residuals.

Standardized Residuals

Standardized residuals are the raw residuals put into standardized units by dividing by the residual standard error.

Plot

ols_plot_resid_stand(model5)

Values

aug_model5$ID <- c(1:nrow(aug_model5))

aug_model5 %>% 
  arrange(desc(abs(.std.resid)))%>% # Arrange the dataset by highest to lowest standardized residuals
  head(n = 20)
## # A tibble: 20 x 9
##    happiness extraversion .fitted .resid .std.resid    .hat .sigma .cooksd    ID
##        <int>        <int>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>   <dbl> <int>
##  1        95           22    54.4   40.6       3.67 0.0235    10.8 0.162     161
##  2        91           17    52.4   38.6       3.50 0.0292    10.8 0.184     165
##  3        87           20    53.6   33.4       3.02 0.0257    10.9 0.120     149
##  4        85           16    52.0   33.0       2.99 0.0304    10.9 0.141     134
##  5        42           53    66.9  -24.9      -2.23 0.00615   11.1 0.0154     43
##  6        50           66    72.1  -22.1      -1.98 0.00785   11.1 0.0156    159
##  7        78           28    56.8   21.2       1.91 0.0178    11.1 0.0330    150
##  8        32           18    52.8  -20.8      -1.88 0.0280    11.1 0.0511    106
##  9        90           59    69.3   20.7       1.85 0.00627   11.1 0.0109     17
## 10        53           69    73.3  -20.3      -1.83 0.00900   11.1 0.0151    117
## 11        98           81    78.2   19.8       1.79 0.0164    11.1 0.0266    140
## 12        93           69    73.3   19.7       1.76 0.00900   11.1 0.0141     11
## 13        39           32    58.4  -19.4      -1.75 0.0146    11.1 0.0227     44
## 14        80           38    60.9   19.1       1.72 0.0108    11.1 0.0161     74
## 15        43           41    62.1  -19.1      -1.71 0.00929   11.1 0.0137     71
## 16        90           63    70.9   19.1       1.71 0.00699   11.1 0.0103     87
## 17        58           75    75.8  -17.8      -1.60 0.0121    11.1 0.0157     22
## 18        48           50    65.7  -17.7      -1.58 0.00651   11.1 0.00823   155
## 19        39           27    56.4  -17.4      -1.57 0.0187    11.1 0.0235    113
## 20        35           15    51.6  -16.6      -1.51 0.0316    11.2 0.0370    160

One standard I have seen is that standardized residuals greater than +/- 2 are worth inspecting further.

Studentized Residuals

Studentized residuals are calculatd by fitting a regression model, deleting each point one at a time, and calculating the distance between that data point and the fitted value from the model fit without it, standardized.

Plot

ols_plot_resid_stud(model5)

Values

studentized_resids <- rstudent(model5)
sort(desc(abs(studentized_resids)))
##          161          165          149          134           43          159 
## -3.819637376 -3.628564078 -3.101559004 -3.071176868 -2.258516493 -2.002908811 
##          150          106           17          117          140           11 
## -1.924585581 -1.899319160 -1.868277131 -1.838297489 -1.798356902 -1.776114639 
##           44           74           71           87           22          155 
## -1.760219608 -1.730393699 -1.720887176 -1.720659584 -1.603405223 -1.592438965 
##          113          160           63          131          124           39 
## -1.578172050 -1.511761361 -1.466729648 -1.431827257 -1.428185997 -1.427937918 
##          154          139          143           57           18           99 
## -1.420385410 -1.399720929 -1.356638945 -1.324457667 -1.300614610 -1.300614610 
##          110           37           94           80           50          103 
## -1.300614610 -1.264485769 -1.264485769 -1.258452846 -1.230475980 -1.175887568 
##           84          104          164           14           35           56 
## -1.147443531 -1.137890607 -1.083146892 -1.082942241 -1.082942241 -1.066642279 
##           69          162           89          122           92           68 
## -1.065995704 -1.052278638 -1.047678112 -1.046895902 -1.031431416 -1.013588724 
##           62           12          152           21           78           40 
## -0.995262781 -0.978877243 -0.940493208 -0.904623143 -0.884808315 -0.872282646 
##           45          119           58          153           53          114 
## -0.872282646 -0.872282646 -0.846361143 -0.831188963 -0.817908797 -0.812266845 
##           65           34           10           60           42           76 
## -0.786833448 -0.763205589 -0.758238330 -0.740580529 -0.725408220 -0.725408220 
##          105          112           46          101          163          115 
## -0.725408220 -0.725408220 -0.724859243 -0.711570632 -0.705296264 -0.686842591 
##          126           20          125          109          130           86 
## -0.654491052 -0.650438765 -0.640918555 -0.638205170 -0.632446679 -0.625232099 
##          156           97          111           31          142          145 
## -0.602653888 -0.579445414 -0.574147588 -0.542050868 -0.540142567 -0.531974790 
##            8           36           64           88            4          100 
## -0.529595635 -0.529595635 -0.525378227 -0.524437649 -0.524009839 -0.516000427 
##          135           19           51          151           83           95 
## -0.516000427 -0.508785214 -0.508785214 -0.508785214 -0.508326032 -0.508326032 
##           33           72           30           54            5          132 
## -0.505832792 -0.504462925 -0.494072406 -0.489145899 -0.488158235 -0.477821753 
##          137          133          120           13            2          146 
## -0.470394341 -0.469615522 -0.437864206 -0.433716781 -0.417361066 -0.417361066 
##           96            1          128           49           48           52 
## -0.385754840 -0.369328478 -0.364726467 -0.361427504 -0.347900851 -0.337187531 
##          144          158           67          116           47          107 
## -0.337187531 -0.315039301 -0.309353202 -0.309353202 -0.298643122 -0.292303940 
##            7          118           81           85          127           24 
## -0.285504702 -0.284783406 -0.277101512 -0.277101512 -0.277101512 -0.256752149 
##           98           38           55           82           26           93 
## -0.255371344 -0.246809573 -0.246809573 -0.239914054 -0.206905343 -0.190373519 
##           66           91            3           77           32          102 
## -0.185853704 -0.185853704 -0.169629946 -0.169629946 -0.169118573 -0.169118573 
##          129           59           79           61           70            6 
## -0.164297772 -0.154887759 -0.154887759 -0.148760343 -0.148760343 -0.131958608 
##           41          141           27          138           16           25 
## -0.131958608 -0.123987655 -0.111199948 -0.093085898 -0.077880522 -0.072353091 
##          147          123           29          108            9           28 
## -0.067554984 -0.061683111 -0.061172369 -0.061172369 -0.040681337 -0.040681337 
##          136           23          148           73           75           90 
## -0.040681337 -0.034885310 -0.030174354 -0.023965022 -0.015314419 -0.015314419 
##          121          157           15 
## -0.009045418 -0.009045418 -0.002869881

One standard I have seen is that studentized residuals greater than +/- 3 are worth inspecting further.

Another way of examining whether outliers are present and should be dealt with is by measuring their influence on the regression model. Three measures you can use to measure influence of a data point on the regression model are 1) Cook’s Distance, 2) DFFITS, and 3) DFBETAS

Cook’s Distance

Summarizes how much the regression model would change if you removed a particular value (measures the difference between the regression coefficients obtained from the full dataset to the regression coefficients that would be obtained if you deleted that particular value).

Plot

ols_plot_cooksd_chart(model5)

Values

aug_model5 %>% 
  arrange(desc(abs(.cooksd)))%>% # Arrange the dataset by highest to lowest Cook's D
  head(n = 20)
## # A tibble: 20 x 9
##    happiness extraversion .fitted .resid .std.resid    .hat .sigma .cooksd    ID
##        <int>        <int>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>   <dbl> <int>
##  1        91           17    52.4   38.6       3.50 0.0292    10.8  0.184    165
##  2        95           22    54.4   40.6       3.67 0.0235    10.8  0.162    161
##  3        85           16    52.0   33.0       2.99 0.0304    10.9  0.141    134
##  4        87           20    53.6   33.4       3.02 0.0257    10.9  0.120    149
##  5        32           18    52.8  -20.8      -1.88 0.0280    11.1  0.0511   106
##  6        35           15    51.6  -16.6      -1.51 0.0316    11.2  0.0370   160
##  7        78           28    56.8   21.2       1.91 0.0178    11.1  0.0330   150
##  8        37           17    52.4  -15.4      -1.40 0.0292    11.2  0.0293   139
##  9        98           81    78.2   19.8       1.79 0.0164    11.1  0.0266   140
## 10        39           27    56.4  -17.4      -1.57 0.0187    11.1  0.0235   113
## 11        39           32    58.4  -19.4      -1.75 0.0146    11.1  0.0227    44
## 12        40           15    51.6  -11.6      -1.05 0.0316    11.2  0.0181   162
## 13        80           38    60.9   19.1       1.72 0.0108    11.1  0.0161    74
## 14        58           75    75.8  -17.8      -1.60 0.0121    11.1  0.0157    22
## 15        50           66    72.1  -22.1      -1.98 0.00785   11.1  0.0156   159
## 16        42           53    66.9  -24.9      -2.23 0.00615   11.1  0.0154    43
## 17        53           69    73.3  -20.3      -1.83 0.00900   11.1  0.0151   117
## 18        93           84    79.4   13.6       1.23 0.0190    11.2  0.0146    50
## 19        93           69    73.3   19.7       1.76 0.00900   11.1  0.0141    11
## 20        43           41    62.1  -19.1      -1.71 0.00929   11.1  0.0137    71

One standard I have seen is that Cook’s D values greater than 3 times the average Cook’s D values are worth investigating.

#DFBETAS

DFBETA measures the difference in each parameter estimate with and without a particular point being included in the data.

Plot

ols_plot_dfbetas(model5)

Value

dfbetas <- dfbeta(model5)
sort(desc(abs(dfbetas)))
##   [1] -1.567056e+00 -1.457327e+00 -1.371171e+00 -1.261344e+00 -8.244636e-01
##   [6] -7.048208e-01 -6.432736e-01 -6.248112e-01 -5.451115e-01 -5.198217e-01
##  [11] -4.923671e-01 -4.095238e-01 -4.091340e-01 -3.736228e-01 -3.564059e-01
##  [16] -3.320981e-01 -3.281228e-01 -3.136707e-01 -3.108772e-01 -3.090735e-01
##  [21] -2.870816e-01 -2.788880e-01 -2.622220e-01 -2.567645e-01 -2.363793e-01
##  [26] -2.255097e-01 -2.255097e-01 -2.185162e-01 -2.025141e-01 -2.007930e-01
##  [31] -1.986876e-01 -1.957449e-01 -1.940032e-01 -1.940032e-01 -1.940032e-01
##  [36] -1.935363e-01 -1.903847e-01 -1.845945e-01 -1.774783e-01 -1.763052e-01
##  [41] -1.694113e-01 -1.682222e-01 -1.626504e-01 -1.561528e-01 -1.561528e-01
##  [46] -1.561528e-01 -1.536908e-01 -1.527392e-01 -1.487820e-01 -1.484796e-01
##  [51] -1.465284e-01 -1.427316e-01 -1.371014e-01 -1.365570e-01 -1.326626e-01
##  [56] -1.320124e-01 -1.227999e-01 -1.213243e-01 -1.197476e-01 -1.157756e-01
##  [61] -1.142295e-01 -1.110806e-01 -1.082132e-01 -1.080169e-01 -1.066312e-01
##  [66] -9.650634e-02 -9.648761e-02 -9.378822e-02 -9.378822e-02 -9.178472e-02
##  [71] -9.036037e-02 -8.679186e-02 -8.369211e-02 -8.318888e-02 -7.875395e-02
##  [76] -7.402521e-02 -7.215159e-02 -7.193058e-02 -6.867100e-02 -6.595416e-02
##  [81] -6.595416e-02 -6.595416e-02 -6.595416e-02 -6.554705e-02 -6.439321e-02
##  [86] -6.439321e-02 -6.119816e-02 -6.119816e-02 -6.092314e-02 -5.965686e-02
##  [91] -5.871003e-02 -5.564519e-02 -5.246369e-02 -5.243608e-02 -5.158903e-02
##  [96] -4.722237e-02 -4.702550e-02 -4.645731e-02 -4.462771e-02 -3.260897e-02
## [101] -3.167067e-02 -3.142636e-02 -3.093977e-02 -3.081960e-02 -3.081960e-02
## [106] -3.027250e-02 -2.967387e-02 -2.948057e-02 -2.948057e-02 -2.890152e-02
## [111] -2.779041e-02 -2.779041e-02 -2.745444e-02 -2.745444e-02 -2.698589e-02
## [116] -2.654446e-02 -2.492622e-02 -2.396734e-02 -2.317284e-02 -2.178640e-02
## [121] -2.105370e-02 -2.056027e-02 -1.978474e-02 -1.978474e-02 -1.965020e-02
## [126] -1.908387e-02 -1.904292e-02 -1.708402e-02 -1.708402e-02 -1.708402e-02
## [131] -1.640725e-02 -1.640725e-02 -1.640725e-02 -1.539987e-02 -1.539987e-02
## [136] -1.438501e-02 -1.342129e-02 -1.255802e-02 -1.203048e-02 -1.181446e-02
## [141] -1.086274e-02 -1.045967e-02 -1.045967e-02 -9.858514e-03 -9.790199e-03
## [146] -9.790199e-03 -9.556175e-03 -9.332325e-03 -9.264974e-03 -8.718321e-03
## [151] -8.718321e-03 -8.437191e-03 -8.138992e-03 -7.907866e-03 -7.588388e-03
## [156] -7.370422e-03 -7.370422e-03 -7.234914e-03 -6.920255e-03 -6.762375e-03
## [161] -6.260426e-03 -6.122199e-03 -5.585495e-03 -5.559828e-03 -5.274375e-03
## [166] -5.049406e-03 -4.933507e-03 -4.855396e-03 -4.707727e-03 -4.707647e-03
## [171] -4.638047e-03 -4.569334e-03 -4.412349e-03 -4.334384e-03 -4.334163e-03
## [176] -4.300185e-03 -4.265996e-03 -4.146023e-03 -4.034497e-03 -3.994025e-03
## [181] -3.985003e-03 -3.947269e-03 -3.877952e-03 -3.877952e-03 -3.877952e-03
## [186] -3.743306e-03 -3.705246e-03 -3.666236e-03 -3.666236e-03 -3.650748e-03
## [191] -3.388141e-03 -3.341668e-03 -3.149493e-03 -3.001766e-03 -2.977125e-03
## [196] -2.931791e-03 -2.831293e-03 -2.821974e-03 -2.821974e-03 -2.742507e-03
## [201] -2.742507e-03 -2.684721e-03 -2.645482e-03 -2.624258e-03 -2.583539e-03
## [206] -2.530327e-03 -2.526043e-03 -2.517873e-03 -2.318212e-03 -2.239954e-03
## [211] -2.183546e-03 -2.110426e-03 -2.108478e-03 -2.060274e-03 -1.990716e-03
## [216] -1.985961e-03 -1.969046e-03 -1.934420e-03 -1.909295e-03 -1.909295e-03
## [221] -1.909295e-03 -1.805715e-03 -1.801900e-03 -1.801900e-03 -1.747613e-03
## [226] -1.747613e-03 -1.747613e-03 -1.710694e-03 -1.710694e-03 -1.642869e-03
## [231] -1.605451e-03 -1.544244e-03 -1.491184e-03 -1.484718e-03 -1.389697e-03
## [236] -1.350893e-03 -1.326901e-03 -1.301347e-03 -1.287587e-03 -1.287587e-03
## [241] -1.277796e-03 -1.277796e-03 -1.208891e-03 -1.205927e-03 -1.202503e-03
## [246] -1.157970e-03 -1.124196e-03 -1.069580e-03 -1.011616e-03 -1.011616e-03
## [251] -9.931352e-04 -9.478930e-04 -9.166344e-04 -9.152274e-04 -9.152274e-04
## [256] -8.841690e-04 -8.624174e-04 -8.624174e-04 -8.541673e-04 -7.853943e-04
## [261] -7.509008e-04 -7.285237e-04 -7.240528e-04 -7.193823e-04 -7.193823e-04
## [266] -6.984511e-04 -6.572989e-04 -6.337364e-04 -6.168387e-04 -5.868480e-04
## [271] -5.578053e-04 -5.578053e-04 -4.956556e-04 -4.856593e-04 -4.826419e-04
## [276] -4.793333e-04 -4.793333e-04 -4.647418e-04 -4.617870e-04 -4.215319e-04
## [281] -4.089958e-04 -3.753083e-04 -3.638759e-04 -3.607404e-04 -3.607404e-04
## [286] -3.601338e-04 -3.307209e-04 -3.307209e-04 -3.307209e-04 -3.110215e-04
## [291] -3.110215e-04 -2.985266e-04 -2.985266e-04 -2.985266e-04 -2.985266e-04
## [296] -2.610847e-04 -2.526480e-04 -2.526480e-04 -2.360959e-04 -2.263705e-04
## [301] -1.952341e-04 -1.947128e-04 -1.947128e-04 -1.607277e-04 -1.473420e-04
## [306] -1.433501e-04 -1.266422e-04 -1.203914e-04 -1.203914e-04 -1.203914e-04
## [311] -1.075110e-04 -8.622436e-05 -7.317689e-05 -6.999826e-05 -6.999826e-05
## [316] -6.970400e-05 -6.970400e-05 -5.772868e-05 -5.772868e-05 -5.213892e-05
## [321] -4.812631e-05 -4.012685e-05 -3.300771e-05 -3.300771e-05 -3.300771e-05
## [326] -2.020893e-05 -2.020893e-05 -1.242336e-05 -1.077725e-06 -1.077725e-06

One standard I have seen is that DFBETAS larger than 2/sqrt(n) should be investigated as potential outliers.

DFFITS

DFFITS measure the difference in the model’s fitted values with and without a particular point being included in the data.

Plot

ols_plot_dffits(model5)

Values

dffits <- dffits(model5)
sort(desc(abs(dffits)))
##           165           161           134           149           106 
## -0.6288673131 -0.5928872219 -0.5436387359 -0.5035906823 -0.3221894932 
##           160           150           139           140           113 
## -0.2732388551 -0.2591164611 -0.2425859709 -0.2323931665 -0.2177338052 
##            44           162            74           159            22 
## -0.2143892880 -0.1901910036 -0.1806530784 -0.1781795750 -0.1777962558 
##            43           117            50            11            71 
## -0.1776030053 -0.1751895094 -0.1711919473 -0.1692634919 -0.1666295913 
##            92           154            80            17           152 
## -0.1600996238 -0.1575017993 -0.1507366838 -0.1484347895 -0.1459841211 
##            87           124            69            89           155 
## -0.1443254034 -0.1429806471 -0.1377535888 -0.1342209106 -0.1288744686 
##            84            39           153            58           131 
## -0.1272361850 -0.1270299326 -0.1231881531 -0.1225143224 -0.1200987392 
##           163            57           164            63            78 
## -0.1196423700 -0.1195248493 -0.1190417157 -0.1165318051 -0.1143395986 
##           115            18            99           110            20 
## -0.1115205380 -0.1103803468 -0.1103803468 -0.1103803468 -0.1064207823 
##            37            94           143            14            35 
## -0.1060624777 -0.1060624777 -0.1059982621 -0.1048588342 -0.1048588342 
##           122            88           103           104            65 
## -0.0997690421 -0.0987382253 -0.0924682935 -0.0920883315 -0.0910491295 
##             5             4            54            56           142 
## -0.0906991284 -0.0857352605 -0.0853982417 -0.0833399544 -0.0825933270 
##            12            68           111            40            45 
## -0.0821062192 -0.0805297172 -0.0798792054 -0.0787185988 -0.0787185988 
##           119           114            72            62            10 
## -0.0787185988 -0.0786499512 -0.0783031458 -0.0782644986 -0.0778510178 
##            86            53            21            33           133 
## -0.0748899043 -0.0727614962 -0.0706809143 -0.0703747997 -0.0696003813 
##           101           125            60            31            34 
## -0.0688997660 -0.0669118883 -0.0668332241 -0.0649265092 -0.0647716064 
##           130            13           100           135            42 
## -0.0612383735 -0.0583934321 -0.0572175658 -0.0572175658 -0.0570439403 
##            76           105           112           109            46 
## -0.0570439403 -0.0570439403 -0.0570439403 -0.0567749891 -0.0566354226 
##            49           156           126           137           145 
## -0.0552659646 -0.0543861212 -0.0529672963 -0.0521604591 -0.0515099090 
##            97            64           132             8            36 
## -0.0491762781 -0.0467378588 -0.0455363503 -0.0444213977 -0.0444213977 
##            30            83            95            19            51 
## -0.0419308557 -0.0411383096 -0.0411383096 -0.0404230319 -0.0404230319 
##           151             2           146            52           144 
## -0.0404230319 -0.0397744548 -0.0397744548 -0.0390179538 -0.0390179538 
##             7           118           120             1            67 
## -0.0384389079 -0.0368012141 -0.0342116412 -0.0328556102 -0.0317623901 
##           116            96           128            47            38 
## -0.0317623901 -0.0312187475 -0.0305925094 -0.0284606503 -0.0273678901 
##            55            48           158            98           107 
## -0.0273678901 -0.0273578860 -0.0267367036 -0.0266607648 -0.0263787852 
##            24            81            85           127           129 
## -0.0228407747 -0.0216507706 -0.0216507706 -0.0216507706 -0.0212314249 
##            82           141            93            26            66 
## -0.0194160008 -0.0172499817 -0.0169357050 -0.0167446393 -0.0155890282 
##            91            61            70            59            79 
## -0.0155890282 -0.0141768412 -0.0141768412 -0.0139777484 -0.0139777484 
##            32           102             3            77            27 
## -0.0132989805 -0.0132989805 -0.0132536955 -0.0132536955 -0.0128676004 
##            25             6            41           138           147 
## -0.0118379669 -0.0111990415 -0.0111990415 -0.0111497698 -0.0074909468 
##            16            23            29           108           123 
## -0.0069282826 -0.0064816426 -0.0049506177 -0.0049506177 -0.0049007288 
##             9            28           136           148            73 
## -0.0041768971 -0.0041768971 -0.0041768971 -0.0028756119 -0.0021627084 
##            75            90           121           157            15 
## -0.0019790100 -0.0019790100 -0.0007067456 -0.0007067456 -0.0003585266

One standard I have seen is that DFFITS larger than 2∗(p+1)/√(n−p−1) should be investigated as potential outliers (n = number of observations, p = number of predictors including the intercept).

Using sjPlot to easily obtain Regression Diagnostic Plots

Lastly, there is another handy function called plot_model() in the sjPlot package that is another option for easily obtaining a variety of regression diagnostic plots.

Within the plot_model() function, provide the argument type = “diag” in order to obtain the regression diagnostics plots.

plot_model(model5, type = "diag")
## [[1]]
## `geom_smooth()` using formula 'y ~ x'

## 
## [[2]]

## 
## [[3]]
## `geom_smooth()` using formula 'y ~ x'

Checking for multicollinearity

Multicollinearity occurs when one or more predictors in our model are highly correlated.

One way we can check this before we even run our model is to look at the zero-order correlations between variables. The df6 dataset contains two predictor variables. Let’s investigate the correlation between them.

head(df6)
##   extraversion happiness social_engagement
## 1           66        68                72
## 2           69        78                85
## 3           56        70                63
## 4           91        88               104
## 5           97        90               100
## 6           47        63                56
cor(df6$extraversion, df6$social_engagement)
## [1] 0.9619373

Extraversion and social engagement appear to be VERY highly correlated (r = 0.96)!

#VIF (Variance Inflation Factor) / Tolerance

Recall two measures of multicollinearity that we have, tolerance and VIF, where

Tolerance is: \[1 - R_{12}^2\]

And VIF (variance inflation factor) is \[1/Tolerance\]

Let’s run a model with two predictors using df6. We’ll predict happiness from both extraversion and social engagement, two predictors that sound like they have the potential to be correlated.

model6 <- lm(happiness ~ extraversion + social_engagement, data = df6)

When there are multiple predictors in a model, the output of the plot_model() function will also provide a plot of VIFs.

plot_model(model6, type = "diag")
## [[1]]

## 
## [[2]]
## `geom_smooth()` using formula 'y ~ x'

## 
## [[3]]

## 
## [[4]]
## `geom_smooth()` using formula 'y ~ x'

Values

ols_vif_tol(model6)
##           Variables  Tolerance      VIF
## 1      extraversion 0.07467659 13.39108
## 2 social_engagement 0.07467659 13.39108

Either a low tolerance (below 0.20 is one standard I have seen) or a high VIF (above 5 or 10 are standards I have seen) is an indication of a problem with multicollinearity.

Minihack 1

Load the bfi dataset from the psych package (this is done for you below, in case you haven’t loaded a dataset from a package).

First, create a model regressing age on the 5 items for Conscientiousness (labeled C1, C2, C3, C4, and C5).

data(bfi, package = "psych")

Next, check if the data meet the homoscedasticity assumption.

Next, check if the errors are normally distributed using a plot.

Minihack 2

Using the df5 data from above, run a regression model with and without the outliers we identified.

Compare the output of the two models. How similar/different are they? What values change?

Minihack 3

Using the fitness data from the olsrr package (loaded for you below), regress runtime on weight, restpulse, runpulse, and maxpulse.

data("fitness") # load data

Check the multicollinearity of predictors. Is multicollinearity at a tolerable level?