You will need the Rmd file for today’s lab. You can download it here.
Be sure to have the following packages installed and loaded:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(boot) # for bootstrapping
library(psych)
We’ll work with that Lab 2 dataset we worked with before. Recall that it has:
df <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv") %>%
janitor::clean_names()
And we’ll also work with the a dataframe we used in the
regression diagnostics lab (df_het
) that
demonstrated an issue with heteroskedasticity. It has only the variables
extraversion
and happiness
.
df_het <- import(file = "https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-6/data/df_het.csv",
setclass = "tibble") %>%
janitor::clean_names()
Bootstrapping is a procedure for empirically determining a sampling distribution when we cannot assume a statistic follows a known distribution. Specifically, we resample with replacement from our own sample to determine how a statistic is distributed when using the current data.
This lab covers how to use bootstrapping in R using the
{boot}
library.
Bootstrapping is a general purpose tool to use whenever you can’t rely on assumptions about how a statistic is distributed. Some cases where it is useful include:
When test assumptions (e.g., normality, homoscedasticity) have
been violated
When you don’t know what the shape of a statistic’s sampling distribution (or have good reason to believe it is non-normal)
With highly unbalanced sample sizes
It is often used to construct CIs based on bootstrapped samples rather than CIs that rely on assumptions about the data.
For example, recall from the regression diagnostics lab that we ran into a model that demonstrated heteroskedasticity.
model <- lm(happiness ~ extraversion, data = df_het)
plot(model, 1) # get residuals plot
Ordinary least squares calculates standard errors based on the assumption that they are approximately equally distributed across the range of our model (homoscedasticity assumption).
summary(model)
Thus, the standard errors calculated for our model coefficients are calculated based on an assumption that we can see we have violated. Instead of using SEs that rely on a model assumption we’ve violated, we can estimate SEs from the data using the bootstrapping method.
boot()
We’re going to focus on using the boot()
function from
the {boot}
package. boot()
has many arguments,
but there are three that are necessary:
data
= This is where you tell boot()
the name of the dataframe you’re working with.
statistic
= a function that returns
the desired statistic when applied to the data.
R
= the number of resamples. You usually want to set
it to a very large number (e.g., 10,000).
And that’s it. The only part that is a little tricky is specifying argument 2, because you have to write your own function to get the statistic you’re looking for. This may be new to all of you, but writing functions in R is a powerful skill and so it is something that is definitely worth practicing.
boot()
For boot()
, we want to write a function that returns the
statistic we’re interested in bootstrapping. Basically, you want to set
up the body such that:
It has two necessary arguments that are (in order) data
and index
. The data
argument comes first, so
whatever you name it will represent the dataframe in the body of the
function. The index
argument comes second, so whatever you
name it should be used to subset the data.
Let’s write a function that runs our model
(lm(happiness ~ extraversion)
) and extracts from the model
output the parameter estimate corresponding to extraversion
(b_ext
).
# your code here
get_b_ext <- function(data, index){
data_subset <- data[index,]
model <- lm(happiness ~ extraversion, data = data_subset)
b_ext <- summary(model)$coefficients[2,1]
return(b_ext)
}
boot()
Now, we have the function that boot()
can use to
bootstrap this statistic we chose. Let’s go ahead and give it a try. We
set (in order) the data
, statistic
, and
R
arguments
set.seed(101) # don't forget to set.seed()
boot_b_ext <- boot(data = df_het,
statistic = get_b_ext,
R = 10000)
boot()
ResultsThe most important object in the list returned by boot()
is t
, which is a matrix that contains our bootstrapped
results. It will be the length of R
, so in this case it
will be length 10,000 (i.e., 10,000 rows); this means there are 10,000
estimates of \(b_ext\), each pertaining
to the estimate obtained in one of the resamplings.
Let’s look at just the first ten estimates of b_ext:
boot_b_ext$t[1:10,]
Let’s also see what the mean and SD of the b_ext values are:
mean(boot_b_ext$t)
sd(boot_b_ext$t)
And let’s look at what the resulting object contains:
boot_b_ext
The original
value refers to the estimate you would have
obtained if you had not used bootstrapping, the bias
value
refers to the difference between the bootstrapped estimate and the
estimate you would have obtained if you had not used bootstrapping, and
std.error
refers to the standard deviation of the sample
distribution.
We can also generate plots to describe our bootstrapped distribution.
Let’s do that using {ggplot2}
. The only sort of tricky part
is that we have to turn the bootstrapped results into a dataframe or
tibble first. I’ve done this below using the tibble()
function from the {tidyverse}
First, let’s create a density plot
# create a density
tibble(boot_b_ext = boot_b_ext$t) %>%
ggplot(aes(x = boot_b_ext)) +
geom_density(fill = "turquoise", alpha = .6) +
theme_bw()
And, we can get the 95% CI by using boot.ci()
, passing
the results of boot()
as the first argument, and setting
type = "perc"
for percentiles.
ci <- boot.ci(boot_b_ext, type = "perc")
ci
The above was an example of bootstrapping a statistic when assumptions about the model have been violated. Let’s also go over an example of bootstrapping a statistic to estimate its standard error when the shape of its sampling distribution is unknown or non-normally distributed, like the sampling distribution of \(R^2\).
First, let’s examine the other data set.
head(df)
str(df)
Let’s use these variables to construct a model predicting happiness from all other variables in the data frame.
model <- lm(happiness ~ extraversion + pos_express + soc_sup + optimism + exercise, data = df)
summary(model)
What if we wanted to know the standard error associated with the \(R^2\) value? Notice there is none given in the output. Instead, we can estimate the standard error by bootstrapping this statistic.
First, write a function called get_r_sq
that will run
the above model and extract the R^2 value from the output:
# your code here
Next, use that function in the boot
function to resample
from the data set 10,000 times, obtaining an estimate of \(R^2\) each time:
set.seed(21) # don't forget to set.seed()
boot_r_sq <- boot(data = # your code here
statistic = # your code here
R = # your code here)
Examine the mean and SD of the \(R^2\) values you obtained through bootstrapping:
# your code here
Plot the distribution of \(R^2\) values:
# your code here
Obtain a 95%CI around the mean estimate of the \(R^2\) value:
# your code here
You can also add additional arguments that you want to provide as inputs to your statistic-grabbing function.
For example, say we want to specify a different model each time:
get_r_sq <- function(data, index, formula){
data_subset <- data[index,]
model <- lm(formula, data = data_subset)
rsq <- summary(fit)$r.squared
return(rsq)
}
Then, when we use our boot function, we can run a number of different models to construct a sampling distribution of \(R^2\) values from:
boot_r2_model1 <- boot(data = df,
statistic = get_r_sq,
formula = happiness ~ extraversion,
R = 1000)
boot_r2_model2 <- boot(data = df,
statistic = get_r_sq,
formula = happiness ~ extraversion + pos_express,
R = 1000)
boot_r2_model3 <- boot(data = df,
statistic = get_r_sq,
formula = happiness ~ extraversion + pos_express + soc_sup,
R = 1000)
boot_r2_model4 <- boot(data = df,
statistic = get_r_sq,
formula = happiness ~ extraversion + pos_express + soc_sup + optimism,
R = 1000)
boot_r2_model5 <- boot(data = df,
statistic = get_r_sq,
formula = happiness ~ extraversion + pos_express + soc_sup + optimism + exercise,
R = 1000)
Let’s see some cool visualiaztions we can do with these five different sampling distributions. First, let’s organize the output of each one (the sampled \(R^2\) values) and the names of the predictors for each model into separate tables:
r2_model1_results <- tibble(rsq = boot_r2_model1$t,
predictors = "Extr.")
r2_model2_results <- tibble(rsq = boot_r2_model2$t,
predictors = "Extr. + Pos. Expr.")
r2_model3_results <- tibble(rsq = boot_r2_model3$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp.")
r2_model4_results <- tibble(rsq = boot_r2_model4$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp. + Opt.")
r2_model5_results <- tibble(rsq = boot_r2_model5$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp. + Opt. + Exercise")
Combine these results for each model into a single data frame. Also, calculate the medians and CIs for each of the models:
r2_all_models <- r2_model1_results %>%
rbind(r2_model2_results) %>%
rbind(r2_model3_results) %>%
rbind(r2_model4_results) %>%
rbind(r2_model5_results)
We could show each of the sampling distributions side-by-side using
geom_density
, and fill the color of the density geoms
according to the model they correspond to. Let’s take a look at what
that looks like:
ggplot(r2_all_models, aes(x = rsq, fill = predictors)) +
geom_density(alpha = .5) +
labs(x = expression(paste("Bootstrapped estimates of ", R^{2}))) +
guides(fill = guide_legend(nrow = 3)) +
scale_fill_viridis_d() +
theme_bw() +
theme(legend.position = "bottom")
Using the same dataset above, calculate bootstrapped estimates of adjusted \(R^2\) value:
# your code here
Get bootstrapped estimates of correlations between:
Then, construct a plot of the distribution for each one.
# your code here