You will need the Rmd file for today’s lab - can download it here.
Bootstrapping is a procedure for empirically determining a sampling distribution when we cannot assume a statistic follows a known distribution. This lab covers how to use bootstrapping in R using the {boot} library.
Be sure to have the following packages installed and loaded:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(boot) # for bootstrapping
We’ll work with that lab 2 dataset we worked with before. Recall that it has:
df <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv") %>%
janitor::clean_names() # to get things into snake_case
Recall from lecture that bootstrapping is a general purpose tool to use whenever you can’t make assumptions about how a statistic is distributed. Some cases where it is useful include:
One of its main uses is to construct CIs but it can also be used in a standard error formula to produce a test statistic. We will focus just on CIs and plotting the full (bootstrapped) distribution today.
We’re going to focus on using the boot()
function from the {boot} library. boot()
has many optional arguments, but there are three that are necessary:
data
= This is where you tell boot()
the name of the dataframe you’re working with.
statistic
= a function that returns the desired statistic when applied to the data.
R
= the number of resamples. You usually want to set it to a very large number (e.g., 10,000).
And that’s it. The only part that is a little tricky is 2., because you have to write your own function to get the statistic you’re looking for. This may be new to all of you, but writing functions in R is a powerful skill and so it is something that is definitely worth practicing.
You can write your own functions using the function()
function. Functions have a few different parts:
()
when you call function()
. They are typically names that will be referenced in the body of your function.return()
within the body of your function.Let’s walk through an example. Let’s make a function that calculates the mean of a vector, and prints it in the format: "The mean is _", where the blank is filled in with the mean.
To create a function, you use the function()
function. Like anything you make in R, you need to save it to a name with <-
.
Within the parentheses, you put the arguments. Then you define the body of the function in {}
:
pretty_mean <- function(data){ # name and arguments
# Body of function:
mean <- mean(data, na.rm = TRUE)
x <- paste("The Mean is", mean)
# specifying what to return as output
return(x)
}
Now let’s test it out on extraversion from the data we loaded above:
pretty_mean(df$extraversion)
## [1] "The Mean is 59.3741496598639"
Notice that it works how we expected it to, and also it didn’t create an object in our global environment called mean
. That variables within a function are local; they only exist within that function (they don’t get created in the global environment).
To learn more about functions, check out Hadley Wickham’s Advanced R book
boot()
Now for boot()
, we want to write a function that returns the statistic we’re interested in bootstrapping. Basically, you want to set up the body such:
boot()
.It has two necessary arguments that are (in order) data
and index
. The data
argument comes first, so whatever you name it will represent the dataframe in the body of the function. The index
argument comes second, so whatever you name it should be used to subset the data.
We could create a simple one that subsets our data, passes the subsetted data to the linear model we want to run, saves out the \(R^2\) from that model, and then returns it:
get_r2 <- function(d, i){ # calling data d; calling index i
d2 <- d[i,] # subset data using the i argument
m <- lm(happiness ~ extraversion + pos_express + soc_sup + optimism + exercise, data = d2) # run lm()
rsq <- summary(m)$r.squared # extract r^2
return(rsq) # return r^2
}
boot()
Now, we have the function that boot()
can use to bootstrap our model \(R^2\). Let’s go ahead and give it a try. We set (in order) the data
, statistic
, and R
arguments
set.seed(227) # don't forget to set.seed()
boot_r2 <- boot(data = df,
statistic = get_r2, # name of function, no ()
R = 1000)
Okay, now let’s take a look at what it contains:
str(boot_r2)
## List of 11
## $ t0 : num 0.547
## $ t : num [1:1000, 1] 0.612 0.5 0.572 0.542 0.609 ...
## $ R : num 1000
## $ data :'data.frame': 147 obs. of 6 variables:
## ..$ extraversion: int [1:147] 66 69 56 91 97 47 28 63 72 72 ...
## ..$ pos_express : int [1:147] 79 63 58 100 96 63 29 88 88 92 ...
## ..$ soc_sup : int [1:147] 72 94 89 94 100 56 69 64 83 81 ...
## ..$ optimism : int [1:147] 46 75 67 58 96 79 50 58 92 50 ...
## ..$ exercise : int [1:147] 3 5 4 5 5 4 3 4 3 5 ...
## ..$ happiness : int [1:147] 68 78 70 88 90 63 60 65 75 83 ...
## $ seed : int [1:626] 10403 624 -1466693457 -1951738076 1541348309 31988370 1791537387 1687984944 544847217 -451016578 ...
## $ statistic:function (d, i)
## ..- attr(*, "srcref")= 'srcref' int [1:8] 1 11 6 1 11 1 1 6
## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x00000000192e7788>
## $ sim : chr "ordinary"
## $ call : language boot(data = df, statistic = get_r2, R = 1000)
## $ stype : chr "i"
## $ strata : num [1:147] 1 1 1 1 1 1 1 1 1 1 ...
## $ weights : num [1:147] 0.0068 0.0068 0.0068 0.0068 0.0068 ...
## - attr(*, "class")= chr "boot"
## - attr(*, "boot_type")= chr "boot"
The most important object in the list returned by boot()
is t
, which is a matrix that contains our bootstrapped results. It will be the length of R, so in this case it will be length 1000 (i.e., 1000 rows); this means there are 1000 estimates of \(R^2\), each pertaining to the estimate obtained in one of the resamplings.
boot()
ResultsWe could get the mean of bootstrapped estimates:
mean(boot_r2$t)
## [1] 0.5617347
And median:
median(boot_r2$t)
## [1] 0.5627374
And, we can get the 95% CI by using boot.ci()
, passing the results of boot()
as the first argument and setting `type = “perc”.
CI <- boot.ci(boot_r2, type = "perc")
CI
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_r2, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.4678, 0.6459 )
## Calculations and Intervals on Original Scale
We can also generate plots to describe our bootstrapped distribution. One way is to produce a density plot. Let’s do that using ggplot.
The only sort of tricky part is that we have to turn the bootstrapped results into a dataframe or tibble first. I’ve done this below using the tibble()
function from {tidyverse}.
tibble(rsquared = boot_r2$t) %>% # turn to tibble
ggplot(aes(x = rsquared)) + # put statistic on the x axis (rsquared here)
geom_density(fill = "blue") + # add a geom_density
theme_minimal() # set them if you're so inclined
Alternatively, we could produce a histogram. Let’s do that and also set up vertical lines for the 95% CI and Median values:
tibble(rsquared = boot_r2$t) %>% # turn to tibble
ggplot(aes(x = rsquared)) + # put stat on x axis
geom_histogram() + # add geom_histogram
geom_vline(aes(xintercept = CI$percent[4], # vertical line for 95% CI lower bound
color = "Lower 2.5%"), size = 2) + # Label it
geom_vline(aes(xintercept = median(boot_r2$t), # vertical line for median
color = "Median"), size = 2) + # label it
geom_vline(aes(xintercept = CI$percent[5], # # vertical line for 95% CI upper bound
color = "Upper 2.5%"), size = 2) + # label it
theme_minimal() # set them if you like
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
That plot contains a lot of useful information - we can see the lower limit is a little above .45 and the upper limit is right around .65, with the median value between .55 and .6. Plus, you get a sense of the shape of the distribution as well.
What we’ve done so far works pretty well, but there is some advantage to making a slightly more flexible function in more advanced or complicated cases. For example, let’s say we wanted to compare several models with different sets of variables in terms of \(R^2\). One straightforward way to implement this would be to add an additional argument that controls the formula of our lm()
. Let’s try it:
get_r2 <- function(d, i, formula){ # calling data d; calling index i
d2 <- d[i,] # subset data using the i argument
m <- lm(formula, data = d2) # this changed; now it runs
# whatever model we setup
# in the formula argument
rsq <- summary(m)$r.squared # extract r^2
return(rsq) # return r^2
}
Next, we’ll run boot()
, changing the formula to each of the models we’re interested in. FYI, you need to set the seed before each model if you want your results to be reproducible.
set.seed(671)
boot_r2_model1 <- boot(df, # data
get_r2, # function
formula = happiness ~ extraversion, # formula
R = 1000) # resamples
set.seed(028383)
boot_r2_model2 <- boot(df, # data
get_r2, # function
formula = happiness ~ extraversion + pos_express, # formula
R = 1000) # resamples
set.seed(5421)
boot_r2_model3 <- boot(df, # data
get_r2, # function
formula = happiness ~ extraversion + pos_express +
soc_sup, # formula
R = 1000) # resamples
set.seed(6777)
boot_r2_model4 <- boot(df, # data
get_r2, # function
formula = happiness ~ extraversion + pos_express +
soc_sup + optimism, # formula
R = 1000) # resamples
set.seed(9148)
boot_r2_model5 <- boot(df, # data
get_r2, # function
formula = happiness ~ extraversion + pos_express +
soc_sup + optimism + exercise, # formula
R = 1000) # resamples
Then we would want to pull out the bootstrapped estimates from each and label which model they came from. We’ll do that for each below, specifying which model it is by creating a variable called predictors
with the predictors that were included in the relevant model.
r2_model1_results <- tibble(rsq = boot_r2_model1$t,
predictors = "Extr.")
r2_model2_results <- tibble(rsq = boot_r2_model2$t,
predictors = "Extr. + Pos. Expr.")
r2_model3_results <- tibble(rsq = boot_r2_model3$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp.")
r2_model4_results <- tibble(rsq = boot_r2_model4$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp. + Opt.")
r2_model5_results <- tibble(rsq = boot_r2_model5$t,
predictors = "Extr. + Pos. Expr. + Soc. Supp. + Opt. + Exercise")
Then, we could put the dataframes together, and calculate medians and CIs for each of the models:
r2_all_models <- r2_model1_results %>% # row bind each df
rbind(r2_model2_results) %>%
rbind(r2_model3_results) %>%
rbind(r2_model4_results) %>%
rbind(r2_model5_results) %>%
group_by(predictors) %>% # group by predictors since we want a median and CI per model
mutate(median = median(rsq),
lower = quantile(rsq, probs = .025),
upper = quantile(rsq, probs = .975))
Finally, we could display the \(R^2\) and bootstrapped CIs for each model to compare them.
ggplot(r2_all_models,
aes(x = predictors, y = rsq)) + # predictors on the x, rsq on the y
geom_pointrange(aes(y = median, ymin = lower, ymax = upper)) + # geom_linerange
geom_hline(yintercept = 0, color = "red") + # set a red line at 0
coord_flip() + # this flips the coordinates, which is useful when you have longer x-axis labels
labs(x = NULL, # remove x axis label
y = expression(paste(R^{2}, " (with bootstrapped 95% CIs)"))) + # label Y
theme_minimal()
We could instead show their full distribution using geom_density
, and color (fill) the density geoms according to which model they correspond to. Let’s take a look at what that looks like:
ggplot(r2_all_models,
aes(x = rsq, fill = predictors)) + # rsq on x; set fill to predictors
geom_density(alpha = .2) + # set alpha to .2 to make overlap easier to see
labs(x = expression(paste("Bootstrapped estimates of models' ", R^{2}))) + # set the xlabel
theme(legend.position="bottom") +
guides(fill = guide_legend(nrow = 3))
This is a nice visualization because it makes it more clear that adding optimism and then adding exercise seem to have the biggest impact on \(R^2\), with the other three models showing a little more overlap.
Using the same dataset we used here, calculate bootstrapped estimates of adjusted \(R^2\) for the same 5 models we ran above:
Get bootstrapped estimates of correlations between:
Then, plot these alongside one another using one of the methods covered today.