You can download the Rmd file here.

Background

Statistical models can be built to serve different research purposes, such as explanation or prediction.

  • When one’s purpose is explanation, the researcher writes a statistical model that tests a specific literature-informed theory (the model may have even been pre-registered).
    • Typically, the researcher wants the model to have interpretable parameter estimates that provide an understanding of the nature of the relationship between each predictor and the outcome variable
  • When one’s purpose is prediction, the researcher wants to build a statistical model that does the best at accurately predicting new data.
    • The model does not necessarily need to produce interpretable parameter estimates as long as, when applied to new data, the statistical model does a good job at predicting those values.

You could also be interested in achieving a mix of these two goals.

Machine Learning

Machine learning encompasses a set of techniques that have been developed to build good predictive models. That is, the goal of machine learning is to build a statistical model that does the best at predicting new, not-yet-seen data points.

Common Machine Learning Algorithms

There are many algorithms that can be used to build a machine learning model. You can predict continuous or categorical outcomes - just make sure you choose an algorithm that’s appropriate for that type of DV!

  • Linear regression, lm()

  • Logistic regression, LogitBoost()

  • Support vector machines, svm() or svmLinear()

  • Random forests, randomForest()

And there are hundreds more… for a full list: names(getModelInfo())

As the algorithm you use becomes increasingly complex, the parameter estimates may become less and less interpretable, were you to examine them.

Overfitting

A very important consideration in the building of predictive models is overfitting. Overfitting is the problem of fitting a model too well to the variability of a specific data set. This could result in the model having poor predictive power when applied to a new data set.

When building a statistical model, you want to meet a balance between finding a model that seems to fit the data well, without accounting for so much error that you overfit the model.

Training Set vs Testing Set

One of the main innovations of machine learning is that it requires you to build your statistical model using a separate set of data than the one used to test the accuracy of your model.

The original data is split into a training set that is used to build and optimize the statistical model, and a testing set that is used to test the resulting model’s predictive accuracy (i.e., ability to accurately predict new, not-yet-seen data).

You typically want more data in the training set since this data is being used to build your model. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.

Cross-Validation

Recall that the statistical model is built using only the data in the training set. To avoid overfitting, we do not want the model we build to fit too specifically to the unique error in the training set.

Cross-validation is a technique for splitting the training set multiple times into mini training-testing sets. The model is fit using each of the mini training sets and then its predictive accuracy is measured in each of the mini testing sets. The results are averaged across each of these models.

Here are a couple of cross-validation approaches:

  • leave-one-out cross-validation: Using the training set, fit the statistical model to n-1 observations. Then, test the predictive accuracy of the model on the single observation that’s been left out. Repeat this n times.

  • k-fold cross-validation: Randomly split the training set into k chunks (aka, folds) of roughly equal size. One of the chunks is treated as the testing set. The remaining k-1 chunks are treated as the training set for building the model, and the predictive accuracy of the model is examined examined using the testing set. This process is repeated k times, with each time, a different chunk being treated as the testing set. Typically, people use values of k = 5 or k = 10 (but it can be any value up to n).

Tuning

In machine learning, there are two types of parameters: model parameters and hyperparameters.

  • Model parameters are estimated from the data (e.g., model coefficients in linear regression)

  • Hyperparameters are values specific to some algorithms that can be “tuned” by the researcher. Different algorithms have different hyperparameters. You won’t know the best values of hyperparameters prior to training a model. You have to rely on rules of thumb or try to find the best values through trial and error. This is the tuning process.

Example: Steps of Machine Learning

We will focus on examples that use a continuous outcome variable (but you can certainly perform machine learning for categorical outcomes, too. Just make sure you use an appropriate algorithm!).

The data set we will use for this example is called Prestige from the car package. This data set contains various features (aka, variables) across a variety of occupations.

head(Prestige) 
str(Prestige)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
# see a description of the variables
# ?Prestige

# Not ideal way of handling missing data, but using for sake of today's example
Prestige <- na.omit(Prestige)

Let’s see if we can build a model that does well at predicting income from the rest of the variables in the data set.

Data Splitting

First, split the original data into a training set and a testing set. Remember, the statistical model is built using only the training set.

createDataPartition is used to split the original data into a training set and a testing set. The inputs include:

  • y = the outcome variable

  • times = the number of times you want to split the data

  • p = the percentage of the data that goes into the training set

  • list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it

set.seed(50) 

# Split the original data set into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)

training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing set

Building Model using Training Set

caret is an R package that consolidates all of the many various machine learning algorithms into an easy-to-use interface.

The train function is used for model training (aka, building a model). It uses the following inputs:

  • y = the outcome variable
    • y ~. means predict y from all other variables in the dataset
  • method = the machine learning algorithm you want to use
  • trControl = the cross-validation method
  • tuneGrid = a data frame of the hyperparameters that you want to be evaluated during model training
  • preProc = any pre-processing adjustments that you want done on the predictor data

Let’s build the model using an algorithm we’re familiar with: linear regression model

set.seed(21)

ml_linear_model <- train(income ~. , 
                         data = training.set, 
                         method = "lm", 
                         trControl = trainControl(method="cv", number = 5), # k-folds CV with k=5 
                        #tuneGrid = , # leaving empty because there's no hyperparameters to tune in this case
                          preProc = c("center")) 


ml_linear_model
## Linear Regression 
## 
## 70 samples
##  5 predictor
## 
## Pre-processing: centered (6) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 58, 55, 57, 54, 56 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2475.216  0.6444453  1616.523
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
ml_linear_model$results

How accurately does this model predict incomes in the testing set?

linear_predicted <- predict(ml_linear_model, testing.set) 

# Overall accuracy assessment
postResample(linear_predicted, testing.set$income)
##         RMSE     Rsquared          MAE 
## 2671.2692906    0.6826051 1398.0471955

The model we built accounts for 68% of the variability in incomes in the testing set.

Comparing Models

One of the advantages of this approach is you can easily run and compare models fit using different types of algorithms. Another commonly used machine learning model is called a Support Vector Machine (SVM).

Support Vector Machine

Unlike linear regression, SVMs do have a hyperparameter that can be tuned. You can see what it’s called by passing the model algorithm, called svmLinear, to the getModelInfo function.

getModelInfo("svmLinear")$svmLinear$parameters 

Then, we can choose “tuning” values to try out for this hyperparameter expand.grid is the function that allows you to specify tuning values.

tuning_values <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100)) 

Now, let’s build the model using the training set.

set.seed(42)

ml_svm_model <- train(income ~. , 
                      data = training.set, 
                      method = "svmLinear", 
                      trControl = trainControl(method="cv", number = 5), 
                      tuneGrid = tuning_values, 
                      preProc = c("center"))

ml_svm_model
## Support Vector Machines with Linear Kernel 
## 
## 70 samples
##  5 predictor
## 
## Pre-processing: centered (6) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 56, 56, 56, 56, 56 
## Resampling results across tuning parameters:
## 
##   C      RMSE      Rsquared   MAE     
##   1e-03  3514.928  0.6228650  2478.984
##   1e-02  2624.018  0.6890410  1614.340
##   1e-01  2324.423  0.6988928  1403.360
##   1e+00  2411.839  0.6679553  1483.526
##   1e+01  2425.302  0.6610138  1507.840
##   1e+02  2430.745  0.6596352  1513.087
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 0.1.
ml_svm_model$results

How accurately does this model predict incomes in the testing set?

svm_predicted <- predict(ml_svm_model, testing.set) 

# Overall accuracy assessment
postResample(svm_predicted, testing.set$income)
##        RMSE    Rsquared         MAE 
## 2876.800584    0.708867 1452.412219

The model we built accounts for 71% of the variability in incomes in the testing set.

Random Forest

Let’s try one more algorithm called Random Forest.

Find the hyperparameters that can be tuned.

getModelInfo("rf")$rf$parameters
tuning_values <- expand.grid(mtry = c(2, 3, 4, 5))

Build the model using the training set.

set.seed(47)

ml_rf_model <- train(income ~. , 
                  data = training.set, 
                  method = "rf", 
                  trControl = trainControl(method="cv", number = 5), 
                  tuneGrid = tuning_values,
                  preProc = c("center"))

ml_rf_model
## Random Forest 
## 
## 70 samples
##  5 predictor
## 
## Pre-processing: centered (6) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 55, 56, 57, 57, 55 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     2535.659  0.6308492  1614.988
##   3     2600.559  0.6211255  1628.985
##   4     2593.525  0.6269852  1614.363
##   5     2651.750  0.6123459  1620.062
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
ml_rf_model$results

How accurately does this model predict incomes in the testing set?

# Testing predictive ability of model in testing set
rf_predicted <- predict(ml_rf_model, testing.set)

postResample(rf_predicted, testing.set$income) 
##         RMSE     Rsquared          MAE 
## 2539.1096777    0.7673937 1400.0285682

The model we built accounts for 77% of the variability in incomes in the testing set.

Which model did the best?

RMSE_Testing <- c(postResample(linear_predicted, testing.set$income)[1], postResample(svm_predicted, testing.set$income)[1], postResample(rf_predicted, testing.set$income)[1])

RSq_Testing <- c(postResample(linear_predicted, testing.set$income)[2], postResample(svm_predicted, testing.set$income)[2], postResample(rf_predicted, testing.set$income)[2])

Algorithm <- c("Linear Regression", "Support Vector Machine", "Random Forest")

compare_models <- data.frame(cbind(Algorithm, RMSE_Testing, RSq_Testing))

compare_models

Papaja

Papaja is a package that allows you to produce APA-formatted reporting, tables, and graphs directly from your .rmd file.

We need to install a version of LaTeX that works with R (I chose tinytex), papaja, and corx.

# tinytex::install_tinytex()
library(tinytex) # LaTeX

# install.packages("papaja")
library(papaja) # for producing APA-style output
## Loading required package: tinylabels
# install.packages("citr")
library(citr) # for citing works in-text

Inserting Object Values in Text

To insert a value stored within an object into text in Rmarkdown, you can write it as “r name_of_object” (replacing the quotes with back ticks; i.e., `).

For example, if you stored the value 5 in an object called x you would write “r x” (replacing the quotes with back ticks) to have the number 5 inserted in the text when you knit the Rmarkdown document.

M <- 2.37
SD <- 1.22

The mean of the sample was 2.37 and the standard deviation was 1.22.

Tables

Descriptive Statistics

First, let’s get descriptive statistics for a set of variables. We can go ahead and keep using the Prestige data set.

descriptives <- describe(Prestige) %>%
  select(n, mean, sd) 

colnames(descriptives) <- c("N", "Mean", "SD")

descriptives # raw table

Next, we format the numerical columns using printnum() function from {papaja} for this. In essence, this puts everything to 2 decimal points (you can change this), pads with zeroes when necessary, and turns those to strings so they print correctly. “-1” indicates that the first column is non-numerical so does not apply.

descriptives <- descriptives %>%
  mutate_if(is.numeric, printnum)

To convert this table into APA-style, we can use the apa_table() function. Make sure in the code chunk options you set results = "asis" for it to render correctly. We’re also going to give the code chunk a label:

apa_table(descriptives, 
          caption = "Descriptive Statistics", 
          note    = "This table was created with apa_table().")
(#tab:descrips-tab)
Descriptive Statistics
N Mean SD
education 98.00 10.80 2.75
income 98.00 6,938.86 4,228.13
women 98.00 28.99 31.38
prestige 98.00 47.33 17.09
census 98.00 5,400.06 2,684.30
type* 98.00 1.79 0.80

Note. This table was created with apa_table().

 

Regression Results

As an example, let’s run a regression predicting income from the other variables in the Prestige data set.

head(Prestige)
regr_model <- lm(income ~ ., data = Prestige)
summary(regr_model)
## 
## Call:
## lm(formula = income ~ ., data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7752.4  -954.6  -331.2   742.6 14301.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.32053 3037.27048   0.002  0.99808    
## education    131.18372  288.74961   0.454  0.65068    
## women        -53.23480    9.83107  -5.415 4.96e-07 ***
## prestige     139.20912   36.40239   3.824  0.00024 ***
## census         0.04209    0.23568   0.179  0.85865    
## typeprof     509.15150 1798.87914   0.283  0.77779    
## typewc       347.99010 1173.89384   0.296  0.76757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2633 on 91 degrees of freedom
## Multiple R-squared:  0.6363, Adjusted R-squared:  0.6123 
## F-statistic: 26.54 on 6 and 91 DF,  p-value: < 2.2e-16

The apa_print() function can be used to convert results from a statistical analysis into an APA-results object that can then be passed to apa_table().

apa_regr_model <- apa_print(regr_model)

The table is stored in the $table component of this object. Pass it to the apa_table() function. Make sure in the code chunk options you set results = "asis" for it to render correctly. We’re also going to give the code chunk a label:

apa_regr_model$table %>% 
  apa_table(caption = "Regression Model Predicting Income from Education, Women, Prestige, Census and Type",
            note    = "* p < 0.05; ** p < 0.01; *** p < 0.001")
(#tab:reg-tbl-1)
Regression Model Predicting Income from Education, Women, Prestige, Census and Type
Predictor \(b\) 95% CI \(t\) \(\mathit{df}\) \(p\)
Intercept 7.32 [-6,025.84, 6,040.49] 0.00 91 .998
Education 131.18 [-442.38, 704.75] 0.45 91 .651
Women -53.23 [-72.76, -33.71] -5.41 91 < .001
Prestige 139.21 [66.90, 211.52] 3.82 91 < .001
Census 0.04 [-0.43, 0.51] 0.18 91 .859
Typeprof 509.15 [-3,064.10, 4,082.40] 0.28 91 .778
Typewc 347.99 [-1,983.81, 2,679.79] 0.30 91 .768

Note. * p < 0.05; ** p < 0.01; *** p < 0.001

 

Figures

You can convert ggplot figures into APA-style by adding theme_apa().

ggplot(Prestige, aes(x = prestige, y = income)) +
  geom_point(alpha   = .5) +
  geom_smooth(method = "lm") +
  labs(x = "Prestige",
       y = "Income") +
  theme_apa()
## `geom_smooth()` using formula = 'y ~ x'

In-text summaries of tests

One of the most useful aspects of writing manuscripts in .Rmd and {papaja} is that we can report our results directly from the statistical models we run in R, reducing the likelihood of copy-and-pasting errors.

We can use apa_print() from {papaja} to report the results of our statistical tests in text. Abbreviated results can be called with $estimate from the apa_print() object. For example, apa_regr_model$estimate$prestige will print a nicely formatted printout of the slope for prestige on income:

Prestige was a significant, positive predictor of income while controlling for education, percentage of women, census, and position type, \(b = 139.21\), 95% CI \([66.90, 211.52]\).

We can instead get the full result (including t, df, and p) by calling the $full_result object, so apa_regr_model$full_result$prestige will be rendered like so:

Prestige was a significant, positive predictor of income while controlling for education, percentage of women, census, and position type, \(b = 139.21\), 95% CI \([66.90, 211.52]\), \(t(91) = 3.82\), \(p < .001\).

You can also extract the R-squared value and its corresponding F-test:

The model accounted for a significant amount of variability in people’s income, (\(R^2 = .64\), 90% CI \([0.50, 0.71]\), \(F(6, 91) = 26.54\), \(p < .001\)).