You can download the Rmd file here.
Statistical models can be built to serve different research purposes, such as explanation or prediction.
You could also be interested in achieving a mix of these two goals.
Machine learning encompasses a set of techniques that have been developed to build good predictive models. That is, the goal of machine learning is to build a statistical model that does the best at predicting new, not-yet-seen data points.
There are many algorithms that can be used to build a machine learning model. You can predict continuous or categorical outcomes - just make sure you choose an algorithm that’s appropriate for that type of DV!
Linear regression, lm()
Logistic regression, LogitBoost()
Support vector machines, svm() or svmLinear()
Random forests, randomForest()
And there are hundreds more… for a full list: names(getModelInfo())
As the algorithm you use becomes increasingly complex, the parameter estimates may become less and less interpretable, were you to examine them.
A very important consideration in the building of predictive models is overfitting. Overfitting is the problem of fitting a model too well to the variability of a specific data set. This could result in the model having poor predictive power when applied to a new data set.
When building a statistical model, you want to meet a balance between finding a model that seems to fit the data well, without accounting for so much error that you overfit the model.
One of the main innovations of machine learning is that it requires you to build your statistical model using a separate set of data than the one used to test the accuracy of your model.
The original data is split into a training set that is used to build and optimize the statistical model, and a testing set that is used to test the resulting model’s predictive accuracy (i.e., ability to accurately predict new, not-yet-seen data).
You typically want more data in the training set since this data is being used to build your model. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.
Recall that the statistical model is built using only the data in the training set. To avoid overfitting, we do not want the model we build to fit too specifically to the unique error in the training set.
Cross-validation is a technique for splitting the training set multiple times into mini training-testing sets. The model is fit using each of the mini training sets and then its predictive accuracy is measured in each of the mini testing sets. The results are averaged across each of these models.
Here are a couple of cross-validation approaches:
leave-one-out cross-validation: Using the training set, fit the statistical model to n-1 observations. Then, test the predictive accuracy of the model on the single observation that’s been left out. Repeat this n times.
k-fold cross-validation: Randomly split the training set into k chunks (aka, folds) of roughly equal size. One of the chunks is treated as the testing set. The remaining k-1 chunks are treated as the training set for building the model, and the predictive accuracy of the model is examined examined using the testing set. This process is repeated k times, with each time, a different chunk being treated as the testing set. Typically, people use values of k = 5 or k = 10 (but it can be any value up to n).
In machine learning, there are two types of parameters: model parameters and hyperparameters.
Model parameters are estimated from the data (e.g., model coefficients in linear regression)
Hyperparameters are values specific to some algorithms that can be “tuned” by the researcher. Different algorithms have different hyperparameters. You won’t know the best values of hyperparameters prior to training a model. You have to rely on rules of thumb or try to find the best values through trial and error. This is the tuning process.
We will focus on examples that use a continuous outcome variable (but you can certainly perform machine learning for categorical outcomes, too. Just make sure you use an appropriate algorithm!).
The data set we will use for this example is called
Prestige
from the car
package. This data set
contains various features (aka, variables) across a variety of
occupations.
head(Prestige)
str(Prestige)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
# see a description of the variables
# ?Prestige
# Not ideal way of handling missing data, but using for sake of today's example
Prestige <- na.omit(Prestige)
Let’s see if we can build a model that does well at predicting income from the rest of the variables in the data set.
First, split the original data into a training set and a testing set. Remember, the statistical model is built using only the training set.
createDataPartition
is used to split the original data
into a training set and a testing set. The inputs include:
y = the outcome variable
times = the number of times you want to split the data
p = the percentage of the data that goes into the training set
list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it
set.seed(50)
# Split the original data set into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)
training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing set
caret
is an R package that consolidates all of the many
various machine learning algorithms into an easy-to-use interface.
The train
function is used for model training (aka,
building a model). It uses the following inputs:
Let’s build the model using an algorithm we’re familiar with: linear regression model
set.seed(21)
ml_linear_model <- train(income ~. ,
data = training.set,
method = "lm",
trControl = trainControl(method="cv", number = 5), # k-folds CV with k=5
#tuneGrid = , # leaving empty because there's no hyperparameters to tune in this case
preProc = c("center"))
ml_linear_model
## Linear Regression
##
## 70 samples
## 5 predictor
##
## Pre-processing: centered (6)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 58, 55, 57, 54, 56
## Resampling results:
##
## RMSE Rsquared MAE
## 2475.216 0.6444453 1616.523
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
ml_linear_model$results
How accurately does this model predict incomes in the testing set?
linear_predicted <- predict(ml_linear_model, testing.set)
# Overall accuracy assessment
postResample(linear_predicted, testing.set$income)
## RMSE Rsquared MAE
## 2671.2692906 0.6826051 1398.0471955
The model we built accounts for 68% of the variability in incomes in the testing set.
One of the advantages of this approach is you can easily run and compare models fit using different types of algorithms. Another commonly used machine learning model is called a Support Vector Machine (SVM).
Unlike linear regression, SVMs do have a
hyperparameter that can be tuned. You can see what it’s called by
passing the model algorithm, called svmLinear
, to the
getModelInfo
function.
getModelInfo("svmLinear")$svmLinear$parameters
Then, we can choose “tuning” values to try out for this
hyperparameter expand.grid
is the function that allows you
to specify tuning values.
tuning_values <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100))
Now, let’s build the model using the training set.
set.seed(42)
ml_svm_model <- train(income ~. ,
data = training.set,
method = "svmLinear",
trControl = trainControl(method="cv", number = 5),
tuneGrid = tuning_values,
preProc = c("center"))
ml_svm_model
## Support Vector Machines with Linear Kernel
##
## 70 samples
## 5 predictor
##
## Pre-processing: centered (6)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 56, 56, 56, 56, 56
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1e-03 3514.928 0.6228650 2478.984
## 1e-02 2624.018 0.6890410 1614.340
## 1e-01 2324.423 0.6988928 1403.360
## 1e+00 2411.839 0.6679553 1483.526
## 1e+01 2425.302 0.6610138 1507.840
## 1e+02 2430.745 0.6596352 1513.087
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 0.1.
ml_svm_model$results
How accurately does this model predict incomes in the testing set?
svm_predicted <- predict(ml_svm_model, testing.set)
# Overall accuracy assessment
postResample(svm_predicted, testing.set$income)
## RMSE Rsquared MAE
## 2876.800584 0.708867 1452.412219
The model we built accounts for 71% of the variability in incomes in the testing set.
Let’s try one more algorithm called Random Forest.
Find the hyperparameters that can be tuned.
getModelInfo("rf")$rf$parameters
tuning_values <- expand.grid(mtry = c(2, 3, 4, 5))
Build the model using the training set.
set.seed(47)
ml_rf_model <- train(income ~. ,
data = training.set,
method = "rf",
trControl = trainControl(method="cv", number = 5),
tuneGrid = tuning_values,
preProc = c("center"))
ml_rf_model
## Random Forest
##
## 70 samples
## 5 predictor
##
## Pre-processing: centered (6)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 55, 56, 57, 57, 55
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2535.659 0.6308492 1614.988
## 3 2600.559 0.6211255 1628.985
## 4 2593.525 0.6269852 1614.363
## 5 2651.750 0.6123459 1620.062
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
ml_rf_model$results
How accurately does this model predict incomes in the testing set?
# Testing predictive ability of model in testing set
rf_predicted <- predict(ml_rf_model, testing.set)
postResample(rf_predicted, testing.set$income)
## RMSE Rsquared MAE
## 2539.1096777 0.7673937 1400.0285682
The model we built accounts for 77% of the variability in incomes in the testing set.
RMSE_Testing <- c(postResample(linear_predicted, testing.set$income)[1], postResample(svm_predicted, testing.set$income)[1], postResample(rf_predicted, testing.set$income)[1])
RSq_Testing <- c(postResample(linear_predicted, testing.set$income)[2], postResample(svm_predicted, testing.set$income)[2], postResample(rf_predicted, testing.set$income)[2])
Algorithm <- c("Linear Regression", "Support Vector Machine", "Random Forest")
compare_models <- data.frame(cbind(Algorithm, RMSE_Testing, RSq_Testing))
compare_models
Papaja
is a package that allows you to produce
APA-formatted reporting, tables, and graphs directly from your .rmd
file.
We need to install a version of LaTeX that works with R (I chose
tinytex
), papaja
, and corx
.
# tinytex::install_tinytex()
library(tinytex) # LaTeX
# install.packages("papaja")
library(papaja) # for producing APA-style output
## Loading required package: tinylabels
# install.packages("citr")
library(citr) # for citing works in-text
To insert a value stored within an object into text in Rmarkdown, you can write it as “r name_of_object” (replacing the quotes with back ticks; i.e., `).
For example, if you stored the value 5
in an object
called x
you would write “r x” (replacing the quotes with
back ticks) to have the number 5 inserted in the text when you knit the
Rmarkdown document.
M <- 2.37
SD <- 1.22
The mean of the sample was 2.37 and the standard deviation was 1.22.
First, let’s get descriptive statistics for a set of variables. We
can go ahead and keep using the Prestige
data set.
descriptives <- describe(Prestige) %>%
select(n, mean, sd)
colnames(descriptives) <- c("N", "Mean", "SD")
descriptives # raw table
Next, we format the numerical columns using printnum()
function from {papaja}
for this. In essence, this puts
everything to 2 decimal points (you can change this), pads with zeroes
when necessary, and turns those to strings so they print correctly. “-1”
indicates that the first column is non-numerical so does not apply.
descriptives <- descriptives %>%
mutate_if(is.numeric, printnum)
To convert this table into APA-style, we can use the
apa_table()
function. Make sure in the
code chunk options you set results = "asis"
for it to
render correctly. We’re also going to give the code chunk a label:
apa_table(descriptives,
caption = "Descriptive Statistics",
note = "This table was created with apa_table().")
N | Mean | SD | |
---|---|---|---|
education | 98.00 | 10.80 | 2.75 |
income | 98.00 | 6,938.86 | 4,228.13 |
women | 98.00 | 28.99 | 31.38 |
prestige | 98.00 | 47.33 | 17.09 |
census | 98.00 | 5,400.06 | 2,684.30 |
type* | 98.00 | 1.79 | 0.80 |
Note. This table was created with apa_table().
As an example, let’s run a regression predicting income from the
other variables in the Prestige
data set.
head(Prestige)
regr_model <- lm(income ~ ., data = Prestige)
summary(regr_model)
##
## Call:
## lm(formula = income ~ ., data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7752.4 -954.6 -331.2 742.6 14301.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.32053 3037.27048 0.002 0.99808
## education 131.18372 288.74961 0.454 0.65068
## women -53.23480 9.83107 -5.415 4.96e-07 ***
## prestige 139.20912 36.40239 3.824 0.00024 ***
## census 0.04209 0.23568 0.179 0.85865
## typeprof 509.15150 1798.87914 0.283 0.77779
## typewc 347.99010 1173.89384 0.296 0.76757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2633 on 91 degrees of freedom
## Multiple R-squared: 0.6363, Adjusted R-squared: 0.6123
## F-statistic: 26.54 on 6 and 91 DF, p-value: < 2.2e-16
The apa_print()
function can be used to convert results
from a statistical analysis into an APA-results object that can then be
passed to apa_table()
.
apa_regr_model <- apa_print(regr_model)
The table is stored in the $table
component of this
object. Pass it to the apa_table()
function. Make
sure in the code chunk options you set
results = "asis"
for it to render correctly. We’re also
going to give the code chunk a label:
apa_regr_model$table %>%
apa_table(caption = "Regression Model Predicting Income from Education, Women, Prestige, Census and Type",
note = "* p < 0.05; ** p < 0.01; *** p < 0.001")
Predictor | \(b\) | 95% CI | \(t\) | \(\mathit{df}\) | \(p\) |
---|---|---|---|---|---|
Intercept | 7.32 | [-6,025.84, 6,040.49] | 0.00 | 91 | .998 |
Education | 131.18 | [-442.38, 704.75] | 0.45 | 91 | .651 |
Women | -53.23 | [-72.76, -33.71] | -5.41 | 91 | < .001 |
Prestige | 139.21 | [66.90, 211.52] | 3.82 | 91 | < .001 |
Census | 0.04 | [-0.43, 0.51] | 0.18 | 91 | .859 |
Typeprof | 509.15 | [-3,064.10, 4,082.40] | 0.28 | 91 | .778 |
Typewc | 347.99 | [-1,983.81, 2,679.79] | 0.30 | 91 | .768 |
Note. * p < 0.05; ** p < 0.01; *** p < 0.001
You can convert ggplot
figures into APA-style by adding
theme_apa()
.
ggplot(Prestige, aes(x = prestige, y = income)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm") +
labs(x = "Prestige",
y = "Income") +
theme_apa()
## `geom_smooth()` using formula = 'y ~ x'
One of the most useful aspects of writing manuscripts in .Rmd and
{papaja}
is that we can report our results directly from
the statistical models we run in R, reducing the likelihood of
copy-and-pasting errors.
We can use apa_print()
from {papaja}
to
report the results of our statistical tests in text. Abbreviated results
can be called with $estimate
from the
apa_print()
object. For example,
apa_regr_model$estimate$prestige
will print a nicely
formatted printout of the slope for prestige on income:
Prestige was a significant, positive predictor of income while controlling for education, percentage of women, census, and position type, \(b = 139.21\), 95% CI \([66.90, 211.52]\).
We can instead get the full result (including t,
df, and p) by calling the $full_result
object, so apa_regr_model$full_result$prestige
will be
rendered like so:
Prestige was a significant, positive predictor of income while controlling for education, percentage of women, census, and position type, \(b = 139.21\), 95% CI \([66.90, 211.52]\), \(t(91) = 3.82\), \(p < .001\).
You can also extract the R-squared value and its corresponding F-test:
The model accounted for a significant amount of variability in people’s income, (\(R^2 = .64\), 90% CI \([0.50, 0.71]\), \(F(6, 91) = 26.54\), \(p < .001\)).