papaja
The following lab covers papaja
, a powerful package for creating reproducible, APA-style manuscripts in Rmarkdown. We will go over how to write a manuscript in papaja
, including creating tables, figures, citations, and in-text statistical reporting. This page has some broad (overview) information and links to useful resources. We will then switch to working hands-on with a papaja
RMD.
papaja
papaja is an R package for Preparing APA Journal Articles (get it: PAPAJA). It contains several functions and perhaps most importantly, an RMD template that formats the output document (.docx or .pdf) in APA format.
The single most useful thing I will provide today is a link to papaja’s documentation. When I’ve used papaja, I have this open on a tab and check it pretty regularly.
papaja
First, we need to install papaja
. papaja
is not yet on CRAN, and so we have to download it from github rather than using install.packages()
, which requires the devtools
package. Let’s install both.
# Install devtools package if necessary
if(!"devtools" %in% rownames(installed.packages())) install.packages("devtools")
# Install the stable development verions from GitHub
if(!"papaja" %in% rownames(installed.packages())) devtools::install_github("crsh/papaja")
# call papaja
library(papaja)
In addition to papaja
, the citr
package is going to be helpful when it comes to citing works in text (within our RMD) and the corx
package is going to make it easier to create a table of correlations.
if(!"citr" %in% rownames(installed.packages())) install.packages("citr")
if(!"corx" %in% rownames(installed.packages())) install.packages("corx")
OK, now we’re ready to go! Below I’ll briefly review some of the functions we’ll use today before we dive into hands-on work in a papaja template RMD.
There are a few options for making apa-style tables in papaja
, each of which provide different pros and cons.
We’ll start by going over apa_table
from papaja
.apa_table()
can be used to format a dataframe as an APA-style table. FYI, it is built on top of knitr::kable()
. When using apa_table()
, the workflow generally goes something like:
printnum()
(from papaja
) to format the numeric columns.apa_table()
, set the caption and note (if applicable).results = "asis"
Let’s start by getting a table of descriptive stats. We’ll use the bfi data from psych
, and use the bfi.keys
to score the 5 scale scores.
b5 <- psych::bfi %>%
# score items with psych's scoreVeryFast
cbind( # column-bind the 5 scale scores
# using scoreVeryFast from psych
# which scores questionnaires based on a
# key
psych::scoreVeryFast(psych::bfi.keys, # scoring key goes here, which psych has for this data
psych::bfi) # data goes here
)
First, we create a dataframe with the descriptives of our Big Five scales. We’ll use some of the tidyverse functions for this.
descriptives <- b5 %>%
# select just the scale scores
select(agree:openness) %>%
# gather so that each row is a person's score for a particular
# Big Five scale
gather(scale, score) %>%
# Group by the scale (so we can get descriptives for each scale)
group_by(scale) %>%
#
summarize(
Mean = mean(score),
Median = median(score),
SD = sd(score),
Min = min(score),
Max = max(score)
)
descriptives
Next, we format the numerical columns using printnum()
function from papaja
for this. In essence, this puts everything to 2 decimal points (you can change this), pads with zeroes when necessary, and turns those to strings so they print correctly.
descriptives[, -1] <- printnum(descriptives[, -1]) # printnum changes numbers to character (from papaja)
# note: -1 since the first column is labels (not numerical)
Then, we run apa_table()
on the dataframe, and we can add a caption and note. You have to set results = "asis"
in the chunk options for it to render correctly:
apa_table(
descriptives
, caption = "Descriptive statistics of Big Five Scale Scores."
, note = "This table was created with apa_table()."
, escape = TRUE
)
scale | Mean | Median | SD | Min | Max |
---|---|---|---|---|---|
agree | 4.65 | 4.80 | 0.90 | 1.00 | 6.00 |
conscientious | 4.27 | 4.40 | 0.95 | 1.00 | 6.00 |
extraversion | 4.15 | 4.20 | 1.06 | 1.00 | 6.00 |
neuroticism | 3.16 | 3.00 | 1.20 | 1.00 | 6.00 |
openness | 4.59 | 4.60 | 0.81 | 1.20 | 6.00 |
Now we have a nicely formatted APA-style table that is fully reproducible. We can also reference the table within the papaja rmd with \@ref(tab:descrips_tab)
as you’ll see when we walk through the example script.
Next, let’s get a table of scale inter-correlations for the same dataset. Correlation tables are a little difficult in papaja
, but the corx()
function from the corx
library makes it a little easier.
Basically, the workflow for this is:
corx
to create the correlation tableapa_table()
to add a caption, note, and print itlibrary(corx)
## Warning: package 'corx' was built under R version 3.6.3
# create table with corx
cor <- b5 %>%
select(agree:openness) %>%
corx(triangle = "lower", # print just lower triangle
stars = c(0.05, 0.01, 0.001), # tell it which p values to star
describe = c(`$M$` = mean, `$SD$` = sd)) # optionally provide descriptives
# print the apa table with apa_table()
apa_table(cor$apa, # apa contains the data.frame needed for apa_table
caption = "Example corr matrix",
note = "* p < 0.05; ** p < 0.01; *** p < 0.001",
escape = TRUE)
1 | 2 | 3 | 4 | $M$ | $SD$ | |
---|---|---|---|---|---|---|
1. agree | - | 4.65 | 0.90 | |||
2. conscientious | .26*** | - | 4.27 | 0.95 | ||
3. extraversion | .46*** | .26*** | - | 4.15 | 1.06 | |
4. neuroticism | -.19*** | -.23*** | -.22*** | - | 3.16 | 1.20 |
5. openness | .15*** | .20*** | .21*** | -.09*** | 4.59 | 0.81 |
Now we’ll get a regression table using another method available in papaja
. We’ll use the apa_print()
function. This function takes statistical results as input and produces a list containing an APA-style table and several objects useful for printing results in-text (covered below).
The workflow for this method is:
apa_print()
$table
Let’s do that with a couple of regression models, regressing Conscientiousness on age (model 1), and also education (model 2).
First, conduct the regressions
b5 <- b5 %>%
mutate(
# turning education to a factor
education = as.factor(education),
# re-labelling education
education = case_when(
education == 1 ~ "Some HS",
education == 2 ~ " (Finished HS)",
education == 3 ~ " (Some College)",
education == 4 ~ " (College Grad)",
education == 5 ~ " (Grad Degree)"
)
) %>%
# removing NAs to make thing a little
# easier for today
filter(!is.na(conscientious),
!is.na(age),
!is.na(education))
# regressions
m1 <- lm(conscientious ~ age, data = b5)
m2 <- lm(conscientious ~ age + education, data = b5)
Next, use apa_print()
to create those apa-style results objects:
# model 1 results
m1_results <- apa_print(m1)
# model 2 results
m2_results <- apa_print(m2)
# model comparison results
mod_comp_results <- apa_print(list(m1, m2),
# setting to 0
# otherwise it creates bootstrapped
# CI for delta R^2 - we will cover bootstrapping
# soon.
boot_samples = 0)
Then, we can pull the tables out of them.
You could get a table of just the Model 1 results:
m1_results$table %>%
apa_table(caption = "Model Regressing Conscientiousness on Age",
note = "* p < 0.05; ** p < 0.01; *** p < 0.001",
escape = TRUE)
Predictor | \(b\) | 95% CI | \(t(2575)\) | \(p\) |
---|---|---|---|---|
Intercept | 4.07 | \([3.96\), \(4.17]\) | 75.42 | < .001 |
Age | 0.01 | \([0.00\), \(0.01]\) | 4.71 | < .001 |
Or we could get a table of just model 2’s results:
m2_results$table %>%
apa_table(caption = "Model Regressing Conscientiousness on Age & Education",
note = "* p < 0.05; ** p < 0.01; *** p < 0.001",
escape = TRUE)
Predictor | \(b\) | 95% CI | \(t(2571)\) | \(p\) |
---|---|---|---|---|
Intercept | 3.89 | \([3.75\), \(4.04]\) | 51.66 | < .001 |
Age | 0.01 | \([0.01\), \(0.01]\) | 5.54 | < .001 |
Education Finished HS | 0.02 | \([-0.12\), \(0.16]\) | 0.32 | .746 |
Education Grad Degree | 0.04 | \([-0.09\), \(0.17]\) | 0.62 | .536 |
Education Some College | 0.22 | \([0.12\), \(0.33]\) | 4.09 | < .001 |
EducationSome HS | -0.02 | \([-0.18\), \(0.13]\) | -0.28 | .782 |
And finally, we can get them together with their comparison:
mod_comp_results$table %>%
apa_table(caption = "Model Comparison: Conscientiousness by Age and Age + Education",
note = "* p < 0.05; ** p < 0.01; *** p < 0.001",
escape = TRUE)
Model 1 | Model 2 | |
---|---|---|
Intercept | \(4.07\) \([3.96\), \(4.17]\) | \(3.89\) \([3.75\), \(4.04]\) |
Age | \(0.01\) \([0.00\), \(0.01]\) | \(0.01\) \([0.01\), \(0.01]\) |
Education Finished HS | \(0.02\) \([-0.12\), \(0.16]\) | |
Education Grad Degree | \(0.04\) \([-0.09\), \(0.17]\) | |
Education Some College | \(0.22\) \([0.12\), \(0.33]\) | |
EducationSome HS | \(-0.02\) \([-0.18\), \(0.13]\) | |
\(R^2\) [90% CI] | \(.01\) \([0.00\), \(0.02]\) | \(.02\) \([0.01\), \(0.03]\) |
\(F\) | 22.16 | 10.93 |
\(df_1\) | 2 | 6 |
\(df_2\) | 2575 | 2571 |
\(p\) | < .001 | < .001 |
\(\mathrm{AIC}\) | 6,979.81 | 6,955.69 |
\(\mathrm{BIC}\) | 6,997.38 | 6,996.68 |
\(\Delta R^2\) | \(.01\) | |
\(F\) | 8.06 | |
\(df_1\) | 4 | |
\(df_2\) | 2575 | |
\(p\) | < .001 | |
\(\Delta \mathrm{AIC}\) | -24.12 | |
\(\Delta \mathrm{BIC}\) | -0.70 |
The last thing I’ll mention here is that we can use the chunk labels for our table to reference them in text. For example, we could reference this last one with by adding Table \@ref(tab:reg_tbl_3)
in the papaja RMD.
Figures are, in some ways, a little easier. You can create figures using basically any of the options we’ve learned about. I personally like ggplot, which can be put into APA format using the theme_apa()
theme (which comes from papaja).
Let’s get an APA-formatted plot for the relation between age and conscientiousness.
ggplot(b5, aes(x = age, y = conscientious)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm") +
labs(x = "Age",
y = "Conscientiousness") +
# theme_apa puts it in apa format
theme_apa()
The best way to give that figure a caption is to use preface it with:
(ref:fig-ageXconsc-cap) Conscientiousness by Age.
You then set the chunk option fig.cap = "(ref:fig-ageXconsc-cap) "
, There is an example of this in the example papaja manuscript, so we will see it in action in a moment.
Just to be perfectly clear, you should be able to expand this to more complicated plots. For example, let’s take a look a figure that also has education in it:
ggplot(b5, aes(x = age, y = conscientious, color = education)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Age",
y = "Conscientiousness") +
# theme_apa puts it in apa format
theme_apa()
Or one that has age by each of the Big Five:
b5 %>%
select(age, agree:openness) %>%
gather(trait, score, -age) %>%
ggplot(aes(x = age, y = score)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm") +
labs(x = "Age",
y = "Scale Score") +
facet_grid(~trait) +
# theme_apa puts it in apa format
theme_apa()
One of the most useful aspects of writing manuscripts in RMD and papaja
is that we can report our results directly from the statistical models we run in R, reducing the likelihood of copypasta errors.
We can use apa_print()
from papaja
to report the results of our statistical tests in text. Abbreviated results (estimate and CI) can be called with $estimate
from the apa_print()
obect. For example, m1_results$estimate$age
will print a nicely formatted printout of the slope for age on conscientiousness:
Age significantly predicted conscientiousness, \(b = 0.01\), 95% CI \([0.00\), \(0.01]\).
We can instead get the full result (including t, df, and p) by calling the $full_result
object, so m1_results$full_result$age
will be rendered like so:
Age significantly predicted conscientiousness, \(b = 0.01\), 95% CI \([0.00\), \(0.01]\), \(t(2575) = 4.71\), \(p < .001\).
It probably goes without saying, but we could get the resylts of other slopes by changing out $age
to reference other terms. For example, we could report the result of the difference between some high school and having finished highschool with m2_results$estimate$education_Finished_HS
:
Finishing highschool (vs. not finishing highschool) was not associated with Conscientiousness, \(b = 0.02\), 95% CI \([-0.12\), \(0.16]\), \(t(2571) = 0.32\), \(p = .746\).
And we could get model fit results too, both from each model and the compairson object. For example
m1_results$full_result$modelfit$r2
will print the \(R^2\) for model 1m2_results$full_result$modelfit$r2
will print the \(R^2\) for model 2mod_comp_results$full_result$model2
will print the \(\Delta R^2\) value (difference in \(R^2\)), and the F test for that difference.Putting it all together, we can report those results with something like the following:
The model with just age explained less variance, \(R^2 = .01\), 90% CI \([0.00\), \(0.02]\), \(F(1, 2575) = 22.16\), \(p < .001\), than the model that also included (dummy-coded) education \(R^2 = .02\), 90% CI \([0.01\), \(0.03]\), \(F(5, 2571) = 10.93\), \(p < .001\), which was significant \(\Delta R^2 = .01\), \(F(4, 2575) = 8.06\), \(p < .001\)
papaja
has various print methods, so be sure to check that part of the documentation.
The final ingredient we will talk about before getting into our example manuscript and the (singular) hack for the lab is the bibliography. All RMDs can include a bibliography:
entry in the YAML (the header portion of an RMD), where you can link to a bibtex (i.e., .bib
) file. Bibtex entries look something like this:
@article{goldberg1990alternative,
title={An alternative" description of personality": the big-five factor structure.},
author={Goldberg, Lewis R},
journal={Journal of personality and social psychology},
volume={59},
number={6},
pages={1216--1229},
year={1990},
doi = {https://doi.org/10.1037/0022-3514.59.6.1216},
publisher={American Psychological Association}
}
One convenient thing, is that you can get them from google scholar. Let’s take a look at the scholar results when we search for McClelland & Judd (1993).
From there, click on the quotes, and then on the BibTeX link, which takes you here
You should see this:
@article{mcclelland1993statistical,
title={Statistical difficulties of detecting interactions and moderator effects.},
author={McClelland, Gary H and Judd, Charles M},
journal={Psychological bulletin},
volume={114},
number={2},
pages={376},
year={1993},
publisher={American Psychological Association}
}
You would then want to copy that into a .bib file (you can edit it within RStudio or choose another text editor), and then make sure you reference that .bib file in the references:
part of your YAML (we’ll see this in the example).
You can also use popular reference managers like mendeley, zotero, etc. to create bibtex files. I use mendeley and it works pretty well.
Then, we could reference this article by referring to its name. There are basically three options:
Type | code | how it renders |
---|---|---|
Parenthetical citation | [@mcclelland1993statistical] |
(McClelland and Judd 1993) |
Non-parenthetical | @mcclelland1993statistical |
McClelland and Judd (1993) |
just the year | [-@mcclelland1993statistical] . |
(1993) |
Note that you can also put additional text in the brackets. For example,
[see @mcclelland1993statistical]
becomes…
(see McClelland and Judd 1993)
You can also add multiple citations by separating them with a ;
. For example,
[@mcclelland1993statistical; @cohen2013applied]
becomes…
(McClelland and Judd 1993; Cohen et al. 2013)
One of the coolest things about using BibTeX is that references will only appear in your reference list that you cite in text! You can see that here, because the ‘lab-9.bib’ file has three entries and the reference list below has just the two we reference above. That means no more last-minute checks on the reference list to make sure you removed the articles you references in a previous draft (that are no longer referenced). This alone is a huge time saver.
papaja
)papaja
makes it easy to cite any R package you’ve used with the r_refs(file = "r-references.bib")
, which creates a .bib file (named r-references.bib in that example code) with bibtex entries for all R packages you’ve called in the script. It is super handy!
You can combine this with the cite_r()
package from papaja
, which will create an in-text citation of all the packages you’ve used - we’ll see this in the example papaja Rmd.
That is all for this tutorial. Now let’s turn to the example RMD, and finally work on the hack (time permitting)!
papaja
hackToday there is just one hack. For this hack, I want you to get started on your final project RMD.
Create a new RMD file for your final project using the papaja template.
Edit the YAML to have the title and author of your project
Create a .bib with at least one reference you know you’ll need for your final project.
Cite the reference from your .bib (part 3) in text.
Run at least one statistical test.
Report the statistical test from part 4 in-text, in a table, and in a figure.
Cohen, Jacob, Patricia Cohen, Stephen G West, and Leona S Aiken. 2013. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Routledge.
McClelland, Gary H, and Charles M Judd. 1993. “Statistical Difficulties of Detecting Interactions and Moderator Effects.” Psychological Bulletin 114 (2): 376.