This document contains all the information shared in lecture (Nov 22, 2022). However, the format of the lecture differs from previous lectures and labs. If you were not able to join the lecture in person or live on Zoom, I strongly recommend watching the video. The document here is meant to serve as a summary sheet/reference, but not as the formal teaching tool.
The purpose of today’s lecture/lab is to get more familiar with RMarkdown. RMarkdown is a file format that allows us to produce reports. You can use RMarkdown to produce html files (as we have been doing), but also word documents, pdfs, slideshows, websites, and more. In 612, we will be using RMarkdown to create APA manuscripts in R.
Here are two resources about RMarkdown that I recommend bookmarking: this cheatsheet and this book.
You’ll need the following packages for this lecture. If you don’t have them installed, be sure to do that before loading them:
library(tidyverse)
library(gt)
library(gtsummary)
The YAML is the header of your RMarkdown file that starts and ends
with ---
. When you open a new RMarkdown file, it will
automatically create a default YAML with a title, author, date, and
output. You can set the output to which file type you want the report to
print out in (html_document
, pdf_document
,
word_document
, ioslides_presentation
,
etc.)
---
title: Lab 7: RMarkdown
author: Vinita Vader
date: November 11, 2021
output:
html_document
---
For an html document, you can add a table of contents by specifying
toc: true
under html_document
. If you like how
the table of contents is floating to the left like it is on the course
website, you can also add toc_float: true
. Additionally,
you can customize your theme, change the width and height of figures for
the whole report, number your sections, and so on. For how to do these,
refer here.
---
title: 'Lab 7: RMarkdown'
output:
html_document:
toc: true
toc_float: true
---
You can create a new code chunk with the shortcut
Ctrl + Alt + I
(or Cmd + Option + I
if you
have a Mac).
Code chunks have the default setting that all code will be evaluated and all code and results of the code will be printed out in your report when you knit it. But you may want to override the defaults. For example, you may want your final report to show only your figures but not the code that created the figures. You would edit the code chunk, inside the brackets, like this:
```{r, echo = FALSE}
Here are some more useful options.
Option | Purpose | Useful for… |
---|---|---|
echo = FALSE |
The results will print but not the code chunk | If you want to hide your code |
eval = FALSE |
The code chunk will print but is not evaluated | If you want to showcase sample code |
include = FALSE |
The code chunk is evaluated but nothing is printed | You are loading libraries |
warning = FALSE |
Warning messages will not be displayed | You want to get rid of a warning message in your report
but you still want it to print the rest of your results out (if it is
not a warning message try message = FALSE ) |
error = TRUE |
The error message will display in your report instead of in the console | You can use this to knit even when you have an error message |
Refer here for
more options. To change the settings for all of the code chunks in your
report, you can set a global option. For example, if I wanted only the
results and no code to print out for my entire report, I would put the
code knitr::opts_chunk$set(echo = FALSE)
in my first code
chunk.
Tabbed sections are a way to organize your html file a little better.
Follow this format exactly to make a tabbed section. Your first line
will be the header of the section
# header title {.tabset .tabset-fade .tabset-pills}
followed by tab names ## tab 1
, ## tab 2
, and
so on. In this section, we will also be exploring how to make a table
using the gt
and gtsummary
packages.
We’ll show some of the code today, but there’s a lot more on the websites for these packages: gt.rstudio.com and www.danieldsjoberg.com/gtsummary/.
Here is a simple data frame.
(table_1 <- gtcars %>%
# which variables
select(year, hp, mpg_c, mpg_h, msrp) %>%
# this is a big df, so let's just work with the first 10 rows
slice_head(n = 10))
## # A tibble: 10 × 5
## year hp mpg_c mpg_h msrp
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017 647 11 18 447000
## 2 2015 597 13 17 291744
## 3 2015 562 13 17 263553
## 4 2014 562 13 17 233509
## 5 2016 661 15 22 245400
## 6 2015 553 16 23 198973
## 7 2017 680 12 17 298000
## 8 2015 652 11 16 295000
## 9 2015 731 11 16 319995
## 10 2015 949 12 16 1416362
This is a simple table using the gt::gt
function.
#Creating a gt table
table_1 %>%
gt()
year | hp | mpg_c | mpg_h | msrp |
---|---|---|---|---|
2017 | 647 | 11 | 18 | 447000 |
2015 | 597 | 13 | 17 | 291744 |
2015 | 562 | 13 | 17 | 263553 |
2014 | 562 | 13 | 17 | 233509 |
2016 | 661 | 15 | 22 | 245400 |
2015 | 553 | 16 | 23 | 198973 |
2017 | 680 | 12 | 17 | 298000 |
2015 | 652 | 11 | 16 | 295000 |
2015 | 731 | 11 | 16 | 319995 |
2015 | 949 | 12 | 16 | 1416362 |
This is technically a little better but we can further customize it.
With the gt::gt
function, you can add a table caption,
change the column names, and format the currency column so that it’s a
little easier to read.
table_1 %>%
gt(
caption = "Table 1. An example dataset from the gt package." #add table caption
) %>%
cols_label( # rename the columns
year = "Year",
hp = "Horsepower",
mpg_c = "MPG Country",
mpg_h = "MPG Highway",
msrp = "MSRP ($)"
) %>%
fmt_currency(
columns = c(msrp),
currency = "USD"
)
Year | Horsepower | MPG Country | MPG Highway | MSRP ($) |
---|---|---|---|---|
2017 | 647 | 11 | 18 | $447,000.00 |
2015 | 597 | 13 | 17 | $291,744.00 |
2015 | 562 | 13 | 17 | $263,553.00 |
2014 | 562 | 13 | 17 | $233,509.00 |
2016 | 661 | 15 | 22 | $245,400.00 |
2015 | 553 | 16 | 23 | $198,973.00 |
2017 | 680 | 12 | 17 | $298,000.00 |
2015 | 652 | 11 | 16 | $295,000.00 |
2015 | 731 | 11 | 16 | $319,995.00 |
2015 | 949 | 12 | 16 | $1,416,362.00 |
But we can do even more! Let’s rename our MPG columns and put them under a single banner. We can also group our values by year.
table_1 %>%
group_by(year) %>% # group by year
gt(
caption = "Table 2. An example dataset from the gt package." #add table caption
) %>%
cols_label( # note that year is not here anymore
hp = "Horsepower",
mpg_c = "Country",
mpg_h = "Highway",
msrp = "MSRP ($)"
) %>%
tab_spanner(
label = "Miles Per Gallon",
columns = c(mpg_c, mpg_h)
)
Horsepower | Miles Per Gallon | MSRP ($) | |
---|---|---|---|
Country | Highway | ||
2017 | |||
647 | 11 | 18 | 447000 |
680 | 12 | 17 | 298000 |
2015 | |||
597 | 13 | 17 | 291744 |
562 | 13 | 17 | 263553 |
553 | 16 | 23 | 198973 |
652 | 11 | 16 | 295000 |
731 | 11 | 16 | 319995 |
949 | 12 | 16 | 1416362 |
2014 | |||
562 | 13 | 17 | 233509 |
2016 | |||
661 | 15 | 22 | 245400 |
It would be helpful to include some summary statistics here.
table_1 %>%
group_by(year) %>%
gt(
caption = "Table 3. An example dataset from the gt package." #add table caption
) %>%
cols_label(
year = "Year",
hp = "Horsepower",
mpg_c = "Country",
mpg_h = "Highway",
msrp = "MSRP ($)"
) %>%
tab_spanner(
label = "Miles Per Gallon",
columns = c(mpg_c, mpg_h)
) %>%
summary_rows(
columns = everything(), #all the columns
fns = list( # which summary statistics
Mean = ~mean(.),
SD = ~sd(.)
)
)
Horsepower | Miles Per Gallon | MSRP ($) | ||
---|---|---|---|---|
Country | Highway | |||
2017 | ||||
647 | 11 | 18 | 447000 | |
680 | 12 | 17 | 298000 | |
2015 | ||||
597 | 13 | 17 | 291744 | |
562 | 13 | 17 | 263553 | |
553 | 16 | 23 | 198973 | |
652 | 11 | 16 | 295000 | |
731 | 11 | 16 | 319995 | |
949 | 12 | 16 | 1416362 | |
2014 | ||||
562 | 13 | 17 | 233509 | |
2016 | ||||
661 | 15 | 22 | 245400 | |
Mean | 659.40 | 12.70 | 17.90 | 400,953.60 |
SD | 117.29 | 1.70 | 2.51 | 362,918.85 |
If I’m writing a manuscript, I don’t want to plot the entire table of
data. Instead, I only want those summary statistics.Let’s combine the
gt
package with the gtsummary
package for some
useful descriptives.
gtcars %>%
select(year, hp, mpg_c, mpg_h, msrp) %>%
tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})"
))
Characteristic | N = 471 |
---|---|
year | |
2014 | 2 (4.3%) |
2015 | 9 (19%) |
2016 | 27 (57%) |
2017 | 9 (19%) |
hp | 515 (140) |
mpg_c | 15.33 (3.43) |
Unknown | 1 |
mpg_h | 22.2 (3.9) |
Unknown | 1 |
msrp | 193,929 (207,626) |
1 n (%); Mean (SD) |
And of course, if this is to be ready for publication, I should rename the variables and do some light formatting. Here I’m also testing differences by groups.
gtcars %>%
select(bdy_style, year, hp, mpg_c, mpg_h, msrp) %>%
# for this example, i want to work with only two groups,
# so here i'm collapsing anyting that's not a coupe into
# a "not coupe" category
mutate(
bdy_style = case_when(
bdy_style == "coupe" ~ "coupe",
TRUE ~ "not coupe"
)) %>%
# now we calculate summary statistics
tbl_summary(
by = bdy_style, #separate summary statistics by group
label = list(
year ~ "Year",
hp ~ "Horsepower",
msrp ~ "MSRP ($)",
mpg_h ~ "MPG (Highway)",
mpg_c ~ "MPG (Country)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})"
)
) %>%
add_n() %>% # add column with total number of non-missing observations
add_p(test = list(
all_categorical() ~ "chisq.test",
all_continuous() ~ "t.test"
)) %>% # test for a difference between groups
bold_labels() # make it pretty
## Warning for variable 'year':
## simpleWarning in stats::chisq.test(x = c(2017, 2015, 2015, 2014, 2016, 2015, 2017, : Chi-squared approximation may be incorrect
Characteristic | N | coupe, N = 321 | not coupe, N = 151 | p-value2 |
---|---|---|---|---|
Year | 47 | 0.4 | ||
2014 | 2 (6.2%) | 0 (0%) | ||
2015 | 7 (22%) | 2 (13%) | ||
2016 | 16 (50%) | 11 (73%) | ||
2017 | 7 (22%) | 2 (13%) | ||
Horsepower | 47 | 546 (141) | 448 (114) | 0.015 |
MPG (Country) | 46 | 14.94 (3.72) | 16.21 (2.52) | 0.2 |
Unknown | 0 | 1 | ||
MPG (Highway) | 46 | 21.4 (3.8) | 24.0 (3.4) | 0.030 |
Unknown | 0 | 1 | ||
MSRP ($) | 47 | 224,250 (240,642) | 129,245 (82,660) | 0.052 |
1 n (%); Mean (SD) | ||||
2 Pearson's Chi-squared test; Welch Two Sample t-test |
There are many great packages available for making tables. One challenge is that no single package will do everything you’ll ever want. Some are great at summarizing models, some place nice with HTML and Word and PDFs, some are very customizable, but few are all three. Here are a few packages that we recommend.
Inline text refers to text that is outside of code chunks. I am
currently writing in inline text. In order to format the text that you
write outside of code chunks, you have to abide by Markdown syntax. For
example, you may want to bold, italicize, or
strikethrough a word. You also may want to insert a list,
table, headers, link, image, blockquote, or equation. Refer to the
RMarkdown cheatsheet
for how to format in Markdown syntax.
Sometimes you may want to code outside of a code chunk. The most common reason for this is to report statistics in your inline text. For example, let’s say you are writing a manuscript in R and you need to report the mean and standard deviation for a variable. You can calculate the mean and standard deviation of the variable in the code chunk, then call the answer in the in-line text. This will reduce human transfer errors.
#Descriptive stats for car hp
range_hp_low <- range(gtcars$hp)[1]
range_hp_high <- range(gtcars$hp)[2]
m_hp <- mean(gtcars$hp)
sd_hp <- sd(gtcars$hp)
Now, I can call the variables inline, like this: The horsepower of cars ranged from 259 to 949 (M = 514.9574468, SD = 139.8205305).
You can do more than call a variable outside of the code chunks. For example, here is some code that will make the sentence look a little nicer. However, when you code outside of the code chunks, the code becomes really difficult to read, so I recommend keeping complicated code inside of the code chunks.
#Descriptive stats for car price
range_msrp_low <- range(gtcars$msrp)[1]
range_msrp_high <- range(gtcars$msrp)[2]
m_msrp <- mean(gtcars$msrp)
sd_msrp <- sd(gtcars$msrp)
The msrp of gtcars ranged from $53900.00 to $1,416,362.00 (M = $193,929.06, SD = $207,626.4).
Instead of this mess, you want to first create your variables:
range_msrp_low <- range(gtcars$msrp)[1] %>%
as.double() %>% #makes range_msrp_low a double (rather than an integer)
format(nsmall = 2) %>% #format it to two decimal places
paste0("$", .) #add a dollar sign before the number
range_msrp_high <- range(gtcars$msrp)[2] %>%
as.double() %>%
format(nsmall = 2, big.mark = ",") %>% #big.mark adds a comma for large numbers
paste0("$", .)
m_msrp <- gtcars$msrp %>%
mean() %>%
format(nsmall = 2, big.mark = ",") %>%
paste0("$", .)
sd_msrp <- gtcars$msrp %>%
sd() %>%
format(nsmall = 2, big.mark = ",") %>%
paste0("$", .)
And then you can print your sentence: The msrp of gtcars ranged from $53900.00 to $1,416,362.00 (M = $193,929.06, SD = $207,626.38).
#To remove redundancy in your code, you may want to create a function instead
#Here I've created a function with one argument "x" that converts numbers into money formatted numbers
money_format <- function(x){
x %>%
as.double() %>%
format(nsmall = 2, big.mark = ",") %>%
paste0("$", .)
}
#testing the function
money_format(23948234)
## [1] "$23,948,234.00"
money_format(23)
## [1] "$23.00"
gtcars$msrp %>%
mean() %>%
money_format()
## [1] "$193,929.06"
Your new function will work inline too: $32,334.00
Knitting errors are particularly frustrating error messages if you thought you were done with your project and the code had worked up to the point of knitting. Plus, the messages can be vague and confusing. Here’s our advice:
Knit early and knit often. If you’re working on the homework, it’s a good idea to knit at least once after every problem.
When you get an error, you first want to try to locate where the
error is. If you think you found it, you can comment out the entire code
chunk by highlighting the code and typing Ctrl + Shift + C
(or Cmd + Shift + C
on a Mac), or you can change the code
chunk options to error = TRUE
. You should be able to knit
now if the error message is coming from the code chunk that you just
disabled. Otherwise, disable more code chunks, one at a time until it
knits.
Once you’ve established where the error is, you may still not understand why the code isn’t working. The most common reasons are…
library()
command.For today’s minihacks, you will be using RMarkdown to create your own html file.
cerulean
, journal
,
flatly
, lumen
, paper
, and
readable
.tidyverse
, gt
, gtsummary
,
rio
and here
.
rio
and
here
packages, import the data set us_contagious_diseases.csv
and store it in a data frame called data.options(scipen = 999)
, which will turn off scientific
notationus_contagious_diseases
data set has the yearly counts for
seven contagious diseases for the years 1928 - 2011. Use these data to
create the following tables (include each table in a separate tab under
your new header):