This document contains all the information shared in lecture (Nov 22, 2022). However, the format of the lecture differs from previous lectures and labs. If you were not able to join the lecture in person or live on Zoom, I strongly recommend watching the video. The document here is meant to serve as a summary sheet/reference, but not as the formal teaching tool.

Purpose

The purpose of today’s lecture/lab is to get more familiar with RMarkdown. RMarkdown is a file format that allows us to produce reports. You can use RMarkdown to produce html files (as we have been doing), but also word documents, pdfs, slideshows, websites, and more. In 612, we will be using RMarkdown to create APA manuscripts in R.

Here are two resources about RMarkdown that I recommend bookmarking: this cheatsheet and this book.


You’ll need the following packages for this lecture. If you don’t have them installed, be sure to do that before loading them:

library(tidyverse)
library(gt)
library(gtsummary)

YAML

The YAML is the header of your RMarkdown file that starts and ends with ---. When you open a new RMarkdown file, it will automatically create a default YAML with a title, author, date, and output. You can set the output to which file type you want the report to print out in (html_document, pdf_document, word_document, ioslides_presentation, etc.)

---
title: Lab 7: RMarkdown
author: Vinita Vader
date: November 11, 2021
output: 
  html_document
---

For an html document, you can add a table of contents by specifying toc: true under html_document. If you like how the table of contents is floating to the left like it is on the course website, you can also add toc_float: true. Additionally, you can customize your theme, change the width and height of figures for the whole report, number your sections, and so on. For how to do these, refer here.

---
title: 'Lab 7: RMarkdown'
output:
  html_document:
    toc: true
    toc_float: true
---

Code chunks

You can create a new code chunk with the shortcut Ctrl + Alt + I (or Cmd + Option + I if you have a Mac).

Code chunk options

Code chunks have the default setting that all code will be evaluated and all code and results of the code will be printed out in your report when you knit it. But you may want to override the defaults. For example, you may want your final report to show only your figures but not the code that created the figures. You would edit the code chunk, inside the brackets, like this:

```{r, echo = FALSE}

Here are some more useful options.

Option Purpose Useful for…
echo = FALSE The results will print but not the code chunk If you want to hide your code
eval = FALSE The code chunk will print but is not evaluated If you want to showcase sample code
include = FALSE The code chunk is evaluated but nothing is printed You are loading libraries
warning = FALSE Warning messages will not be displayed You want to get rid of a warning message in your report but you still want it to print the rest of your results out (if it is not a warning message try message = FALSE)
error = TRUE The error message will display in your report instead of in the console You can use this to knit even when you have an error message

Refer here for more options. To change the settings for all of the code chunks in your report, you can set a global option. For example, if I wanted only the results and no code to print out for my entire report, I would put the code knitr::opts_chunk$set(echo = FALSE) in my first code chunk.



Tables and tabbed sections

Tabbed sections are a way to organize your html file a little better. Follow this format exactly to make a tabbed section. Your first line will be the header of the section # header title {.tabset .tabset-fade .tabset-pills} followed by tab names ## tab 1, ## tab 2, and so on. In this section, we will also be exploring how to make a table using the gt and gtsummary packages.

We’ll show some of the code today, but there’s a lot more on the websites for these packages: gt.rstudio.com and www.danieldsjoberg.com/gtsummary/.


Table 1

Here is a simple data frame.

(table_1 <- gtcars %>% 
   # which variables
   select(year, hp, mpg_c, mpg_h, msrp) %>% 
   # this is a big df, so let's just work with the first 10 rows
   slice_head(n = 10)) 
## # A tibble: 10 × 5
##     year    hp mpg_c mpg_h    msrp
##    <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1  2017   647    11    18  447000
##  2  2015   597    13    17  291744
##  3  2015   562    13    17  263553
##  4  2014   562    13    17  233509
##  5  2016   661    15    22  245400
##  6  2015   553    16    23  198973
##  7  2017   680    12    17  298000
##  8  2015   652    11    16  295000
##  9  2015   731    11    16  319995
## 10  2015   949    12    16 1416362

Table 2

This is a simple table using the gt::gt function.

#Creating a gt table
table_1 %>% 
  gt()
year hp mpg_c mpg_h msrp
2017 647 11 18 447000
2015 597 13 17 291744
2015 562 13 17 263553
2014 562 13 17 233509
2016 661 15 22 245400
2015 553 16 23 198973
2017 680 12 17 298000
2015 652 11 16 295000
2015 731 11 16 319995
2015 949 12 16 1416362


This is technically a little better but we can further customize it.


Table 3

With the gt::gt function, you can add a table caption, change the column names, and format the currency column so that it’s a little easier to read.

table_1 %>% 
  gt(
    caption = "Table 1. An example dataset from the gt package." #add table caption
  ) %>% 
  cols_label( # rename the columns
    year = "Year", 
    hp = "Horsepower", 
    mpg_c = "MPG Country", 
    mpg_h = "MPG Highway",
    msrp = "MSRP ($)"
  ) %>% 
  fmt_currency(
    columns = c(msrp),
    currency = "USD"
  )
Table 1. An example dataset from the gt package.
Year Horsepower MPG Country MPG Highway MSRP ($)
2017 647 11 18 $447,000.00
2015 597 13 17 $291,744.00
2015 562 13 17 $263,553.00
2014 562 13 17 $233,509.00
2016 661 15 22 $245,400.00
2015 553 16 23 $198,973.00
2017 680 12 17 $298,000.00
2015 652 11 16 $295,000.00
2015 731 11 16 $319,995.00
2015 949 12 16 $1,416,362.00

Table 4

But we can do even more! Let’s rename our MPG columns and put them under a single banner. We can also group our values by year.

table_1 %>% 
  group_by(year) %>% # group by year
  gt(
    caption = "Table 2. An example dataset from the gt package." #add table caption
  ) %>% 
  cols_label( # note that year is not here anymore
    hp = "Horsepower", 
    mpg_c = "Country", 
    mpg_h = "Highway",
    msrp = "MSRP ($)"
  ) %>% 
  tab_spanner(
    label = "Miles Per Gallon",
    columns = c(mpg_c, mpg_h)
  )
Table 2. An example dataset from the gt package.
Horsepower Miles Per Gallon MSRP ($)
Country Highway
2017
647 11 18 447000
680 12 17 298000
2015
597 13 17 291744
562 13 17 263553
553 16 23 198973
652 11 16 295000
731 11 16 319995
949 12 16 1416362
2014
562 13 17 233509
2016
661 15 22 245400

Table 5

It would be helpful to include some summary statistics here.

table_1 %>% 
  group_by(year) %>% 
  gt(
    caption = "Table 3. An example dataset from the gt package." #add table caption
  ) %>% 
  cols_label(
    year = "Year", 
    hp = "Horsepower", 
    mpg_c = "Country", 
    mpg_h = "Highway",
    msrp = "MSRP ($)"
  ) %>% 
  tab_spanner(
    label = "Miles Per Gallon",
    columns = c(mpg_c, mpg_h)
  ) %>% 
  summary_rows(
    columns = everything(), #all the columns
    fns = list( # which summary statistics
      Mean = ~mean(.),
      SD = ~sd(.)
    )
  )
Table 3. An example dataset from the gt package.
Horsepower Miles Per Gallon MSRP ($)
Country Highway
2017
647 11 18 447000
680 12 17 298000
2015
597 13 17 291744
562 13 17 263553
553 16 23 198973
652 11 16 295000
731 11 16 319995
949 12 16 1416362
2014
562 13 17 233509
2016
661 15 22 245400
Mean 659.40 12.70 17.90 400,953.60
SD 117.29 1.70 2.51 362,918.85

Table 6 (Summary statistics)

If I’m writing a manuscript, I don’t want to plot the entire table of data. Instead, I only want those summary statistics.Let’s combine the gt package with the gtsummary package for some useful descriptives.

gtcars %>% 
  select(year, hp, mpg_c, mpg_h, msrp) %>% 
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ))
Characteristic N = 471
year
    2014 2 (4.3%)
    2015 9 (19%)
    2016 27 (57%)
    2017 9 (19%)
hp 515 (140)
mpg_c 15.33 (3.43)
    Unknown 1
mpg_h 22.2 (3.9)
    Unknown 1
msrp 193,929 (207,626)
1 n (%); Mean (SD)

Table 7

And of course, if this is to be ready for publication, I should rename the variables and do some light formatting. Here I’m also testing differences by groups.

gtcars %>% 
  select(bdy_style, year, hp, mpg_c, mpg_h, msrp) %>% 
  # for this example, i want to work with only two groups,
  # so here i'm collapsing anyting that's not a coupe into
  # a "not coupe" category
  mutate(
    bdy_style = case_when(
      bdy_style == "coupe" ~ "coupe",
      TRUE ~ "not coupe"
     )) %>% 
  # now we calculate summary statistics
  tbl_summary(
    by = bdy_style, #separate summary statistics by group
    label = list(
      year ~ "Year",
      hp ~ "Horsepower",
      msrp ~ "MSRP ($)",
      mpg_h ~ "MPG (Highway)",
      mpg_c ~ "MPG (Country)"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
      )
    )  %>% 
  add_n() %>% # add column with total number of non-missing observations
  add_p(test = list(
    all_categorical() ~ "chisq.test",
    all_continuous() ~ "t.test"
  )) %>% # test for a difference between groups
  bold_labels() # make it pretty
## Warning for variable 'year':
## simpleWarning in stats::chisq.test(x = c(2017, 2015, 2015, 2014, 2016, 2015, 2017, : Chi-squared approximation may be incorrect
Characteristic N coupe, N = 321 not coupe, N = 151 p-value2
Year 47 0.4
    2014 2 (6.2%) 0 (0%)
    2015 7 (22%) 2 (13%)
    2016 16 (50%) 11 (73%)
    2017 7 (22%) 2 (13%)
Horsepower 47 546 (141) 448 (114) 0.015
MPG (Country) 46 14.94 (3.72) 16.21 (2.52) 0.2
    Unknown 0 1
MPG (Highway) 46 21.4 (3.8) 24.0 (3.4) 0.030
    Unknown 0 1
MSRP ($) 47 224,250 (240,642) 129,245 (82,660) 0.052
1 n (%); Mean (SD)
2 Pearson's Chi-squared test; Welch Two Sample t-test

Other packages

There are many great packages available for making tables. One challenge is that no single package will do everything you’ll ever want. Some are great at summarizing models, some place nice with HTML and Word and PDFs, some are very customizable, but few are all three. Here are a few packages that we recommend.

Inline text

Inline text refers to text that is outside of code chunks. I am currently writing in inline text. In order to format the text that you write outside of code chunks, you have to abide by Markdown syntax. For example, you may want to bold, italicize, or strikethrough a word. You also may want to insert a list, table, headers, link, image, blockquote, or equation. Refer to the RMarkdown cheatsheet for how to format in Markdown syntax.




Inline code

Inline code

Sometimes you may want to code outside of a code chunk. The most common reason for this is to report statistics in your inline text. For example, let’s say you are writing a manuscript in R and you need to report the mean and standard deviation for a variable. You can calculate the mean and standard deviation of the variable in the code chunk, then call the answer in the in-line text. This will reduce human transfer errors.

#Descriptive stats for car hp 

range_hp_low <- range(gtcars$hp)[1]

range_hp_high <- range(gtcars$hp)[2]

m_hp <- mean(gtcars$hp)

sd_hp <- sd(gtcars$hp)

Now, I can call the variables inline, like this: The horsepower of cars ranged from 259 to 949 (M = 514.9574468, SD = 139.8205305).



Advanced code

You can do more than call a variable outside of the code chunks. For example, here is some code that will make the sentence look a little nicer. However, when you code outside of the code chunks, the code becomes really difficult to read, so I recommend keeping complicated code inside of the code chunks.

#Descriptive stats for car price 

range_msrp_low <- range(gtcars$msrp)[1]

range_msrp_high <- range(gtcars$msrp)[2]

m_msrp <- mean(gtcars$msrp)

sd_msrp <- sd(gtcars$msrp)

The msrp of gtcars ranged from $53900.00 to $1,416,362.00 (M = $193,929.06, SD = $207,626.4).

Instead of this mess, you want to first create your variables:

range_msrp_low <- range(gtcars$msrp)[1] %>%
  as.double() %>% #makes range_msrp_low a double (rather than an integer)
  format(nsmall = 2) %>% #format it to two decimal places
  paste0("$", .) #add a dollar sign before the number

range_msrp_high <- range(gtcars$msrp)[2] %>% 
  as.double() %>% 
  format(nsmall = 2, big.mark = ",") %>% #big.mark adds a comma for large numbers
  paste0("$", .)

m_msrp <- gtcars$msrp %>%
  mean() %>% 
  format(nsmall = 2, big.mark = ",") %>% 
  paste0("$", .)

sd_msrp <- gtcars$msrp %>%
  sd() %>% 
  format(nsmall = 2, big.mark = ",") %>% 
  paste0("$", .)

And then you can print your sentence: The msrp of gtcars ranged from $53900.00 to $1,416,362.00 (M = $193,929.06, SD = $207,626.38).



Creating a function

#To remove redundancy in your code, you may want to create a function instead
#Here I've created a function with one argument "x" that converts numbers into money formatted numbers
money_format <- function(x){
  x %>%
  as.double() %>% 
  format(nsmall = 2, big.mark = ",") %>% 
  paste0("$", .)
}

#testing the function
money_format(23948234)
## [1] "$23,948,234.00"
money_format(23)
## [1] "$23.00"
gtcars$msrp %>% 
  mean() %>% 
  money_format()
## [1] "$193,929.06"

Your new function will work inline too: $32,334.00


Knitting errors

Knitting errors are particularly frustrating error messages if you thought you were done with your project and the code had worked up to the point of knitting. Plus, the messages can be vague and confusing. Here’s our advice:

  • Knit early and knit often. If you’re working on the homework, it’s a good idea to knit at least once after every problem.

  • When you get an error, you first want to try to locate where the error is. If you think you found it, you can comment out the entire code chunk by highlighting the code and typing Ctrl + Shift + C (or Cmd + Shift + C on a Mac), or you can change the code chunk options to error = TRUE. You should be able to knit now if the error message is coming from the code chunk that you just disabled. Otherwise, disable more code chunks, one at a time until it knits.

  • Once you’ve established where the error is, you may still not understand why the code isn’t working. The most common reasons are…

    • Your data set wasn’t imported properly.
    • You haven’t loaded the proper libraries and you need to add at least one library() command.
    • You are referring to a variable that hasn’t been assigned yet. It probably worked before because it was assigned in your global environment, but it needs to be assigned before the line of code in order for it to knit.

Minihacks

For today’s minihacks, you will be using RMarkdown to create your own html file.

  1. Open up a new RMarkdown file and erase everything but the YAML. Edit the YAML in the following ways:
    • Change the title to “Lecture 17 Minihacks.”
    • Remove the author and date.
    • Under output, add a table of contents, numbered sections, and choose a theme. You should be outputting to an html file. Some html themes you may want to try are cerulean, journal, flatly, lumen, paper, and readable.
    • Knit the file. You should only see the title. The color and font will depend on which theme you chose.
  2. Create a new code chunk and load the following libraries: tidyverse, gt, gtsummary, rio and here.
    • Change the default chunk option so that the code is evaluated, but neither the code nor any resulting messages will show up in your report. You can do this all by changing one default option.
    • In the same code chunk, using the rio and here packages, import the data set us_contagious_diseases.csv and store it in a data frame called data.
    • Finally, in this code chunk, add the code options(scipen = 999), which will turn off scientific notation
  3. Create a header called “Tables”. The us_contagious_diseases data set has the yearly counts for seven contagious diseases for the years 1928 - 2011. Use these data to create the following tables (include each table in a separate tab under your new header):
    • A table showing the first 8 rows of the data frame. Be sure to format the column names. Include summary statistics (mean and standard deviation) of the continuous variables.
    • A table summarizing the continuous variables (i.e., not state). Present these statistics separately for each disease.
    • Pick 2 of the diseases and create a table summarizing the continuous variables only for these groups. Include p-values testing the differences between them. Be sure to use t-tests to calculate those p-values.
  4. Create one last level 1 header called “Questions”. Create three subheadings “Question 1”, “Question 2”, and “Question 3”. Using in-line code, answer the following questions under the respective subheading:
    • Question 1) Measles had the highest number of infections in the US during this time span. What was the number of infections?
    • Question 2) What was the average number of Measles cases per year in the US from 1928 to 2011? Round this number to two decimal places.
    • Question 3) In 1938, Wisconsin had the highest number of Measles cases per capita. What percent of Wisconsin’s population contracted Measles in 1938?