Purpose

The purpose of today’s lab is to introduce you to visualizing data. Although R has built in plotting functions, we will be using the {ggplot2} package in today’s lab. It is far more powerful than the base R plotting functions and is the basis for other packages that allow you to create interactive plots (e.g., {plotly}), animated plots (e.g., {gganimate}), and 3D plots (e.g., {rayshader}). Being part of the {tidyverse}, {ggplot2} was also designed to play nice with other tidyverse packages (e.g., {dplyr}, {tidyr}).

The content of today’s labs will be split into four sections. The first section, The Grammar of Graphics will provide a brief overview of the logic behind how {ggplot2} works. The second section, Histograms, will provide an overview of how to create a histogram and customize it. The third section, Categorical by Continuous Plots, will guide you through the process of creating a bar chart in R. Finally, in the Continuous by Continuous Plots section, we will discuss how to create plots where you have have a continuous variable on both axes. As always, the lab will end with a set of minihacks to test your knowledge.

To quickly navigate to the desired section, click one of the following links:

  1. The Grammar of Graphics
  2. Histograms
  3. Categorical by Continuous Plots
  4. Continuous by Continous Plots
  5. Minihacks

Below are some resources that you may find useful going forward:


The Grammar of Graphics

The {ggplot2} package is built around Leland Wilkinson’s idea of the Grammar of Graphics. The Grammar of Graphics proposes that every graph can be created from (1) a data set (e.g., mtcars, world happiness), (2) a coordinate system or canvas (e.g., a Cartesian coordinate system), and (3) a set of geometric elements that represent the data (e.g., points, lines, polygons).


Histograms

Below is the data set pirates from the {yarrr} package. All of the plots in today’s lab, excluding those created in the Minihacks, will be using the pirates data set.

# the data set
pirates
## # A tibble: 1,000 x 17
##       id sex     age height weight headband college tattoos tchests parrots
##    <int> <chr> <dbl>  <dbl>  <dbl> <chr>    <chr>     <dbl>   <dbl>   <dbl>
##  1     1 male     28   173.   70.5 yes      JSSFP         9       0       0
##  2     2 male     31   209.  106.  yes      JSSFP         9      11       0
##  3     3 male     26   170.   77.1 yes      CCCC         10      10       1
##  4     4 fema…    31   144.   58.5 no       JSSFP         2       0       2
##  5     5 fema…    41   158.   58.4 yes      JSSFP         9       6       4
##  6     6 male     26   190.   85.4 yes      CCCC          7      19       0
##  7     7 fema…    31   158.   59.6 yes      JSSFP         9       1       7
##  8     8 fema…    31   173.   74.5 yes      JSSFP         5      13       7
##  9     9 fema…    28   165.   68.7 yes      JSSFP        12      37       2
## 10    10 male     30   184.   84.7 yes      JSSFP        12      69       4
## # … with 990 more rows, and 7 more variables: favorite.pirate <chr>,
## #   sword.type <chr>, eyepatch <dbl>, sword.time <dbl>,
## #   beard.length <dbl>, fav.pixar <chr>, grogg <dbl>

We can pipe (%>%) the pirates data set to the ggplot() function to create the canvas of our plot. In this case, we will add aes(x = age) inside the ggplot() function.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age))

The aes() function above tells ggplot() that we are providing it aesthetic information for the canvas. In this case, we are saying the canvas should have age on the x-axis. We can also map variables to y, colour, size, shape, alpha, and group, among others. We will return to some of these later.

As should be readily apparent from looking at our plot, a canvas is not that exciting without anything on it. We can add geometric elements to the plot by using {ggplot2} functions that begin with the geom_ prefix. Let’s add a histogram to the plot by adding geom_histogram() to the code. Unlike other functions in the tidyverse, we use the plus symbol(+) instead of the pipe symbol (%>%) to add elements to a plot. This makes sense if you consider that we want to add the function to the plot rather than simply wanting to pass information from one function to the next.

Great - it looks like a histogram! …but it is a bit hard to distinguish between the different bins. We can use the fill and colour (color) arguments inside the geom_histogram() function to specify the colour of the inside and outline of the geometric element, respectively. We can also include the alpha argument to set the opacity (i.e., the lack of transparency) of the geometric element. Let’s make the inside of the bars turquoise, the outline of the bars black, and the opacity of the bars .6 (i.e., 40% transparent).

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(fill   = "turquoise",
                   colour = "black",
                   alpha  = .6)

Looks much better, but the data look pretty leptokurtotic. The geometric element geom_histogram() has an additional argument bins that allows us to specify how many bins the variable should be categorized into. Let’s change the number of bins to 35 and see if that gives a better impression of the distribution.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35)

Looks like our data is more normal than we previously thought.

Currently the plot is showing us the frequency of cases on the y-axis. Let’s get the probability distribution instead by including the argument aes(y = ..density..) in the geom_histogram() function.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(aes(y  = ..density..), 
                   fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35)

As you can see, the y-axis shows the proportion of cases that fall into each age bin.

I haven’t mentioned this yet, but you can add multiple geometric elements to a single plot. Let’s also add a purple density curve over the histogram using geom_density(colour = "darkorchid", lwd = 1.20). I also included the argument lwd = 1.20 to make the width of the line slightly bigger.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(aes(y  = ..density..), 
                   fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35) +
    geom_density(colour = "darkorchid",
                 lwd    = 1.20)

We can also change the way the canvas looks by using the suite of functions that start with the theme_ prefix. I like theme_bw(), but other choices include theme_gray(), theme_minimal(), and theme_classic().

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(aes(y  = ..density..), 
                   fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35) +
    geom_density(colour = "darkorchid",
                 lwd    = 1.20) +
    # set theme
    theme_bw()

No plot is complete without proper labels. Fortunately, {ggplot2} includes the labs() function for that exact purpose.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(aes(y  = ..density..), 
                   fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35) +
    geom_density(colour = "darkorchid",
                 lwd    = 1.20) +
    # set theme
    theme_bw() +
    # add labels 
    labs(title    = "Probability Distribution of Pirate Ages",
         subtitle = "A ggplot plot",
         x        = "Age",
         y        = "Frequency",
         caption  = "Data from that `yarrr` package.")

Finally, let’s see if the age distribution differs by the pirate’s sex. To do so, we can add facet_wrap(~sex). In short, the plot will be split into different plots based on the groups of the variable included after the tilde (~).

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = age)) +
    # the geometric elements
    geom_histogram(aes(y  = ..density..), 
                   fill   = "turquoise",
                   colour = "black",
                   alpha  = .6,
                   bins   = 35) +
    geom_density(colour = "darkorchid",
                 lwd    = 1.20) +
    # set theme
    theme_bw() +
    # add labels 
    labs(title    = "Probability Distribution of Pirate Ages",
         subtitle = "A ggplot plot",
         x        = "Age",
         y        = "Frequency",
         caption  = "Data from that `yarrr` package.") +
    # split the plot by sex
    facet_grid(~sex)


Categorical by Continuous

Cool. But when we are looking at our data, histograms only get us so far. Usually we are interested in the relationship between two or more variables.

Below we’ve created a canvas with sex on the x-axis and height on the y-axis.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = sex, y = height))

Even without adding geometric elements, we can see that sex is on the x-axis and height is on the y-axis. Now, let’s add bars using geom_col() to compare the average height for women, men, and those who identify as some other sex.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = sex, y = height)) +
    # the geometric elements
    geom_col()

Shoot! It looks like it added up all the heights. We wanted the average. Let’s use group_by() and summarize() to calculate the means of the groups before plotting the values.

# the data set
pirates %>%
  # get mean height by sex
  group_by(sex) %>%
  summarise(height_avg = mean(height)) %>%
  # the canvas
  ggplot(aes(x = sex, y = height_avg)) +
    # the geometric elements
    geom_col()

There we go!

Now let’s update the colours (using fill and colour), the opacity (using alpha), the width of the bars (using width), the theme (using theme_bw()), and the labels (using lab()).

# the data set
pirates %>%
  # get mean height by sex
  group_by(sex) %>%
  summarise(height_avg = mean(height)) %>%
  # the canvas
  ggplot(aes(x = sex, y = height_avg)) +
    # the geometric elements
    geom_col(fill   = "turquoise",
             colour = "white",
             alpha  = .7,
             width  = .5) +
    # set theme
    theme_bw() +
    # add labels 
    labs(title    = "Average height of pirate by pirate sex",
         subtitle = "A ggplot plot",
         x        = "Sex",
         y        = "Height (cm)",
         caption  = "Data from that `yarrr` package.")

Looks much better! Let’s also rearrange the columns from shortest to tallest using reorder(sex, height_avg). Here we are saying reorder the levels of the sex variable by average height (height_avg). Note. If we wanted to arrange from tallest to shortest, we would append - to the beginning of height_avg (i.e., reorder(sex, -height_avg)).

# the data set
pirates %>%
  # get mean height by sex
  group_by(sex) %>%
  summarise(height_avg = mean(height)) %>%
  # the canvas
  ggplot(aes(x = reorder(sex, height_avg), y = height_avg)) +
    # the geometric elements
    geom_col(fill   = "turquoise",
             colour = "white",
             alpha  = .7,
             width  = .5) +
    # set theme
    theme_bw() +
    # add labels 
    labs(title    = "Average height of pirate by pirate sex",
         subtitle = "A ggplot plot",
         x        = "Sex",
         y        = "Height (cm)",
         caption  = "Data from that `yarrr` package.")

Now, let’s flip our coordinates so that x is shown on the vertical axis and y is shown on the horizontal axis.

# the data set
pirates %>%
  # get mean height by sex
  group_by(sex) %>%
  summarise(height_avg = mean(height)) %>%
  # the canvas
  ggplot(aes(x = reorder(sex, height_avg), y = height_avg)) +
    # the geometric elements
    geom_col(fill   = "turquoise",
             colour = "white",
             alpha  = .7,
             width  = .5) +
    # set theme
    theme_bw() +
    # add labels 
    labs(title    = "Average height of pirate by pirate sex",
         subtitle = "A ggplot plot",
         x        = "Sex",
         y        = "Height (cm)",
         caption  = "Data from that `yarrr` package.") +
    # flip coordinates
    coord_flip()


Continuous by Continuous

A second (and far more common) type of bivariate (two variable) plot is a scatterplot. A scatterplot has a continuous variable on the x-axis and a continuous variable on the y-axis. Let’s create the canvas for that plot below by specifying that height should be on the x-axis and weight should by on the y-axis.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = height, y = weight))

Now, let’s add some points to our plot using geom_point.

# the data set
pirates %>%
  # the canvas
  ggplot(aes(x = height, y = weight)) +
    # the geometric elements
    geom_point() 

Let’s also add a regression line using geom_smooth().

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight)) +
      # the geometric elements
      geom_point() +
      geom_smooth()

Although the line looks pretty straight, the line is actually slightly curved. Let’s change that by specifying method = "lm" ingeom_smooth(); method = lm tells geom_smooth() that we want the line to be perfectly linear. Let’s also get rid of the confidence interval around the line by specifying se = FALSE.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight)) +
    # the geometric elements
    geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE)

What if we wanted to distinguish the points by the pirates sex? As mentioned before, we can also specify an aesthetic mapping for colour (and fill). Let’s map the point’s colours to the pirates’ sexes by including aes(colour = sex)) in the geom_point() function.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight)) +
      # the geometric elements
      geom_point(aes(colour = sex)) +
      geom_smooth(method = "lm", 
                  se     = FALSE)

The points are different colours, but I was also hoping to get separate lines for each sex.

Only the points are grouped by colour because we only specified colour = sex in the aesthetic mapping (aes()) of geom_point(). We could specify colour = sex in both geom_point() and geom_smooth() to get different coloured points and different coloured lines for each sex OR we can specify colour = sex inside the aes() argument in the ggplot() function and geom_point() and geom_smooth() will inherit the aesthetic mapping.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE)

Success! (Well, sort of). I really don’t like the colours.

Luckily, {ggplot2} has a solution for that: the scale_ functions. Scale functions start with scale_ and are followed by the aesthetic mapping you wish to scale (e.g., x, y, colour, fill, size, alpha, shape). The final part of the function specifies how you would like to scale the axis. For example, scale_y_log10() scales the y-axis to be on a logarithmic scale.

Let’s use scale_colour_manual() to manually set the colours of the plot.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE) +
    # manually set colours
    scale_colour_manual(values = c("red", "blue", "green"))

Note. Instead of using strings (e.g., “red”, “blue”), we can also use the HTML hex codes to set the colours (e.g., “#FF0000”, “#000CF3”). You can also set a colour using the rgb() function. The rgb function takes three primary arguments (i.e., red, green, blue) that allow you to set the amount of red, green, and blue in your desired colour (e.g., for the colour blue, you would use rgb(0, 0, 1)).

The new colours are even worse. Okay, let’s use a built in colour palette. To do so, we can use the scale_colour_brewer() function and specify the palette choice after that. The available palettes can be found here or by using the interactive website mentioned at the outset of this lab.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE) +
    # set colour pallete
    scale_colour_brewer(palette = "RdYlBu")

I also really don’t like that. Let’s use scale_colour_viridis_d() to use a colour-blind safe colour palette.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE) +
    # set colour pallete
    scale_colour_viridis_d()

Much better! But I still can’t distinguish between the different lines. Let’s use facet_wrap() to produce a separate plot for each sex.

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE) +
    # set colour pallete
    scale_colour_viridis_d() +
    # facet wrap by sex
    facet_wrap(~sex)

Finally, let’s add proper labels (using labs()), change the theme (using theme_bw()), and drop the redundant legend (using theme(legend.position = "none")).

# the data set
pirates %>%
  # the canvas
    ggplot(aes(x = height, y = weight, colour = sex)) +
      # the geometric elements
      geom_point() +
      geom_smooth(method = "lm", 
                  se     = FALSE) +
    # set colour pallete
    scale_colour_viridis_d() +
    # facet wrap by sex
    facet_wrap(~sex) +
    # add labels
    labs(title    = "Associations between height and weight by gender",
         x        = "Height (cm)",
         y        = "Weight (kg)") +
    # set theme
    theme_bw() +
    # remove legend
    theme(legend.position = "none")

Minihacks

You are welcome to work with a partner or in a small group of 2-3 people. Please feel free to ask the lab leader any questions you might have!

The minihacks all use the movies data set from the {yarrr} package.

Minihack 1: Histograms

  1. Create a histogram of domestic and international revenue combined (revenue.all). Is it normally distributed?
# your code here
  1. Change the number of bins to 50.
# your code here
  1. Change the theme of the canvas to your preferred theme.
# your code here
  1. Change the fill and the colour of the histogram.
# your code here
  1. Add an informative title and a label for the x and the y axis.
# your code here

Minihack 2: Categorical by Continuous Plots

  1. Create a barplot with total revenue (revenue.all) on the y-axis and movie genre (genre) on the x-axis. (Hint. You will want to calculate the mean revenue for each genre before your plot it. You will also want to filter out cases (usingfilter()) that are NA for the genre variable.
# your code here
  1. Reorder the bars from the lowest total revenue to the greatest total revenue.
# your code here
  1. Flip the coordinates by adding the function coord_flip()
# your code here
  1. Lastly, change the colour of the bars (Hint. You will want to use fill, not colour) and the width of the bars.
# your code here

Minihack 3: Continuous by Continuous Plots

  1. Use filter to remove all rows that have NA for release year (year), total revenue (revenue.all), budget (budget), rating (rating), and genre (genre). Also, remove all rows that have a rating (rating) of "Not Rated".
# your code here
  1. Using the new data, create a {ggplot2} canvas with year mapped to the x-axis, revenue.all mapped to the y-axis, and rating mapped to colour.
# your code here
  1. Add geom_point() and geom_smooth() to the plot. Make sure the geom_smooth() line is linear and remove standard errors.
# your code here
  1. Add an argument to geom_point that maps budget to the size of the points. Make the points half transparent using the alpha argument.
# your code here
  1. Scale the y-axis to be logarithmic.
# your code here
  1. Scale the colours to be colour-blind friendly.
# your code here
  1. Facet wrap by genre.
# your code here
  1. Use labs() to add labels to your plot and use one of the theme_ functions to change the plot’s theme to your preferred theme.
# your code here