The purpose of today’s lab is to introduce you to visualizing data. Although R has built in plotting functions, we will be using the {ggplot2}
package in today’s lab. It is far more powerful than the base R plotting functions and is the basis for other packages that allow you to create interactive plots (e.g., {plotly}
), animated plots (e.g., {gganimate}
), and 3D plots (e.g., {rayshader}
). Being part of the {tidyverse}
, {ggplot2}
was also designed to play nice with other tidyverse
packages (e.g., {dplyr}
, {tidyr}
).
The content of today’s labs will be split into four sections. The first section, The Grammar of Graphics will provide a brief overview of the logic behind how {ggplot2}
works. The second section, Histograms, will provide an overview of how to create a histogram and customize it. The third section, Categorical by Continuous Plots, will guide you through the process of creating a bar chart in R. Finally, in the Continuous by Continuous Plots section, we will discuss how to create plots where you have have a continuous variable on both axes. As always, the lab will end with a set of minihacks to test your knowledge.
To quickly navigate to the desired section, click one of the following links:
Below are some resources that you may find useful going forward:
The {ggplot2}
package is built around Leland Wilkinson’s idea of the Grammar of Graphics. The Grammar of Graphics proposes that every graph can be created from (1) a data set (e.g., mtcars
, world happiness
), (2) a coordinate system or canvas (e.g., a Cartesian coordinate system), and (3) a set of geometric elements that represent the data (e.g., points, lines, polygons).
Below is the data set pirates
from the {yarrr}
package. All of the plots in today’s lab, excluding those created in the Minihacks, will be using the pirates
data set.
# the data set
pirates
## # A tibble: 1,000 x 17
## id sex age height weight headband college tattoos tchests parrots
## <int> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 male 28 173. 70.5 yes JSSFP 9 0 0
## 2 2 male 31 209. 106. yes JSSFP 9 11 0
## 3 3 male 26 170. 77.1 yes CCCC 10 10 1
## 4 4 fema… 31 144. 58.5 no JSSFP 2 0 2
## 5 5 fema… 41 158. 58.4 yes JSSFP 9 6 4
## 6 6 male 26 190. 85.4 yes CCCC 7 19 0
## 7 7 fema… 31 158. 59.6 yes JSSFP 9 1 7
## 8 8 fema… 31 173. 74.5 yes JSSFP 5 13 7
## 9 9 fema… 28 165. 68.7 yes JSSFP 12 37 2
## 10 10 male 30 184. 84.7 yes JSSFP 12 69 4
## # … with 990 more rows, and 7 more variables: favorite.pirate <chr>,
## # sword.type <chr>, eyepatch <dbl>, sword.time <dbl>, beard.length <dbl>,
## # fav.pixar <chr>, grogg <dbl>
We can pipe (%>%
) the pirates
data set to the ggplot()
function to create the canvas of our plot. In this case, we will add aes(x = age)
inside the ggplot()
function.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age))
The aes()
function above tells ggplot()
that we are providing it aesthetic information for the canvas. In this case, we are saying the canvas should have age
on the x-axis. We can also map variables to y
, colour
, size
, shape
, alpha
, and group
, among others. We will return to some of these later.
As should be readily apparent from looking at our plot, a canvas is not that exciting without anything on it. We can add geometric elements to the plot by using {ggplot2}
functions that begin with the geom_
prefix. Let’s add a histogram to the plot by adding geom_histogram()
to the code. Unlike other functions in the tidyverse
, we use the plus symbol(+
) instead of the pipe symbol (%>%
) to add elements to a plot. This makes sense if you consider that we want to add the function to the plot rather than simply wanting to pass information from one function to the next.
Great - it looks like a histogram! …but it is a bit hard to distinguish between the different bins. We can use the fill
and colour
(color
) arguments inside the geom_histogram()
function to specify the colour of the inside and outline of the geometric element, respectively. We can also include the alpha
argument to set the opacity (i.e., the lack of transparency) of the geometric element. Let’s make the inside of the bars turquoise
, the outline of the bars black
, and the opacity of the bars .6
(i.e., 40% transparent).
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(fill = "turquoise",
colour = "black",
alpha = .6)
Looks much better, but the data look pretty leptokurtotic. The geometric element geom_histogram()
has an additional argument bins
that allows us to specify how many bins the variable should be categorized into. Let’s change the number of bins
to 35
and see if that gives a better impression of the distribution.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35)
Looks like our data is more normal than we previously thought.
Currently the plot is showing us the frequency of cases on the y-axis. Let’s get the probability distribution instead by including the argument aes(y = ..density..)
in the geom_histogram()
function.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35)
As you can see, the y-axis shows the proportion of cases that fall into each age bin.
I haven’t mentioned this yet, but you can add multiple geometric elements to a single plot. Let’s also add a purple density curve over the histogram using geom_density(colour = "darkorchid", lwd = 1.20)
. I also included the argument lwd = 1.20
to make the width of the line slightly bigger.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35) +
geom_density(colour = "darkorchid",
lwd = 1.20)
We can also change the way the canvas looks by using the suite of functions that start with the theme_
prefix. I like theme_bw()
, but other choices include theme_gray()
, theme_minimal()
, and theme_classic()
.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35) +
geom_density(colour = "darkorchid",
lwd = 1.20) +
# set theme
theme_bw()
No plot is complete without proper labels. Fortunately, {ggplot2}
includes the labs()
function for that exact purpose.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35) +
geom_density(colour = "darkorchid",
lwd = 1.20) +
# set theme
theme_bw() +
# add labels
labs(title = "Probability Distribution of Pirate Ages",
subtitle = "A ggplot plot",
x = "Age",
y = "Frequency",
caption = "Data from that `yarrr` package.")
Finally, let’s see if the age distribution differs by the pirate’s sex
. To do so, we can add facet_wrap(~sex)
. In short, the plot will be split into different plots based on the groups of the variable included after the tilde (~
).
# the data set
pirates %>%
# the canvas
ggplot(aes(x = age)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "turquoise",
colour = "black",
alpha = .6,
bins = 35) +
geom_density(colour = "darkorchid",
lwd = 1.20) +
# set theme
theme_bw() +
# add labels
labs(title = "Probability Distribution of Pirate Ages",
subtitle = "A ggplot plot",
x = "Age",
y = "Frequency",
caption = "Data from that `yarrr` package.") +
# split the plot by sex
facet_grid(~sex)
Your turn! 1. Plot a histogram of the number of parrots each pirate has 2. Split by whether they wear a headband. 3. Label it appropriately and change the colors to ones that a pirate might like!
# the data set
pirates %>%
# the canvas
ggplot(aes(x = parrots)) +
# the geometric elements
geom_histogram(aes(y = ..density..),
fill = "red",
colour = "black",
alpha = 1,
bins = 35) +
geom_density(colour = "blue",
lwd = 1.20) +
# set theme
theme_bw() +
# add labels
labs(title = "Probability Distribution of Parrot Ownership",
subtitle = "A ggplot plot",
x = "Parrots",
y = "Frequency",
caption = "Data from that `yarrr` package.") +
# split the plot by sex
facet_grid(~headband)
Cool. But when we are looking at our data, histograms only get us so far. Usually we are interested in the relationship between two or more variables.
Below we’ve created a canvas with favorite.pirate
on the x-axis and height
on the y-axis.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = favorite.pirate, y = height))
Even without adding geometric elements, we can see that favorite.pirate
is on the x-axis and height
is on the y-axis. Now, let’s use geom_boxplot()
to compare the height based on pirate preferences
# the data set
pirates %>%
# the canvas
ggplot(aes(x = favorite.pirate, y = height)) +
# the geometric elements
geom_violin()
Now let’s update the colours (using fill
and colour
), the opacity (using alpha
), the width of the boxes (using width
), the theme (using theme_bw()
), and the labels (using lab()
).
# the data set
pirates %>%
# the canvas
ggplot(aes(x = favorite.pirate, y = height)) +
# the geometric elements
geom_boxplot(fill = "turquoise",
colour = "black",
alpha = .7,
width = .8) +
# set theme
theme_bw() +
# add labels
labs(title = "Height by favorite pirate",
subtitle = "A ggplot plot",
x = "favorite pirate",
y = "Height (cm)",
caption = "Data from that `yarrr` package.")
Looks much better! Let’s also rearrange the columns from shortest to tallest using reorder(favorite.pirate, height)
. Here we are saying reorder the levels of the favorite.pirate
variable by Height (height
). Note. If we wanted to arrange from tallest to shortest, we would append -
to the beginning of height
(i.e., reorder(favorite.pirate, -height)
).
# the data set
pirates %>%
# the canvas
ggplot(aes(x = reorder(favorite.pirate, -height), y = height)) +
# the geometric elements
geom_boxplot(fill = "turquoise",
colour = "black",
alpha = .7,
width = .5) +
# set theme
theme_bw() +
# add labels
labs(title = "Height by favorite pirate",
subtitle = "A ggplot plot",
x = "Favorite Pirate",
y = "Height (cm)",
caption = "Data from that `yarrr` package.")
Now, let’s flip our coordinates so that x
is shown on the vertical axis and y
is shown on the horizontal axis.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = reorder(favorite.pirate, height), y = height)) +
# the geometric elements
geom_boxplot(fill = "turquoise",
colour = "black",
alpha = .7,
width = .5) +
# set theme
theme_bw() +
# add labels
labs(title = "Height by favorite pirate",
subtitle = "A ggplot plot",
x = "Favorite Pirate",
y = "Height (cm)",
caption = "Data from that `yarrr` package.") +
# flip coordinates
coord_flip()
Your Turn! 1. Plot the weight of pirates by their sword type. 2. Order from the highest to lowest weight. 3. Label appropriately. 4. Choose better colors than I did.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = reorder(sword.type, weight), y = weight)) +
# the geometric elements
geom_boxplot(fill = "#808284",
colour = "#9b111e",
alpha = .7,
width = .5) +
# set theme
theme_bw() +
# add labels
labs(title = "Average weight by sword type",
subtitle = "A ggplot plot",
x = "Sword Type",
y = "Height (cm)",
caption = "Data from that `yarrr` package.")
A second (and far more common) type of bivariate (two variable) plot is a scatterplot. A scatterplot has a continuous variable on the x-axis and a continuous variable on the y-axis. Let’s create the canvas for that plot below by specifying that height
should be on the x-axis and weight
should by on the y-axis.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight))
Now, let’s add some points to our plot using geom_point
.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight)) +
# the geometric elements
geom_point()
Let’s also add a regression line using geom_smooth()
.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight)) +
# the geometric elements
geom_point() +
geom_smooth()
Although the line looks pretty straight, the line is actually slightly curved. Let’s change that by specifying method = "lm"
ingeom_smooth()
; method = lm
tells geom_smooth()
that we want the line to be perfectly linear. Let’s also get rid of the confidence interval around the line by specifying se = FALSE
.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight)) +
# the geometric elements
geom_point() +
geom_smooth(method = "lm",
se = FALSE)
What if we wanted to distinguish the points by the pirates sex? As mentioned before, we can also specify an aesthetic mapping for colour
(and fill
). Let’s map the point’s colours to the pirates’ sexes by including aes(colour = sex))
in the geom_point()
function.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight)) +
# the geometric elements
geom_point(aes(colour = sex)) +
geom_smooth(method = "lm",
se = FALSE)
Your turn! The points are different colors now, but there is a lot of overlap, so they are hard to see, let’s change the opacity to help make the differences more visible. Another problem is that the line is the same color as the “other”, fix that too.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight)) +
# the geometric elements
geom_point(aes(colour = sex), alpha = .4) +
geom_smooth(method = "lm",
se = FALSE,
color = 'black')
Right now, there is only one regression line. What if I want to look at the relationship between height and weight for each sex?
Only the points are grouped by colour because we only specified colour = sex
in the aesthetic mapping (aes()
) of geom_point()
. We could specify colour = sex
in both geom_point()
and geom_smooth()
to get different coloured points and different coloured lines for each sex OR we can specify colour = sex
inside the aes()
argument in the ggplot()
function and geom_point()
and geom_smooth()
will inherit the aesthetic mapping.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point(alpha = .4) +
geom_smooth(method = "lm",
se = FALSE) +
facet_grid(.~sex)
Your turn! These lines are pretty hard to see since they are overlapping. Let’s make separate grids to visualize this better.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point(alpha = .4) +
geom_smooth(method = "lm",
se = FALSE) +
# split the plot by sex
facet_grid(~sex)
Getting better! I really don’t like the colors though.
Luckily, {ggplot2}
has a solution for that: the scale_
functions. Scale functions start with scale_
and are followed by the aesthetic mapping you wish to scale (e.g., x
, y
, colour
, fill
, size
, alpha
, shape
). The final part of the function specifies how you would like to scale the axis. For example, scale_y_log10()
scales the y-axis to be on a logarithmic scale.
Let’s use scale_colour_manual()
to manually set the colours of the plot.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point(alpha = .4) +
geom_smooth(method = "lm",
se = FALSE) +
# split the plot by sex
facet_grid(~sex) +
# manually set colours
scale_colour_manual(values = c("red", "blue", "green"))
Note. Instead of using strings (e.g., “red”, “blue”), we can also use the HTML hex codes to set the colours (e.g., “#FF0000”, “#000CF3”). You can also set a colour using the rgb()
function. The rgb
function takes three primary arguments (i.e., red, green, blue) that allow you to set the amount of red, green, and blue in your desired colour (e.g., for the colour blue, you would use rgb(0, 0, 1)
).
The new colours are even worse. Okay, let’s use a built in colour palette. To do so, we can use the scale_colour_brewer()
function and specify the palette choice after that. The available palettes can be found here or by using the interactive website mentioned at the outset of this lab.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point(alpha = .4) +
geom_smooth(method = "lm",
se = FALSE) +
# split the plot by sex
facet_grid(~sex) +
# set colour pallete
scale_colour_brewer(palette = "RdYlBu")
I also really don’t like that. Let’s use scale_colour_viridis_d()
to use a colour-blind safe colour palette.
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point(alpha = .4) +
geom_smooth(method = "lm",
se = FALSE) +
# split the plot by sex
facet_grid(~sex) +
# set colour pallete
scale_colour_viridis_d()
Finally, let’s add proper labels (using labs()
), change the theme (using theme_bw()
), and drop the redundant legend (using theme(legend.position = "none")
).
# the data set
pirates %>%
# the canvas
ggplot(aes(x = height, y = weight, colour = sex)) +
# the geometric elements
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
# set colour pallete
scale_colour_viridis_d() +
# facet wrap by sex
facet_wrap(~sex) +
# add labels
labs(title = "Associations between height and weight by gender",
x = "Height (cm)",
y = "Weight (kg)") +
# set theme
theme_bw() +
# remove legend
theme(legend.position = "none")
You are welcome to work with a partner or in a small group of 2-3 people. Please feel free to ask the lab leader any questions you might have!
The minihacks all use the movies
data set from the {yarrr}
package.
revenue.all
). Is it normally distributed?movies %>%
ggplot(aes(x = revenue.all)) +
geom_histogram()
50
.movies %>%
ggplot(aes(x = revenue.all)) +
geom_histogram(bins = 50)
movies %>%
ggplot(aes(x = revenue.all)) +
geom_histogram(bins = 50) +
theme_bw()
fill
and the colour
of the histogram.movies %>%
ggplot(aes(x = revenue.all)) +
geom_histogram(bins = 50, fill = "darkorchid4", colour = "white") +
theme_bw()
movies %>%
ggplot(aes(x = revenue.all)) +
geom_histogram(bins = 50, fill = "darkorchid4", colour = "white") +
theme_bw() +
labs(x = "International Revenue",
y = "Frequency")
revenue.all
) on the y-axis and movie genre (genre
) on the x-axis.movies %>%
ggplot(aes(x = genre, y = revenue.all)) +
geom_boxplot()
movies %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_boxplot()
movies %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_boxplot() +
coord_flip()
movies %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_boxplot(color='blue',
fill = 'yellow') +
coord_flip()
movies %>%
filter(!is.na(genre)) %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_boxplot(color='blue',
fill = 'yellow') +
coord_flip()
movies %>%
filter(!is.na(genre)) %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_violin(color='blue',
fill = 'yellow') +
scale_y_log10()
movies %>%
filter(!is.na(genre)) %>%
ggplot(aes(x = reorder(genre, revenue.all), y = revenue.all)) +
geom_boxplot(color='blue',
fill = 'yellow') +
coord_flip() +
scale_y_log10()
filter
to remove all rows that have NA
for release year (year
), total revenue (revenue.all
), budget (budget
), rating (rating
), and genre (genre
). Also, remove all rows that have a rating (rating
) of "Not Rated"
.movies_filtered <- movies %>%
filter(!is.na(year),
!is.na(revenue.all),
!is.na(budget),
!is.na(rating),
!is.na(genre),
rating != "Not Rated")
{ggplot2}
canvas with year
mapped to the x-axis, revenue.all
mapped to the y-axis, and rating
mapped to colour.movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating))
movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
geom_point
that maps budget
to the size
of the points. Make the points half transparent using the alpha
argument.movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point(aes(size = budget), alpha = .5) +
geom_smooth(method = "lm", se = FALSE)
movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point(aes(size = budget), alpha = .5) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10()
movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point(aes(size = budget), alpha = .5) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10() +
scale_colour_viridis_d()
genre
.movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point(aes(size = budget), alpha = .5) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10() +
scale_colour_viridis_d() +
facet_wrap(~genre)
labs()
to add labels to your plot and use one of the theme_
functions to change the plot’s theme to your preferred theme.movies_filtered %>%
ggplot(aes(x = year,
y = revenue.all,
colour = rating)) +
geom_point(aes(size = budget), alpha = .5) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10() +
scale_colour_viridis_d() +
facet_wrap(~genre) +
labs(x = "Revenue",
y = "Year") +
theme_minimal()