You can download the rmd file here.
ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:
Each graph has the following components:
data
: the dataset containing the variables you want to visualizegeom
: the type of geometric object you want to graph (i.e., bars, points, boxplots)aes
: the aesthetic attributes you want to apply to the geometric object (including which variables should be on the x & y axis, the color, shape, and size of the geometric object)Here is a general ggplot template:
The data set we’re using for today’s lab is called pirates
, and it’s from the {yarrr}
package. Below, we’re converting the pirates data set to a tibble (an upgraded type of data frame from the base R data frame).
pirates <- as_tibble(pirates) # convert pirates data to a tibble
pirates$sex <- factor(pirates$sex) # convert sex to a factor
levels(pirates$sex) <- c("female", "male", "intersex") # label the levels of sex
levels(pirates$sex)
## [1] "female" "male" "intersex"
Next, go ahead and inspect the pirates data set. What variables does the data set contain?
head(pirates)
?pirates
By default, the y-axis will be set to frequency (i.e., count).
ggplot(data = pirates) +
geom_bar(aes(x = sex))
Changing y-axis to proportions.
ggplot(data = pirates) +
geom_bar(aes(x = sex, y = ..prop.., group = 1))
Changing y-axis to percentages.
ggplot(data = pirates) +
geom_bar(aes(x = sex, y = ..prop.., group = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L))
To make the bars all the same color, set fill to a specific color outside of the aesthetic function
ggplot(data = pirates) +
geom_bar(aes(x = sex), fill = 'blue')
You can also change the outline color of the bars using color.
ggplot(data = pirates) +
geom_bar(aes(x = sex), fill = 'blue', color = 'black')
Fill the bars with colors based on the levels of a categorical variable by setting the fill argument inside of the aesthetic function to a specific variable.
ggplot(data = pirates) +
geom_bar(aes(x = sex, fill = sex), color = 'black')
Set the fill argument to a second categorical variable.
ggplot(data = pirates) +
geom_bar(aes(x = sex, fill = headband))
It can be aesthetically helpful to dodge the bars.
ggplot(data = pirates) +
geom_bar(aes(x = sex, fill = headband), position = "dodge")
Or fill the bars.
ggplot(data = pirates) +
geom_bar(aes(x = sex, fill = headband), position = "fill")
You can manually choose the colors that ggplot fills the bars with by using scale_fill_manual
.
ggplot(data = pirates) +
geom_bar(aes(x = favorite.pirate, fill = favorite.pirate), color = 'black') +
scale_fill_manual(values = c('blue', 'red', 'green', 'yellow', 'purple', 'pink'))
The RColorBrewer package contains some rad custom color palettes. You can see the color palette options using display.brewer.all()
after opening the RColorBrewer library.
# install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all()
Let’s manually choose our colors for the graph above again, but this time use one of the RColorBrewer color palettes. Choose one of the palettes and the number of colors to take from it.
my_palette <- brewer.pal(6, "Spectral") # number of colors, name of color palette
ggplot(data = pirates) +
geom_bar(aes(x = favorite.pirate, fill = favorite.pirate), color = 'black') +
scale_fill_manual(values = my_palette)
You can also choose specific colors from a color palette using indexing.
my_2nd_palette <- c(brewer.pal(9, "Greens")[2], brewer.pal(9, "Greens")[3], brewer.pal(9, "Greens")[4], brewer.pal(9, "Greens")[5], brewer.pal(9, "Greens")[6], brewer.pal(9, "Greens")[7])
ggplot(data = pirates) +
geom_bar(aes(x = favorite.pirate, fill = favorite.pirate), color = 'black') +
scale_fill_manual(values = my_2nd_palette)
Here are some custom palettes other people have made from hex codes that are pretty nice:
jazzcup <- c("#80E0DF", "#31AEA6", "#3E88BC", "#783A9C", "#3A2C82")
crystal_pepsi <- c("#CCFFFC", "#E4E9FF", "#F2DCFF", "#FFCEFF")
sunset <- c("#F58F80", "#D5539C", "#FC2A7F", "#A81B56", "#691344")
ggplot(data = pirates) +
geom_bar(aes(x = favorite.pirate, fill = favorite.pirate), color = 'black') +
scale_fill_manual(values = c(crystal_pepsi, '#E4E9FF', '#CCFFFC'))
You can also choose one of a number of ggplot themes.
Some examples of available themes:
theme_gray() # this is the default theme for ggplot theme_bw() theme_dark() theme_classic() theme_light() theme_linedraw() theme_minimal() theme_void()
centered_plot +
theme_minimal() +
text_settings
Default histogram.
ggplot(data = pirates) +
geom_histogram(aes(x = age))
Specify bin widths.
ggplot(data = pirates) +
geom_histogram(aes(x = age), bins = 10)
ggplot(data = pirates) +
geom_histogram(aes(x = age), bins = 30)
One option: make the entire histogram one color by using fill
outside of the aesthetic argument. You can also use the color
argument to choose a border color.
ggplot(data = pirates) +
geom_histogram(aes(x = age), fill = 'turquoise', color = 'black')
Another option: fill based on a categorical variable by setting fill
to a specific variable inside of the aesthetic argument.
ggplot(data = pirates) +
geom_histogram(aes(x = age, fill = sex), color = 'black')
Change transparency using the alpha argument.
ggplot(data = pirates) +
geom_histogram(aes(x = age, fill = sex), color = 'black', alpha = 0.6)
Add a smooth curve on top of your histogram.
ggplot(data = pirates, aes(x = age, y = ..density..)) +
geom_histogram(fill = 'turquoise', color = 'black', alpha = 0.45) +
geom_density(color = 'darkorchid', size = 1.1)
Use facet_wrap
to get a histogram for a particular continuous variable across different levels of a categorical variable.
ggplot(data = pirates, aes(x = age, y = ..density..)) +
geom_histogram(fill = 'turquoise', color = 'black', alpha = 0.45) +
geom_density(color = 'darkorchid', size = 1.1) +
facet_wrap(~sex)
Now, fancy it up with label customization.
ggplot(data = pirates, aes(x = age, y = ..density..)) +
geom_histogram(fill = 'turquoise', color = 'black', alpha = 0.45) +
geom_density(color = 'darkorchid', size = 1.1) +
facet_wrap(~sex) +
labs(title = "Age Distribution of Pirates", # add labels
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Density",
caption = "Data taken from the `yarrr` package.") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5), # center the title
plot.subtitle = element_text(hjust = 0.5))
Basic frequency polygon for a single variable.
ggplot(data = pirates) +
geom_freqpoly(aes(x = age), color = 'blue', size = 1)
See a frequency polygon for a particular continuous variable across different levels of a categorical variable.
ggplot(data = pirates) +
geom_freqpoly(aes(x = age, color = headband), size = 1)
Fancy it up with label customization.
ggplot(data = pirates) +
geom_freqpoly(aes(x = age, color = headband), size = 1) +
labs(title = "Age Distribution of Pirates", # add labels
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Frequency",
caption = "Data taken from the `yarrr` package.") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5), # center the title
plot.subtitle = element_text(hjust = 0.5))
Another variable in the pirates
dataset is favorite.pirate
, which is each pirate’s self-reported favorite pirate. Let’s visualize favorite pirates by age.
Notice below, instead of a color name, I’ve used a color’s RGB code (i.e., hex #). Here are a couple of websites for finding hex codes:
RGB Color Codes Chart HTML Color Codes
ggplot(data = pirates) +
geom_boxplot(aes(x = favorite.pirate, y = age), fill = '#808284', color = '#9b111e', alpha = .7)
We can also reorder the columns from youngest to older using reorder(favorite.pirate, age)
.
ggplot(data = pirates) +
geom_boxplot(aes(x = reorder(favorite.pirate, age), y = age), fill = '#808284', color = '#9b111e', alpha = .7)
And if it’s helpful, you can flip the axes using coord_flip
.
ggplot(data = pirates) +
geom_boxplot(aes(x = reorder(favorite.pirate, age), y = age), fill = '#808284', color = '#9b111e', alpha = .7) +
coord_flip()
Now, fancying it up.
ggplot(data = pirates) +
geom_boxplot(aes(x = reorder(favorite.pirate, age), y = age), fill = '#808284', color = '#9b111e', alpha = .7) +
coord_flip() +
labs(title = "Age by Favorite Pirate ",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Favorite Pirate",
caption = "Data taken from the `yarrr` package.") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Let’s look at the relationship between the pirates’ heights and weights using geom_point
.
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight))
You can add the best-fitting regression line to the scatterplot using geom_smooth
. Use the argument method = "lm"
when you want the best-fitting linear regression line.
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight)) +
geom_smooth(aes(x = height, y = weight), method = "lm")
You can make all the points on the scatterplot one color by setting the color
argument equal to the desired hue outside of the aesthetic argument.
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight), color = 'light coral') +
geom_smooth(aes(x = height, y = weight), method = "lm")
Or color the points on the scatterplot based on which level of a categorical variable each point belongs to. To do so, set the color
argument equal to the categorical variable of choice inside the aesthetic argument. (Set se = FALSE
if you don’t want the standard error bars to show).
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight, color = sex)) +
geom_smooth(aes(x = height, y = weight), method = "lm", se = FALSE, color = 'black')
As you just saw, you can add a third variable to a scatterplot representing the relationship between two continuous variables. There are a couple of ways of achieving this.
Option 1: Mapping the third variable to an aesthetic
Aesthetic Options: * Color * Alpha * Size * Shape
Change the color
argument below to the other three aesthetic options and see what you notice about how the graph changes.
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight), shape = 8)
Options for the shape aesthetic:
Option 2: Facet Wrapping
Another option is to use facet wrapping, which will produce separate, side-by-side scatterplots showing the relationship between two continuous variables across the levels of a chosen categorical variables.
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight)) +
facet_wrap(~sex)
And you can do both:
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight, color = sex)) +
facet_wrap(~sex)
ggplot(data = pirates) +
geom_point(aes(x = height, y = weight, color = sex), alpha = 0.4) +
geom_smooth(aes(x = height, y = weight), method = "lm", se = FALSE, color = 'black') +
facet_wrap(~sex) +
labs(title = "The Relationship Between Height and Weight Across Levels of Self-Reported Sex",
subtitle = "Arrrrrrr matey!",
x = "Height",
y = "Weight",
caption = "Data taken from the `yarrr` package.") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
To display multiple plots simultaneously, you can use the plot_grid()
function from the cowplot
package.
This can be helpful if there are multiple visualizations that you want to compare or that have information you want to take in at the same time.
# install.packages(cowplot)
library(cowplot)
final_hist <- ggplot(data = pirates, aes(x = age, y = ..density..)) +
geom_histogram(fill = 'turquoise', color = 'black', alpha = 0.45) +
geom_density(color = 'darkorchid', size = 1.1) +
facet_wrap(~sex) +
labs(title = "Age Distribution of Pirates",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Density",
caption = "Data taken from the `yarrr` package.") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
final_box <- ggplot(data = pirates) +
geom_boxplot(aes(x = reorder(favorite.pirate, age), y = age), fill = '#808284', color = '#9b111e', alpha = .7) +
coord_flip() +
labs(title = "Age by Favorite Pirate ",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Favorite Pirate",
caption = "Data taken from the `yarrr` package.") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
plot_grid(final_hist, final_box)
If you want to plot data coming from different data sets, then specify the data
argument within the geom_x
function itself.
Below, I’ve just made a shortened version of the pirates
dataset (pirates_short
) and plotted weight by height from both the original and shortened versions. It is likely helpful to specify different colors for each plot so you can tell which dataset the points are originating from.
pirates_short <- pirates[1:100,]
ggplot() +
geom_point(data = pirates, aes(x = weight, y = height), color = 'blue') +
geom_point(data = pirates_short, aes(x = weight, y = height), color = 'red')
The minihacks today are intentionally very open-ended. Get as creative as you want!
Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.
For all of these minihacks, you can use any variable(s) from the pirates
dataset, or you can take a look at the movies
dataset from the {yarrr}
package to see if there are variables of interest to you.
While we still have a few minutes left in class, I’ll ask people to share some of their visualizations!
1a. Create a visualization of a single categorical variable and a single continuous variable. Add as many customization features as you want (e.g., color, labels, text settings, themes, etc.).
1b. Describe what’s being illustrated by the visualization (as if you were explaining to someone who is very unfamiliar with this data and with interpreting visualizations).
2a. Create a visualization of a continuous variable by a categorical variable. For example, you can create a boxplot/histogram/frequency polygon split by the levels of a categorical variable.
2b. Again, describe the story being told by the visualization.
3a. Create a scatterplot representing the relationship between two continuous variables. Choose one of the methods we discussed to add a third variable to the plot.
3b. What’s the story here?