You can download the rmd file here.
ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:
Each graph has the following components:
data
: the dataset containing the variables you want to
visualizegeom
: the type of geometric object you want to graph
(i.e., bars, points, boxplots)aes
: the aesthetic attributes you want to apply to the
geometric object (including which variables should be on the x & y
axis, the color, shape, and size of the geometric object)Here is a general ggplot template:
You don’t need to remember the syntax! Here’s the
A ggplot object can have multiple components (connected with
+
), which specify a layer on the graph.
# load a dataset
data(CPS85, package = "mosaicData")
# check the structure
str(CPS85)
## 'data.frame': 534 obs. of 11 variables:
## $ wage : num 9 5.5 3.8 10.5 15 9 9.57 15 11 5 ...
## $ educ : int 10 12 12 12 12 16 12 14 8 12 ...
## $ race : Factor w/ 2 levels "NW","W": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 2 2 1 ...
## $ hispanic: Factor w/ 2 levels "Hisp","NH": 2 2 2 2 2 2 2 2 2 2 ...
## $ south : Factor w/ 2 levels "NS","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ married : Factor w/ 2 levels "Married","Single": 1 1 2 1 1 1 1 2 1 1 ...
## $ exper : int 27 20 4 29 40 27 5 22 42 14 ...
## $ union : Factor w/ 2 levels "Not","Union": 1 1 1 1 2 1 2 1 1 1 ...
## $ age : int 43 38 22 47 58 49 23 42 56 32 ...
## $ sector : Factor w/ 8 levels "clerical","const",..: 2 7 7 1 2 1 8 7 4 7 ...
ggplot(data = <data>, mapping = aes(x = <x-axis variable>, y = <y-axis variable>))
# generate a univariate graph with a categorical variable
ggplot(data = CPS85, mapping = aes(x = sex))
We need to make sure the categorical variable is a factor, and we can
adjust the labels and the order of the categories using the parameter
levels
# check the class of the variable
class(CPS85$sex)
## [1] "factor"
# rename the labels
CPS85_clean <- CPS85 %>%
mutate(sex = recode(sex, F = "Female", M = "Male"))
ggplot(data = CPS85_clean, mapping = aes(x = sex))
# change the order
CPS85_clean %>%
mutate(sex = factor(sex, levels = c("Male", "Female"))) %>%
ggplot(mapping = aes(x = sex))
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar()
add filled color by specifying the fill
parameter, and
shape color by specifying the color
parameter
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar(fill = 'cornflowerblue', color = 'black')
Fill the bars with colors based on the levels of a categorical
variable by assigning the catigorical variable to fill
.
Note: When assigning a variable to fill
,
it has to be inside the same aes()
as the associated
variable.
ggplot(data = CPS85_clean, mapping = aes(x = sex, fill = sex)) +
geom_bar(color = 'black')
# this doesn't work
# ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
# geom_bar(fill = sex, color = 'black')
# this works
ggplot(data = CPS85_clean) +
geom_bar(aes(x = sex, fill = sex), color = 'black')
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram()
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10)
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10, alpha = 0.3)
Specify the categorical variables that determine the color with
fill
and the types of bar graph by positon
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "stack")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "dodge")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "fill")
Do you find anything wrong with this figure?
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(y = "Percentage")
specify the continuous variable on the y-axis with y=
and specify stat = "identity"
inside
geom_bar
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_col(fill = "darkorange", alpha = 0.7)
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_bar(fill = "darkorange", alpha = 0.7, stat = "identity")
specify the continuous variable on the x-axis and the categorical
variable with fill
ggplot(CPS85_clean, aes(x = exper, fill = race)) +
geom_density(alpha = 0.4)
specify the continuous variable with y=
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_boxplot()
reorder the boxplots by the continous variable
ggplot(CPS85_clean) +
geom_boxplot(aes(x = reorder(sector, exper), y = exper), color = "darkorange", alpha = .7)
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange")
Add linear fit line by add a layer of geom_smooth
, with
specified method
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(se = TRUE)
Specify the grouping variable with color by adding color
to aes
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point(color= "darkorange") + # parameters specified outside of ggplot will override the previous settings
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm") +
facet_wrap(~race)
Adjust the order with limits
and label with
labels
inside the scale_x_discrete
layer.
# check the current levels of the factor
levels(CPS85_clean$race)
## [1] "NW" "W"
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) # need to match the order of the limits
Customize legend by specify parameters inside
scale_fill_discrete
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) + # need to match the order of the limits
scale_fill_discrete(name = "Race", labels = c("Non-White", "White"))+
labs(x = "Race")
specify the min, max and interval with
scale_x_continuous(breaks = seq())
# check the range
x_range <- range(CPS85_clean$age, na.rm = TRUE)
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(x_range[1], x_range[2], 5)) # have to be within range
Add dollar sign
# check the range
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) + # have to be within range
scale_y_continuous(labels = scales::dollar)
Specify all labels with labs
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.")
Specify text_settings
text_settings <-
theme(plot.title = element_text(size = 16, face = 'bold')) +
theme(plot.subtitle = element_text(size = 14)) +
theme(axis.title.x = element_text(size = 16, face = 'bold')) +
theme(axis.title.y = element_text(size = 16, face = 'bold')) +
theme(axis.text.x = element_text(size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.") +
theme_minimal() +
text_settings
Add a density plot on to histogram. Need to change the y-axis to density
ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex)
Assign figures into variables, then organize multiple figures using
plot_grid
library(cowplot)
wage_hist <- ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex) +
labs(title = "Wage distribution by gender") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
wage_age_plot <- ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) +
labs(title = "Associations between Wage and age") +
theme_light() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
plot_grid(wage_hist, wage_age_plot)
Sometimes, you need to plot data from multiple sources and overlay them. When you work with repeated-measures designs, you often want to plot individual traces of data, but then aggregate statistics on top. Here, we’re using the ChickWeight data built into R, which measures chicken weight of individual chickens over time as a function of Diet and Time.
ChickWeight_df <- ChickWeight #not necessary, but I prefer to re-assign the data frame
#now, we're aggregating so that we can compute aggregate statistics like the mean for Time X Diet
ChickWeight_agg <- ChickWeight_df %>%
group_by(Time, Diet) %>%
summarize(M_weight = mean(weight))
ggplot(mapping = aes(x = Time, color = Diet))+
geom_line(data = ChickWeight_df, aes(y = weight, group = Chick), alpha = 0.3)+ #first, we're plotting the data frame for the individual lines
geom_point(data = ChickWeight_df, aes(y = weight, group = Chick), alpha = 0.3)+ #first, we're plotting the data frame for the individual lines
geom_line(data = ChickWeight_agg, aes(y = M_weight, group = Diet), linewidth = 1.5)+ #next, referencing the data frmae with aggregate statistics
geom_point(data = ChickWeight_agg, aes(y = M_weight), size = 3)+ #next, referencing the data frmae with aggregate statistics
labs(x = "Days",
y = "Weight (g)",
title = "Chicken Weight as a function of Time and Diet")+
theme_classic()
#to make it look a bit nicer, binarize into high vs. low
ChickWeight_bi <- ChickWeight %>%
filter(Time < 5 | Time > 16) %>% #filter only high and low time points
mutate(Time_bi = ifelse(Time < 5, "T1", "T2")) %>%#binarize
group_by(Chick, Time_bi, Diet) %>%
summarize(weight = mean(weight))
ChickWeight_bi_agg <- ChickWeight_bi %>%
group_by(Time_bi, Diet) %>%
summarize(M_weight = mean(weight),
SE_weight = sd(weight)/n())
ggplot(mapping = aes(x = Time_bi, color = Diet))+
geom_line(data = ChickWeight_bi, aes(y = weight, group = Chick), alpha = 0.3)+
geom_point(data = ChickWeight_bi, aes(y = weight, group = Chick), alpha = 0.3)+
geom_line(data = ChickWeight_bi_agg, aes(y = M_weight, group = Diet), linewidth = 1.5)+
geom_point(data = ChickWeight_bi_agg, aes(y = M_weight))+
geom_errorbar(data = ChickWeight_bi_agg, aes(ymin = M_weight-SE_weight, ymax = M_weight+SE_weight), width = 0.3)+
labs(x = "Days",
y = "Weight (g)",
title = "Chicken Weight as a function of Time and Diet")+
facet_wrap(~Diet)+
theme_classic()
ggplot gives you the flexibility to customize almost everything. Data visualization is an art, but also it’s an important way of communication. Therefore, even if I would like to spend hours on finding the perfect color combination, increase the clarity and interpretability of your data should always be your priority. So, before deciding the colors, you may want to make sure the color palettes you use have sufficient contrast and are color-blind friendly.
Want to use colors that perfectly match your slide design? In Powerpoint and many other programs, you can use an eyedropper tool to determine the exact color of any object. There are multiple color codes you can use to then match colors exactly. Most commonly, people use HEX or RGB codes. Hex codes can be directly pasted as a string for color or fill arguments in ggplot. RGB colors have to be converted first. See the example below:
mtcars <- mtcars
mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(x = cyl, y = hp))+
geom_boxplot(aes(fill = cyl))+
scale_fill_manual(values = c("4" = "navy",
"6" = "maroon",
"8" = "gold"))+
labs(x = "# Cylinders",
y = "Horsepower",
title = "Engine Horsepower as a Function of the Number of Cylinders")+
theme_classic()
mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(x = cyl, y = hp))+
geom_boxplot(aes(fill = cyl))+
# scale_fill_manual(values = c("4" = rgb(10, 122, 64, maxColorValue = 255),
# "6" = rgb(0, 50, 120, maxColorValue = 255),
# "8" = rgb(100, 100, 50, maxColorValue = 255)))+
scale_fill_manual(values = c("4" = "#0A7A40",
"6" = rgb(0, 50, 120, maxColorValue = 255),
"8" = rgb(100, 100, 50, maxColorValue = 255)))+
labs(x = "# Cylinders",
y = "Horsepower",
title = "Engine Horsepower as a Function of the Number of Cylinders")+
theme_classic()
I recommend saving ggplots for publications as .svg. This is a vector graphic format, which means that rather than saving pixels and their associated colors in a fixed dimension and size, it saves the relative position and size of objects. The result is that you can zoom into svg as much as you like and still don’t see pixels - graphics will also be crystal-clear. This will not work for standard formats like .png and .jpg. If you zoom into these types of images, they will eventually become pixelated.
Additionally, when you specify the size of text elements within the
theme()
command, you specify them in pt
units
(like the font size in Word, for instance). The effective font size will
change when you manually increase the size of the graphic in your final
document. If you like to be precise about this, I recommend that in the
document where you want to insert a graphic, you do the following: 1.
Insert a placeholder shape that has the dimensions of the graphic you
would like to add 1. Note the dimensions of that graphic. 1. Use the
dimensions to create a figure in R using the ggsave()
command, like so:
#Create a simple plot:
mtcars_plot1 <-
mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(x = cyl, y = hp))+
geom_boxplot()+
labs(x = "# Cylinders",
y = "Horsepower",
title = "Engine Horsepower as a Function of the Number of Cylinders")+
theme_classic()
mtcars_plot1
#let's make a png plot first
ggsave(here("labs/lab6_plots/mtcars_plot.png"), width = 6, height = 2, units = "in")
#let's make an svg plot next
ggsave(here("labs/lab6_plots/mtcars_plot.svg"), width = 6, height = 2, units = "in")
ggsave()
assumes you want to save the most recent plot
you made. But the first argument can also be a plot that you saved into
an object somewhere in your document.
mtcars_plot <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(x = cyl, y = hp))+
geom_boxplot()+
labs(x = "# Cylinders",
y = "Horsepower",
title = "Engine Horsepower as a Function of the Number of Cylinders")+
theme_classic()
mtcars_plot
#let's make an svg plot next
ggsave(mtcars_plot, filename = here("labs/lab6_plots/mtcars_plot2.svg"), width = 6, height = 2, units = "in")
The minihacks today are intentionally very open-ended. Get as creative as you want!
Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.
load the SaratogaHouses dataset from the mosaicData
package
data(SaratogaHouses, package="mosaicData")
1a. Create visualizations for the heating
variable and
the livingArea
variable. Add as many customization features
as you want (e.g., color, labels, text settings, themes, etc.).
1b. Bonus: Can you find a way to highlight the most commont heating pattern and mark the average living area on the figure?
2a. Create visualizations to demonstrate whether newly constructed houses have different heating patterns or not.
2b. Create a histogram on price
with different colors
representing different fuel
type.
2c. Create visualizations to demonstrate whether the age
of houses differs by fuel
type
3a. Create visualizations to demonstrate the association between
age
of the houses and price
?
3b. Create visualizations to demonstrate whether the association
between age
of the houses and price
depend on
the waterfront
and centralAir
of the
house?