You can download the rmd file here.
ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:
Each graph has the following components:
data
: the dataset containing the variables you want to
visualizegeom
: the type of geometric object you want to graph
(i.e., bars, points, boxplots)aes
: the aesthetic attributes you want to apply to the
geometric object (including which variables should be on the x & y
axis, the color, shape, and size of the geometric object)Here is a general ggplot template:
You don’t need to remember the syntax! Here’s the
A ggplot object can have multiple components (connected with
+
), which specify a layer on the graph.
# load a dataset
data(CPS85, package = "mosaicData")
# check the structure
str(CPS85)
## 'data.frame': 534 obs. of 11 variables:
## $ wage : num 9 5.5 3.8 10.5 15 9 9.57 15 11 5 ...
## $ educ : int 10 12 12 12 12 16 12 14 8 12 ...
## $ race : Factor w/ 2 levels "NW","W": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 2 2 1 ...
## $ hispanic: Factor w/ 2 levels "Hisp","NH": 2 2 2 2 2 2 2 2 2 2 ...
## $ south : Factor w/ 2 levels "NS","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ married : Factor w/ 2 levels "Married","Single": 1 1 2 1 1 1 1 2 1 1 ...
## $ exper : int 27 20 4 29 40 27 5 22 42 14 ...
## $ union : Factor w/ 2 levels "Not","Union": 1 1 1 1 2 1 2 1 1 1 ...
## $ age : int 43 38 22 47 58 49 23 42 56 32 ...
## $ sector : Factor w/ 8 levels "clerical","const",..: 2 7 7 1 2 1 8 7 4 7 ...
ggplot(data = <data>, mapping = aes(x = <x-axis variable>, y = <y-axis variable>))
# generate a univariate graph with a categorical variable
ggplot(data = CPS85, mapping = aes(x = sex))
We need to make sure the categorical variable is a factor, and we can
adjust the labels and the order of the categories using the parameter
levels
# check the class of the variable
class(CPS85$sex)
## [1] "factor"
# rename the labels
CPS85_clean <- CPS85 %>%
mutate(sex = recode(sex, F = "Female", M = "Male"))
ggplot(data = CPS85_clean, mapping = aes(x = sex))
# change the order
CPS85_clean %>%
mutate(sex = factor(sex, levels = c("Male", "Female"))) %>%
ggplot(mapping = aes(x = sex))
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar()
add filled color by specifying the fill
parameter, and
shaple color by specifying the color
parameter
ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
geom_bar(fill = 'darkorange', color = 'black')
Fill the bars with colors based on the levels of a categorical
variable by assigning the catigorical variable to fill
.
Note: When assigning a variable to fill
,
it has to be inside the same aes()
as the associated
variable.
ggplot(data = CPS85_clean, mapping = aes(x = sex, fill = sex)) +
geom_bar(color = 'black')
# this doesn't work
# ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
# geom_bar(fill = sex, color = 'black')
# this works
ggplot(data = CPS85_clean) +
geom_bar(aes(x = sex, fill = sex), color = 'black')
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram()
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10)
ggplot(CPS85_clean,aes(x = wage)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10, alpha = 0.7)
Specify the categorical variables that determine the color with
fill
and the types of bar graph by positon
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "stack")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "dodge")
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "fill")
Do you find anything wrong with this figure?
ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L)) + labs(y = "Percentage")
specify the continuous variable on the y-axis with y=
and specify stat = "identity"
inside
geom_bar
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_col(fill = "darkorange", alpha = 0.7)
specify the continuous variable on the x-axis and the categorical
variable with fill
ggplot(CPS85_clean, aes(x = exper, fill = race)) +
geom_density(alpha = 0.4)
specify the continuous variable with y=
ggplot(CPS85_clean, aes(x = sector, y = exper)) +
geom_boxplot()
reorder the boxplots by the continous variable
ggplot(CPS85_clean) +
geom_boxplot(aes(x = reorder(sector, exper), y = exper), color = "darkorange", alpha = .7)
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange")
Add linear fit line by add a layer of geom_smooth
, with
specified method
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm")
Specify the grouping variable with color by adding color
to aes
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point(color= "darkorange") + # parameters specified outside of ggplot will override the previous settings
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm")
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() + # keep the color pattern for the dots
geom_smooth(method = "lm") +
facet_wrap(~race)
Adjust the order with limits
and label with
labels
inside the scale_x_discrete
layer.
# check the current levels of the factor
levels(CPS85_clean$race)
## [1] "NW" "W"
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) # need to match the order of the limits
Customize lagend by specify parameters inside
scale_fill_discrete
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
geom_bar() +
scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor
labels = c("White", "Non-White")) + # need to match the order of the limits
scale_fill_discrete(name = "Race", labels = c("Non-White", "White"))
specify the min, max and interval with
scale_x_continuous(breaks = seq())
# check the range
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) # have to be within range
Add dollar sign
# check the range
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) + # have to be within range
scale_y_continuous(labels = scales::dollar)
Specify all labels with labs
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.")
Specify text_settings
text_settings <-
theme(plot.title = element_text(size = 16, face = 'bold')) +
theme(plot.subtitle = element_text(size = 14)) +
theme(axis.title.x = element_text(size = 16, face = 'bold')) +
theme(axis.title.y = element_text(size = 16, face = 'bold')) +
theme(axis.text.x = element_text(size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(CPS85_clean,
aes(x = age,
y = exper)) +
geom_point(color= "darkorange") +
geom_smooth(method = "lm") +
labs(title = "A positive correlation between age and experience",
subtitle = "Arrrrrrr matey!",
x = "Age",
y = "Experience (year)",
caption = "Data taken from the `mosaicData` package.") +
theme_minimal() +
text_settings
Add a density plot on to histogram. Need to change the y-axis to density
ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex)
Assign figures into variables, then orangize multiple figures using
plot_grid
library(cowplot)
wage_hist <- ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
geom_histogram(fill = "darkorange", color = "black", bins = 10) +
geom_density(color = 'steelblue', size = 1.1) +
facet_wrap(~sex) +
labs(title = "Wage distribution by gender") +
theme_bw() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
wage_age_plot <- ggplot(CPS85_clean,
aes(x = age,
y = wage,
color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(18, 64, 5)) +
labs(title = "Associations between Wage and age") +
theme_light() + # add a theme
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
plot_grid(wage_hist, wage_age_plot)
ggplot gives you the flexibility to customize almost everything. Data visualization is an art, but also it’s an important way of communication. Therefore, even if I would like to spend hours on finding the perfect color combination, increase the clarity and interpretability of your data should always be your priority. So, before deciding the colors, you may want to make sure the color palettes you use have sufficient contrast and are color-blind friendly.
The minihacks today are intentionally very open-ended. Get as creative as you want!
Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.
load the SaratogaHouses dataset from the mosaicData
package
data(SaratogaHouses, package="mosaicData")
1a. Create visualizations for the heating
variable and
the livingArea
variable. Add as many customization features
as you want (e.g., color, labels, text settings, themes, etc.).
1b. Bonus: Can you find a way to highlight the most commont heating pattern and mark the average living area on the figure?
2a. Create visualizations to demonstrate whether newly constructed houses have different heating patterns or not.
2b. Create a histogram on price
with different colors
representing different fuel
type.
2c. Create visualizations to demonstrate whether the age
of houses differs by fuel
type
3a. Create visualizations to demonstrate the association between
age
of the houses and price
?
3b. Create visualizations to demonstrate whether the association
between age
of the houses and price
depend on
the waterfront
and centralAir
of the
house?