You can download the rmd file here.

Introduction to Data Visualization

ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:

Each graph has the following components:

  • data: the dataset containing the variables you want to visualize
  • geom: the type of geometric object you want to graph (i.e., bars, points, boxplots)
  • aes: the aesthetic attributes you want to apply to the geometric object (including which variables should be on the x & y axis, the color, shape, and size of the geometric object)

Here is a general ggplot template:

You don’t need to remember the syntax! Here’s the ggplot cheat sheet

A ggplot object can have multiple components (connected with +), which specify a layer on the graph.

# load a dataset
data(CPS85, package = "mosaicData")

# check the structure
str(CPS85)
## 'data.frame':    534 obs. of  11 variables:
##  $ wage    : num  9 5.5 3.8 10.5 15 9 9.57 15 11 5 ...
##  $ educ    : int  10 12 12 12 12 16 12 14 8 12 ...
##  $ race    : Factor w/ 2 levels "NW","W": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 2 2 1 ...
##  $ hispanic: Factor w/ 2 levels "Hisp","NH": 2 2 2 2 2 2 2 2 2 2 ...
##  $ south   : Factor w/ 2 levels "NS","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ married : Factor w/ 2 levels "Married","Single": 1 1 2 1 1 1 1 2 1 1 ...
##  $ exper   : int  27 20 4 29 40 27 5 22 42 14 ...
##  $ union   : Factor w/ 2 levels "Not","Union": 1 1 1 1 2 1 2 1 1 1 ...
##  $ age     : int  43 38 22 47 58 49 23 42 56 32 ...
##  $ sector  : Factor w/ 8 levels "clerical","const",..: 2 7 7 1 2 1 8 7 4 7 ...

ggplot (data)

  • specify the dataset(s)
  • specify aesthetics (variables on the x & y axis)
  • use the formula: ggplot(data = <data>, mapping = aes(x = <x-axis variable>, y = <y-axis variable>))
# generate a univariate graph with a categorical variable
ggplot(data = CPS85, mapping = aes(x = sex))

Rename & Reorder Categorical Variables

We need to make sure the categorical variable is a factor, and we can adjust the labels and the order of the categories using the parameter levels

# check the class of the variable
class(CPS85$sex)
## [1] "factor"
# rename the labels
CPS85_clean <- CPS85 %>%
  mutate(sex = recode(sex, F = "Female", M = "Male"))

ggplot(data = CPS85_clean, mapping = aes(x = sex))

# change the order
CPS85_clean %>%
  mutate(sex = factor(sex, levels = c("Male", "Female"))) %>%
  ggplot(mapping = aes(x = sex))

geom_ (geometric objects)

  • specify the type of graph
  • specify grouping variable
  • specify color, shape and size of the geometric objects

One Categorical Varialbe (geom_bar)

ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
  geom_bar()

Adjust color

add filled color by specifying the fill parameter, and shaple color by specifying the color parameter

ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
  geom_bar(fill = 'darkorange', color = 'black')

Fill the bars with colors based on the levels of a categorical variable by assigning the catigorical variable to fill. Note: When assigning a variable to fill, it has to be inside the same aes() as the associated variable.

ggplot(data = CPS85_clean, mapping = aes(x = sex, fill = sex)) +
  geom_bar(color = 'black')

# this doesn't work
# ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
#   geom_bar(fill = sex, color = 'black')

# this works
ggplot(data = CPS85_clean) +
  geom_bar(aes(x = sex, fill = sex), color = 'black')

One Continous Variable(geom_histogram)

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram()

Adjust bin widths

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10)

Adjust transparency

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10, alpha = 0.7)

Bivariate: Categorical & Categorical(geom_bar)

Specify the categorical variables that determine the color with fill and the types of bar graph by positon

Stacked bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "stack")

Grouped bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "dodge")

Segmented bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "fill")

Do you find anything wrong with this figure?

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1L)) +   labs(y = "Percentage")

Bivariate: Categorical & Continuous

Bar graph with group means (geom_col)

specify the continuous variable on the y-axis with y= and specify stat = "identity" inside geom_bar

ggplot(CPS85_clean, aes(x = sector, y = exper)) +
  geom_col(fill = "darkorange", alpha = 0.7)

Grouped kernel density plots (geom_density)

specify the continuous variable on the x-axis and the categorical variable with fill

ggplot(CPS85_clean, aes(x = exper, fill = race)) +
  geom_density(alpha = 0.4)

Boxplot(geom_boxplot)

specify the continuous variable with y=

ggplot(CPS85_clean, aes(x = sector, y = exper)) +
  geom_boxplot()

reorder the boxplots by the continous variable

ggplot(CPS85_clean) +
  geom_boxplot(aes(x = reorder(sector, exper), y = exper), color = "darkorange", alpha = .7)

Bivariate: Continuous & Continuous

Scatterplot(geom_point)

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange")

Scatterplot with linear fit line

Add linear fit line by add a layer of geom_smooth, with specified method

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(method = "lm")

Grouping

Add a grouping variable with color

Specify the grouping variable with color by adding color to aes

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point(color= "darkorange") + # parameters specified outside of ggplot will override the previous settings 
  geom_smooth(method = "lm")

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + # keep the color pattern for the dots 
  geom_smooth(method = "lm")

Add a grouping variable with facets

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + # keep the color pattern for the dots 
  geom_smooth(method = "lm") + 
  facet_wrap(~race)

Scales

Categorical Variables

Re-order categorical variable

Adjust the order with limits and label with labels inside the scale_x_discrete layer.

# check the current levels of the factor 
levels(CPS85_clean$race)
## [1] "NW" "W"
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
  geom_bar() + 
  scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor 
                   labels = c("White", "Non-White")) # need to match the order of the limits

Customize legend

Customize lagend by specify parameters inside scale_fill_discrete

ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
  geom_bar() + 
  scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor 
                   labels = c("White", "Non-White")) + # need to match the order of the limits
  scale_fill_discrete(name = "Race", labels = c("Non-White", "White"))

Continuous Variables

Adjust the label intervals

specify the min, max and interval with scale_x_continuous(breaks = seq())

# check the range 
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(18, 64, 5)) # have to be within range 

Specify the unit

Add dollar sign

# check the range 
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(18, 64, 5)) + # have to be within range 
  scale_y_continuous(labels = scales::dollar)

Labels

  • Have a title
  • Make sure the x and y labels make sense

Specify all labels with labs

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(method = "lm") + 
  labs(title    = "A positive correlation between age and experience",
       subtitle = "Arrrrrrr matey!",
       x        = "Age",
       y        = "Experience (year)",
       caption  = "Data taken from the `mosaicData` package.")

Themes

  • Specify a general layout
  • Costomize the font

Specify text_settings

text_settings <- 
  theme(plot.title = element_text(size = 16, face = 'bold')) +
  theme(plot.subtitle = element_text(size = 14)) +
  theme(axis.title.x = element_text(size = 16, face = 'bold')) +
  theme(axis.title.y = element_text(size = 16, face = 'bold')) +
  theme(axis.text.x = element_text(size = 10)) +
  theme(axis.text.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(method = "lm") + 
  labs(title    = "A positive correlation between age and experience",
       subtitle = "Arrrrrrr matey!",
       x        = "Age",
       y        = "Experience (year)",
       caption  = "Data taken from the `mosaicData` package.") + 
  theme_minimal() + 
  text_settings

Display Multiple Figures

Overlaying

Add a density plot on to histogram. Need to change the y-axis to density

ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10) + 
  geom_density(color = 'steelblue', size = 1.1) + 
  facet_wrap(~sex)

Organize figures into grid

Assign figures into variables, then orangize multiple figures using plot_grid

library(cowplot)

wage_hist <- ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10) + 
  geom_density(color = 'steelblue', size = 1.1) + 
  facet_wrap(~sex) + 
  labs(title = "Wage distribution by gender") +
  theme_bw() +  # add a theme
  theme(plot.title = element_text(hjust = 0.5),   
        plot.subtitle = element_text(hjust = 0.5)) 

wage_age_plot <- ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(18, 64, 5)) + 
  labs(title = "Associations between Wage and age") +
  theme_light() +  # add a theme
  theme(plot.title = element_text(hjust = 0.5),   
        plot.subtitle = element_text(hjust = 0.5)) 

plot_grid(wage_hist, wage_age_plot)

Yes, you can customize EVERYTHING!

ggplot gives you the flexibility to customize almost everything. Data visualization is an art, but also it’s an important way of communication. Therefore, even if I would like to spend hours on finding the perfect color combination, increase the clarity and interpretability of your data should always be your priority. So, before deciding the colors, you may want to make sure the color palettes you use have sufficient contrast and are color-blind friendly.

A good reference for customizing ggplot

A guide for finding color blind friendly colors

Minihacks

The minihacks today are intentionally very open-ended. Get as creative as you want!

Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.

load the SaratogaHouses dataset from the mosaicData package

data(SaratogaHouses, package="mosaicData")

1a. Create visualizations for the heating variable and the livingArea variable. Add as many customization features as you want (e.g., color, labels, text settings, themes, etc.).

1b. Bonus: Can you find a way to highlight the most commont heating pattern and mark the average living area on the figure?

2a. Create visualizations to demonstrate whether newly constructed houses have different heating patterns or not.

2b. Create a histogram on price with different colors representing different fuel type.

2c. Create visualizations to demonstrate whether the age of houses differs by fuel type

3a. Create visualizations to demonstrate the association between age of the houses and price?

3b. Create visualizations to demonstrate whether the association between age of the houses and price depend on the waterfront and centralAir of the house?