You can download the rmd file here.

Introduction to Data Visualization

ggplot2 follows a theory of data visualization called the grammar of graphics. You can summarize this grammar as:

Each graph has the following components:

  • data: the dataset containing the variables you want to visualize
  • geom: the type of geometric object you want to graph (i.e., bars, points, boxplots)
  • aes: the aesthetic attributes you want to apply to the geometric object (including which variables should be on the x & y axis, the color, shape, and size of the geometric object)

Here is a general ggplot template:

You don’t need to remember the syntax! Here’s the ggplot cheat sheet

A ggplot object can have multiple components (connected with +), which specify a layer on the graph.

# load a dataset
data(CPS85, package = "mosaicData")

# check the structure
str(CPS85)
## 'data.frame':    534 obs. of  11 variables:
##  $ wage    : num  9 5.5 3.8 10.5 15 9 9.57 15 11 5 ...
##  $ educ    : int  10 12 12 12 12 16 12 14 8 12 ...
##  $ race    : Factor w/ 2 levels "NW","W": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 2 2 1 ...
##  $ hispanic: Factor w/ 2 levels "Hisp","NH": 2 2 2 2 2 2 2 2 2 2 ...
##  $ south   : Factor w/ 2 levels "NS","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ married : Factor w/ 2 levels "Married","Single": 1 1 2 1 1 1 1 2 1 1 ...
##  $ exper   : int  27 20 4 29 40 27 5 22 42 14 ...
##  $ union   : Factor w/ 2 levels "Not","Union": 1 1 1 1 2 1 2 1 1 1 ...
##  $ age     : int  43 38 22 47 58 49 23 42 56 32 ...
##  $ sector  : Factor w/ 8 levels "clerical","const",..: 2 7 7 1 2 1 8 7 4 7 ...

ggplot (data)

  • specify the dataset(s)
  • specify aesthetics (variables on the x & y axis)
  • use the formula: ggplot(data = <data>, mapping = aes(x = <x-axis variable>, y = <y-axis variable>))
# generate a univariate graph with a categorical variable
ggplot(data = CPS85, mapping = aes(x = sex))

Rename & Reorder Categorical Variables

We need to make sure the categorical variable is a factor, and we can adjust the labels and the order of the categories using the parameter levels

# check the class of the variable
class(CPS85$sex)
## [1] "factor"
# rename the labels
CPS85_clean <- CPS85 %>%
  mutate(sex = recode(sex, F = "Female", M = "Male"))

ggplot(data = CPS85_clean, mapping = aes(x = sex))

# change the order
CPS85_clean %>%
  mutate(sex = factor(sex, levels = c("Male", "Female"))) %>%
  ggplot(mapping = aes(x = sex))

geom_ (geometric objects)

  • specify the type of graph
  • specify grouping variable
  • specify color, shape and size of the geometric objects

One Categorical Variable (geom_bar)

ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
  geom_bar()

Adjust color

add filled color by specifying the fill parameter, and shape color by specifying the color parameter

ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
  geom_bar(fill = 'cornflowerblue', color = 'black')

Fill the bars with colors based on the levels of a categorical variable by assigning the catigorical variable to fill. Note: When assigning a variable to fill, it has to be inside the same aes() as the associated variable.

ggplot(data = CPS85_clean, mapping = aes(x = sex, fill = sex)) +
  geom_bar(color = 'black')

# this doesn't work
# ggplot(data = CPS85_clean, mapping = aes(x = sex)) +
#   geom_bar(fill = sex, color = 'black')

# this works
ggplot(data = CPS85_clean) +
  geom_bar(aes(x = sex, fill = sex), color = 'black')

One Continous Variable(geom_histogram)

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram()

Adjust bin widths

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10)

Adjust transparency

ggplot(CPS85_clean,aes(x = wage)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10, alpha = 0.3)

Bivariate: Categorical & Categorical(geom_bar)

Specify the categorical variables that determine the color with fill and the types of bar graph by positon

Stacked bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "stack")

Grouped bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "dodge")

Segmented bar chart

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "fill")

Do you find anything wrong with this figure?

ggplot(CPS85_clean, aes(x = sector,fill = sex)) +
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +   
  labs(y = "Percentage")

Bivariate: Categorical & Continuous

Bar graph with group means (geom_col)

specify the continuous variable on the y-axis with y= and specify stat = "identity" inside geom_bar

ggplot(CPS85_clean, aes(x = sector, y = exper)) +
  geom_col(fill = "darkorange", alpha = 0.7)

ggplot(CPS85_clean, aes(x = sector, y = exper)) +
  geom_bar(fill = "darkorange", alpha = 0.7, stat = "identity")

Grouped kernel density plots (geom_density)

specify the continuous variable on the x-axis and the categorical variable with fill

ggplot(CPS85_clean, aes(x = exper, fill = race)) +
  geom_density(alpha = 0.4)

Boxplot(geom_boxplot)

specify the continuous variable with y=

ggplot(CPS85_clean, aes(x = sector, y = exper)) +
  geom_boxplot()

reorder the boxplots by the continous variable

ggplot(CPS85_clean) +
  geom_boxplot(aes(x = reorder(sector, exper), y = exper), color = "darkorange", alpha = .7)

Bivariate: Continuous & Continuous

Scatterplot(geom_point)

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange")

Scatterplot with linear fit line

Add linear fit line by add a layer of geom_smooth, with specified method

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(se = TRUE)

Grouping

Add a grouping variable with color

Specify the grouping variable with color by adding color to aes

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point(color= "darkorange") + # parameters specified outside of ggplot will override the previous settings 
  geom_smooth(method = "lm")

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + # keep the color pattern for the dots 
  geom_smooth(method = "lm")

Add a grouping variable with facets

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + # keep the color pattern for the dots 
  geom_smooth(method = "lm") + 
  facet_wrap(~race)

Scales

Categorical Variables

Re-order categorical variable

Adjust the order with limits and label with labels inside the scale_x_discrete layer.

# check the current levels of the factor 
levels(CPS85_clean$race)
## [1] "NW" "W"
ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
  geom_bar() + 
  scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor 
                   labels = c("White", "Non-White")) # need to match the order of the limits

Customize legend

Customize legend by specify parameters inside scale_fill_discrete

ggplot(data = CPS85_clean, mapping = aes(x = race, fill = race)) +
  geom_bar() + 
  scale_x_discrete(limits = c("W", "NW"), # need to match the levels of the factor 
                   labels = c("White", "Non-White")) + # need to match the order of the limits
  scale_fill_discrete(name = "Race", labels = c("Non-White", "White"))+
  labs(x = "Race")

Continuous Variables

Adjust the label intervals

specify the min, max and interval with scale_x_continuous(breaks = seq())

# check the range 
x_range <- range(CPS85_clean$age, na.rm = TRUE)

ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(x_range[1], x_range[2], 5)) # have to be within range 

Specify the unit

Add dollar sign

# check the range 
range(CPS85_clean$age)
## [1] 18 64
ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(18, 64, 5)) + # have to be within range 
  scale_y_continuous(labels = scales::dollar)

Labels

  • Have a title
  • Make sure the x and y labels make sense

Specify all labels with labs

ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(method = "lm") + 
  labs(title    = "A positive correlation between age and experience",
       subtitle = "Arrrrrrr matey!",
       x        = "Age",
       y        = "Experience (year)",
       caption  = "Data taken from the `mosaicData` package.")

Themes

  • Specify a general layout
  • Customize the font

Specify text_settings

text_settings <- 
  theme(plot.title = element_text(size = 16, face = 'bold')) +
  theme(plot.subtitle = element_text(size = 14)) +
  theme(axis.title.x = element_text(size = 16, face = 'bold')) +
  theme(axis.title.y = element_text(size = 16, face = 'bold')) +
  theme(axis.text.x = element_text(size = 10)) +
  theme(axis.text.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
ggplot(CPS85_clean,
       aes(x = age, 
           y = exper)) +
  geom_point(color= "darkorange") + 
  geom_smooth(method = "lm") + 
  labs(title    = "A positive correlation between age and experience",
       subtitle = "Arrrrrrr matey!",
       x        = "Age",
       y        = "Experience (year)",
       caption  = "Data taken from the `mosaicData` package.") + 
  theme_minimal() + 
  text_settings

Display Multiple Figures

Overlaying

Add a density plot on to histogram. Need to change the y-axis to density

ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10) + 
  geom_density(color = 'steelblue', size = 1.1) + 
  facet_wrap(~sex)

Organize figures into grid

Assign figures into variables, then organize multiple figures using plot_grid

library(cowplot)

wage_hist <- ggplot(CPS85_clean,aes(x = wage, y = ..density..)) +
  geom_histogram(fill = "darkorange", color = "black", bins = 10) + 
  geom_density(color = 'steelblue', size = 1.1) + 
  facet_wrap(~sex) + 
  labs(title = "Wage distribution by gender") +
  theme_bw() +  # add a theme
  theme(plot.title = element_text(hjust = 0.5),   
        plot.subtitle = element_text(hjust = 0.5)) 

wage_age_plot <- ggplot(CPS85_clean,
       aes(x = age, 
           y = wage, 
           color = sex)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(breaks = seq(18, 64, 5)) + 
  labs(title = "Associations between Wage and age") +
  theme_light() +  # add a theme
  theme(plot.title = element_text(hjust = 0.5),   
        plot.subtitle = element_text(hjust = 0.5)) 

plot_grid(wage_hist, wage_age_plot)

Plotting more than one data frame

Sometimes, you need to plot data from multiple sources and overlay them. When you work with repeated-measures designs, you often want to plot individual traces of data, but then aggregate statistics on top. Here, we’re using the ChickWeight data built into R, which measures chicken weight of individual chickens over time as a function of Diet and Time.

ChickWeight_df <- ChickWeight #not necessary, but I prefer to re-assign the data frame

#now, we're aggregating so that we can compute aggregate statistics like the mean for Time X Diet
ChickWeight_agg <- ChickWeight_df %>% 
  group_by(Time, Diet) %>% 
  summarize(M_weight = mean(weight))

ggplot(mapping = aes(x = Time, color = Diet))+
  geom_line(data = ChickWeight_df, aes(y = weight, group = Chick), alpha = 0.3)+ #first, we're plotting the data frame for the individual lines
  geom_point(data = ChickWeight_df, aes(y = weight, group = Chick), alpha = 0.3)+ #first, we're plotting the data frame for the individual lines
  geom_line(data = ChickWeight_agg, aes(y = M_weight, group = Diet), linewidth = 1.5)+ #next, referencing the data frmae with aggregate statistics
  geom_point(data = ChickWeight_agg, aes(y = M_weight), size = 3)+ #next, referencing the data frmae with aggregate statistics
  labs(x = "Days",
       y = "Weight (g)",
       title = "Chicken Weight as a function of Time and Diet")+
  theme_classic()

#to make it look a bit nicer, binarize into high vs. low
ChickWeight_bi <- ChickWeight %>% 
  filter(Time < 5 | Time > 16) %>% #filter only high and low time points
  mutate(Time_bi = ifelse(Time < 5, "T1", "T2")) %>%#binarize
  group_by(Chick, Time_bi, Diet) %>%
  summarize(weight = mean(weight))
            
ChickWeight_bi_agg <- ChickWeight_bi %>%
  group_by(Time_bi, Diet) %>%
  summarize(M_weight = mean(weight),
            SE_weight = sd(weight)/n())

ggplot(mapping = aes(x = Time_bi, color = Diet))+
  geom_line(data = ChickWeight_bi, aes(y = weight, group = Chick), alpha = 0.3)+
  geom_point(data = ChickWeight_bi, aes(y = weight, group = Chick), alpha = 0.3)+
  geom_line(data = ChickWeight_bi_agg, aes(y = M_weight, group = Diet), linewidth = 1.5)+
  geom_point(data = ChickWeight_bi_agg, aes(y = M_weight))+
  geom_errorbar(data = ChickWeight_bi_agg, aes(ymin = M_weight-SE_weight, ymax = M_weight+SE_weight), width = 0.3)+
  labs(x = "Days",
       y = "Weight (g)",
       title = "Chicken Weight as a function of Time and Diet")+
  facet_wrap(~Diet)+
  theme_classic()

Yes, you can customize EVERYTHING!

ggplot gives you the flexibility to customize almost everything. Data visualization is an art, but also it’s an important way of communication. Therefore, even if I would like to spend hours on finding the perfect color combination, increase the clarity and interpretability of your data should always be your priority. So, before deciding the colors, you may want to make sure the color palettes you use have sufficient contrast and are color-blind friendly.

A good reference for customizing ggplot

A guide for finding color blind friendly colors

Selecting specific colors

Want to use colors that perfectly match your slide design? In Powerpoint and many other programs, you can use an eyedropper tool to determine the exact color of any object. There are multiple color codes you can use to then match colors exactly. Most commonly, people use HEX or RGB codes. Hex codes can be directly pasted as a string for color or fill arguments in ggplot. RGB colors have to be converted first. See the example below:

mtcars <- mtcars

mtcars %>% 
  mutate(cyl = factor(cyl)) %>%
  ggplot(aes(x = cyl, y = hp))+
  geom_boxplot(aes(fill = cyl))+
  scale_fill_manual(values = c("4" = "navy",
                               "6" = "maroon",
                               "8" = "gold"))+
  labs(x = "# Cylinders",
       y = "Horsepower",
       title = "Engine Horsepower as a Function of the Number of Cylinders")+
  theme_classic()

mtcars %>% 
  mutate(cyl = factor(cyl)) %>%
  ggplot(aes(x = cyl, y = hp))+
  geom_boxplot(aes(fill = cyl))+
  # scale_fill_manual(values = c("4" = rgb(10, 122, 64, maxColorValue = 255),
  #                              "6" = rgb(0, 50, 120, maxColorValue = 255),
  #                              "8" = rgb(100, 100, 50, maxColorValue = 255)))+
    scale_fill_manual(values = c("4" = "#0A7A40",
                               "6" = rgb(0, 50, 120, maxColorValue = 255),
                               "8" = rgb(100, 100, 50, maxColorValue = 255)))+
  labs(x = "# Cylinders",
       y = "Horsepower",
       title = "Engine Horsepower as a Function of the Number of Cylinders")+
  theme_classic()

Saving plots

I recommend saving ggplots for publications as .svg. This is a vector graphic format, which means that rather than saving pixels and their associated colors in a fixed dimension and size, it saves the relative position and size of objects. The result is that you can zoom into svg as much as you like and still don’t see pixels - graphics will also be crystal-clear. This will not work for standard formats like .png and .jpg. If you zoom into these types of images, they will eventually become pixelated.

Additionally, when you specify the size of text elements within the theme() command, you specify them in pt units (like the font size in Word, for instance). The effective font size will change when you manually increase the size of the graphic in your final document. If you like to be precise about this, I recommend that in the document where you want to insert a graphic, you do the following: 1. Insert a placeholder shape that has the dimensions of the graphic you would like to add 1. Note the dimensions of that graphic. 1. Use the dimensions to create a figure in R using the ggsave() command, like so:

#Create a simple plot:

mtcars_plot1 <- 
  mtcars %>% 
  mutate(cyl = factor(cyl)) %>%
  ggplot(aes(x = cyl, y = hp))+
  geom_boxplot()+
  labs(x = "# Cylinders",
       y = "Horsepower",
       title = "Engine Horsepower as a Function of the Number of Cylinders")+
  theme_classic()

mtcars_plot1

#let's make a png plot first
ggsave(here("labs/lab6_plots/mtcars_plot.png"), width = 6, height = 2, units = "in")
#let's make an svg plot next
ggsave(here("labs/lab6_plots/mtcars_plot.svg"), width = 6, height = 2, units = "in")

ggsave() assumes you want to save the most recent plot you made. But the first argument can also be a plot that you saved into an object somewhere in your document.

mtcars_plot <- mtcars %>% 
  mutate(cyl = factor(cyl)) %>%
  ggplot(aes(x = cyl, y = hp))+
  geom_boxplot()+
  labs(x = "# Cylinders",
       y = "Horsepower",
       title = "Engine Horsepower as a Function of the Number of Cylinders")+
  theme_classic()

mtcars_plot

#let's make an svg plot next
ggsave(mtcars_plot, filename = here("labs/lab6_plots/mtcars_plot2.svg"), width = 6, height = 2, units = "in")

Minihacks

The minihacks today are intentionally very open-ended. Get as creative as you want!

Data visualization is a great way to uncover stories in the data that would be difficult to notice by just looking at the numbers. See what stories you can uncover by exploring individual variables and their relationships with each other.

load the SaratogaHouses dataset from the mosaicData package

data(SaratogaHouses, package="mosaicData")

1a. Create visualizations for the heating variable and the livingArea variable. Add as many customization features as you want (e.g., color, labels, text settings, themes, etc.).

1b. Bonus: Can you find a way to highlight the most commont heating pattern and mark the average living area on the figure?

2a. Create visualizations to demonstrate whether newly constructed houses have different heating patterns or not.

2b. Create a histogram on price with different colors representing different fuel type.

2c. Create visualizations to demonstrate whether the age of houses differs by fuel type

3a. Create visualizations to demonstrate the association between age of the houses and price?

3b. Create visualizations to demonstrate whether the association between age of the houses and price depend on the waterfront and centralAir of the house?