You can download an rmarkdown file of today’s lab here and a PDF of today’s lab here.

Purpose

The purpose of today’s lab is to introduce you to the tidyverse, which is a collection of packages designed for data science. We will focus on data wrangling, including how to extract specific observations and variables, how to generate new variables, and how to summarize data.

For further resources on these topics, check out R for Data Science by Hadley Wickham and this cheatsheet on data wrangling from RStudio.


Intro to the tidyverse

Let’s start by loading the tidyverse:

#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
  • When you load the tidyverse into R, you are actually loading several different packages that are designed to work well together and help you complete common tasks in R. Some of these packages are…

    • data manipulation (dplyr) = our focus today
    • reshaping data (tidyr)
    • data visualization (ggplot2)
    • working with strings (stringr)
    • working with factors (forcats)

Data wrangling with R

  • Data wrangling means getting your data into a useful form for visualizing and modeling it. Before you analyze your data, you will need to clean and transform it. For example, you may want to remove a variable in your data set, or remove participants who do not fit a certain criteria.

  • If you get your dataset ready for analysis in excel or SPSS, you can end up deleting important information. In contrast, when you clean and transform a dataset in R, you aren’t editing the underlying file. Rather, you are editing the data frame in your global environment.

Today’s focus: {dplyr}

  • Most of the functions we’ll go over today come from the {dplyr} package. You can think of this package as a set of “pliers” that you can use to tweak data frames, hence its name “d(ataframe)plyr.”

  • The functions of {dplyr} are verbs:

    • mutate() adds new column(s).

    • select() selects column(s).

    • filter() selects rows based on a criteria.

    • summarize() summarizes multiple selected values (useful for mean, sd, n).

    • arrange() changes the ordering of the rows.

Pipes

  • Pipes come from the {magrittr} package and are available when you load the tidyverse. Pipes are a way to write strings of functions more easily, creating pipelines. A pipe looks like this:
  • You can enter a pipe with the shortcut CTRL+Shift+M for PC or CMD+Shift+M for Mac.
#practice entering a pipe with the shortcut here
  • A pipe passes an object on the left-hand side as the first argument (or . argument) of whatever function is on the right-hand side.

    • x %>% f(y) is the same as f(x, y)

    • y %>% f(x, ., z) is the same as f(x, y, z )

Example: I want to calculate the mean of the mpg variable from the mtcars data set and round our answer to 2 decimal places. I can accomplish this by nesting:

round(mean(mtcars$mpg, na.rm = TRUE), 2)
## [1] 20.09

Or, we could use pipes. Grammatically, you can think of a pipe as “then.” I have a variable, the mile per gallon of cars, THEN I want to take the mean of that variable, and THEN I want to round that answer to two decimal places.

mtcars$mpg %>% 
  mean(na.rm = TRUE) %>% 
  round(2)
## [1] 20.09

Now, rewrite the following code using pipes.

round(sqrt(sum(mtcars$cyl)), 1)
## [1] 14.1
#Your code here

Why use pipes?

  1. Cleaner code
    • This is nice, because it helps make your code more readable by others (including your future self).
  1. Cleaner environment
    • When you use pipes, you don’t have to save objects from intermediary steps in your global environment because you can just pass output from function to function without saving it.
  1. Efficiency in writing code
    • Naming objects is hard; piping means coming up with fewer names.
  1. More error-proof
    • Because naming is hard, you might accidentally re-use a name and make an error.

Example data set

  • Because you are already familiar with the World Happiness data set, we will use this as an example today (we’ll use the same version from Homework 1). You can import the data with the following code:
world_happiness <- rio::import("https://raw.githubusercontent.com/uopsych/psy611/master/labs/resources/lab5/data/world_happiness.csv")

Clean names

  • If we look at the names of the variables in world_happiness, we’ll notice that all of the variable names are capitalized.
names(world_happiness)
## [1] "Country"    "Happiness"  "GDP"        "Support"    "Life"      
## [6] "Freedom"    "Generosity" "Corruption" "World"
  • The clean_names() function from the {janitor} package will (by default) convert all variable names to snake_case (but there are several other options…see here for more info).
#install.packages("janitor")
library(janitor)
# clean variable names and re-save the data
world_happiness <- world_happiness %>% 
  clean_names()

Now all of our variable names are lower case.

names(world_happiness)
## [1] "country"    "happiness"  "gdp"        "support"    "life"      
## [6] "freedom"    "generosity" "corruption" "world"

Note: Remember to save your new data frame to an object of the same name as your old data frame if you want to overwrite the old one.


Manipulating rows

Select rows with filter()

  • The filter() function is used to subset observations based on their values. The result of filtering is a data frame with the same number of columns as before but fewer rows.

  • The first argument is data and subsequent arguments are logical expressions that tell you which observations to retain in the data frame.

For example, you can filter rows to choose only rows from the United States.

world_happiness %>% 
    filter(country == "United States")

Logical operators

  • The == we just used is an example of a comparison operator that tests for equality. The other comparison operators available are:

    • > (greater than)
    • >= (greater than or equal to)
    • < (less than)
    • <= (less than or equal to)
    • != (not equal to)
  • You can combine multiple arguments to filter() with Boolean operators. The figure below from R for Data Science shows the complete set of Boolean operators.
  • For example, let’s select the rows United States, Mexico, and Canada. Since each row is one country, we are telling it to choose the rows where the variable country is either United States, Mexico, or Canada.
world_happiness %>% 
   filter(country == "United States" | 
            country == "Mexico" |
            country == "Canada")
  • Since it is somewhat cumbersome to write country three times, we can use a special short-hand with the %in% operator. Generally speaking, specifying x %in% y will select every row where x is one of the values in y.

So we could have written our filter statement like this:

world_happiness %>% 
  filter(country %in% c("United States", "Mexico", "Canada"))

Examples

  • Filter for countries that are greater than the mean of happiness
# your code here
  • Filter for countries that are greater than the mean of happiness and are considered “second world” countries (world is coded as 2).
# your code here

Sort rows with arrange()

  • The arrange() function keeps the same number of rows but changes the order of the rows in your data frame.

  • The first argument is data and subsequent arguments are name(s) of columns to order the rows by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

For example, let’s re-order observations by happiness. Note that rows are sorted in ascending order (smallest values first) by default.

world_happiness %>% 
  arrange(happiness)

If you want to see the happiest countries first, you need to sort in descending order (largest values first) using the desc function.

world_happiness %>% 
  arrange(desc(happiness))

Manipulating variables

Extract columns with select()

  • The select() function subsets columns in your data frame. This is particularly useful when you have a data set with a huge number of variables and you want to narrow down to the variables that are relevant for your analysis.

  • The first argument is data, followed by the name(s) of the column(s) you want to subset. Let’s go through some simple examples of common uses of select().

  • Select one variable

world_happiness %>% 
  select(country)
  • Select multiple variables
world_happiness %>% 
  select(country, freedom, corruption)
  • Select a range of variables
world_happiness %>% 
  select(country:support)
  • Rearrange the order of variables
    • Note: everything() is a helper function that gives us all the remaining variables in the data frame (see more on helper functions below)
world_happiness %>% 
  select(country, world, everything()) 
  • De-select variables with a minus sign (-)
world_happiness %>% 
  select(-happiness)
  • De-select range of variables
world_happiness %>% 
  select(-(gdp:world))

Examples

  • Produce a data frame of the variables country, gdp, and happiness for countries whose gdp is greater than average.
# your code here
  • Produce a data frame that has “first world” countries (world = 1) with greater than average levels of freedom. Arrange the rows by freedom scores so that the highest freedom scores are at the top of the data set. Only display the country and freedom variables (in that order). How many observations are you left with?
# your code here

Helper functions for select()

There are some “helper” functions that you can use along with select() that can sometimes be more efficient than selecting your variables explicitly by name.

function what it does
starts_with() selects columns starting with a string
ends_with() selects columns that end with a string
contains() selects columns that contain a string
matches() selects columns that match a regular expression
num_ranges() selects columns that match a numerical range
one_of() selects columns whose names match entries in a character vector
everything() selects all columns
last_col() selects last column; can include an offset.

Quick example:

world_happiness %>% 
  select(starts_with("c")) 

Create new variables with mutate()

  • The mutate() function is most commonly used to add new columns to your data frame that are functions of existing columns.

  • mutate() requires data as its first argument, followed by a set of expressions defining new columns. Let’s take a couple examples…

  • Create new variables

    • Note: New variables are automatically added at the end of the data frame (scroll to the right to see them)

Here, we are creating a new variable called corruption_z, which standardizes the corruption scores, and a new variable life_int, which rounds the life scores to the nearest integer.

world_happiness %>% 
  mutate(corruption_z = scale(corruption),
         life_int = round(life, 0)) 
  • Change existing variables

When we imported our data, the world variable was automatically categorized as an integer.

class(world_happiness$world)
## [1] "integer"

However, this variable refers to discrete categories, and we want to change it to be a factor. We can do this using mutate().

# Note that I am re-saving the dataframe here to preserve this change
world_happiness <- world_happiness %>% 
  mutate(world = as.factor(world))

Now check the type again…

class(world_happiness$world)
## [1] "factor"

Summarizing data

  • The next dplyr verb we’ll cover is summarize(), which is used to summarize across rows of a data set. Like all tidyverse functions, summarize() requires data as its first argument, and then you enter your summary functions separated by commas. Summary functions take vectors as inputs and return single values as outputs.

  • The resulting data set will have just the summary variables you created. In other words, you are going from your raw data frame to a smaller summary data frame that only contains the summary variables you specify within summarize().

Let’s use summarize() to get the mean of gdp across all observations in the data set.

world_happiness %>% 
  summarize(mean_gdp = mean(gdp, na.rm = TRUE))
  • Of course, we typically want to calculate more than just a mean. We can add other summary variables, separating them by commas.
world_happiness %>% 
  summarize(mean_gdp = mean(gdp, na.rm = TRUE), 
            sd_gdp = sd(gdp, na.rm = TRUE),
            n = n())

Grouping data

  • The group_by() function creates groups based on one or more variables in the data. This affects all kinds of things that you then do with the data, such as mutating and/or summarizing. group_by() requires data as its first argument, and the you name the variable(s) to group by.
world_happiness %>% 
  group_by(world)

At first glance, it doesn’t appear that anything has happened. However, under the hood it has indeed grouped the data frame by the world variable. Copy and paste this code into the console–what do you notice?

Combining group_by() and summarize()

  • group_by() and summarize() can be combined to get group-level statistics. This is a great way to make tables of descriptive stats in R or to create aggregated data sets for some purposes.

  • To use these together, you just run group_by() followed by summarize() in a pipeline.

world_happiness %>% 
  group_by(world) %>%
  summarize(mean_gdp = mean(gdp, na.rm = TRUE), 
            sd_gdp = sd(gdp, na.rm = TRUE), 
            n = n())
## `summarise()` ungrouping output (override with `.groups` argument)

Minihacks

For the minihacks today, we will be working with the diamonds data set, which is built into R. This data set contains the prices and various other attributes of about 54,000 different diamonds. Take a peek at the data set with the following functions:

head(diamonds) # first few rows
str(diamonds) # structure of the data frame

Here are what the variables refer to:

variable meaning
price price in US dollars
carat weight of the diamond
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from D (best) to J (worst)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm
y width in mm
z depth in mm
depth total depth percentage
table width of top of diamond relative to widest point

Minihack 1: Make code more legible

  1. Take a look at the following chunk of code. See if you can understand what this code is trying to do.
arrange(select(filter(diamonds, carat > 3 & carat < 4, cut == "Premium", color == "G" |  color == "H" | color == "I" | color == "J"), carat, color, price), color, desc(price))
  1. Re-write this code using pipes (%>%) so it is easier to read. Make sure you get the same result that you get when you run the above code. For an extra challenge, try making the filtering step a little more concise.
# your code here

Minihack 2: Translating into {dplyr} verbs

Answer the questions below using functions from {dplyr}.

  1. On average, which cut of diamond is the most expensive?
# your code here
  1. Which is more expensive on average: a diamond that is the worst color but best clarity, or a diamond that is the best color but worst clarity? Note: The best clarity is coded as IF and worst clarity is coded as I1. The best color is coded as D and the worst color is coded as J.
# your code here

Minihack 3: Summarizing data

  1. Calculate the summary statistics listed below for the carat variable for each color of diamond. Give your summary variables the names indicated in parentheses.
  • mean (mean)
  • standard deviation (sd)
  • number of observations (n)
  • standard error of the mean (sem)
  • 95% confidence interval (lower and upper bounds) around the mean (calculated using a t distribution). (ci_lower and ci_upper)

***In your final summary output only include color, mean, ci_lower and ci_upper

# your code here