The purpose of today’s lab is to introduce you to the tidyverse
as a framework for working with data structures in R. We will mostly focus on data wrangling (particularly data transformation), including how to extract specific observations and variables, how to generate new variables and how to summarize data.
For further resources on these topics, check out R for Data Science by Hadley Wickham and this cheatsheet on data wrangling from RStudio.
To quickly navigate to the desired section, click one of the following links:
The tidyverse
, according to its creators, is “an opionated collection of R packages designed for data science.” It’s a suite of packages designed with a consistent philosophy and aesthetic. This is nice because all of the packages are designed to work well together, providing a consistent framework to do many of the most common tasks in R, including, but not limited to…
dplyr
) = our focus todaytidyr
)ggplot2
)stringr
)forcats
)To load all the packages included in the tidyverse
, use:
#install.packages("tidyverse")
library(tidyverse)
Three qualities of the tidyverse
are worth mentioning at the outset:
Packages are designed to be like grammars for their task, so we’ll be using functions that are named as verbs to discuss the tidyverse. The idea is that you can string these grammatical elements together to form more complex statements, just like with language.
The first argument of (basically) every function we’ll review today is data
(in the form of a data frame). This is very handy, especially when it comes to piping (discussed below).
Variable names are usually not quoted.
Data wrangling, broadly speaking, means getting your data into a useful form for visualizing and modelling it. Hadley Wickham, who has developed a lot of the tidyverse, conceptualizes the main steps involved in data wrangling as follows:
Importing your data (we covered this in Week 1’s lab)
Tidying your data (see brief overview below)
Transforming your data (what we’ll cover today)
The figure below highlights the steps in data wrangling in relation to the broader scope of a typical data science workflow:
Data is considered “tidy” when:
Each variable has its own column
Each observation has its own row
Each value has its own cell
The following figure from R for Data Science illustrates this visually.
If your data is not already in tidy format when you import it, you can use functions from the {tidyR}
package, e.g. gather()
and spread()
, that allow you to “reshape” your data to get it into tidy format.
However, this term we are mostly going to work with simpler data sets that are already tidy, so we are not going to focus on these functions today. These functions will become especially useful in the future when we work with repeated measures data that has multiple observations for each subject. If you are interested in learning more about reshaping your data with {tidyR}
, check out this chapter from R for Data Science.
{dplyr}
{dplyr}
package. Essentially, you can think of this package as a set of “pliers” that you can use to tweak data frames, hence its name (and hex sticker).{dplyr}
is a “grammar” of data manipulation. As such, its functions are verbs:
mutate()
adds new variables that are functions of existing variables
select()
picks variables based on their names.
filter()
picks cases based on their values.
summarize()
reduces multiple values down to a single summary.
arrange()
changes the ordering of the rows.
Note that {dplyr}
functions always take a data frame as the first argument and return a modified data frame back to you. The fact that you always get a data frame back is useful down the road when you are modelling and visualizing data.
{magrittr}
package are available when you load the tidyverse. (Technically, the pipe is imported with {dplyr}
.) Pipes are a way to write strings of functions more easily, creating pipelines. They are extremely powerful and useful. A pipe looks like this:You can enter a pipe with the shortcut CTRL+Shift+M
for PC or CMD+Shift+M
for Mac.
Strictly speaking, a pipe passes an object on the left-hand side as the first argument (or .
argument) of whatever function is on the right-hand side.
x %>% f(y)
is the same as f(x, y)
y %>% f(x, ., z)
is the same as f(x, y, z )
For example, to calculate the mean of the mpg
variable from the mtcars
data set and round our answer to 2 decimal places, we can do the following…
mtcars$mpg %>% # select the `mpg` variable from the `mtcars` dataset
mean(na.rm = TRUE) %>% # calculate the mean
round(2) # round to 2 decimal places
## [1] 20.09
This accomplishes the same thing as “nesting” functions within each other…
round(mean(mtcars$mpg, na.rm = TRUE), 2)
## [1] 20.09
world_happiness <- rio::import("https://raw.githubusercontent.com/uopsych/psy611/master/labs/resources/lab5/data/world_happiness.csv")
world_happiness
, we’ll notice that all of the variable names are capitalized.names(world_happiness)
## [1] "Country" "Happiness" "GDP" "Support" "Life"
## [6] "Freedom" "Generosity" "Corruption" "World"
clean_names()
function from the {janitor}
package will (by default) convert all variable names to snake_case
(but there are several other options…see here for more info).install.packages("janitor") # if not already installed
library(janitor)
# clean variable names and re-save the data
world_happiness <- world_happiness %>%
clean_names()
Now all of our variable names are lower case.
names(world_happiness)
## [1] "country" "happiness" "gdp" "support" "life"
## [6] "freedom" "generosity" "corruption" "world"
filter()
filter()
function is used to subset observations based on their values. The result of filtering is a data frame with the same number of columns as before but fewer rows, as illustrated below…data
and subsequent arguments are logical expressions that tell you which observations to retain in the data frame.For example, we can filter rows to retain data only for the United States.
world_happiness %>%
filter(country == "United States")
The ==
we just used is an example of a comparison operator that tests for equality. The other comparison operators available are :
>
(greater than)>=
(greater than or equal to)<
(less than)<=
(less than or equal to)!=
(not equal to)filter()
with Boolean operators. The figure below from R for Data Science shows the complete set of Boolean operators.world_happiness %>%
filter(country == "United States" | country == "Mexico" | country == "Canada")
country
three times, we can use a special short-hand here with the %in%
operator. Generally speaking, specifying x %in% y
will select every row where x
is one of the values in y
.So we could have written our filter statement like this:
world_happiness %>%
filter(country %in% c("United States", "Mexico", "Canada"))
happiness
# your code here
happiness
but less than the mean of gdp
# your code here
arrange()
arrange()
function keeps the same number of rows but changes the order of the rows in your data frame, as illustrated below…data
and subsequent arguments are name(s) of columns to order the rows by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.For example, let’s re-order observations by happiness
. Note that rows are sorted in ascending order by default.
world_happiness %>%
arrange(happiness) # sorts in ascending order by default
world_happiness %>%
arrange(desc(happiness)) # sort in descending order
select()
select()
function subsets columns in your data frame. This is particularly useful when you have a data set with a huge number of variables and you want to narrow down to the variables that are relevant for your analysis.The first argument is data
, followed by the name(s) of the column(s) you want to subset. Note that you can use variable positions rather than their names, but this is usually not as useful. Let’s go through some simple examples of common uses of select()
.
Select one variable
world_happiness %>%
select(country)
world_happiness %>%
select(country, freedom, corruption)
world_happiness %>%
select(country:support)
everything()
is a helper function that gives us all the remaining variables in the data frame (see more on helper functions below)world_happiness %>%
select(country, world, everything())
-
)world_happiness %>%
select(-happiness)
world_happiness %>%
select(-(gdp:world))
1
that have greater than average levels of freedom. Arrange the rows by freedom
scores in descending order, and only display the country
and freedom
variables (in that order). How many observations are you left with?# your code here
select()
There are some “helper” functions that you can use along with select()
that can sometimes be more efficient than selecting your variables explicitly by name.
function | what it does |
---|---|
starts_with() |
selects columns starting with a string |
ends_with() |
selects columns that end with a string |
contains() |
selects columns that contain a string |
matches() |
selects columns that match a regular expression |
num_ranges() |
selects columns that match a numerical range |
one_of() |
selects columns whose names match entries in a character vector |
everything() |
selects all columns |
last_col() |
selects last column; can include an offset. |
Quick example:
world_happiness %>%
select(starts_with("c"))
mutate()
mutate()
function is most commonly used to add new columns to your data frame that are functions of existing columns.mutate()
requires data as its first argument, followed by a set of expressions defining new columns. Let’s take a couple examples…
world_happiness %>%
mutate(corruption_z = scale(corruption), # z-score `corruption` variable
life_int = round(life, 0)) # round `life` variable to a whole number
When we imported our data, the world
variable was automatically categorized as an integer
.
class(world_happiness$world)
## [1] "integer"
However, this variable refers to discrete categories, and we want to change it to be a factor
. We can do this using mutate()
.
# Note that I am re-saving the dataframe here to preserve this change
world_happiness <- world_happiness %>%
mutate(world = as.factor(world))
Now check the type again…
class(world_happiness$world)
## [1] "factor"
summarize()
, which is used to summarize across rows of a data set. Like all tidyverse functions, summarize()
requires data
as its first argument, and then you enter your summary functions separated by commas. Summary functions take vectors as inputs and return single values as outputs:summarize()
, as illustrated below…Let’s use summarize()
to get the mean of gdp
across all observations in the data set.
world_happiness %>%
summarize(mean_gdp = mean(gdp, na.rm = TRUE))
world_happiness %>%
summarize(mean_gdp = mean(gdp, na.rm = TRUE), # mean
sd_gdp = sd(gdp, na.rm = TRUE), # standard deviation
n = n()) # number of observations
group_by()
function creates groups based on one or more variables in the data. This affects all kinds of things that you then do with the data, such as mutating and/or summarizing. group_by()
requires data
as its first argument, and the you name the variable(s) to group by.world_happiness %>%
group_by(world)
At first glance, it doesn’t appear that anything has happened. However, under the hood it has indeed grouped the data frame by the world
variable. Copy and paste this code into the console–what do you notice?
group_by()
and summarize()
group_by()
and summarize()
can be combined to get group-level statistics. This is a great way to make tables of descriptive stats in R or to create aggregated data sets for some purposes.group_by()
followed by summarize()
in a pipeline.world_happiness %>%
group_by(world) %>% # group by the world variable
summarize(mean_gdp = mean(gdp, na.rm = TRUE), # mean
sd_gdp = sd(gdp, na.rm = TRUE), # standard deviation
n = n()) # number of observations
For the minihacks today, we will be working with the diamonds
data set, which is built into R. This data set contains the prices and various other attributes of about 54,000 different diamonds. Take a peek at the data set with the following functions:
head(diamonds) # first few rows
str(diamonds) # structure of the data frame
Here are what the variables refer to:
variable | meaning |
---|---|
price |
price in US dollars |
carat |
weight of the diamond |
cut |
quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color |
diamond colour, from D (best) to J (worst) |
clarity |
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x |
length in mm |
y |
width in mm |
z |
depth in mm |
depth |
total depth percentage |
table |
width of top of diamond relative to widest point |
arrange(select(filter(diamonds, carat > 3 & carat < 4, cut == "Premium", color == "G" | color == "H" | color == "I" | color == "J"), carat, color, price), color, desc(price))
%>%
) so it is easier to read. Make sure you get the same result that you get when you run the above code. For an extra challenge, try making the filtering step a little more concise.# your code here
{dplyr}
verbsAnswer the questions below using functions from {dplyr}
. Think about the order in which you will need to do different operations.
Fair
and Ideal
diamonds?# your code here
Fair
and IF
(worst cut, best clarity), or a diamond that is Ideal
and I1
(best cut, worst clarity)?# your code here
carat
variable for each color
of diamond. Give your summary variables the names indicated in parentheses.mean
)sd
)n
)sem
)ci_lower
and ci_upper
)
***In your final summary output only include color
, mean
, ci_lower
and ci_upper
# your code here
carat
variable for each combination of color
and cut
. How many observations do you have in your summary data frame this time?# your code here