You can download the .Rmd file here. You can download the .doc file here.
The purpose of today’s lab is to introduce you to the
tidyverse
as a framework for working with data structures
in R. We will mostly focus on data wrangling (particularly data
transformation), including how to extract specific observations and
variables, how to generate new variables and how to summarize data.
For further resources on these topics, check out R for Data Science by Hadley Wickham and this cheatsheet on data wrangling from RStudio.
To quickly navigate to the desired section, click one of the following links:
The tidyverse
, according to its creators, is “an opionated
collection of R packages designed for data science.” It’s a suite of
packages designed with a consistent philosophy and aesthetic. This is
nice because all of the packages are designed to work well together,
providing a consistent framework to do many of the most common tasks in
R, including, but not limited to…
dplyr
) = our focus
todaytidyr
)ggplot2
)stringr
)forcats
)To load all the packages included in the tidyverse
,
use:
#install.packages("tidyverse")
library(tidyverse)
Three qualities of the tidyverse
are worth
mentioning at the outset:
Packages are designed to be like grammars for their task, so we’ll be using functions that are named as verbs to discuss the tidyverse. The idea is that you can string these grammatical elements together to form more complex statements, just like with language.
The first argument of (basically) every function we’ll review
today is data
(in the form of a data frame). This is very
handy, especially when it comes to piping (discussed below).
Variable names are usually not quoted.
Data wrangling, broadly speaking, means getting your data into a useful form for visualizing and modelling it. Hadley Wickham, who has developed a lot of the tidyverse, conceptualizes the main steps involved in data wrangling as follows:
Importing your data (we covered this in Week 1’s lab)
Tidying your data (see brief overview below)
Transforming your data (what we’ll cover today)
The figure below highlights the steps in data wrangling in relation to the broader scope of a typical data science workflow:
Data is considered “tidy” when:
Each variable has its own column
Each observation has its own row
Each value has its own cell
The following figure from R for Data Science illustrates this visually.
If your data is not already in tidy format when you import it,
you can use functions from the {tidyR}
package,
e.g. pivot_longer()
and pivot_wider()
, that
allow you to “reshape” your data to get it into tidy format.
However, this term we are mostly going to work with simpler
datasets that are already tidy, so we are not going to focus on these
functions today. These functions will become especially useful in the
future when we work with repeated measures data that has multiple
observations for each subject. If you are interested in learning more
about reshaping your data with {tidyR}
, check out this chapter from R for Data
Science.
{dplyr}
{dplyr}
package. Essentially, you can think of this package
as a set of “pliers” that you can use to tweak data frames, hence its
name (and hex sticker).{dplyr}
is a “grammar” of data manipulation. As
such, its functions are verbs:
mutate()
adds new variables that are functions of
existing variables
select()
picks variables based on their
names.
filter()
picks cases based on their values.
summarize()
reduces multiple values down to a single
summary.
arrange()
changes the ordering of the rows.
Note that {dplyr}
functions always take a data frame
as the first argument and return a modified data frame back to you. The
fact that you always get a data frame back is useful down the road when
you are modelling and visualizing data.
{magrittr}
package are available
when you load the tidyverse. (Technically, the pipe is imported with
{dplyr}
.) Pipes are a way to write strings of functions
more easily, creating pipelines. They are extremely powerful
and useful. A pipe looks like this:CTRL+Shift+M
for
PC or CMD+Shift+M
for Mac.#practice entering a pipe with the shortcut here
A pipe passes an object on the left-hand side as the first
argument (or .
argument) of whatever function is on the
right-hand side.
x %>% f(y)
is the same as
f(x, y)
y %>% f(x, ., z)
is the same as
f(x, y, z )
Example: I want to calculate the mean of the mpg variable from the mtcars data set and round our answer to 2 decimal places. I can accomplish this by nesting:
round(mean(mtcars$mpg, na.rm = TRUE), 2)
Or, we could use pipes. Grammatically, you can think of a pipe as “then.” I have a variable, the mile per gallon of cars, THEN I want to take the mean of that variable, and THEN I want to round that answer to two decimal places.
mtcars$mpg %>% # select the `mpg` variable from the `mtcars` dataset
mean(na.rm = TRUE) %>% # calculate the mean
round(2) # round to 2 decimal places
Now, rewrite the following code using pipes.
round(sqrt(sum(mtcars$cyl)), 1)
mtcars$cyl %>%
sum() %>%
sqrt() %>%
round(1)
. Cleaner environment * When you use pipes, you have basically no reason to save objects from intermediary steps in your data wrangling / analysis workflow, because you can just pass output from function to function without saving it. * Finding objects you’re looking for is easier.
. Efficiency in writing code * Naming objects is hard; piping means coming up with fewer names.
. More error-proof * Because naming is hard, you might accidentally re-use a name and make an error.
world_happiness <- rio::import("https://raw.githubusercontent.com/uopsych/psy611/master/labs/resources/lab5/data/world_happiness.csv")
world_happiness
, we’ll notice that all of the variable
names are capitalized.names(world_happiness)
## [1] "Country" "Happiness" "GDP" "Support" "Life"
## [6] "Freedom" "Generosity" "Corruption" "World"
clean_names()
function from the {janitor}
package will (by default) convert all variable names to
snake_case
(but there are several other options…see here for more info).#install.packages("janitor") # if not already installed
library(janitor)
# clean variable names and re-save the data
world_happiness <- world_happiness %>%
clean_names()
Now all of our variable names are lower case.
names(world_happiness)
## [1] "country" "happiness" "gdp" "support" "life"
## [6] "freedom" "generosity" "corruption" "world"
filter()
filter()
function is used to subset observations
based on their values. The result of filtering is a data frame with the
same number of columns as before but fewer rows, as illustrated
below…data
and subsequent arguments are
logical expressions that tell you which observations to retain in the
data frame.For example, we can filter rows to retain data only for the United States.
world_happiness %>%
filter(country == "United States")
The ==
we just used is an example of a comparison
operator that tests for equality. The other comparison operators
available are :
>
(greater than)>=
(greater than or equal to)<
(less than)<=
(less than or equal to)!=
(not equal to)%in%
(tests whether the element on left side of
%in%
is inside a vector on the right side)filter()
with
Boolean operators. The figure below from R for Data Science shows the complete set
of Boolean operators.world_happiness %>%
filter(country == "United States" | country == "Mexico" | country == "Canada")
country
three
times, we can use a special short-hand here with the %in%
operator. Generally speaking, specifying x %in% y
will
select every row where x
is one of the values in
y
.So we could have written our filter statement like this:
world_happiness %>%
filter(country %in% c("United States", "Mexico", "Canada"))
country_names = c("United States", "Mexico", "Canada")
world_happiness %>%
filter(country %in% country_names)
happiness
world_happiness %>%
filter(happiness > mean(happiness, na.rm= TRUE))
happiness
but less than the mean of gdp
# your code here
world_happiness %>%
filter(happiness>mean(happiness, na.rm = TRUE) &
gdp<mean(gdp, na.rm = TRUE))
world_happiness %>%
filter(happiness>mean(happiness, na.rm = TRUE) &
gdp<mean(gdp, na.rm = TRUE))
world_happiness %>%
filter(happiness > mean(happiness, na.rm = TRUE),
gdp < mean(gdp, na.rm = TRUE))
arrange()
arrange()
function keeps the same number of rows
but changes the order of the rows in your data frame, as
illustrated below…data
and subsequent arguments are
name(s) of columns to order the rows by. If you provide more than one
column name, each additional column will be used to break ties in the
values of preceding columns.For example, let’s re-order observations by happiness
.
Note that rows are sorted in ascending order by default.
world_happiness %>%
arrange(happiness) %>% # sorts in ascending order by default
head()
world_happiness %>%
arrange(desc(happiness)) %>%
head()# sort in descending order
select()
select()
function subsets columns in your data
frame. This is particularly useful when you have a data set with a huge
number of variables and you want to narrow down to the variables that
are relevant for your analysis.The first argument is data
, followed by the name(s)
of the column(s) you want to subset. Note that you can use variable
positions rather than their names, but this is usually not as useful.
Let’s go through some simple examples of common uses of
select()
.
Select one variable
world_happiness %>%
select(country)
world_happiness %>%
select(country, freedom, corruption)
world_happiness %>%
select(country:support)
names(world_happiness)
everything()
is a helper function that gives us
all the remaining variables in the data frame (see more on helper functions below)world_happiness %>%
select(country, world, everything()) %>%
names()
-
)world_happiness %>%
select(-happiness) %>%
names()
world_happiness %>%
select(-(gdp:world)) %>%
names()
country
,
gdp
, and happiness
for countries whose
gdp
is greater than average.# your code here
world_happiness %>%
janitor::clean_names() %>%
select(country, gdp, happiness) %>%
filter(gdp>mean(gdp, na.rm = TRUE)) %>%
head()
#mean(as.numeric(world_happiness$GDP), na.rm = TRUE)
1
that have greater than average levels of freedom. Arrange the rows by
freedom
scores in descending order, and only display the
country
and freedom
variables (in that order).
How many observations are you left with?# your code here
world_happiness %>%
filter(world == 1 & freedom > mean(freedom, na.rm = TRUE)) %>%
arrange(desc(freedom)) %>%
select(country, freedom)
select()
There are some “helper” functions that you can use along with
select()
that can sometimes be more efficient than
selecting your variables explicitly by name.
function | what it does |
---|---|
starts_with() |
selects columns starting with a string |
ends_with() |
selects columns that end with a string |
contains() |
selects columns that contain a string |
matches() |
selects columns that match a regular expression |
num_ranges() |
selects columns that match a numerical range |
one_of() |
selects columns whose names match entries in a character vector |
everything() |
selects all columns |
last_col() |
selects last column; can include an offset. |
Quick example:
world_happiness %>%
select(starts_with("c"))
mutate()
mutate()
function is most commonly used to add new
columns to your data frame that are functions of existing columns.mutate()
requires data as its first argument,
followed by a set of expressions defining new columns. Let’s take a
couple examples…
Create new variables
world_happiness %>%
mutate(corruption_z = scale(corruption), # z-score `corruption` variable
life_int = round(life, 0)) %>% # round `life` variable to a whole number
head()
.before = [column name]
or .after = [column name]
. However, this refers to the
placement of all new variables as they are created within a single
mutate()
function:world_happiness %>%
mutate(corruption_z = scale(corruption), # z-score `corruption` variable
life_int = round(life, 0),# round `life` variable to a whole number
.before = life) %>% #place the two new variables before life
head()
When we imported our data, the world
variable was
automatically categorized as an integer
.
class(world_happiness$world)
However, this variable refers to discrete categories, and we want to
change it to be a factor
. We can do this using
mutate()
.
# Note that I am re-saving the dataframe here to preserve this change
world_happiness <- world_happiness %>%
mutate(world = as.factor(world))
Now check the type again…
class(world_happiness$world)
summarize()
, which
is used to summarize across rows of a dataset. Like all tidyverse
functions, summarize()
requires data
as its
first argument, and then you enter your summary functions separated by
commas. Summary functions take vectors as inputs and return single
values as outputs:summarize()
, as illustrated below…Let’s use summarize()
to get the mean of
gdp
across all observations in the dataset.
world_happiness %>%
summarize(mean_gdp = mean(gdp, na.rm = TRUE))
world_happiness %>%
summarize(mean_gdp = mean(gdp, na.rm = TRUE), # mean
sd_gdp = sd(gdp, na.rm = TRUE), # standard deviation
n = n()) # number of observations
group_by()
function creates groups based on one or
more variables in the data. This affects all kinds of things that you
then do with the data, such as mutating and/or summarizing.
group_by()
requires data
as its first
argument, and the you name the variable(s) to group by.world_happiness %>%
group_by(world)
At first glance, it doesn’t appear that anything has happened.
However, under the hood it has indeed grouped the data frame by the
world
variable. Copy and paste this code into the
console–what do you notice?
group_by()
and summarize()
group_by()
and summarize()
can be combined
to get group-level statistics. This is a great way to make tables of
descriptive stats in R or to create aggregated datasets for some
purposes.group_by()
followed
by summarize()
in a pipeline.world_happiness %>%
group_by(world) %>% # group by the world variable
summarize(mean_gdp = mean(gdp, na.rm = TRUE), # mean
sd_gdp = sd(gdp, na.rm = TRUE), # standard deviation
n = n()) # number of observations
For the minihacks today, we will be working with the diamonds
dataset, which is built into
R. This dataset contains the prices and various other attributes of
about 54,000 different diamonds. Take a peek at the dataset with the
following functions:
head(diamonds) # first few rows
str(diamonds) # structure of the data frame
Here are what the variables refer to:
variable | meaning |
---|---|
price |
price in US dollars |
carat |
weight of the diamond |
cut |
quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color |
diamond colour, from D (best) to J (worst) |
clarity |
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x |
length in mm |
y |
width in mm |
z |
depth in mm |
depth |
total depth percentage |
table |
width of top of diamond relative to widest point |
arrange(select(filter(diamonds, carat > 3 & carat < 4, cut == "Premium", color == "G" | color == "H" | color == "I" | color == "J"), carat, color, price), color, desc(price))
%>%
) so it is easier
to read. Make sure you get the same result that you get when you run the
above code. For an extra challenge, try making the filtering step a
little more concise.# your code here
diamonds %>%
filter(carat > 3 & carat < 4,
cut == "Premium",
color %in% c("G", "H","I", "J")) %>%
select(carat, color, price) %>%
arrange(color, desc(price))
{dplyr}
verbsAnswer the questions below using functions from {dplyr}
.
Think about the order in which you will need to do different
operations.
# your code here
diamonds %>%
group_by(cut) %>%
summarize(mean_price = mean(price)) %>%
arrange(desc(mean_price))
Fair
and IF
(worst cut, best clarity), or a
diamond that is Ideal
and I1
(best cut, worst
clarity)?# your code here
diamonds %>%
filter(cut =="Fair", clarity == "IF") %>%
summarize(mean_price = mean(price, na.rm = T))
diamonds %>%
filter(cut =="Ideal", clarity == "I1") %>%
summarize(mean_price = mean(price, na.rm = T))
diamonds %>%
group_by(cut, clarity) %>%
summarize (mean_price = mean(price, na.rm = T)) %>%
filter(cut =="Fair" & clarity == "IF"|
cut =="Ideal" & clarity == "I1")
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
# two steps
diamonds %>%
filter(cut == "Fair",
clarity == "IF") %>%
summarize(mean_price = mean(price, na.rm = TRUE))
diamonds %>%
filter(cut == "Ideal",
clarity == "I1") %>%
summarize(mean_price = mean(price, na.rm = TRUE))
# one step
diamonds %>%
group_by(cut, clarity) %>%
summarize(mean_price = mean(price, na.rm = TRUE)) %>%
filter(cut == "Fair" & clarity == "IF" |
cut == "Ideal" & clarity == "I1")
carat
variable for each color
of diamond. Give
your summary variables the names indicated in parentheses.mean
)sd
)n
)sem
)ci_lower
and
ci_upper
)
***In your final summary output only include color
,
mean
, ci_lower
and ci_upper
# your code here
# group by color, CI based on t distribution
diamonds %>%
group_by(color) %>%
summarize(mean = mean(carat, na.rm = TRUE),
sd = sd(carat, na.rm = TRUE),
n = n(),
sem = sd/sqrt(n),
ci_lower = mean - sem * qt(p = .975, df = n-1),
ci_upper = mean + sem * qt(p = .975, df = n-1)) %>%
select(-n, -sd, -sem)
carat
variable for each combination of
color
and cut
. How many observations do you
have in your summary data frame this time?# your code here
diamonds %>%
group_by(color, cut) %>%
summarize(mean = mean(carat, na.rm = TRUE),
sd = sd(carat, na.rm = TRUE),
n = n(),
sem = sd/sqrt(n),
ci_lower = mean - sem * qt(p = .975, df = n-1),
ci_upper = mean + sem * qt(p = .975, df = n-1)) %>%
select(-n, -sd, -sem)