Download this lab: lab-1_knit.docx
Note. If you are unable to open the link, a back-up copy of the knitted lab can be found on Canvas under Home
> Labs
.
The purpose of today’s lab is not to teach you everything there is to know about coding in R. It is not even to describe why the code in R works the way it does. Instead, we take an applied approach to learning R. We hope that giving you a functional understanding of R and suggesting some strategies for overcoming common coding obstacles will allow you to begin playing around with the language. We believe that the easiest way to learn R is by using R.
Today’s lab will cover:
After we have covered the content of the lab, we will move on to Minihacks. Minihacks are small coding projects intended to test your knowledge of the day’s material. The minihacks will be similar to—but narrower in focus than—the questions on your homework assignments. If you are able successfully complete all of the minihacks, you should be well equipped to begin tackling your homework!
So what is R?
In the simplest possible terms, R is a programming language used for conducting analyses and producing graphics. It is substantially more flexible than GUI-based statistics programs (e.g., SPSS, LISREL) but less flexible than other programming languages. This lack of flexibility is on purpose; it allows the code to written in a far more efficient and intuitive way than other programming languages.
Only one piece of software is required to get started using the R programming language and, confusingly, it is also called R. I will refer to it here as the R Engine. The R Engine essentially allows the computer to understand the R programming language, turning your lines of text into computer operations. Unlike other popular statistics programs (e.g., SPSS, SAS), the R Engine is free. Instructions for downloading the R Engine are below.
A second piece of software that is not required to use R but is nonetheless useful is RStudio. RStudio is an integrated development environment (IDE) or, in potentially overly simplistic terms, a tool that makes interacting with the R Engine easier. Instructions for downloading RStudio are also below.
R 3.6.1. "Action of the Toes"
(all version nicknames are references to the Peanuts comic strip). I would click R-3.6.1.pkg
to start the download.Note. The same steps are used to update the R Engine: You install a new version and replace the old version in the process.
RStudio 1.2.1335 - macOS 10.12+ (64-bit)
.Note. To update RStudio after it is already installed, all you have to do is navigate to Help > Check for Updates
in the menubar.
As shown in the image below, an RStudio session is split into four sections called panes: the console, the source pane, the environment/history pane, and the succinctly named files/plots/packages/help pane.
In RStudio, the console is the access point to the underlying R Engine. It evaluates the code you provide it,including code called using the the source pane. You can pass commands to the R Engine by typing them in after the >
.
The source pane shows you a collection of code called a script. In R, we primarily work with R Script
files (files ending in .R
) or R Markdown
documents (files ending in .Rmd
). In this class, we will mostly be working with R Markdown
files. The document you are currently reading was created with an R Markdown
document.
The environment/history pane shows, well, your environment and history. Specifically, if you have the “Environment” tab selected, you will see a list of all the variables that exist in your global environment. If you have the “History” tab selected, you will see previous commands that were passed to the R Engine.
The final pane—the files/plots/packages/help pane–includes a number of helpful tabs. The “Files” tab shows you the files in your current working directory, the “Plots” tab shows you a preview of any plots you have created, the “Packages” tab shows you a list of the packages currently installed on your computer, and the “Help” tab is where help documentation will appear. We will discuss packages and help documentation later in this lab.
As noted above, you will mostly be using R Markdown
documents in this course. In fact, it is required that your homeworks be created using an R Markdown
document. The following section will guide you the process of creating an R Markdown
document.
Click on the blank piece of paper with the plus sign over it in the upper left-hand corner of RStudio.
Click on R Markdown...
.
lab_1
.Congratulations! You now have an R Markdown
document!
The content of R Markdown
documents can be split into two main types. I will call the first type simple text. Simple text will not be evaluated by the computer other than to be formatted according to markdown syntax. If you are answering a homework question or interpreting the results of an analysis, you will likely be using simple text.
Markdown syntax is used to format the simple text, such as italicizing words by enclosing them in asterisks (e.g., *this is italicized*
becomes this is italicized) or bolding words by enclosing them in double-asterisks (e.g., **this is bold**
becomes this is bold). For a quick rundown of what you can do with R Markdown formatting, I suggest you check out the Markdown section of the R Markdown Cheat Sheet.
In addition to simple text, R Markdown
documents support blocks (also called chunks) of R code. In contrast to simple text, the R code chunks are evaluated by the computer. The chunks are surrounded by ```{r}
and ```
. In the example image below, the 1 + 2
in the R Code chunk will be evaluated when the document is “knitted” (rendered). For your homeworks, you will want to include your analyses in these chunks.
In order to knit an R Markdown document, you can either use the shortcut command + shift + k
or click the button at the top of the R Markdown document that says Knit
. The computer will take several seconds (or, depending on the length of the R Markdown document, several minutes) to knit the document. Once the computer has finished knitting the document, a new document will appear in the same location that the R Markdown
document is saved. In this example, the new document will end with a .html
extension.
As shown in the above image, the simple text in the R Markdown
document on the left was rendered into a formatted in the knitted document on the right. The equation in the code chunk was also evaluated in the knitted document, returning the value 3
.
As mentioned above, you can pass commands to the R-engine via the console. R has arithmetic commands for doing basic math operations, including addition (+
), subtraction (-
), multiplication (*
), division (/
), and exponentiation (^
).
R will automatically follow the PEMDAS order of operations (BEDMAS if you are from Canada or New Zealand). Parentheses can be used to tell R what parts of the equation should be evaluated first. As shown below and as expected, (10 + 5) * 2
is not equivalent to 10 + 5 * 2
.
You can create variables using the assignment operator (<-
). Whatever is on the left of the assignment operator is saved to name specified on the right of the assignment operator. I like to imagine that there is a box with a name on it and you are placing a value, inside of the box. For example, if we wanted to place 10
into a variable called my_number
, we would write:
If we want to see what is stored in my_number
, we can simply type my_number
into the console and press enter
. We are essentially asking the computer, “What’s in the box with my_number
written on it?”
If we want to overwrite my_number
with a new value, we simply assign a new value to my_number
.
Looking at my_number
again, we can see that it is now 20
.
We can treat variables just like we would the underlying values. For example, we can add 5
to my_number
by using +
.
Keep in mind, the above operation does not save the result of my_number + 5
to my_number
. To do that, we would have to assign the result of my_number + 5
to my_number
.
If we want to remove a variable from our environment, we can use rm()
.
In R, there are four basic types of data: (1) logical
values (also called booleans
), which can either be TRUE
or FALSE
, (2) integer
values, which can be any whole number (i.e.., a number without digits after the decimal place), (3) double
values, which can be any number with digits before and after the decimal place, and (4) character
values (also called strings
), which are pieces of text enclosed in quotation marks ("
).
Type | Examples |
---|---|
Logical/Boolean | TRUE , FALSE |
Integer | 10L , -10L |
Double | 10.50 , -10.50 |
Character | "Hello" , "World" |
A collection of values is called a vector
. If they are all of the same type, we call them atomic vectors
. In R, we use the c()
command to concatenate (or combine) values into an atomic vector
.
Just as we did with the scalar
values above, we can assign a vector to a variable.
To print out the entire vector, we simply type my_vector
into the console.
In order to select just one value from the vector, we use square brackets ([]
). For example, if we wanted the third value from my_vector
we would type my_vector[3]
1.
If we want to replace a specific value in a vector, we use the assignment operator (<-
) in conjunction with the square brackets ([]
).
As with single-value objects we can perform arithmetic operations on vectors, but the behaviour is not identical. If the vectors are the same length, each value from one vector will be paired with a corresponding value from the other vector. See below for an example of this in action.
If the vectors of different lengths, the shorter vector will be recycled (i.e., repeated) to be the same length as the longer vector.
This also works when the longer vector is not a multiple of the shorter vector, but you will get the warning: longer object length is not a multiple of shorter object length
.
1
instead of 0
. For instance, if you want to select the first element of a vector, you would write my_vector[1]
instead of my_vector[0]
. A second difference to keep in mind is that the -
is used in R to remove whichever value is in the spot indicated by the index value. Using vector[-2]
on the vector c(10, 20, 30, 40, 50, 60)
would return c(10, 30, 40, 50, 60)
in R. In python, it would return 50
.A vector that can accomodate more than one type of value (e.g., a double
AND a character
) is called a list
. To create a list
, we use list()
instead of c()
. If we wanted to create a vector with the values 5L
, 10
, "fifteen"
, and FALSE
we would use list(5L, 10, "fifteen", FALSE)
.
Although lists
are an incredibly powerful type of data structure, dealing with them can be quite frustrating (especially for beginning coders). Since you are unlikely to need to know the inner workings of list
s for anything we will be doing in this course, I have chosen not to include much about them here. However, as you become a more advanced user, learning to leverage lists will allow you to write code that is far more efficient.
In R you will mostly be working with data frames
. A data frame
is technically a list of atomic vectors. For our purposes, we can think of a data frame
as a spread sheet with columns of variables and rows of observations.
Let’s look at a data frame
that is automatically loaded when you open R, mtcars
. Type mtcars
to print out the data frame.
The data frame mtcars
has a row for 32 cars featured in the 1974 Motor Trend magazine. There is a column for the car’s miles per gallon (mpg
), number of cylinders (cyl
), engine displacement (disp
), horse power (hp
), rear axle ratio (drat
), weight in thousands of pounds (wt
), quarter-mile time (qsec
), engine shape (vs
), transmission type (am
), number of forward gears (gear
), and number of carburetors (carb
).
With data frames, you can extract a value by including [row, col]
immediately after the object. For example, if we wanted to extract the number of gears in the Datsun 710
we could use mtcars[3, 10]
to extract the value stored in the third row, tenth column.
Since the rows and columns have names, we can also be explicit and use the name of the row ("Datsun 710"
) and the name of the column ("gear"
) instead of the row and column indices.
We can also extract an entire column by dropping the index value for the row. Since you don’t specify a given row, the computer assumes you want all of the values in the column. For example, to extract all values stored in the gear column, we could use [, 10]
or [, "gear"]
.
To extract an entire row, we drop the column index. To extract all of the values associated with the Datsun 710
, we would drop the column index (e.g., [3, ]
or ["Datsun 710", ]
)
You can also extract columns using $
followed by the column name without quotes.
If we want to extract multiple columns (or multiple rows) we use vectors. For example, if we wanted the number of gears and carburetors in the Datsun 710
and the Duster 360
we would use [c("Datsun 710", "Duster 360"), c("gear", "carb")]
or [c(3, 7), c(10:11)]
.
Up to this point, we have been more-or-less directly telling R what we want it to do. This is great if we want to understand the processes that underlie R, but it can be incredibly time-consuming. Thankfully, we have functions. Functions are essentially pre-packaged snippets of code that take one or more pieces of input (called arguments
) and return one or more pieces of output (called values
). For example, length()
is a function that takes a vector as its sole argument and returns the length of the vector as its sole value.
The function unique()
also takes a vector as its primary argument, but—instead of returning the length of the vector as its value—it returns only the unique values of that vector.
The mean()
function and sd()
function are two functions that you will end up using a lot. The former (mean()
) takes a numeric vector and returns the average of the vector.
The latter (sd()
) also takes a numeric vector, but it returns the standard deviation of the vector instead.
Although it is more conceptual, it is also useful to mention the typeof()
function here. The function typeof()
takes any object and tells you what type of variable it is.
Using the suite of as.*()
functions (e.g., as.numeric()
, as.character()
, as.logical()
, as.integer()
), we can likewise coerce objects to other types.
Sometimes when working in R you will want to know more about a function. For example, you might want to know what arguments the function sd()
takes. You can use ?
at the beginning of any function call to display the help documentation for that function.
From the help documentation we can see that sd()
takes two arguments: (1) An R object and (2) a logical value indicating whether NA
s (unknown values) should be removed before the standard deviation is calculated.
Typically R will infer, based on the order of the arguments, what values correspond to which arguments. For example, since sd()
expects that the argument x
will be provided first and the argument na.rm
will be provided second, the following works:
However, we can also explicitly tell R what values are associated with which arguments.
The help documentation for a function often also includes an example of how to use the function and details on what the expected output will be.
A package
can include code, documentation for that code, and/or data. A helpful way to think of packages is as a toolbox full of data analysis tools.
There are general purpose toolboxes that contain tools for running common analyses in psychology (e.g., psych
), toolboxes for helping your run advanced statistical models (e.g., lavaan
; lmer
), toolboxes for text mining (e.g., tidytext
), and toolboxes for plotting (e.g., ggplot2
, gganimate
). If you have a problem that needs to be solved, there will probably be a package for it.
To install a package onto your computer, you simply pass the name of the package to install.packages()
. As a demonstration, we install the psych
package below. The psych
package has several useful data analysis tools for psychologists.
Note. When installing packages, the package name must be enclosed in quotes: install.packages("psych")
NOT install.packages(psych)
. You generally only need to install a package once.
Just because we’ve installed a package to our computer doesn’t mean we have access to its functions. Buying a toolbox doesn’t necessarily give you access to its tools. You also have to open the toolbox. To open psych
and load its functions, we use library()
.
Note. A package can be loaded with or without quotes: library("psych")
OR library(psych)
. We have to load a package every time that R is restarted.
Now that we have installed and loaded the psych
package, let’s try out of some its commands.
Using corr.test()
we can make a correlation matrix of the variables in mtcars
.
Using skew()
, we can look at the skew of all of the columns in mtcars
.
We can also use t2d()
to calculate the Cohen’s d for a t-value of 3.00 with 300 participants.
This is only a small subset of the functions available in the psych
package, and psych
is only one package of over 11,000 on CRAN (as of 2018). This is not to mention the tens of thousands of packages hosted on online repositories like GitHub. As Cory Costello noted during R Bootcamp, the question with R is never if but how.
The final topic that we will cover in this lab is how to load data into R. Over the course of your grad school careers (and many times in this class) you will need to import data into R to be analyzed.
For this example, we will be using the planets data set from Star Wars. The data can be downloaded here.
When I took this course, you would have to use file-type-specific functions to load data into R (e.g., read.csv
, read_excel
). The rio
package streamlines this process by having a single import function (import()
) that infers the file type from its extension (e.g., .csv
, .xlsx
, .sav
). As we did for psych
, we will first need to install rio
.
Second, we will need to load rio
.
Once this is done, importing the data is as easy as passing the location of the downloaded file to import
and saving the data into a variable (called planets_data
here). In this case, the sw_planets.xlsx
was saved to my downloads
folder. If it was in a folder called data_sets
on my desktop
, I would have used "~/desktop/data_sets/sw_people.xlsx"
as the argument.
The tilde (~
) in the above string is a shortcut for the home directory on my computer (i.e., /Users/cameronkay
). Replacing the tilde with the path to the home directory should have the exact same result as using the tilde.
To ensure it was read in properly, we can look at the first six rows of the imported dataset by using the head()
function.
We can also look at the last six rows by using the tail()
function.
Now that we have covered the lab material, we will move on to the Minihacks. You are welcome to work with a partner or in a small group of 2-3 people. If you have any questions, I would be happy to answer them!
Create an R Markdown
document called lab1_minihacks
.
Complete minihack 2
through minihack 5
using a combination of simple text and code chunks.
Try rendering your R Markdown
document by clicking knit
. If it doesn’t render correctly, try to figure out why it didn’t.
Use R to solve for \(x\): \[x = \frac{(102 + 68) \times (3 + 2) + 1250}{50}\]
Assign the \(x\) to a variable called x
.
Assign the numbers 10
, 20
, and 30
to a vector called y
.
Before running any code, decide whether you think adding x
to y
would result in a single value or three separate values. Add x
to y
.
Assign the string "I AM NOT YELLING"
to a variable called exclamation
.
Use the function tolower()
to convert every letter of exclamation
to lower case. Assign the result to exclamation
.
Use the capitalize()
function from the Hmisc
package to capitalize the first letter of exclamation
.
5
values between 10
and 50
using seq()
, but the code I wrote is creating a vector of 9
values between 10
and 50
. I believe it has something to do with the arguments I used, but I can’t remember how to access the help documention to check. Without changing the values (i.e., 10
, 50
, and 5
), can you fix my code?seq(from = 10, to = 50, by = 5)
## [1] 10 15 20 25 30 35 40 45 50
Download the Marvel character dataset to your computer.
Import the data into R and assign it to a variable called marvel_data
.
Ah! The value for the number of appearances of Spider-Man seems to be an error! It should be 4043
not 40430
! Use square brackets ([]
) to replace the erroneous value with the correct value (hint: The value is stored in the first row of the eighth column).
Using mean()
and dollar sign notation (data$column
), calculate the average number of appearances for all of the Marvel characters. Assign the result to a variable called mean_appearances
.
Install and load the package ggplot2
.
If you succesfully completed the proceeding steps, you should be able to run the following code without producing an error. If you get an error, try to figure out why you are receiving the error.
ggplot(marvel_data, aes(x = reorder(align, -appearances),
y = appearances,
fill = align)) +
geom_bar(stat = "summary", fun.y = "mean") +
geom_point(shape = 21,
alpha = .7,
position = position_jitter(w = 0.4, h = 0)) +
geom_hline(yintercept = mean_appearances,
linetype = "twodash",
lwd = 1,
colour = "firebrick") +
annotate(geom = "text",
x = 3,
y = 800,
size = 5,
label = paste("Mean = ", round(mean_appearances, 2)),
colour = "firebrick") +
scale_fill_viridis_d() +
theme_bw(base_size = 15) +
theme(legend.position = "none") +
labs(title = "Alignment and Appearances",
subtitle = "Marvel character appearances by alignment",
x = "Alignment",
y = "Appearances")
Comments
Comments are pieces of code text that are not interpreted by the computer. In R we use the octothorpe/pound sign/hashtag (
#
) at the beginning of a line to denote a comment. The first and third line of code below are not evaluated, whereas the second and fourth line are.Comments are mostly used to remind yourself (or other people) what a piece of code does and why the code is written the way that it is. Below is a piece of code that checks if a string is a valid phone number. We can see that the comments explain, not only what each piece of code is doing, but also why the second piece of code was written the way that it was.