You can download the rmd file here.
The purpose of today’s lab is to introduce you to some simple tools that will allow you to calculate and visualize descriptive statistics in R. This will not be a deep dive into the theoretical meaning behind descriptives, but rather a guide to some practical steps toward getting the most basic summary information out of a given dataset. Along the way, we will implement some of the skills we learned in last week’s lab, such as creating variables, working with data frames, installing packages, and learning how to use new functions.
Today’s lab will cover:
Advantages of organizing R scripts/markdowns with projects:
Having separate R environment for each data analysis project, so your analysis and variables won’t interfere with each other.
Organize scripts, inputs and outputs with relative paths, so transferring project folder won’t affect file locations.
A directory refers a file path (location). A working
directory in R is the default file path where R read and save
file. You can check the current working directory with
getwd()
# Print my current working directory
getwd()
## [1] "/Users/BerniceCheung/Dropbox (University of Oregon)/Teaching/psy611"
Because I’m using a Mac, sub-folders are separated by /
.
If you use a PC, sub-folders are separated by \
.
The working directory of a R project is where the
.Rproj
file locates.
You can access the sub folders inside the working directory using
relative path: it starts with a .
, representing the current
location (the working directory). Then it follows by a
/{name of the sub-folder}
(mac) or
\{{name of the sub-folder}}
(PC).
# examples of using relative file directory
df <- import("./labs/resources/lab2/world_happiness_2015.csv")
or you can use function here
to locate the directory of
a file by adding names of the sub-folders
# examples of using here function
df <- import(here("labs", "resources", "lab2", "world_happiness_2015.csv"))
We’ll use the same 2015 World Happiness Report dataset from lecture throughout today’s lab.
Create a R project
Download the data file here (right click the file -> Save As)
move the lab-2.Rmd
document and the
world_happiness_2015.csv
file to the appropriate directory
within your R project.
Go ahead and load {rio}
, {here}
{psych}
, as we are going to need functions from these
packages throughout.
# load libraries (install the libraries if not previously installed)
if (!require(here)) {
install.packages('rio')
}
if (!require(here)) {
install.packages('psych')
}
if (!require(here)) {
install.packages('here')
}
rio
package
and save it to an object called world_happiness_df
. You can
use the import
function along with the here
function. Supplement: R
script style guide# load data
world_happiness_df <- import(here("labs", "resources", "lab2", "world_happiness_2015.csv"))
Use the str()
function to see the structure of the
dataset. This tells us we have a data frame with 136 observations of 8
different variables: Country
, Happiness
,
GDP
, Support
, Life
,
Freedom
, Generosity
and
Corruption
.
# check data structure
str(world_happiness_df)
## 'data.frame': 136 obs. of 8 variables:
## $ Country : chr "Albania" "Argentina" "Armenia" "Australia" ...
## $ Happiness : num 4.61 6.7 4.35 7.31 7.08 ...
## $ GDP : num 9.25 NA 8.97 10.68 10.69 ...
## $ Support : num 0.639 0.926 0.723 0.952 0.928 ...
## $ Life : num 68.4 67.3 65.3 72.6 70.8 ...
## $ Freedom : num 0.704 0.881 0.551 0.922 0.9 ...
## $ Generosity: num -0.0823 NA -0.1867 0.3157 0.0891 ...
## $ Corruption: num 0.885 0.851 0.901 0.357 0.557 ...
Things to check:
The number of observation and variable
The class of the variables
Untidy variable names (including space, capitalization inconsistency, etc.)
The present of NA
’s (May not be obvious at this
step)
You can also look at the first few rows of the data frame using
head()
.
# check the first few rows of the dataset
head(world_happiness_df)
## Country Happiness GDP Support Life Freedom Generosity
## 1 Albania 4.606651 9.251464 0.6393561 68.43517 0.7038507 -0.08233768
## 2 Argentina 6.697131 NA 0.9264923 67.28722 0.8812237 NA
## 3 Armenia 4.348320 8.968936 0.7225510 65.30076 0.5510266 -0.18669653
## 4 Australia 7.309061 10.680326 0.9518616 72.56024 0.9218710 0.31570196
## 5 Austria 7.076447 10.691354 0.9281103 70.82256 0.9003052 0.08908856
## 6 Azerbaijan 5.146775 9.730904 0.7857028 61.97585 0.7642895 -0.22263514
## Corruption
## 1 0.8847930
## 2 0.8509062
## 3 0.9014622
## 4 0.3565544
## 5 0.5574796
## 6 0.6155525
Notice that there are some NA
’s, indicating that there
is missing data. This will become important later.
Question: What does it indicate when the class of
an interval variable is chr
?
Recall from lecture that a distribution often refers to a description of the (relative) number of times a given variable will take each of its unique values.
One common way of visualizing distributions is using a histogram, which plots the frequencies of different values for a given variable.
For example, let’s take a look at a distribution of the
Happiness
variable. We do this using the
hist()
function. (Remember, you can check out the help
documentation using ?hist
).
Things to check:
x
(inputs)# create a histogram
hist(x = world_happiness_df$Happiness, main = "Histogram of Happiness Scores", xlab = "Happiness")
You can also change the number of bins (i.e. bars) in your histogram
using the breaks
argument. Try 5, 10, and 20 breaks. What
do you notice as the number of breaks increases?
# create a histogram with a different number of bins
hist(x = world_happiness_df$Happiness, breaks = 20, main = "Histogram of Happiness Scores", xlab = "Happiness")
Question: What else do you want to adjust in the figure?
We can also visualize distributions using a boxplot, which gives us different information. For a short guide on how to read boxplots, see here or refer to this section of your textbook.
# create a boxplot for the Happiness variable
boxplot(x = world_happiness_df$Happiness, main = "Boxplot of Happiness Scores")
Get a boxplot for the Corruption variable below. What do you notice?
# create a boxplot for the Corruption variable
boxplot(x = world_happiness_df$Corruption, main = "Boxplot of Corruption Scores")
You can use the boxplot.stats
function to identify
specific outliers. We’re specifically interested in the $out component
of the output.
# Identify the outliers of Corruption
boxplot.stats(x = world_happiness_df$Corruption)$out
## [1] 0.35655439 0.19101639 0.22336966 0.37539047 0.18588871 0.29881436
## [7] 0.09460447 0.09894388 0.23196414 0.20953351
Challenge: Can you create a new dataset without the outliers?
So far we have been plotting in base R. However, the ggplot2 package is generally a much better tool for
plotting. For now we’ll stick with base plotting to keep things simple,
but in a future class you will learn how to use ggplot
to
make better-looking plots, such as this:
Ok, so now that we know how to visualize a basic distribution, let’s think about how we commonly characterize distributions with descriptive statistics…
For a given set of observations, measures of central tendency allow us to get the “gist” of the data. They tell us about where the “average” or the “mid-point” of the data lies. Let’s take a really simple example first. We’ll use a vector of integers as our dataset.
simple_vector <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7)
A quick way to find the mean is to use the aptly named
mean()
function from base R. Use this function on the
simple_vector we just created.
mean(simple_vector)
## [1] 3
What happens if we try to calculate the mean when there are missing
values in our data? Let’s see what happens when we add a couple
NA
’s to our simple vector and calculate the mean.
# add NA's
simple_vector_missing <- c(1, 1, NA, 1, 2, 2, 2, 2, 2, 3, 5, NA, 5, 6, 7)
Try getting the mean again for this new vector, simple_vector_missing. What happens?
mean(simple_vector_missing)
## [1] NA
…It gives us an NA
! The reason for this is that the mean
is calculated by using every value for a given variable, so if you don’t
remove (or impute) the missing values before getting the mean, it won’t
work.
Use the mean function again, but add the additional argument:
na.rm = TRUE
.
mean(simple_vector_missing, na.rm = TRUE)
## [1] 3
The median is the middle value of a set of observations: 50% of the data points fall below the median, and 50% fall above.
To find the median, we can use the median()
function.
Use it on simple_vector.
median(simple_vector)
## [1] 2
Oddly enough, the core packages in R do not contain a function to
find the mode. In the case of our very simple dataset, it is easy to
determine that the mode (i.e. the most common value) is 2
.
Your first minihack exercise will be to find
the mode using the R package that accompanies your textbook.
The range gives us the distance between the smallest and largest
value in a dataset. You can find the range using the
range()
function, which will output the minimum and maximum
values. Find the range of the simple_vector.
range(simple_vector)
## [1] 1 7
To find the variance and standard deviation, we use
var()
and sd()
, respectively. Find the
variance and standard deviation of the simple_vector.
# variance
var(simple_vector)
## [1] 4.166667
# standard deviation
sd(simple_vector)
## [1] 2.041241
Note: As it pertains to missing data, the same rule
described above for the mean()
function also applies to
these functions (namely, that you need to add the argument
na.rm = TRUE
when there is missing data).
Outliers are extreme values in the data. They don’t follow the same general pattern as all the other data. One easy way to detect outliers is to visually inspect a distribution for values that stand out. For example, if we add an extreme value to our simple vector, it is easy to see that it is an outlier when we look at the boxplot.
# add extreme value (25)
simple_vector_outlier <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7, 25)
# visualize outlier with boxplot
boxplot(simple_vector_outlier)
# identify specific outliers
boxplot.stats(simple_vector_outlier)$out
## [1] 25
Another way to identify outliers is to standardize the scores in our dataset, i.e. to convert them to z-scores. To do this, we take each score’s distance from the mean and divide it by the standard deviation:
\[ z = \frac{x_i - M}{s} \]
This tells us, for each score, how many standard deviations it is away from the mean. Some people use the rule of thumb that if a score is above or below 3 standard deviations from the mean, it should be considered an outlier.
To standardize scores in our dataset, use the scale()
function on the simple_vector_outlier.
# transform a the vector 'simple_vector_outlier' into z-scores
scale(simple_vector_outlier)
## [,1]
## [1,] -0.57620491
## [2,] -0.57620491
## [3,] -0.57620491
## [4,] -0.41486753
## [5,] -0.41486753
## [6,] -0.41486753
## [7,] -0.41486753
## [8,] -0.41486753
## [9,] -0.25353016
## [10,] 0.06914459
## [11,] 0.06914459
## [12,] 0.23048196
## [13,] 0.39181934
## [14,] 3.29589207
## attr(,"scaled:center")
## [1] 4.571429
## attr(,"scaled:scale")
## [1] 6.198192
What do you notice about the output? Is there anything you want to change about the output?
# transform the output of the scale function into a vector
as.vector(scale(simple_vector_outlier))
## [1] -0.57620491 -0.57620491 -0.57620491 -0.41486753 -0.41486753 -0.41486753
## [7] -0.41486753 -0.41486753 -0.25353016 0.06914459 0.06914459 0.23048196
## [13] 0.39181934 3.29589207
Based on this output, how many standard deviations above the mean is
our extreme value, 25
? Should it be considered an
outlier?
So far we have been calculating various descriptive statistics (somewhat painstakingly) using an assortment of different functions. So what if we have a dataset with a bunch of variables we want descriptive statistics for? Surely we don’t want to calculate descriptives for each variable by hand…
Fortunately for us, there is a function called
describe()
from the {psych}
package, which we
can use to quickly summarize a whole set of variables in a dataset.
Let’s look back at our world_happiness
dataset…
psych::describe()
This function automatically calculates all of the descriptives we
reviewed above (and more!). Use the describe
function from
the psych package on the entire world_happiness
dataset.
Notes: If you load the a library up front, you can
directly call any function from it. Instead, you can call a function by
library_name::function_name
without loading the entire
library.
# generate descriptives of the dataset
psych::describe(world_happiness_df)
## vars n mean sd median trimmed mad min max range skew
## Country* 1 136 68.50 39.40 68.50 68.50 50.41 1.00 136.00 135.00 0.00
## Happiness 2 136 5.43 1.11 5.42 5.44 1.20 2.70 7.60 4.90 -0.07
## GDP 3 121 9.22 1.16 9.45 9.28 1.20 6.61 11.43 4.82 -0.43
## Support 4 135 0.80 0.12 0.83 0.81 0.12 0.43 0.99 0.55 -0.88
## Life 5 135 63.12 7.46 64.64 63.73 7.35 43.74 76.04 32.30 -0.67
## Freedom 6 132 0.75 0.13 0.78 0.76 0.16 0.40 0.98 0.58 -0.45
## Generosity 7 120 0.00 0.16 -0.03 -0.01 0.15 -0.28 0.46 0.74 0.59
## Corruption 8 125 0.73 0.20 0.81 0.77 0.12 0.09 0.96 0.87 -1.48
## kurtosis se
## Country* -1.23 3.38
## Happiness -0.69 0.10
## GDP -0.78 0.11
## Support 0.11 0.01
## Life -0.34 0.64
## Freedom -0.60 0.01
## Generosity -0.27 0.01
## Corruption 1.53 0.02
NOTE: Country
is not numeric
, it is a
categorical variable of type character
. By
default, the describe()
function forces non-numeric
variables to be numeric and attempts to calculate descriptives for them.
These variables are marked with an asterisk (*
). In this
case, it doesn’t make sense to calculate descriptive statistics for
Country
, so we get a warning message and a bunch of
NaN
’s and NA
’s for this variable.
A better approach would be to remove non-numeric variables before you attempt to run numerical calculations on your dataset. You will do just that in Minihack #4.
If we want to focus on a particular variable, we can calculate
descriptives for that variable only. Again, use the
describe
function from the psych package,
but this time get descriptives just for the Happiness variable.
# generate descriptives of the Happiness variable
psych::describe(world_happiness_df$Happiness)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 136 5.43 1.11 5.42 5.44 1.2 2.7 7.6 4.9 -0.07 -0.69 0.1
Note: psych::describe()
also gives us
skew and kurtosis. Skewness more than +1 or -1 are typically considered
highly skewed. Kurotsis values near zero indicate normality.
Bivariate descriptives pertain to the relationship between two variables.
cor()
functionTo find the strength of the association between two variables, we can
use the cor()
function from the {psych}
package.
For example, in our world_happiness
dataset, we might
want to know the strength of the relationship between a country’s
Happiness
and average level of social support,
i.e. Support
.
# calculate the correlation between Happiness & Support
cor(world_happiness_df$Happiness, world_happiness_df$Support)
## [1] NA
What’s the outcome? How can we correct that?
# calculate the correlation between Happiness & Support
cor(world_happiness_df$Happiness, world_happiness_df$Support, use = "pairwise.complete.obs")
## [1] 0.7258971
Not surprisingly, we see that there is a strong positive association between happiness and social support.
Note: the use of
use = "pairwise.complete.obs"
. Because there is missing
data in these variables, this argument is used to tell R to only use
observations (i.e. rows in the data frame) for which there is no missing
data for either variable.
If we want to see this relationship graphically, we can create a scatterplot.
# create a scatterplot
plot(world_happiness_df$Support, world_happiness_df$Happiness, xlab = "Social Support", ylab = "Happiness", main = "Relationship between social support and happiness \nat the country level")
pairs.panels()
functionIf we have a dataset with a reasonably small number of variables, we
can use pairs.panels()
(also from the {psych}
package) to generate a matrix that contains scatterplots between all of
the variables below the diagonal and Pearson correlations between all of
the variables above the diagonal.
# create a bivariate panel
pairs.panels(world_happiness_df, lm=T, ellipses = F)
Again, note that we can’t interpret the results for
Country
because it is a non-numeric character variable.
Lastly, if we want to find the covariance between two variables, we
use the cov()
function.
# calculate the covariance between Happiness & Support
cov(world_happiness$Happiness, world_happiness_df$Support, use = "pairwise.complete.obs")
## Error in is.data.frame(x): object 'world_happiness' not found
NOTE: You can also generate a covariance matrix for a whole dataframe
by passing a dataframe to cov()
instead of only two
specific variables. The same goes for generating a correlation matrix
using cor()
.
Question: Can you generate a correlation matrix? What does the error message mean? How can you correct it?
We’ve seen how RMarkdown is a powerful way to create documents that combine code, output, and explanations/comments.
Rmarkdown text can be combined with code to embed the output of functions into one’s written explanations. Here’s an exmaple:
To embed your r code within your text, use the following formatting:
## Write whatever you want, then `r #insert code here` then write some more.
Let’s try this out. Here is a vector of values. We will store the mean of the vector in an object called my_mean and the standard deviation in an object called my_sd. Use the formatting above to report the mean and standard deviation of the vector in a sentence after this code chunk.
my_vector <- c(5,6,14,11,2,1,3,2,8)
my_mean <- mean(my_vector)
my_sd <- sd(my_vector)
– write your sentence here –
The mean of my_vector is 5.7777778.
Install and load the {lsr}
package, the companion R
package to your textbook, Learning Statistics with R.
Check out the help documentation for the function
modeOf()
.
#install.packages("lsr")
library(lsr)
?modeOf
?maxFreq
modeOf()
to find the mode of
simple_vector
. Make sure that the answer you get is
2
. Save your result to an object called
mode
.simple_vector <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7)
my_mode <- modeOf(simple_vector)
maxFreq
to find the frequency of the
modal value (i.e. how many time it occurs). Save your result to an
object called modal_freq
.modal_freq <- maxFreq(simple_vector)
and its frequency is 5.
Notes: Because the mode
is also a
functional name in r, usinging it as a variable name can cause trouble
some time:
e.g. Error: The inline value cannot be coerced to character: mode
.
Giving the variable another name, such as my_mode
will
solve this issue.
corruption_std
. This contains a
reduced form of the World Happiness dataset with three variables:-Country
-Corruption
-Corruption_scaled
(a standardized, i.e. z-scored, version
of the Corruption
variable)
corruption_std <- import('https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/corruption_standardized.csv')
Corruption
. What do you notice
about the shape of this distribution? How would you describe the
skewness?hist(x = corruption_std$Corruption, main = "Histogram of Corruption Scores", xlab = "Corruption")
Corrumption is negatively skewed.
corruption_std
. Based on the z-scores
(i.e. Corruption_scaled
), which countries would be
considered outliers on the Corruption
variable?# extract the indices of scaled corruption whose absolute value > 3
outlier_idx <- which(abs(corruption_std$Corruption_scaled) > 3)
# extract the country name using the outlier indices
corruption_std$Country[outlier_idx]
## [1] "Singapore" "Rwanda"
-Hint #1: you can use the minus sign (-
) for
de-selecting.
-Hint #2: pay attention to rows 124
and 125
corruption_std_clean <- corruption_std[-outlier_idx,]
For this minihack, we’re going to work with a slightly different
version of the world_happiness
data. The dataset can be
imported from the following url: https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/world_happiness_2015_sim.csv.
This version has two new variables:
-War
represents whether or not a country was an active
participant in WWII or was invested in by the US or Soviet Union just
after WWII
-Service
represents the proportion of the population
that works in the service industry. Note: this variable has
been simulated and is completely made up!
world_happiness_sim
.world_happiness_sim <- import('https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/world_happiness_2015_sim.csv')
Service
variable. Set the
number of bins to 30
. What do you notice about the shape of
this distribution?hist(x = world_happiness_sim$Service, main = "Histogram of Service Scores", xlab = "Service", breaks = 30)
The Service
variable has a bimodal distribution.
psych::describeBy()
.
Use this function to get descriptive statistics for the
Service
variable for countries involved/not involved in
WWII. (That is use War
as your grouping variable). What are
the means and standard deviations for the two groups?(service_by_war <- psych::describeBy(world_happiness_sim$Service, group = world_happiness_sim$War))
##
## Descriptive statistics by group
## group: NotWWII
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 85 0.26 0.12 0.25 0.26 0.12 -0.04 0.59 0.63 0.07 -0.2 0.01
## ------------------------------------------------------------
## group: WWII
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 51 0.7 0.1 0.72 0.71 0.1 0.48 0.88 0.39 -0.21 -0.76 0.01
The means and standard deviations for the countries that involved in WWII are: 0.7037614 and 0.100226.
The means and standard deviations for the countries that weren’t involved in WWII are: 0.2625464 and 0.1211548.
Service
variable?The proportion of the population that works in the service industry had different yet overlapping distribution among countries that were involved in WWII and those were not.
For this minihack, we’re going to use the cor()
function
from {psych}
to create a correlation matrix of the entire
world_happiness
dataset. Be sure to use the original
world_happiness
data frame without the added variables from
Minihack #3.
Country
variable because this variable
is not numeric and will not yield meaningful correlations with the other
variables in the dataset. Save the resulting dataframe to a new object
called world_happiness_num
.# use the within function to remove columns by names
world_happiness_num <- within(world_happiness_df, rm(Country))
# or use the subset function to remove columns by names
world_happiness_num <- subset(world_happiness_df, select = -Country)
cor()
to generate a correlation matrix for all the
variables in world_happiness_num
. Use listwise deletion by
specifying use = "complete.obs"
.(cor_mx <- cor(world_happiness_num, use = "complete.obs"))
## Happiness GDP Support Life Freedom
## Happiness 1.0000000 0.823217031 0.7317866 0.76599429 0.5085981
## GDP 0.8232170 1.000000000 0.7121414 0.84842896 0.3453470
## Support 0.7317866 0.712141406 1.0000000 0.60578186 0.3670727
## Life 0.7659943 0.848428965 0.6057819 1.00000000 0.3006421
## Freedom 0.5085981 0.345346974 0.3670727 0.30064207 1.0000000
## Generosity 0.2174625 0.009531499 0.1841681 0.04838897 0.3947917
## Corruption -0.4823416 -0.346664211 -0.2998693 -0.33739376 -0.5264837
## Generosity Corruption
## Happiness 0.217462530 -0.4823416
## GDP 0.009531499 -0.3466642
## Support 0.184168126 -0.2998693
## Life 0.048388971 -0.3373938
## Freedom 0.394791684 -0.5264837
## Generosity 1.000000000 -0.3960720
## Corruption -0.396072030 1.0000000
Freedom
and
Happiness
? Between Freedom
and
Corruption
? Do these correlations make intuitive
sense?The correlation between Freedom
and
Happiness
is 0.5085981, and the correlation between
Freedom
and Corruption
is -0.5264837.
GDP
and happiness
. Report the result using
in-line r code.