You can download the rmd file here.

Purpose

The purpose of today’s lab is to introduce you to some simple tools that will allow you to calculate and visualize descriptive statistics in R. This will not be a deep dive into the theoretical meaning behind descriptives, but rather a guide to some practical steps toward getting the most basic summary information out of a given dataset. Along the way, we will implement some of the skills we learned in last week’s lab, such as creating variables, working with data frames, installing packages, and learning how to use new functions.

Today’s lab will cover:

  1. Visualizing Distributions
  2. Basic Descriptives
  3. Summarizing a dataset
  4. Bivariate Descriptives
  5. In-Line R Code

Recap: R Project

Advantages of organizing R scripts/markdowns with projects:

  1. Having separate R environment for each data analysis project, so your analysis and variables won’t interfere with each other.

  2. Organize scripts, inputs and outputs with relative paths, so transferring project folder won’t affect file locations.

Directory: Where’s my file?

A directory refers a file path (location). A working directory in R is the default file path where R read and save file. You can check the current working directory with getwd()

# Print my current working directory
getwd()
## [1] "/Users/BerniceCheung/Dropbox (University of Oregon)/Teaching/psy611"

Because I’m using a Mac, sub-folders are separated by /. If you use a PC, sub-folders are separated by \.

The working directory of a R project is where the .Rproj file locates.

the current working directory

You can access the sub folders inside the working directory using relative path: it starts with a ., representing the current location (the working directory). Then it follows by a /{name of the sub-folder} (mac) or \{{name of the sub-folder}} (PC).

# examples of using relative file directory
df <- import("./labs/resources/lab2/world_happiness_2015.csv")

or you can use function here to locate the directory of a file by adding names of the sub-folders

# examples of using here function
df <- import(here("labs", "resources", "lab2", "world_happiness_2015.csv"))

Read more about R project & working directories

Getting Started

We’ll use the same 2015 World Happiness Report dataset from lecture throughout today’s lab.

  1. Create a R project

  2. Download the data file here (right click the file -> Save As)

  3. move the lab-2.Rmd document and the world_happiness_2015.csv file to the appropriate directory within your R project.

  4. Go ahead and load {rio}, {here} {psych}, as we are going to need functions from these packages throughout.

# load libraries (install the libraries if not previously installed)

if (!require(here)) {
  install.packages('rio')
}

if (!require(here)) {
  install.packages('psych')
}

if (!require(here)) {
  install.packages('here')
}
  1. Import the World Happiness data using the rio package and save it to an object called world_happiness_df. You can use the import function along with the here function. Supplement: R script style guide
# load data
world_happiness_df <- import(here("labs", "resources", "lab2", "world_happiness_2015.csv"))

Use the str() function to see the structure of the dataset. This tells us we have a data frame with 136 observations of 8 different variables: Country, Happiness, GDP, Support, Life, Freedom, Generosity and Corruption.

# check data structure
str(world_happiness_df)
## 'data.frame':    136 obs. of  8 variables:
##  $ Country   : chr  "Albania" "Argentina" "Armenia" "Australia" ...
##  $ Happiness : num  4.61 6.7 4.35 7.31 7.08 ...
##  $ GDP       : num  9.25 NA 8.97 10.68 10.69 ...
##  $ Support   : num  0.639 0.926 0.723 0.952 0.928 ...
##  $ Life      : num  68.4 67.3 65.3 72.6 70.8 ...
##  $ Freedom   : num  0.704 0.881 0.551 0.922 0.9 ...
##  $ Generosity: num  -0.0823 NA -0.1867 0.3157 0.0891 ...
##  $ Corruption: num  0.885 0.851 0.901 0.357 0.557 ...

Things to check:

  1. The number of observation and variable

  2. The class of the variables

  3. Untidy variable names (including space, capitalization inconsistency, etc.)

  4. The present of NA’s (May not be obvious at this step)

You can also look at the first few rows of the data frame using head().

# check the first few rows of the dataset
head(world_happiness_df)
##      Country Happiness       GDP   Support     Life   Freedom  Generosity
## 1    Albania  4.606651  9.251464 0.6393561 68.43517 0.7038507 -0.08233768
## 2  Argentina  6.697131        NA 0.9264923 67.28722 0.8812237          NA
## 3    Armenia  4.348320  8.968936 0.7225510 65.30076 0.5510266 -0.18669653
## 4  Australia  7.309061 10.680326 0.9518616 72.56024 0.9218710  0.31570196
## 5    Austria  7.076447 10.691354 0.9281103 70.82256 0.9003052  0.08908856
## 6 Azerbaijan  5.146775  9.730904 0.7857028 61.97585 0.7642895 -0.22263514
##   Corruption
## 1  0.8847930
## 2  0.8509062
## 3  0.9014622
## 4  0.3565544
## 5  0.5574796
## 6  0.6155525

Notice that there are some NA’s, indicating that there is missing data. This will become important later.


Question: What does it indicate when the class of an interval variable is chr ?


Visualizing Distributions

Recall from lecture that a distribution often refers to a description of the (relative) number of times a given variable will take each of its unique values.

Histogram

One common way of visualizing distributions is using a histogram, which plots the frequencies of different values for a given variable.

For example, let’s take a look at a distribution of the Happiness variable. We do this using the hist() function. (Remember, you can check out the help documentation using ?hist).

Things to check:

  1. Usage:
  • Required syntax and parameters
  • The default values for parameters
  1. Arguments:
  • The class of x (inputs)
  • Options for parameters
  1. Value:
  • The class of the outputs
  • The components of outputs (for output extraction)
  1. In a hurry? Read the examples
# create a histogram 
hist(x = world_happiness_df$Happiness, main = "Histogram of Happiness Scores", xlab = "Happiness")

You can also change the number of bins (i.e. bars) in your histogram using the breaks argument. Try 5, 10, and 20 breaks. What do you notice as the number of breaks increases?

# create a histogram with a different number of bins
hist(x = world_happiness_df$Happiness, breaks = 20, main = "Histogram of Happiness Scores", xlab = "Happiness")


Question: What else do you want to adjust in the figure?


Box Plot

We can also visualize distributions using a boxplot, which gives us different information. For a short guide on how to read boxplots, see here or refer to this section of your textbook.

# create a boxplot for the Happiness variable
boxplot(x = world_happiness_df$Happiness, main = "Boxplot of Happiness Scores")

Get a boxplot for the Corruption variable below. What do you notice?

# create a boxplot for the Corruption variable
boxplot(x = world_happiness_df$Corruption, main = "Boxplot of Corruption Scores")

You can use the boxplot.stats function to identify specific outliers. We’re specifically interested in the $out component of the output.

# Identify the outliers of Corruption 
boxplot.stats(x = world_happiness_df$Corruption)$out
##  [1] 0.35655439 0.19101639 0.22336966 0.37539047 0.18588871 0.29881436
##  [7] 0.09460447 0.09894388 0.23196414 0.20953351

Challenge: Can you create a new dataset without the outliers?


Looking ahead…

So far we have been plotting in base R. However, the ggplot2 package is generally a much better tool for plotting. For now we’ll stick with base plotting to keep things simple, but in a future class you will learn how to use ggplot to make better-looking plots, such as this:

Ok, so now that we know how to visualize a basic distribution, let’s think about how we commonly characterize distributions with descriptive statistics…


Basic Descriptives

Measures of Central Tendency

For a given set of observations, measures of central tendency allow us to get the “gist” of the data. They tell us about where the “average” or the “mid-point” of the data lies. Let’s take a really simple example first. We’ll use a vector of integers as our dataset.

simple_vector <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7)

Mean

A quick way to find the mean is to use the aptly named mean() function from base R. Use this function on the simple_vector we just created.

mean(simple_vector)
## [1] 3

What happens if we try to calculate the mean when there are missing values in our data? Let’s see what happens when we add a couple NA’s to our simple vector and calculate the mean.

# add NA's
simple_vector_missing <- c(1, 1, NA, 1, 2, 2, 2, 2, 2, 3, 5, NA, 5, 6, 7)

Try getting the mean again for this new vector, simple_vector_missing. What happens?

mean(simple_vector_missing)
## [1] NA

…It gives us an NA! The reason for this is that the mean is calculated by using every value for a given variable, so if you don’t remove (or impute) the missing values before getting the mean, it won’t work.

Use the mean function again, but add the additional argument: na.rm = TRUE.

mean(simple_vector_missing, na.rm = TRUE)
## [1] 3

Median

The median is the middle value of a set of observations: 50% of the data points fall below the median, and 50% fall above.

To find the median, we can use the median() function. Use it on simple_vector.

median(simple_vector)
## [1] 2

Mode

Oddly enough, the core packages in R do not contain a function to find the mode. In the case of our very simple dataset, it is easy to determine that the mode (i.e. the most common value) is 2. Your first minihack exercise will be to find the mode using the R package that accompanies your textbook.

Measures of Variability

Range

The range gives us the distance between the smallest and largest value in a dataset. You can find the range using the range() function, which will output the minimum and maximum values. Find the range of the simple_vector.

range(simple_vector)
## [1] 1 7

Variance and standard deviation

To find the variance and standard deviation, we use var() and sd(), respectively. Find the variance and standard deviation of the simple_vector.

# variance
var(simple_vector)
## [1] 4.166667
# standard deviation
sd(simple_vector)
## [1] 2.041241

Note: As it pertains to missing data, the same rule described above for the mean() function also applies to these functions (namely, that you need to add the argument na.rm = TRUE when there is missing data).

Outliers

Outliers are extreme values in the data. They don’t follow the same general pattern as all the other data. One easy way to detect outliers is to visually inspect a distribution for values that stand out. For example, if we add an extreme value to our simple vector, it is easy to see that it is an outlier when we look at the boxplot.

# add extreme value (25)
simple_vector_outlier <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7, 25)

# visualize outlier with boxplot 
boxplot(simple_vector_outlier)

# identify specific outliers
boxplot.stats(simple_vector_outlier)$out
## [1] 25

z-scores

Another way to identify outliers is to standardize the scores in our dataset, i.e. to convert them to z-scores. To do this, we take each score’s distance from the mean and divide it by the standard deviation:

\[ z = \frac{x_i - M}{s} \]

This tells us, for each score, how many standard deviations it is away from the mean. Some people use the rule of thumb that if a score is above or below 3 standard deviations from the mean, it should be considered an outlier.

To standardize scores in our dataset, use the scale() function on the simple_vector_outlier.

# transform a the vector 'simple_vector_outlier' into z-scores
scale(simple_vector_outlier)
##              [,1]
##  [1,] -0.57620491
##  [2,] -0.57620491
##  [3,] -0.57620491
##  [4,] -0.41486753
##  [5,] -0.41486753
##  [6,] -0.41486753
##  [7,] -0.41486753
##  [8,] -0.41486753
##  [9,] -0.25353016
## [10,]  0.06914459
## [11,]  0.06914459
## [12,]  0.23048196
## [13,]  0.39181934
## [14,]  3.29589207
## attr(,"scaled:center")
## [1] 4.571429
## attr(,"scaled:scale")
## [1] 6.198192

What do you notice about the output? Is there anything you want to change about the output?

# transform the output of the scale function into a vector
as.vector(scale(simple_vector_outlier))
##  [1] -0.57620491 -0.57620491 -0.57620491 -0.41486753 -0.41486753 -0.41486753
##  [7] -0.41486753 -0.41486753 -0.25353016  0.06914459  0.06914459  0.23048196
## [13]  0.39181934  3.29589207

Based on this output, how many standard deviations above the mean is our extreme value, 25? Should it be considered an outlier?


Summarizing data

So far we have been calculating various descriptive statistics (somewhat painstakingly) using an assortment of different functions. So what if we have a dataset with a bunch of variables we want descriptive statistics for? Surely we don’t want to calculate descriptives for each variable by hand…

Fortunately for us, there is a function called describe() from the {psych} package, which we can use to quickly summarize a whole set of variables in a dataset.

Let’s look back at our world_happiness dataset…

psych::describe()

This function automatically calculates all of the descriptives we reviewed above (and more!). Use the describe function from the psych package on the entire world_happiness dataset.

Notes: If you load the a library up front, you can directly call any function from it. Instead, you can call a function by library_name::function_name without loading the entire library.

# generate descriptives of the dataset
psych::describe(world_happiness_df)
##            vars   n  mean    sd median trimmed   mad   min    max  range  skew
## Country*      1 136 68.50 39.40  68.50   68.50 50.41  1.00 136.00 135.00  0.00
## Happiness     2 136  5.43  1.11   5.42    5.44  1.20  2.70   7.60   4.90 -0.07
## GDP           3 121  9.22  1.16   9.45    9.28  1.20  6.61  11.43   4.82 -0.43
## Support       4 135  0.80  0.12   0.83    0.81  0.12  0.43   0.99   0.55 -0.88
## Life          5 135 63.12  7.46  64.64   63.73  7.35 43.74  76.04  32.30 -0.67
## Freedom       6 132  0.75  0.13   0.78    0.76  0.16  0.40   0.98   0.58 -0.45
## Generosity    7 120  0.00  0.16  -0.03   -0.01  0.15 -0.28   0.46   0.74  0.59
## Corruption    8 125  0.73  0.20   0.81    0.77  0.12  0.09   0.96   0.87 -1.48
##            kurtosis   se
## Country*      -1.23 3.38
## Happiness     -0.69 0.10
## GDP           -0.78 0.11
## Support        0.11 0.01
## Life          -0.34 0.64
## Freedom       -0.60 0.01
## Generosity    -0.27 0.01
## Corruption     1.53 0.02

NOTE: Country is not numeric, it is a categorical variable of type character. By default, the describe() function forces non-numeric variables to be numeric and attempts to calculate descriptives for them. These variables are marked with an asterisk (*). In this case, it doesn’t make sense to calculate descriptive statistics for Country, so we get a warning message and a bunch of NaN’s and NA’s for this variable.

A better approach would be to remove non-numeric variables before you attempt to run numerical calculations on your dataset. You will do just that in Minihack #4.

If we want to focus on a particular variable, we can calculate descriptives for that variable only. Again, use the describe function from the psych package, but this time get descriptives just for the Happiness variable.

# generate descriptives of the Happiness variable 
psych::describe(world_happiness_df$Happiness)
##    vars   n mean   sd median trimmed mad min max range  skew kurtosis  se
## X1    1 136 5.43 1.11   5.42    5.44 1.2 2.7 7.6   4.9 -0.07    -0.69 0.1

Note: psych::describe() also gives us skew and kurtosis. Skewness more than +1 or -1 are typically considered highly skewed. Kurotsis values near zero indicate normality.


Bivariate Descriptives

Bivariate descriptives pertain to the relationship between two variables.

Correlation

cor() function

To find the strength of the association between two variables, we can use the cor() function from the {psych} package.

For example, in our world_happiness dataset, we might want to know the strength of the relationship between a country’s Happiness and average level of social support, i.e. Support.

# calculate the correlation between Happiness & Support
cor(world_happiness_df$Happiness, world_happiness_df$Support)
## [1] NA

What’s the outcome? How can we correct that?

# calculate the correlation between Happiness & Support
cor(world_happiness_df$Happiness, world_happiness_df$Support, use = "pairwise.complete.obs")
## [1] 0.7258971

Not surprisingly, we see that there is a strong positive association between happiness and social support.

Note: the use of use = "pairwise.complete.obs". Because there is missing data in these variables, this argument is used to tell R to only use observations (i.e. rows in the data frame) for which there is no missing data for either variable.

Scatterplots

If we want to see this relationship graphically, we can create a scatterplot.

# create a scatterplot 
plot(world_happiness_df$Support, world_happiness_df$Happiness, xlab = "Social Support", ylab = "Happiness", main = "Relationship between social support and happiness \nat the country level")

pairs.panels() function

If we have a dataset with a reasonably small number of variables, we can use pairs.panels() (also from the {psych} package) to generate a matrix that contains scatterplots between all of the variables below the diagonal and Pearson correlations between all of the variables above the diagonal.

# create a bivariate panel 
pairs.panels(world_happiness_df, lm=T, ellipses = F)

Again, note that we can’t interpret the results for Country because it is a non-numeric character variable.

Covariance

Lastly, if we want to find the covariance between two variables, we use the cov() function.

# calculate the covariance between Happiness & Support
cov(world_happiness$Happiness, world_happiness_df$Support, use = "pairwise.complete.obs")
## Error in is.data.frame(x): object 'world_happiness' not found

NOTE: You can also generate a covariance matrix for a whole dataframe by passing a dataframe to cov() instead of only two specific variables. The same goes for generating a correlation matrix using cor().


Question: Can you generate a correlation matrix? What does the error message mean? How can you correct it?


In-Line R code

We’ve seen how RMarkdown is a powerful way to create documents that combine code, output, and explanations/comments.

Rmarkdown text can be combined with code to embed the output of functions into one’s written explanations. Here’s an exmaple:

To embed your r code within your text, use the following formatting:

## Write whatever you want, then `r #insert code here` then write some more.

Let’s try this out. Here is a vector of values. We will store the mean of the vector in an object called my_mean and the standard deviation in an object called my_sd. Use the formatting above to report the mean and standard deviation of the vector in a sentence after this code chunk.

my_vector <- c(5,6,14,11,2,1,3,2,8)
my_mean <- mean(my_vector)
my_sd <- sd(my_vector)

– write your sentence here –

The mean of my_vector is 5.7777778.


Minihacks

Minihack 1: Find the mode

  1. Install and load the {lsr} package, the companion R package to your textbook, Learning Statistics with R.

  2. Check out the help documentation for the function modeOf().

#install.packages("lsr")
library(lsr)

?modeOf
?maxFreq
  1. Use modeOf() to find the mode of simple_vector. Make sure that the answer you get is 2. Save your result to an object called mode.
simple_vector <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 5, 5, 6, 7)

my_mode <- modeOf(simple_vector)
  1. Use the function maxFreq to find the frequency of the modal value (i.e. how many time it occurs). Save your result to an object called modal_freq.
modal_freq <- maxFreq(simple_vector)
  1. Report the mode and frequency of the mode using r code embedded within your text.

and its frequency is 5.

Notes: Because the mode is also a functional name in r, usinging it as a variable name can cause trouble some time: e.g. Error: The inline value cannot be coerced to character: mode. Giving the variable another name, such as my_mode will solve this issue.


Minihack 2: Outliers

  1. Import the corruption_standardized.csv dataset using the import function from the rio package and this url: https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/corruption_standardized.csv. Save it as an object called corruption_std. This contains a reduced form of the World Happiness dataset with three variables:

-Country
-Corruption
-Corruption_scaled (a standardized, i.e. z-scored, version of the Corruption variable)

corruption_std <- import('https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/corruption_standardized.csv')
  1. Create a histogram of Corruption. What do you notice about the shape of this distribution? How would you describe the skewness?
hist(x = corruption_std$Corruption, main = "Histogram of Corruption Scores", xlab = "Corruption")

Corrumption is negatively skewed.

  1. View corruption_std. Based on the z-scores (i.e. Corruption_scaled), which countries would be considered outliers on the Corruption variable?
# extract the indices of scaled corruption whose absolute value > 3
outlier_idx <- which(abs(corruption_std$Corruption_scaled) > 3)

# extract the country name using the outlier indices
corruption_std$Country[outlier_idx]
## [1] "Singapore" "Rwanda"
  1. After identifying which countries are outliers, use what you learned in last week’s lab about indexing data frames to remove these rows from the data frame.

-Hint #1: you can use the minus sign (-) for de-selecting.
-Hint #2: pay attention to rows 124 and 125

corruption_std_clean <- corruption_std[-outlier_idx,]

Minihack 3: Descriptive plots

For this minihack, we’re going to work with a slightly different version of the world_happiness data. The dataset can be imported from the following url: https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/world_happiness_2015_sim.csv. This version has two new variables:

-War represents whether or not a country was an active participant in WWII or was invested in by the US or Soviet Union just after WWII

-Service represents the proportion of the population that works in the service industry. Note: this variable has been simulated and is completely made up!

  1. Import the new version of the data and save it to an object called world_happiness_sim.
world_happiness_sim <- import('https://github.com/uopsych/psy611/raw/master/labs/resources/lab2/world_happiness_2015_sim.csv')
  1. Create a histogram of the Service variable. Set the number of bins to 30. What do you notice about the shape of this distribution?
hist(x = world_happiness_sim$Service, main = "Histogram of Service Scores", xlab = "Service", breaks = 30)

The Service variable has a bimodal distribution.

  1. Look up the help documentation for psych::describeBy(). Use this function to get descriptive statistics for the Service variable for countries involved/not involved in WWII. (That is use War as your grouping variable). What are the means and standard deviations for the two groups?
(service_by_war <- psych::describeBy(world_happiness_sim$Service, group = world_happiness_sim$War))
## 
##  Descriptive statistics by group 
## group: NotWWII
##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 85 0.26 0.12   0.25    0.26 0.12 -0.04 0.59  0.63 0.07     -0.2 0.01
## ------------------------------------------------------------ 
## group: WWII
##    vars  n mean  sd median trimmed mad  min  max range  skew kurtosis   se
## X1    1 51  0.7 0.1   0.72    0.71 0.1 0.48 0.88  0.39 -0.21    -0.76 0.01

The means and standard deviations for the countries that involved in WWII are: 0.7037614 and 0.100226.

The means and standard deviations for the countries that weren’t involved in WWII are: 0.2625464 and 0.1211548.

  1. Based on what you observe about these group means, how might this explain the shape of the distribution of the Service variable?

The proportion of the population that works in the service industry had different yet overlapping distribution among countries that were involved in WWII and those were not.


Minihack 4: Correlations

For this minihack, we’re going to use the cor() function from {psych} to create a correlation matrix of the entire world_happiness dataset. Be sure to use the original world_happiness data frame without the added variables from Minihack #3.

  1. First remove the Country variable because this variable is not numeric and will not yield meaningful correlations with the other variables in the dataset. Save the resulting dataframe to a new object called world_happiness_num.
# use the within function to remove columns by names 
world_happiness_num <- within(world_happiness_df, rm(Country))

# or use the subset function to remove columns by names
world_happiness_num <- subset(world_happiness_df, select = -Country)
  1. Use cor() to generate a correlation matrix for all the variables in world_happiness_num. Use listwise deletion by specifying use = "complete.obs".
(cor_mx <- cor(world_happiness_num, use = "complete.obs"))
##             Happiness          GDP    Support        Life    Freedom
## Happiness   1.0000000  0.823217031  0.7317866  0.76599429  0.5085981
## GDP         0.8232170  1.000000000  0.7121414  0.84842896  0.3453470
## Support     0.7317866  0.712141406  1.0000000  0.60578186  0.3670727
## Life        0.7659943  0.848428965  0.6057819  1.00000000  0.3006421
## Freedom     0.5085981  0.345346974  0.3670727  0.30064207  1.0000000
## Generosity  0.2174625  0.009531499  0.1841681  0.04838897  0.3947917
## Corruption -0.4823416 -0.346664211 -0.2998693 -0.33739376 -0.5264837
##              Generosity Corruption
## Happiness   0.217462530 -0.4823416
## GDP         0.009531499 -0.3466642
## Support     0.184168126 -0.2998693
## Life        0.048388971 -0.3373938
## Freedom     0.394791684 -0.5264837
## Generosity  1.000000000 -0.3960720
## Corruption -0.396072030  1.0000000
  1. What is the correlation between Freedom and Happiness? Between Freedom and Corruption? Do these correlations make intuitive sense?

The correlation between Freedom and Happiness is 0.5085981, and the correlation between Freedom and Corruption is -0.5264837.

  1. Use bracket notation to extract the relationship between GDP and happiness. Report the result using in-line r code.