Purpose

In today’s lab, we will discuss how to use R to visualize correlations, calculate correlation coefficients, and assess statistical significance of correlations. You can download the Rmd file here to follow along.

For today’s lab, you will need to load the following libraries.

If you are missing any of the following packages, you can use the following code to install it.

install.packages("pwr")

library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for covariance and correlation functions
library(apaTables) # for correlation tables
library(pwr) # for power calculations
library(here) # for file paths
library(corrplot) # for heat maps

Visualizing correlations

It is very important to always visualize your data. There might be patterns in your data that are not apparent just from looking at a correlation between two variables. We will go over just a few examples here, but there are many different options for visualizing correlations. See here if you want to explore more options.

Scatter plots

Scatter plots allow us to visualize the relationship between two variables. Here is the code for a scatter plot using ggplot(). You can add a line of best fit by adding a geom_smooth() layer to the plot.

ggplot(data = mtcars, aes(x = mpg, y = hp)) +
  geom_point() +
  geom_smooth(method = "lm")

SPLOM plots

“SPLOM” stands for scatter plot matrix. The pairs.panel() function from the {psych} package allows a quick way to visualize relationships among all the continuous variables in your data frame. The lower diagonal contains scatter plots showing bivariate relationships between pairs of variables, and the upper diagonal contains the corresponding correlation coefficients. Histograms for each variable are shown along the diagonal.

pairs.panels(mtcars, lm = TRUE)

Heat maps

Heat maps are a great way to get a high-level visualization of a correlation matrix. They are particularly useful for visualizing the number of “clusters” in your data if that’s something you’re looking for. Thurstone, built into the {psych} package, is a correlation matrix of variables assessing different aspects of cognitive ability. We can plot a heatmap of this correlation matrix using the corrplot() function from the {corrplot} package. Note: make sure that you are feeding the function a correlation matrix (not the data set).

corrplot(corr = Thurstone, method = "square")

Question: What do you notice about the structure of this data?

APA Tables

The package {apaTables} has a very useful function apa.cor.table() that creates nicely formatted tables of correlation matrices in APA format. This code prints the table to a word document called “mtcars.doc” that I am storing in the lab-1 folder of my working directory. table.number = 1 indicates “Table 1.”

apa.cor.table(mtcars, 
              filename = here("labs", "lab-1", "mtcars.doc"), 
              table.number = 1)

Covariances and correlations

Covariance

Covariance captures how the variances of two variables are related, i.e., how they co-vary. If higher values of one variable tend to correspond with higher values of the other variable, and lower values of one variable tend to correspond with lower values of the other variable, then the covariance would be positive. However, if the two variables are inversely related (i.e., higher values on one variable correspond with lower values on the other variable), then the covariance would be negative.

\[\large cov_{xy} = {\frac{\sum{(x-\bar{x})(y-\bar{y})}}{N-1}}\]

To calculate covariance, use the function cov() from the {stats} package. The cov() function takes two arguments: the first variable “x” and the second variable “y”.

Calculate the covariance between mtcars$mpg and mtcars$hp:

cov(x = mtcars$mpg, y = mtcars$hp)

## [1] -320.7321

Feeding cov() a data frame (of numeric variables) will generate a covariance matrix. Calculate a covariance matrix for the mtcars data frame. Round to two decimal places.

cov(mtcars)

##              mpg         cyl        disp          hp         drat          wt
## mpg    36.324103  -9.1723790  -633.09721 -320.732056   2.19506351  -5.1166847
## cyl    -9.172379   3.1895161   199.66028  101.931452  -0.66836694   1.3673710
## disp -633.097208 199.6602823 15360.79983 6721.158669 -47.06401915 107.6842040
## hp   -320.732056 101.9314516  6721.15867 4700.866935 -16.45110887  44.1926613
## drat    2.195064  -0.6683669   -47.06402  -16.451109   0.28588135  -0.3727207
## wt     -5.116685   1.3673710   107.68420   44.192661  -0.37272073   0.9573790
## qsec    4.509149  -1.8868548   -96.05168  -86.770081   0.08714073  -0.3054816
## vs      2.017137  -0.7298387   -44.37762  -24.987903   0.11864919  -0.2736613
## am      1.803931  -0.4657258   -36.56401   -8.320565   0.19015121  -0.3381048
## gear    2.135685  -0.6491935   -50.80262   -6.358871   0.27598790  -0.4210806
## carb   -5.363105   1.5201613    79.06875   83.036290  -0.07840726   0.6757903
##              qsec           vs           am        gear        carb
## mpg    4.50914919   2.01713710   1.80393145   2.1356855 -5.36310484
## cyl   -1.88685484  -0.72983871  -0.46572581  -0.6491935  1.52016129
## disp -96.05168145 -44.37762097 -36.56401210 -50.8026210 79.06875000
## hp   -86.77008065 -24.98790323  -8.32056452  -6.3588710 83.03629032
## drat   0.08714073   0.11864919   0.19015121   0.2759879 -0.07840726
## wt    -0.30548161  -0.27366129  -0.33810484  -0.4210806  0.67579032
## qsec   3.19316613   0.67056452  -0.20495968  -0.2804032 -1.89411290
## vs     0.67056452   0.25403226   0.04233871   0.0766129 -0.46370968
## am    -0.20495968   0.04233871   0.24899194   0.2923387  0.04637097
## gear  -0.28040323   0.07661290   0.29233871   0.5443548  0.32661290
## carb  -1.89411290  -0.46370968   0.04637097   0.3266129  2.60887097

Question: In the above output, what do the numbers along the diagonal represent?

Correlation

Correlations are standardized covariances. Because correlations are in standardized units, we can compare them across scales of measurements and across studies. Recall that mathematically, a correlation is the covariance divided by the product of the standard deviations of each variable.

\[\large r_{xy} = {\frac{cov(X,Y)}{\hat\sigma_{x}\hat\sigma_{y}}}\]

Calculate the correlation coefficient for mtcars$mpg and mtcars$hp using the cor() function (again, from the {stats} package).

cor(mtcars$mpg, mtcars$hp)

## [1] -0.7761684

As with covariances, we can generate a matrix of correlations by feeding a data frame to cor(). Calculate a correlation matrix for the variables in the mtcars data set. Round to two decimal places.

cor(mtcars)

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

If you are given a covariance matrix, you can convert it to a correlation matrix using cov2cor().

# covariance matrix
cov_mat <- cov(mtcars)

# convert to correlation matrix
cov2cor(cov_mat) %>% 
  round(2)

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Hypothesis testing with correlations

Once you have a correlation coefficient, you may want to assess whether the correlation is statistically meaningful.

For this example, we are going to work with a data set (n = 60; 30 men and 30 women) that contains variables measuring self-reported conscientiousness (from the BFI) and self-reported physical health. We want to know if there is a significant correlation between conscientiousness and physical health, and then whether or not that correlation differs between men and women.
Import the data using the following code and check the structure of the data.

health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")

str(health)
head(health)

Visualize the data

ggplot(data = health, aes(x = consc, y = sr_health)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Conscientiousness", y = "Self-reported health", title = "Conscientiousness and self-reported health")

Statistical hypotheses

\[\large H_{0}: \rho_{xy} = 0\]

\[\large H_{A}: \rho_{xy} \neq 0\]

Correlation test

You can run a correlation test using the corr.test() function from the {psych} package.

(r_consc_health <- health %>% 
  select(consc, sr_health) %>% 
  corr.test())

## Call:corr.test(x = .)
## Correlation matrix 
##           consc sr_health
## consc      1.00      0.48
## sr_health  0.48      1.00
## Sample Size 
## [1] 60
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##           consc sr_health
## consc         0         0
## sr_health     0         0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Question: What can we conclude about the relationship between conscientiousness and health from this data?

It is often very useful to save the output of a statistical test to an object that you can then pull useful information out of it.

r_consc_health$ci #confidence interval

##                 lower         r     upper            p
## consc-sr_hl 0.2532571 0.4765366 0.6516132 0.0001186082

r_consc_health$p #p-value matrix

##                  consc    sr_health
## consc     0.0000000000 0.0001186082
## sr_health 0.0001186082 0.0000000000

Comparing correlations

We saw above that there is a significant positive relationship between conscientiousness and self-reported health. However, now we want to know whether or not this correlation is significantly different for men and women.

Visualizing by group

First let’s plot our data again, but this time split the data by gender.

health %>% 
  ggplot(aes(consc, sr_health, color = gender)) + 
  geom_point(size = 2) +
  geom_smooth(method = "lm") + 
  labs(x = "Conscientiousness", y = "Self-reported Health")

## `geom_smooth()` using formula 'y ~ x'

Question: What can we conclude from this graph?

`psych::r.test()`

To statistically compare the correlations between conscientiousness and self-reported health for men and women, we can use the r.test() function. This particular function requires the sample size of the first group and two correlations we’re testing against each other. So, we’ll need to first run the correlation separately for men and women.

health_women <- health %>% 
  filter(gender == "female") 

health_men <- health %>% 
  filter(gender == "male")

r_women <- cor(health_women$consc, health_women$sr_health)

r_men <- cor(health_men$consc, health_men$sr_health)

r.test() has three arguments:

n = sample size for group 1 (the female group)
r12 = r for group 1 (the female group) 3.r34 = r for group 2 (the male group)

r.test(n = nrow(health_women), r12 = r_women, r34 = r_men)

## Correlation tests 
## Call:r.test(n = nrow(health_women), r12 = r_women, r34 = r_men)
## Test of difference between two independent correlations 
##  z value 1.16    with probability  0.25

Question: What does this test suggest about the correlation between conscientiousness and health across genders?

Minihacks

For these minihacks, you should practice using inline code to write up your answers. Inline code allows you to report a variable outside of a code chunk, e.g., 0.61.

Minihack 1

Use the pwr.r.test function to answer the following questions:

How many participants should we collect if we expect a small correlation (r = .2) and want to achieve power = .8? Assume alpha = .05.

#pwr.r.test(n = , sig.level = , power = , r = )

How does the number of participants needed to obtain .8 power change when you decrease the significance level to .01?

#pwr.r.test(n = , sig.level = , power = , r = )

How does the number of participants needed to obtain .8 power change when you increase the effect size (r) to .5?

#pwr.r.test(n = , sig.level = , power = , r = )

Minihack 2

For this minihack, calculate a 95% CI “by hand” (i.e., you can use R but not the cor.test() function) for the correlation between conscientiousness and self-reported health from the health() dataset. Make sure you get the same answer that was given by cor.test(). Hint: when calculating CI’s, use the functions fisherz() and fisherz2r() from {psych}.

To review the steps of calculating confidence intervals using Fisher’s transformation, see here.

# Your code here

Minihack 3

The following table is from this paper from Dawtry et al. (2015).

You can import the data using the following code:

dawtry_clean <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/dawtry_2015_clean.csv")

Replicate the correlation matrix from the table using the apa.cor.table() function and open it in Microsoft Word.

#Your text here

Visualize the correlation matrix using a SPLOM plot and/or a heat map.

#Your text here

Lab 1: Correlations