In today’s lab, we will discuss how to use R to visualize correlations, calculate correlation coefficients, and assess statistical significance of correlations. You can download the Rmd file here to follow along.
For today’s lab, you will need to load the following libraries.
If you are missing any of the following packages, you can use the following code to install it.
install.packages("pwr")
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for covariance and correlation functions
library(apaTables) # for correlation tables
library(pwr) # for power calculations
library(here) # for file paths
library(corrplot) # for heat maps
It is very important to always visualize your data. There might be patterns in your data that are not apparent just from looking at a correlation between two variables. We will go over just a few examples here, but there are many different options for visualizing correlations. See here if you want to explore more options.
Scatter plots allow us to visualize the relationship between two variables. Here is the code for a scatter plot using ggplot()
. You can add a line of best fit by adding a geom_smooth()
layer to the plot.
ggplot(data = mtcars, aes(x = mpg, y = hp)) +
geom_point() +
geom_smooth(method = "lm")
“SPLOM” stands for scatter plot matrix. The pairs.panel()
function from the {psych}
package allows a quick way to visualize relationships among all the continuous variables in your data frame. The lower diagonal contains scatter plots showing bivariate relationships between pairs of variables, and the upper diagonal contains the corresponding correlation coefficients. Histograms for each variable are shown along the diagonal.
pairs.panels(mtcars, lm = TRUE)
Heat maps are a great way to get a high-level visualization of a correlation matrix. They are particularly useful for visualizing the number of “clusters” in your data if that’s something you’re looking for. Thurstone
, built into the {psych}
package, is a correlation matrix of variables assessing different aspects of cognitive ability. We can plot a heatmap of this correlation matrix using the corrplot()
function from the {corrplot}
package. Note: make sure that you are feeding the function a correlation matrix (not the data set).
corrplot(corr = Thurstone, method = "square")
Question: What do you notice about the structure of this data?
{apaTables}
has a very useful function apa.cor.table()
that creates nicely formatted tables of correlation matrices in APA format. This code prints the table to a word document called “mtcars.doc” that I am storing in the lab-1
folder of my working directory. table.number = 1
indicates “Table 1.”apa.cor.table(mtcars,
filename = here("labs", "lab-1", "mtcars.doc"),
table.number = 1)
Covariance captures how the variances of two variables are related, i.e., how they co-vary. If higher values of one variable tend to correspond with higher values of the other variable, and lower values of one variable tend to correspond with lower values of the other variable, then the covariance would be positive. However, if the two variables are inversely related (i.e., higher values on one variable correspond with lower values on the other variable), then the covariance would be negative.
\[\large cov_{xy} = {\frac{\sum{(x-\bar{x})(y-\bar{y})}}{N-1}}\]
To calculate covariance, use the function cov()
from the {stats}
package. The cov()
function takes two arguments: the first variable “x” and the second variable “y”.
mtcars$mpg
and mtcars$hp
:cov(x = mtcars$mpg, y = mtcars$hp)
## [1] -320.7321
cov()
a data frame (of numeric variables) will generate a covariance matrix. Calculate a covariance matrix for the mtcars
data frame. Round to two decimal places.cov(mtcars)
## mpg cyl disp hp drat wt
## mpg 36.324103 -9.1723790 -633.09721 -320.732056 2.19506351 -5.1166847
## cyl -9.172379 3.1895161 199.66028 101.931452 -0.66836694 1.3673710
## disp -633.097208 199.6602823 15360.79983 6721.158669 -47.06401915 107.6842040
## hp -320.732056 101.9314516 6721.15867 4700.866935 -16.45110887 44.1926613
## drat 2.195064 -0.6683669 -47.06402 -16.451109 0.28588135 -0.3727207
## wt -5.116685 1.3673710 107.68420 44.192661 -0.37272073 0.9573790
## qsec 4.509149 -1.8868548 -96.05168 -86.770081 0.08714073 -0.3054816
## vs 2.017137 -0.7298387 -44.37762 -24.987903 0.11864919 -0.2736613
## am 1.803931 -0.4657258 -36.56401 -8.320565 0.19015121 -0.3381048
## gear 2.135685 -0.6491935 -50.80262 -6.358871 0.27598790 -0.4210806
## carb -5.363105 1.5201613 79.06875 83.036290 -0.07840726 0.6757903
## qsec vs am gear carb
## mpg 4.50914919 2.01713710 1.80393145 2.1356855 -5.36310484
## cyl -1.88685484 -0.72983871 -0.46572581 -0.6491935 1.52016129
## disp -96.05168145 -44.37762097 -36.56401210 -50.8026210 79.06875000
## hp -86.77008065 -24.98790323 -8.32056452 -6.3588710 83.03629032
## drat 0.08714073 0.11864919 0.19015121 0.2759879 -0.07840726
## wt -0.30548161 -0.27366129 -0.33810484 -0.4210806 0.67579032
## qsec 3.19316613 0.67056452 -0.20495968 -0.2804032 -1.89411290
## vs 0.67056452 0.25403226 0.04233871 0.0766129 -0.46370968
## am -0.20495968 0.04233871 0.24899194 0.2923387 0.04637097
## gear -0.28040323 0.07661290 0.29233871 0.5443548 0.32661290
## carb -1.89411290 -0.46370968 0.04637097 0.3266129 2.60887097
Question: In the above output, what do the numbers along the diagonal represent?
\[\large r_{xy} = {\frac{cov(X,Y)}{\hat\sigma_{x}\hat\sigma_{y}}}\]
mtcars$mpg
and mtcars$hp
using the cor()
function (again, from the {stats}
package).cor(mtcars$mpg, mtcars$hp)
## [1] -0.7761684
cor()
. Calculate a correlation matrix for the variables in the mtcars
data set. Round to two decimal places.cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
cov2cor()
.# covariance matrix
cov_mat <- cov(mtcars)
# convert to correlation matrix
cov2cor(cov_mat) %>%
round(2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
Once you have a correlation coefficient, you may want to assess whether the correlation is statistically meaningful.
For this example, we are going to work with a data set (n = 60; 30 men and 30 women) that contains variables measuring self-reported conscientiousness (from the BFI) and self-reported physical health. We want to know if there is a significant correlation between conscientiousness and physical health, and then whether or not that correlation differs between men and women.
Import the data using the following code and check the structure of the data.
health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")
str(health)
head(health)
ggplot(data = health, aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Conscientiousness", y = "Self-reported health", title = "Conscientiousness and self-reported health")
\[\large H_{0}: \rho_{xy} = 0\]
\[\large H_{A}: \rho_{xy} \neq 0\]
You can run a correlation test using the corr.test()
function from the {psych}
package.
(r_consc_health <- health %>%
select(consc, sr_health) %>%
corr.test())
## Call:corr.test(x = .)
## Correlation matrix
## consc sr_health
## consc 1.00 0.48
## sr_health 0.48 1.00
## Sample Size
## [1] 60
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## consc sr_health
## consc 0 0
## sr_health 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Question: What can we conclude about the relationship between conscientiousness and health from this data?
r_consc_health$ci #confidence interval
## lower r upper p
## consc-sr_hl 0.2532571 0.4765366 0.6516132 0.0001186082
r_consc_health$p #p-value matrix
## consc sr_health
## consc 0.0000000000 0.0001186082
## sr_health 0.0001186082 0.0000000000
We saw above that there is a significant positive relationship between conscientiousness and self-reported health. However, now we want to know whether or not this correlation is significantly different for men and women.
health %>%
ggplot(aes(consc, sr_health, color = gender)) +
geom_point(size = 2) +
geom_smooth(method = "lm") +
labs(x = "Conscientiousness", y = "Self-reported Health")
## `geom_smooth()` using formula 'y ~ x'
Question: What can we conclude from this graph?
psych::r.test()
r.test()
function. This particular function requires the sample size of the first group and two correlations we’re testing against each other. So, we’ll need to first run the correlation separately for men and women.health_women <- health %>%
filter(gender == "female")
health_men <- health %>%
filter(gender == "male")
r_women <- cor(health_women$consc, health_women$sr_health)
r_men <- cor(health_men$consc, health_men$sr_health)
r.test()
has three arguments:
n
= sample size for group 1 (the female group)r12
= r for group 1 (the female group) 3.r34
= r for group 2 (the male group)r.test(n = nrow(health_women), r12 = r_women, r34 = r_men)
## Correlation tests
## Call:r.test(n = nrow(health_women), r12 = r_women, r34 = r_men)
## Test of difference between two independent correlations
## z value 1.16 with probability 0.25
Question: What does this test suggest about the correlation between conscientiousness and health across genders?
For these minihacks, you should practice using inline code to write up your answers. Inline code allows you to report a variable outside of a code chunk, e.g., 0.61.
Use the pwr.r.test
function to answer the following questions:
#pwr.r.test(n = , sig.level = , power = , r = )
#pwr.r.test(n = , sig.level = , power = , r = )
#pwr.r.test(n = , sig.level = , power = , r = )
For this minihack, calculate a 95% CI “by hand” (i.e., you can use R but not the cor.test()
function) for the correlation between conscientiousness and self-reported health from the health()
dataset. Make sure you get the same answer that was given by cor.test()
. Hint: when calculating CI’s, use the functions fisherz()
and fisherz2r()
from {psych}
.
To review the steps of calculating confidence intervals using Fisher’s transformation, see here.
# Your code here
The following table is from this paper from Dawtry et al. (2015).
You can import the data using the following code:
dawtry_clean <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/dawtry_2015_clean.csv")
apa.cor.table()
function and open it in Microsoft Word.#Your text here
#Your text here