Psy 612: Data Analysis II

Welcome back!

Last term:

  • Descriptive statistics
  • Probability
  • Sampling
  • Hypothesis testing

This term:

  • Model building
  • Correlation and regression
  • General linear model
  • Multiple regression
library(psych)
library(psychTools)
library(tidyverse)
library(RColorBrewer)

options(scipen=999)

PSY 612 details

UOpsych.github.io/psy612/

Structure of this course:

  • Lectures, Labs, Reading
  • Weekly quizzes (T/F)
  • Homework assignments (3, 30 points each)
  • Final project (1)

Journals are optional (no impact on grade)

PSY 612 goals

  • Understand how models are built and estimated

  • Exposure to a broad range of ideas and tools

    • Can’t learn everything in a year – exposure helps you when you need to learn something new
  • Practice practical skills

    • Looking up code online
    • Troubleshooting errors
    • Using real data, dealing with real problems
    • Asking for help

Relationships

  • What is the relationship between IV and DV?

  • Measuring relationships depend on type of measurement

  • You have primarily been working with categorical IVs

    • t-tests
    • chi-square

Scatter Plot with best fit line

Code
galton.data <- galton
purples = brewer.pal(n = 3, name = "Purples")

ggplot(galton.data, aes(x=parent, y=child)) +
    geom_jitter() +    
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE) +     # Don't add shaded confidence region
    labs(x = "parent height", y = "child height")

Review of Dispersion

Variation (sum of squares)

\[SS = {\sum{(x-\bar{x})^2}}\] \[SS = {\sum{(x-\mu)^2}}\]

Review of Dispersion

Variance

\[\large s^{2} = \hat{\sigma}^2 = {\frac{\sum{(x-\bar{x})^2}}{N-1}}\]

\[\large\sigma^{2} = {\frac{\sum{(x-\mu)^2}}{N}}\]

Review of Dispersion

Standard Deviation

\[ s = \hat{\sigma} = \sqrt{\frac{\sum{(x-\bar{x})^2}}{N-1}}\]

\[\sigma = \sqrt{\frac{\sum{(x-\mu)^2}}{N}}\]

Formula for standard error of the mean?

  • \[\sigma_M = \frac{\sigma}{\sqrt{N}}\]

  • \[\sigma_M = \frac{s}{\sqrt{N}}\]

Associations

  • i.e., relationships
  • to look at continuous variable associations we need to think in terms of how variables relate to one another

Associations

Covariation (cross products)

Sample:

\[ SS = {\sum{(x-\bar{x})(y-\bar{y})}}\]

Population:

\[SS = {\sum{{(x-\mu_{x}})(y-\mu_{y})}}\]

Associations

Covariance

Sample:

\[ cov_{xy} = {\frac{\sum{(x-\bar{x})(y-\bar{y})}}{N-1}}\]

Population:

\[ \sigma_{xy}^{2} = {\frac{\sum{(x-\mu_{x})(y-\mu_{y})}}{N}}\]

  • Covariance matrix is basis for many analyses
  • What are some issues that may arise when comparing covariances?

Associations

Correlations

Sample:

\[ r_{xy} = {\frac{\sum({z_{x}z_{y})}}{N}}\]

Population:

\[\rho_{xy} = {\frac{cov(X,Y)}{\sigma_{x}\sigma_{y}}}\]

Many other formulas exist for specific types of data, these were more helpful when we computed everything by hand (more on this later).

Correlations

  • How much two variables are linearly related

  • -1 to 1

  • Invariant to changes in mean or scaling

  • Most common (and basic) effect size measure

  • Will use to build our regression model

Correlations

Code
ggplot(galton.data, aes(x=parent, y=child)) +
    geom_jitter(alpha = .4) +
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE) +     # Don't add shaded confidence region
    labs(x = "parent height", y = "child height") +
  theme_bw(base_size = 20)

Conceptually

Ways to think about a correlation:

  • How two vectors of numbers co-relate

  • Product of z-scores

    • Mathematically, it is
  • The average squared distance between two vectors in the same space

  • The cosine of the angle between Y and the projected Y from X \((\hat{Y})\).

Statistical test

Hypothesis testing

\[ H_{0}: \rho_{xy} = 0\]

\[ H_{A}: \rho_{xy} \neq 0\]

Assumes:

  • Observations are independent
  • Symmetric bivariate distribution (joint probability distribution)

Univariate distributions

Code
pop = ggplot(data.frame(x = seq(-30, 30)), aes(x)) +
  stat_function(fun = function(x) dnorm(x, m = 0, sd = 10),
                geom = "area", fill = purples[3]) +
  scale_x_continuous("X") +
  scale_y_continuous("density", labels = NULL) +
  ggtitle("Population")+
  theme_bw(base_size = 20)
sample = data.frame(x = rnorm(n = 30, m = 0, sd = 10)) %>%
  ggplot(aes(x = x)) +
  geom_histogram(fill = purples[2], color = "black", bins = 20) +
  scale_x_continuous("X", limits = c(-30, 30)) +
  scale_y_continuous("frequency", labels = NULL) +
  ggtitle("Sample")+
  theme_bw(base_size = 20)
sampling = ggplot(data.frame(x = seq(-30, 30)), aes(x)) +
  stat_function(fun = function(x) dnorm(x, m = 0, sd = 10/sqrt(30)),
                geom = "area", fill = purples[3]) +
  scale_x_continuous("X") +
  scale_y_continuous("density", labels = NULL) +
  ggtitle("Sampling")+
  theme_bw(base_size = 20)

ggpubr::ggarrange(pop, sample, sampling)

Population

Code
mu1<-0 # setting the expected value of x1
mu2<-0 # setting the expected value of x2
s11<-4 # setting the variance of x1
s12<-2 # setting the covariance between x1 and x2
s22<-5 # setting the variance of x2
rho<-0.1 # setting the correlation coefficient between x1 and x2
x1<-seq(-10,10,length=41) # generating the vector series x1
x2<-x1 # copying x1 to x2
f<-function(x1,x2){
    term1<-1/(2*pi*sqrt(s11*s22*(1-rho^2)))
    term2<--1/(2*(1-rho^2))
    term3<-(x1-mu1)^2/s11
    term4<-(x2-mu2)^2/s22
    term5<--2*rho*((x1-mu1)*(x2-mu2))/(sqrt(s11)*sqrt(s22))
    term1*exp(term2*(term3+term4-term5))
} # setting up the function of the multivariate normal density
#
z<-outer(x1,x2,f) # calculating the density values
#
persp(x1, x2, z,
main="Joint Probability Distribution", sub=expression(italic(f)~(bold(x))==frac(1,2~pi~sqrt(sigma[11]~ sigma[22]~(1-rho^2)))~phantom(0)^bold(.)~exp~bgroup("{", list(-frac(1,2(1-rho^2)),
bgroup("[", frac((x[1]~-~mu[1])^2, sigma[11])~-~2~rho~frac(x[1]~-~mu[1], sqrt(sigma[11]))~ frac(x[2]~-~mu[2],sqrt(sigma[22]))~+~ frac((x[2]~-~mu[2])^2, sigma[22]),"]")),"}")),
col= purples[3],
theta=30, phi=20,
r=50,
d=0.1,
expand=0.5,
ltheta=90, lphi=180,
shade=0.75,
ticktype="detailed",
nticks=5)

# produces the 3-D plot
mtext(expression(list(mu[1]==0,mu[2]==0,sigma[11]==4,sigma[22]==5,sigma[12 ]==2,rho==0.1)), side=3)

Sample

Code
sigma <- matrix( c( 4, 2,
                    2, 5 ), 2, 2 )  # covariance matrix

sample = BDgraph::rmvnorm(n = 30, mean = c(0,0), sigma = sigma)

sample %>% as.data.frame() %>%
  ggplot(aes(x = V1, y = V2)) +
  geom_smooth(method = "lm", se = F, color = purples[2])+
  geom_point(color = purples[3]) +
  scale_x_continuous("X1") + 
  scale_y_continuous("X2")+
  theme_bw(base_size = 20)

Sampling distribution?

  • The sampling distribution we use depends on our null hypothesis.

  • If our null hypothesis is the nil \((\rho = 0)\) , then we can use a t-distribution to estimate the statistical significance of a correlation.

Statistical test

Test statistic

\[\large t = {\frac{r}{SE_{r}}}\]

\[SE_r = \sqrt{\frac{1-r^2}{N-2}}\]

\[t = {\frac{r}{\sqrt{\frac{1-r^{2}}{N-2}}}}\]

\[ DF = N-2\]

Example

You’re comparing scores on the GRE Quantitative section and GRE Verbal section. You randomly select 30 applications out of those submitted to the University of Oregon and find a correlation between these scores of .80. Is this significantly different from 0?

Code
r = .80
N = 30
SE = sqrt((1-r^2)/(N-2))
tstat = r/SE
cv = qt(.975, df = N-2)

\[SE_r = \sqrt{\frac{1-.80^2}{30-2}} = 0.11\] \[t = \frac{.80}{0.11} = 7.06\]

pt(7.055, df = 30-2, lower.tail = F)*2
[1] 0.0000001127696

Power calculations

library(pwr)
pwr.r.test(n = , r = .1, sig.level = .05 , power = .8)

     approximate correlation power calculation (arctangh transformation) 

              n = 781.7516
              r = 0.1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided
pwr.r.test(n = , r = .3, sig.level = .05 , power = .8)

     approximate correlation power calculation (arctangh transformation) 

              n = 84.07364
              r = 0.3
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Confidence intervals

But here’s where we get into some trouble. What happens if we try to estimate the precision around our estimate using the techniques we learned in PSY 611?

For the CI around a mean or a difference in means, we would use: \[CI_{95} = \bar{X} \pm SE(t_{\text{critical value}})\]

(cv = qt(.975, df = 30-2))
[1] 2.048407
.80 - 0.11*cv; .80 + 0.11*cv
[1] 0.5746752
[1] 1.025325

Sampling distributions are skewed

If we want to make calculations around correlation values that are not equal to 0, then we will run into a skewed sampling distribution. This applies to both calculating confidence intervals around estimates of correlations and null hypotheses in which \(\rho \neq 0\).

Skewed sampling distributions

Code
r_sampling = function(x, r, n){
  z = fisherz(r)
  se = 1/(sqrt(n-3))
  x_z = fisherz(x)
  density = dnorm(x_z, mean = z, sd = se)
  return(density)
}

cor_75 = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = .75, 
                                             n = 30),
                geom = "area", fill = purples[3]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = .75, n = 30") +
  theme_bw(base_size = 20)

cor_32 = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = .32, 
                                             n = 30),
                geom = "area", fill = purples[2]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = .32, n = 30")+
  theme_bw(base_size = 20)


cor_n85 = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = -.85, 
                                             n = 30),
                geom = "area", fill = purples[3]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = -.85, n = 30")+
  theme_bw(base_size = 20)

cor_75b = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = .75, 
                                             n = 150),
                geom = "area", fill = purples[3]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = .75, n = 150") +
  theme_bw(base_size = 20)

cor_32b = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = .32, 
                                             n = 150),
                geom = "area", fill = purples[2]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = .32, n = 150")+
  theme_bw(base_size = 20)


cor_n85b = ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = -.85, 
                                             n = 150),
                geom = "area", fill = purples[3]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = -.85, n = 150")+
  theme_bw(base_size = 20)


ggpubr::ggarrange(cor_n85, cor_32, cor_75, 
                  cor_n85b, cor_32b, cor_75b)

Skewed sampling distribution will rear their heads when:

  • \(H_{0}: \rho \neq 0\)

  • Calculating confidence intervals

  • Testing two correlations against one another

Code
ggplot(data.frame(x = seq(-.99, .99)), aes(x)) +
  stat_function(fun = function(x) r_sampling(x, r = .75, 
                                             n = 30),
                geom = "area", fill = purples[3]) +
  scale_x_continuous(limits = c(-.99, .99)) +
  ggtitle("r = .75, n = 30") +
  theme_bw(base_size = 20)

Fisher’s r to z’ transformation

  • r to z’:

\[ z^{'} = {\frac{1}{2}}ln{\frac{1+r}{1-r}}\]

Fisher’s r to z’ transformation

Code
r = seq(-.99,.99,.01)
z = psych::fisherz(r)
data.frame(r,z) %>%
  ggplot(aes(x = r, y = z)) +
  geom_line() +
  scale_x_continuous(expr(r))+
  scale_y_continuous(expr(z_r))+
  theme_bw(base_size = 20)

Steps for computing confidence interval

  1. Transform r into z’ 1
  2. Compute CI as you normally would using z’
  3. Reconvert back to r

\[ SE_z = \frac{1}{\sqrt{N-3}}\]

\[r = {\frac{e^{2z'}-1}{e^{2z'}+1}}\]

Example

In a sample of 42 students, you calculate a correlation of 0.44 between hours spent outside on Saturday and self-rated health. What is the precision of your estimate?

\[z = {\frac{1}{2}}ln{\frac{1+.44}{1-.44}} = 0.47\] \[SE_z = \frac{1}{\sqrt{42-3}} = 0.16\]

\[CI_{Z_{LB}} = 0.47-(1.96)0.16 = 0.16\]

\[CI_{Z_{UB}} = 0.47+(1.96)0.16 = 0.79\]

\[CI_{r_{LB}} = {\frac{e^{2(0.16)}-1}{e^{2(0.16)}+1}} = 0.16\]

\[CI_{r_{UB}} = {\frac{e^{2(0.79)}-1}{e^{2(0.79)}+1}} = 0.66\]

These formulas are easy to mistype in R – use the psych package and its functions fisherz() and fisherz2r() to save time and reduce human error.

library(psych)
z_r = fisherz(.44)
se = 1/( sqrt(42-3) )
cv = qnorm(p = .975, lower.tail = T)
lbz = z_r - ( se*cv )
lb = fisherz2r(lbz)
ubz = z_r + ( se*cv )
ub = fisherz2r(ubz)
lb; ub
[1] 0.1570741
[1] 0.6561805

Comparing two correlations

Again, we use the Fisher’s r to z’ transformation. Here, we’re transforming the correlations into z’s, then using the difference between z’s to calculate the test statistic.

\[Z = \frac{z_1^{'}- z_2^{'}}{se_{z_1-z_2}}\]

\[se_{z_1-z_2} = \sqrt{se_{z_1}+se_{z_2}} = \sqrt{\frac{1}{n_1-3}+\frac{1}{n_2-3}}\]

Example

You measure narcissism and happiness in two sets of adults: young adults (19-25) and older adults (over 25). You calculate the correlations separately for these two groups, and you want to know whether this relationship is stronger for one group or another.

Young adults

\[N = 327\] \[r = .402\]

Older adults

\[N = 273\] \[r = .283\]

\[H_0:\rho_1 = \rho_2\] \[H_1:\rho_1 \neq \rho_2\]

z1 = fisherz(.402)
[1] 0.4260322
z2 = fisherz(.283)
[1] 0.2909403
n1 = 327
n2 = 273
se = sqrt(1/(n1-3) + 1/(n2-3))
[1] 0.08240221
zstat = (z1-z2)/se
[1] 1.639421
pnorm(abs(zstat), lower.tail = F)*2
[1] 0.1011256

Effect size

  • The strength of relationship between two variables

  • \(\eta^2\), Cohen’s d, Cohen’s f, hedges g, \(R^2\) , Risk-ratio, etc

  • Significance is a function of effect size and sample size

  • Statistical significance \(\neq\) practical significance

Next time…

More correlations