Describing Data II

Announcements

Homework 1 due in < two weeks
Measure yourself for an in-class demo today tinyurl.com/uwn463vj
- height (inches)
- length of right forearm (inches)

Population variability

Sums of squares

\[ \small SS = \Sigma(X_i-\bar{X})^2 \]

Variance

\[ \small \sigma^2 = \frac{\Sigma(X_i-\bar{X})^2}{N} = \frac{SS}{N} \]

Standard deviation

\[ \scriptsize \sigma = \sqrt{\frac{\Sigma(X_i-\bar{X})^2}{N}}= \sqrt{\frac{SS}{N}} = \sqrt{\sigma^2} \]

Sample variability

Sums of squares

\[ \small SS = \Sigma(X_i-\bar{X})^2 \]

Variance

\[ \small \hat{\sigma}^2 = s^2 = \frac{\Sigma(X_i-\bar{X})^2}{N-1} = \frac{SS}{N-1} \]

Standard deviation

\[ \scriptsize \hat{\sigma} = s = \sqrt{\frac{\Sigma(X_i-\bar{X})^2}{N-1}}= \sqrt{\frac{SS}{N-1}} = \sqrt{s^2} \]

Bi-variate descriptives

Covariation

“Sum of the cross-products”

Population

\[SP_{XY} =\Sigma(X_i−\mu_X)(Y_i−\mu_Y)\]

Sample

\[ SP_{XY} =\Sigma(X_i−\bar{X})(Y_i−\bar{Y})\]

Covariance

Sort of like the variance of two variables

Population

\[\sigma_{XY} =\frac{\Sigma(X_i−\mu_X)(Y_i−\mu_Y)}{N}\]

Sample

\[s_{XY} = cov_{XY} =\frac{\Sigma(X_i−\bar{X})(Y_i−\bar{Y})}{N-1}\]

Covariance table

\[\Large \mathbf{K_{XX}} = \left[\begin{array} {rrr} \sigma^2_X & cov_{XY} & cov_{XZ} \\ cov_{YX} & \sigma^2_Y & cov_{YZ} \\ cov_{ZX} & cov_{ZY} & \sigma^2_Z \end{array}\right]\]

Correlation

Measure of association
How much two variables are linearly related
-1 to 1
Sign indicates direction of relationship
Invariant to changes in mean or scaling

Correlation

Pearson product moment correlation

Population

\[\rho_{XY} = \frac{\Sigma z_Xz_Y}{N} = \frac{SP}{\sqrt{SS_X}\sqrt{SS_Y}} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}\]

Sample \[r_{XY} = \frac{\Sigma z_Xz_Y}{n-1} = \frac{SP}{\sqrt{SS_X}\sqrt{SS_Y}} = \frac{s_{XY}}{s_X s_Y}\]

Code

set.seed(101019) # so we all get the same random numbers
mu = c(50, 5) # means of two variables (MX = 50, MY = 5)
Sigma = matrix(c(.8, .5, .5, .7), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 150, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
data %>% ggplot(aes(x = x, y = y)) + geom_point(size = 3) + theme_bw()

What is the correlation between these two variables?

Code

set.seed(101019) # so we all get the same random numbers
mu = c(10, 100)
Sigma = matrix(c(.8, -.3, -.3, .7), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 150, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")

data %>% ggplot(aes(x = x, y = y)) + geom_point(size = 3) + theme_bw()

What is the correlation between these two variables?

Code

set.seed(101019) # so we all get the same random numbers
mu = c(3, 4)
Sigma = matrix(c(.8, 0, 0, .7), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 150, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
data %>% ggplot(aes(x = x, y = y)) + geom_point(size = 3) + theme_bw()

What is the correlation between these two variables?

Effect size

Recall that z-scores allow us to compare across units of measure; the products of standardized scores are themselves standardized.
The correlation coefficient is a standardized effect size which can be used communicate the strength of a relationship.
Correlations can be compared across studies, measures, constructs, time.
Example: the correlation between age and height among children is \(r = .70\). The correlation between self- and other-ratings of extraversion is \(r = .25\).

What is a large correlation?

- Cohen (1988): .1 (small), .3 (medium), .5 (large)

Often forgot: Cohen said only to use them when you had nothing else to go on, and has since regretted even suggesting benchmarks to begin with. |
Rosenthal & Rubin (1982): life and death (the Binomial Effect Size Display)
- treatment success rate \(= .50 + .5(r)\) and the control success rate \(= .50 - .5(r)\).

What is a large correlation?

\(r^2\): Proportion of variance “explained”
- Ozer & Funder (2019) claim this is misleading and nonsensical
- Fisher (2019) suggests this particular argument is non-scientific (follow up here and then here)

Funder & Ozer (2019)

Effect sizes are often mis-interpreted. How?
What can fix this?
Pitfalls of small effects and large effects
Recommendations?

What affects correlations?

It’s not enough to calculate a correlation between two variables. You should always look at a figure of the data to make sure the number accurately describes the relationship. Correlations can be easily fooled by qualities of your data, like:

Skewed distributions
Outliers
Restriction of range
Nonlinearity

Skewed distributions

Code

set.seed(101019) # so we all get the same random numbers
mu = c(3, 4)
Sigma = matrix(c(.8, .2, .2, .7), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 150, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
data$x = data$x^4

p = data %>% ggplot(aes(x=x, y=y)) + geom_point()
ggMarginal(p, type = "density")

Outliers

Code

set.seed(101019) # so we all get the same random numbers
mu = c(3, 4)
Sigma = matrix(c(.8, 0, 0, .7), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 50, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
data[51, ] = c(7, 10)
data %>% ggplot(aes(x=x, y=y)) + geom_point()

Outliers

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") + geom_smooth(data = data[-51,], method = "lm", se = FALSE)

Outliers

Code

set.seed(101019) # so we all get the same random numbers
mu = c(3, 4)
n = 15
Sigma = matrix(c(.9, .8, .8, .9), ncol =2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = n, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
data[n+1, ] = c(1.5, 5.5)

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") + geom_smooth(data = data[-c(n+1),], method = "lm", se = FALSE)

Restriction of range

Code

set.seed(1010191) # so we all get the same random numbers
mu = c(100, 4)
Sigma = matrix(c(.7, .4, 4, .75), ncol = 2) #diagonals are reliabilites, off-diagonals are correlations
data = mvrnorm(n = 150, mu = mu, Sigma = Sigma)
data = as.data.frame(data)
colnames(data) = c("x", "y")
real_data = data
data = filter(data, x >100 & x < 101)

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red")

Restriction of range

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") + geom_point(data = real_data) + geom_smooth(method = "lm", se = FALSE, data = real_data, color = "blue")

Nonlinearity

Code

x = runif(n = 150, min = -2, max = 2)
y = x^2 +rnorm(n = 150, sd = .5)
data = data.frame(x,y)

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red")

It’s not always apparent

Sometimes issues that affect correlations won’t appear in your graph, but you still need to know how to look for them.

Low reliability
Content overlap
Multiple groups

Reliability

\[r_{xy} = \rho_{xy}\sqrt{r_{xx}r_{yy}}\]

Meaning that our estimate of the population correlation coefficient is attenuated in proportion to reduction in reliability.

If you have a bad measure of X or Y, you should expect a lower estimate of \(\rho\).

Content overlap

If your Operation Y of Construct B includes items (or tasks or manipulations) that could also be influenced by Constrct A, then the correlation between X and Y will be inflated.

Example: SAT scores and IQ tests
Example: Depression and number of hours sleeping
Which kind of validity is this associated with?

In-class demo

Add your height (in inches), forearm length (in inches), and gender to this spreadsheet: tinyurl.com/uwn463vj

Multiple groups

Code

set.seed(101019) # so we all get the same random numbers
m_mu = c(100, 4)
m_Sigma = matrix(c(.7, .4, 4, .75), ncol = 2) #diagonals are reliabilites, off-diagonals are correlations
m_data = mvrnorm(n = 150, mu = m_mu, Sigma = m_Sigma)
m_data = as.data.frame(m_data)
colnames(m_data) = c("x", "y")

f_mu = c(102, 3)
f_Sigma = matrix(c(.7, .4, 4, .75), ncol = 2) #diagonals are reliabilites, off-diagonals are correlations
f_data = mvrnorm(n = 150, mu = f_mu, Sigma = f_Sigma)
f_data = as.data.frame(f_data)
colnames(f_data) = c("x", "y")

m_data$gender = "male"
f_data$gender = "female"

data = rbind(m_data, f_data)

data %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red")

Multiple groups

data %>% ggplot(aes(x=x, y=y, color = gender)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + guides(color = "none")

Special cases of the Pearson correlation

Spearman correlation coefficient
- Applies when both X and Y are ranks (ordinal data) instead of continuous
- Denoted \(\rho\) by your textbook, although I prefer to save Greek letters for population parameters.
Point-biserial correlation coefficient
- Applies when Y is binary.
Phi ( \(\phi\) ) coefficient
- Both X and Y are dichotomous.

Do the special cases matter?

For Spearman, you’ll get a different answer.

x = rnorm(n = 10); y = rnorm(n = 10) #randomly generate 10 numbers from normal distribution

Here are two ways to analyze these data

head(cbind(x,y))

              x          y
[1,] -0.6682733 -0.3940594
[2,] -1.7517951  0.9581278
[3,]  0.6142317  0.8819954
[4,] -0.9365643 -1.7716136
[5,] -2.1505726 -1.4557637
[6,] -0.3593537 -1.2175787

cor(x,y, method = "pearson")

[1] 0.2702894

head(cbind(x,y, rank(x), rank(y)))

              x          y     
[1,] -0.6682733 -0.3940594  5 7
[2,] -1.7517951  0.9581278  2 9
[3,]  0.6142317  0.8819954 10 8
[4,] -0.9365643 -1.7716136  4 3
[5,] -2.1505726 -1.4557637  1 4
[6,] -0.3593537 -1.2175787  7 6

cor(x,y, method = "spearman")

[1] 0.3454545

Do the special cases matter?

If your data are naturally binary, no difference between Pearson and point-biserial.

x = rnorm(n = 10); y = rbinom(n = 10, size = 1, prob = .3)
head(cbind(x,y))

               x y
[1,] -0.48974849 1
[2,] -2.53667101 0
[3,]  0.03521883 1
[4,]  0.03043436 0
[5,] -0.27043857 0
[6,] -0.55228283 1

Here are two ways to analyze these data

cor(x,y, method = "pearson")

[1] 0.1079188

ltm::biserial.cor(x,y)

[1] -0.1079188

Do the special cases matter?

If artificially dichotomize data, there can be big differences. This is bad.

x = rnorm(n = 10); y = rnorm(n = 10)

Here are two ways to analyze these data

head(cbind(x,y))

               x          y
[1,]  1.27516603 -0.2012149
[2,] -1.55729177  0.2925842
[3,]  0.09364959  0.0821713
[4,]  0.87343693  0.1879078
[5,]  0.74807054  0.3794815
[6,]  0.02831971 -1.2940189

cor(x,y, method = "pearson")

[1] -0.1584301

d_y = ifelse(y < median(y), 0, 1)
head(cbind(x,y, d_y))

               x          y d_y
[1,]  1.27516603 -0.2012149   0
[2,] -1.55729177  0.2925842   1
[3,]  0.09364959  0.0821713   0
[4,]  0.87343693  0.1879078   0
[5,]  0.74807054  0.3794815   1
[6,]  0.02831971 -1.2940189   0

ltm::biserial.cor(x, d_y)

[1] 0.4079477

Don’t use median splits!

Special cases of the Pearson correlation

Why do we have special cases of the correlation?

Sometimes we get different results
- If we treat ordinal data like interval/ratio data, our estimate will be incorrect
Sometimes we get the same result
- Even when formulas are different
- Example: Point biserial formula
  - \[r_{pb} = \frac{M_1-M_0}{\sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}}\sqrt{\frac{n_1n_0}{n(n-1)}}\]

Next time…

Probability!