Chi-sq Test of Independence

class: center, middle, inverse, title-slide

# Chi-sq Test of Independence

---

Another common way that categorical data are encountered in research is in a cross-tabulation.  The simplest is a two-way table representing how two categorical variables are related.

The second example data set comes from Goldstein et al. (2012), who followed 413 young participants with bipolar disorder over an average of 5 years to determine the predictors of suicide attempts.  One predictor was whether a first-degree or second-degree relative had ever been diagnosed with depression.

.small[Goldstein et al. (2012). Predictors of prospectively examined suicide attempts among youth with bipolar disorder. *Archives of General Psychiatry, 69*, 1113-1122.
]
---
<table style="border-collapse:collapse; border:none;">
 <tr>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; border-bottom:1px solid;" rowspan="2">Depression</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">Suicide</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; font-weight:bolder; font-style:italic; border-bottom:1px solid; " rowspan="2">Total</th>
 </tr>
 
<tr>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">No Attempt</td>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">Attempt</td>
 </tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Nondepressed</td>
<td style="padding:0.2cm; text-align:center; ">70</td>
<td style="padding:0.2cm; text-align:center; ">4</td>
<td style="padding:0.2cm; text-align:center; ">74</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Depressed</td>
<td style="padding:0.2cm; text-align:center; ">267</td>
<td style="padding:0.2cm; text-align:center; ">72</td>
<td style="padding:0.2cm; text-align:center; ">339</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; border-bottom:double; font-weight:bolder; font-style:italic; text-align:left; vertical-align:middle;">Total</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">337</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">76</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">413</td> 
</tr>
 
</table>

Descriptively it appears that suicide attempts were more common among participants who had depressed relatives.

The rate of suicide attempts among participants without depressed relatives was (4/74) .054. The rate of suicide attempts among participants with depressed relatives was (72/339) .21.  Seems to be quite different.

---

What is the null hypothesis here?

There are several ways to think about it.  First, we could propose that the rate of suicide attempts is the same for those with depressed relatives and those without depressed relatives.  Second, we could propose that the rate of depressed relatives is the same for those who attempt suicide and those who do not attempt suicide.

Both of these are ways of saying that depression status of relatives is independent of suicide attempt status.

How do we turn that into expected frequencies so that we can use the same basic procedure that we used for the superpower data?

---
First, let’s convert the frequency table to proportions, which can also be viewed as the probabilities of the cells and margin totals.

<table style="border-collapse:collapse; border:none;">
 <tr>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; border-bottom:1px solid;" rowspan="2">Depression</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">Suicide</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; font-weight:bolder; font-style:italic; border-bottom:1px solid; " rowspan="2">Total</th>
 </tr>
 
<tr>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">No Attempt</td>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">Attempt</td>
 </tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Nondepressed</td>
<td style="padding:0.2cm; text-align:center; ">70 16.9&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">4 1&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">74 17.9&nbsp;&#37;</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Depressed</td>
<td style="padding:0.2cm; text-align:center; ">267 64.6&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">72 17.4&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">339 82&nbsp;&#37;</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; border-bottom:double; font-weight:bolder; font-style:italic; text-align:left; vertical-align:middle;">Total</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">337 81.6&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">76 18.4&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">413 100&nbsp;&#37;</td> 
</tr>
 
</table>

---

If two events are independent, the probability of their joint occurrence is the product of their individual probabilities.

Under null hypothesis, we are assuming the two events (depressed status of relatives, suicide attempt status) are independent.

To get the expected probability of a joint event under the null hypothesis (a cell in the two-way table), we can take the product of the corresponding row and column probabilities.

`$$P_{rc} = P_{r.}P_{.c}$$`

---

<table style="border-collapse:collapse; border:none;">
 <tr>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; border-bottom:1px solid;" rowspan="2">Depression</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">Suicide</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; font-weight:bolder; font-style:italic; border-bottom:1px solid; " rowspan="2">Total</th>
 </tr>
 
<tr>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">No Attempt</td>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">Attempt</td>
 </tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Nondepressed</td>
<td style="padding:0.2cm; text-align:center; ">16.9&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">1&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">17.9&nbsp;&#37;</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Depressed</td>
<td style="padding:0.2cm; text-align:center; ">64.6&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">17.4&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; ">82&nbsp;&#37;</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; border-bottom:double; font-weight:bolder; font-style:italic; text-align:left; vertical-align:middle;">Total</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">81.6&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">18.4&nbsp;&#37;</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">100&nbsp;&#37;</td> 
</tr>
 
</table>

.15 = .18(.82) = the expected proportion of the sample that has not attempted suicide and also does not have first-degree or second-degree relatives with depression.

These proportions can, in turn, be transformed to expected frequencies by multiplying by the sample size.

`$$E_{rc} = P_{rc}N$$`

---
<table style="border-collapse:collapse; border:none;">
 <tr>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; border-bottom:1px solid;" rowspan="2">Depression</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">Suicide</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; font-weight:bolder; font-style:italic; border-bottom:1px solid; " rowspan="2">Total</th>
 </tr>
 
<tr>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">No Attempt</td>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">Attempt</td>
 </tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Nondepressed</td>
<td style="padding:0.2cm; text-align:center; ">60</td>
<td style="padding:0.2cm; text-align:center; ">14</td>
<td style="padding:0.2cm; text-align:center; ">74</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Depressed</td>
<td style="padding:0.2cm; text-align:center; ">277</td>
<td style="padding:0.2cm; text-align:center; ">62</td>
<td style="padding:0.2cm; text-align:center; ">339</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; border-bottom:double; font-weight:bolder; font-style:italic; text-align:left; vertical-align:middle;">Total</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">337</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">76</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">413</td> 
</tr>
 
</table>

The expected cell frequencies can be obtained from the marginal frequencies as well:

`$$E_{rc} = \frac{O_{r.}O_{.c}}{N}$$`
---

Once the observed and expected frequencies are determined, the `$\chi^2$` statistic is calculated as before:

`$$\chi^2_{df = (r-1)(c-1)} = \sum^r_{i=1}\sum^c_{j=1}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}$$`
---

```r
cross_tabs = table(suicide$depression, suicide$attempt)
chisq.test(cross_tabs, correct = F)
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  cross_tabs
## X-squared = 10.141, df = 1, p-value = 0.00145
```

The obtained `$\chi^2$` is rare under the null hypothesis of no relationship between the two event categories. We can reject the null: Suicide attempts are more common among individuals with relatives who have been diagnosed with depression.

---

The test statistic has a single degree of freedom, determined by (number of rows-1)(number of columns-1).

This might seem odd given that there are 4 cells in the two-way table and by our prior logic that might imply df=4-1=3.

But, we have three constraints on the model:
 
* The cell frequencies must sum to the overall sample size

* The row totals must sum to the overall sample size

* The column totals must sum to the overall sample size

---

![](chi_square_independence_files/figure-html/unnamed-chunk-8-1.png)

---

The `chisq.test()` function defaults to a correction that reduces the size of the test statistic, making it more conservative.

```r
cross_tabs = table(suicide$depression, suicide$attempt)
chisq.test(cross_tabs, correct = F)
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  cross_tabs
## X-squared = 10.141, df = 1, p-value = 0.00145
```

```r
chisq.test(cross_tabs)
```

```
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cross_tabs
## X-squared = 9.1142, df = 1, p-value = 0.002536
```

---
Categorical data are discrete, and under the null hypothesis, we can think of the cell frequencies as arising from a binomial distribution with N trials and cell probabilities equal to the null hypothesis values `$(P_i)$`.

The observed frequencies are compared to the expected frequencies under the  null hypothesis.  The differences are capture in the form of the Pearson `$\chi^2$` statistic.

The sampling distribution for that statistic is approximated by the `$\chi^2$` family of distributions. But, the `$\chi^2$` distribution is continuous and so can only approximate the sampling behavior of a discrete variable.  That approximation will be poorest with small sample sizes, where the discrete nature of the data is more apparent.

---
A number of solutions have been suggested to improve the accuracy of the `$\chi^2$` approximation with small samples. The most common is the Yates correction (also called the correction for continuity):

`$$\chi^2_{df = k-1} = \sum^k_{i=1}\frac{(|O_i-E_i|-.5)^2}{E_i}$$`

`$$\chi^2_{df = (r-1)(c-1)} = \sum^r_{i=1}\sum^c_{j=1}\frac{(|O_{ij}-E_{ij}| - .5)^2}{E_{ij}}$$`

The correction penalizes the `$\chi^2$` statistic.  The intent is to remove a positive bias for small samples and make it less likely that a false rejection will be made.  The correction is recommended for small samples and when expected frequencies in any cell are low (some say 5, others say 10).

---

Effect sizes frequently accompany chi-square tests of independence.  The particular effect size used depends on whether the two-way contingency table is 2 x 2 or has more than 2 categories for one of the nominal variables.

Effect sizes for 2 x 2 tables:

* Phi coefficient
*	Odds ratio

Effect sizes for larger two-way tables:

* Cramer’s V

---

The phi `$(\phi)$` coefficient is simply the Pearson product-moment correlation applied to two binary variables.  It has the usual range for correlations (-1 to 1) and when squared has the usual "proportion of variance shared" interpretation.

It can be calculated in the usual way or it can be obtained from the  `$\chi^2$` statistic for the independence test:

`$$\phi = \sqrt{\frac{\chi^2}{N}}$$`

---

```r
# as a correlation (require numeric variables)
cor(as.numeric(suicide$depression),
    as.numeric(suicide$attempt))
```

```
## [1] 0.156701
```

```r
# from the test statistic
chi_fit = chisq.test(table(suicide$depression, suicide$attempt),
                     correct = F)

sqrt(chi_fit$statistic/nrow(suicide))
```

```
## X-squared 
##  0.156701
```

Proportion of variance shared:

```r
cor(as.numeric(suicide$depression),
    as.numeric(suicide$attempt))^2
```

```
## [1] 0.02455521
```

---

The odds ratio effect size begins with frequencies.

**Observed**

<table style="border-collapse:collapse; border:none;">
 <tr>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; border-bottom:1px solid;" rowspan="2">Depression</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">Suicide</th>
 <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal; font-weight:bolder; font-style:italic; border-bottom:1px solid; " rowspan="2">Total</th>
 </tr>
 
<tr>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">No Attempt</td>
 <td style="border-bottom:1px solid; text-align:center; padding:0.2cm;">Attempt</td>
 </tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Nondepressed</td>
<td style="padding:0.2cm; text-align:center; ">70</td>
<td style="padding:0.2cm; text-align:center; ">4</td>
<td style="padding:0.2cm; text-align:center; ">74</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; text-align:left; vertical-align:middle;">Depressed</td>
<td style="padding:0.2cm; text-align:center; ">267</td>
<td style="padding:0.2cm; text-align:center; ">72</td>
<td style="padding:0.2cm; text-align:center; ">339</td> 
</tr>
 
<tr> 
<td style="padding:0.2cm; border-bottom:double; font-weight:bolder; font-style:italic; text-align:left; vertical-align:middle;">Total</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">337</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">76</td>
<td style="padding:0.2cm; text-align:center; border-bottom:double;">413</td> 
</tr>
 
</table>

---

The odds ratio effect size begins with frequencies.

**Observed**

|              | No Attempt | Attempt | Row Total |
|--------------|:----------:|:-------:|:---------:|
| Nondepressed | `$O_{11}$`   | `$O_{12}$`| `$O_{1.}$`  |
| Depressed    | `$O_{21}$`   | `$O_{22}$`| `$O_{2.}$`  |
| Column total | `$O_{.1}$`   | `$O_{.2}$`| `$O_{..}$`  |

.pull-left[
* SA = Suicide Attempt
* NR = Nondepressed Relatives
* DR = Depressed Relatives

From the cell and marginal frequencies we can define the following:
]

.pull-right[
`$$P(SA|NR) = \frac{O_{12}}{O_{1.}}$$`
`$$P(SA|DR) = \frac{O_{22}}{O_{2.}}$$`

]

---

The odds ratio effect size begins with frequencies.

**Observed**

.pull-left[
* SA = Suicide Attempt
* NR = Nondepressed Relatives
* DR = Depressed Relatives

From the cell and marginal frequencies we can define the following:
]

.pull-right[
`$$Odds(SA|NR) = \frac{P(SA|NR)}{1-P(SA|NR)}$$`
`$$Odds(SA|DR) = \frac{P(SA|DR)}{1-P(SA|DR)}$$`
]
---

The odds ratio effect size begins with frequencies.

**Observed**

.pull-left[
* SA = Suicide Attempt
* NR = Nondepressed Relatives
* DR = Depressed Relatives

From the cell and marginal frequencies we can define the following:
]

.pull-right[
`$$OddsRatio = \frac{Odds(SA|NR)}{Odds(SA|DR)}$$`

]

---
`$$P(SA|NR) = \frac{4}{74} = .0541$$`
`$$P(SA|DR) = \frac{72}{339} = .212$$`

`$$Odds(SA|NR) = \frac{.0541}{1-.0541} = .0572$$`
`$$Odds(SA|DR) = \frac{.212}{1-.212} = .269$$`

`$$Odds Ratio = \frac{.269}{.0572} = 4.72$$`

The odds of attempting suicide if you have depressed relatives are 4.72 times the odds of attempting suicide if you don’t have depressed relatives.

---

There is a convenient short-cut calculation based on the cell frequencies:

`$$Odds Ratio = \frac{O_{11}O_{22}}{O_{12}O_{21}} = \frac{70(72)}{4(267)} = 4.72$$`
---
When tables are larger than 2 x 2, Cramer’s V is the most common effect size and follows in form the calculation of the phi coefficient:

`$$V = \sqrt{\frac{\chi^2}{min(r-1,c-1)(N)}}$$`

When df=1 (a 2 x 2 table), `$V = \phi$`.  In all cases, it is bounded by 0 (no association) and 1 (perfect association).