Comparing two means I

# Comparing two means I

---

# Today

* One mean versus the population (review)

* Two independent means

---

# One-sample *t*-test

The one-sample t-test is appropriate when a single sample mean is compared to a population mean but the population standard deviation is unknown.

A sample estimate of the population standard deviation is used instead.

The appropriate sampling distribution is the t-distribution, with N-1 degrees of freedom.

`$$t_{df=N-1} = \frac{\bar{X}-\mu}{\frac{\hat{\sigma}}{\sqrt{N}}}$$`
---

## A new example

We can again use the Census at School database, which includes six questions asking students to rate the following issues on a scale from 0 (not important) to 1000 (very important):

* Reducing pollution
* Recycling
* Conserving water
* Conserving energy (electricity, gas, oil for heating, lighting, car travel)
* Owning a computer
* Access to the internet

A random sample of 247 Oregon high schoolers can be compared to the means from high schoolers from all 50 states.

---

```r
school = read_csv(here("data/census_at_school.csv"))
school = school %>% filter(ClassGrade >= 9)

usa = read_csv(here("data/census_at_school_usa.csv"))
usa = usa %>% filter(ClassGrade >= 9 & Region != "OR")
```

---

The null and alternative hypotheses:

`$$H_0: \bar{X}_{OR} = \mu_{USA}$$`

`$$H_1: \bar{X}_{OR} \neq \mu_{USA}$$`

```r
t.test(school$Importance_reducing_pollution, 
       mu = mean(usa$Importance_reducing_pollution, na.rm=T),
       alternative = "two.sided", conf.level = .95)
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Issue </th>
   <th style="text-align:right;"> USA Mean </th>
   <th style="text-align:right;"> OR mean </th>
   <th style="text-align:right;"> t value </th>
   <th style="text-align:right;"> p value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> reducing_pollution </td>
   <td style="text-align:right;"> 778.27 </td>
   <td style="text-align:right;"> 787.04 </td>
   <td style="text-align:right;"> 0.13 </td>
   <td style="text-align:right;"> 0.90 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> recycling_rubbish </td>
   <td style="text-align:right;"> 794.71 </td>
   <td style="text-align:right;"> 714.85 </td>
   <td style="text-align:right;"> -1.70 </td>
   <td style="text-align:right;"> 0.09 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> conserving_water </td>
   <td style="text-align:right;"> 645.47 </td>
   <td style="text-align:right;"> 628.03 </td>
   <td style="text-align:right;"> -0.37 </td>
   <td style="text-align:right;"> 0.71 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> saving_energy </td>
   <td style="text-align:right;"> 776.94 </td>
   <td style="text-align:right;"> 652.66 </td>
   <td style="text-align:right;"> -2.64 </td>
   <td style="text-align:right;"> 0.01 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> owning_computer </td>
   <td style="text-align:right;"> 648.93 </td>
   <td style="text-align:right;"> 458.85 </td>
   <td style="text-align:right;"> -3.27 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Internet_access </td>
   <td style="text-align:right;"> 856.06 </td>
   <td style="text-align:right;"> 670.62 </td>
   <td style="text-align:right;"> -3.03 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
</tbody>
</table>

---

```r
lsr::oneSampleTTest(x = school$Importance_reducing_pollution, 
       mu = mean(usa$Importance_reducing_pollution, na.rm=T))
```

```
## Warning in lsr::oneSampleTTest(x = school$Importance_reducing_pollution, :
## 33 case(s) removed due to missingness
```

```
## 
##    One sample t-test 
## 
## Data variable:   school$Importance_reducing_pollution 
## 
## Descriptive statistics: 
##             Importance_reducing_pollution
##    mean                           787.036
##    std dev.                       932.148
## 
## Hypotheses: 
##    null:        population mean equals 778.2687 
##    alternative: population mean not equal to 778.2687 
## 
## Test results: 
##    t-statistic:  0.132 
##    degrees of freedom:  196 
##    p-value:  0.895 
## 
## Other information: 
##    two-sided 95% confidence interval:  [656.06, 918.011] 
##    estimated effect size (Cohen's d):  0.009
```

---

### Assumptions

* Normality
* Independence

If population scores are normally distributed, the sampling distribution of the mean will also be normally distributed.

If the data are independent, then the *t* distribution chosen by `$df = N-1$` will provide an accurate approximation of the sampling distribution of the mean for purposes of making inferences.

---

## Independent samples *t*-test

The independent samples *t*-test is appropriate when the means from two independent samples are compared.

`$$H_0: \mu_1 = \mu_2$$`
`$$H_1: \mu_1 \neq \mu_2$$`

Test statistic is (roughly) defined by

`$$t = \frac{\bar{X}_1 - \bar{X}_2}{SE}$$`

The standard error of the mean difference can be estimated in two different ways.  If the variances for the two groups are the same, then pooling them provides the most powerful solution, known as Student’s t-test.  Otherwise, a separate-variances estimate is used, resulting in Welch’s t-test.

---

More formally, the numerator of the t-statistic is the following:

`$$t = \frac{(\bar{X}_1 - \bar{X}_2)-(\mu_1 - \mu_2)}{SE}$$`
In other words, is the difference between the sample means different from the hypothesized difference in the population means?

* `$\bar{X}_1 - \bar{X}_2$` = our observed difference in (independent) means
* `$\mu_1 - \mu_2$` = the (null) hypothesized difference in (independent) means

In this sense, the independent samples t-test is similar to a one-sample test of differences.

---

Calculating the demoninator of this test is not straightfoward. If we think of our test as a one-sample test of our difference score, then we need our standard error to be that of difference in means, not means themselves.

`$\text{SE (difference in means)} = \sigma_D$`

To take our standard error of the mean calculation and adjust it to reflect a difference in means, we make a small adjustment:

`$$\sigma_D = \hat{\sigma}_p\sqrt{\frac{1}{N_1} + \frac{1}{N_2}}$$`

---

A key assumption of the independent samples t-test is **homogeneity of variance,** meaning that even if these samples come from different populations, those populations differ only in terms of their mean, not in terms of their variance.

The benefit of making this assumption is that we can calculate a single estimate of variance and simply adjust the units (as in the previous slide).

If the variance in both samples is exactly the same, then we can plug `$\hat{\sigma}$` in for `$\sigma_p$` and be done.

---

### Pooled variance

Rarely do we find that the variance estimates are *exactly* the same, in which case, we need to combine the two estimates to get a single variance number. This is called the **pooled variance**.

`$$\large{\hat{\sigma}^2_p = \frac{(N_1-1)\hat{\sigma}^2_1 + (N_2-1)\hat{\sigma}^2_2}{N_1+N_2-2}}$$`
--

The bottom part, `$N_1+N_2-2$` is the degrees of freedom for an independent sample *t*-test. The total number of quantities ( `$N_1+N_2$` total scores) minus the number of constraints (2 means).

Each variance is weighted based on its sample's contribution to degrees of freedom.

---

So our final calculation for the standard error of the difference of independent means is

`$$\large{\hat{\sigma}_D = \sqrt{\frac{(N_1-1)\hat{\sigma}^2_1 + (N_2-1)\hat{\sigma}^2_2}{N_1+N_2-2}} \sqrt{\frac{1}{N_1} + \frac{1}{N_2}}}$$`
**Recap**

The first part of this equation pools the two variance estimates by weighting each one based on sample size.

The second part changes the units from scores to differences in means. 
  * Recall that in hypothesis testing, we're not interested in the likelihood of scores, but the likelihood of statistics. 
  
---

At this point, the procedure of hypothesis testing goes back to normal.

We have our sampling distribution under the null hypothesis, which is a *t* distribution defined by

$$
`\begin{aligned}
 \mu &= 0 \\
 \sigma_D &= \sqrt{\frac{(N_1-1)\hat{\sigma}^2_1 + (N_2-1)\hat{\sigma}^2_2}{N_1+N_2-2}} \sqrt{\frac{1}{N_1} + \frac{1}{N_2}} \\
 df &= N_1 + N_2 - 2
\end{aligned}`
$$
In this case, we're using `$\sigma_D$` the same way we use `$\sigma_M$`, we just change the notation to reflect our interest in a difference in means, rather than the mean itself.

---

We calculate our test statistic

$$ t = \frac{\bar{X}_1-\bar{X}_2}{\sigma_D}$$
and then find the probability of this test statistic or more extreme under the null.

---

## Example

We turn again to the Census at School data. One question that students were asked was to report how many hours per week they spent time with friends. We might be interested to know whether adolescent girls spend more time with friends than adolescent boys.

```r
table(school$Gender)
```

```
## 
## Female   Male 
##    144    101
```

```r
psych:: describe(school$Hanging_Out_With_Friends_Hours)
```

```
##    vars   n  mean    sd median trimmed mad min  max range  skew kurtosis
## X1    1 208 23.59 78.41      9   12.49 8.9   0 1000  1000 10.28   118.18
##      se
## X1 5.44
```

---
.pull-left[

```r
school %>%
  ggplot(aes(x = friends)) + 
   geom_density(color = "red", size = 2, fill = "red", alpha = .3) + 
  geom_histogram(aes(y = ..density..), bins = 50) + 
 labs(x = "Hours per week", title = "Full sample") + theme_minimal(base_size = 20)
```

![](15-two_means_I_files/figure-html/unnamed-chunk-6-1.png)
]

```r
school %>%
  filter(friends < 168) %>%
  ggplot(aes(x = friends)) + 
  geom_density(color = "red", size = 2, fill = "red", alpha = .3) + 
  geom_histogram(aes(y = ..density..), bins = 50) + 
  labs(x = "Hours per week", title = "Mathematically possible") + theme_minimal(base_size = 20)
```

![](15-two_means_I_files/figure-html/unnamed-chunk-7-1.png)
]

---

Maybe I'll exclude participants who report impossible numbers.

```r
school_new = filter(school, friends < 168)
psych::describeBy(school_new$friends, group = school_new$Gender)
```

```
## 
##  Descriptive statistics by group 
## group: Female
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 122 18.31 21.44     10   14.19 10.38 0.3 124 123.7 2.14     5.45
##      se
## X1 1.94
## -------------------------------------------------------- 
## group: Male
##    vars  n  mean    sd median trimmed  mad min max range skew kurtosis
## X1    1 83 12.34 15.91      7    8.87 7.41   0  80    80 2.46     6.32
##      se
## X1 1.75
```

.pull-left[
$$
`\begin{aligned}
 \bar{X}_1 &= 18.31 \\
 \hat{\sigma}_1 &= 21.44 \\
 N_1 &= 122 \\
\end{aligned}`
$$
]

.pull-right[
$$
`\begin{aligned}
 \bar{X}_2 &= 12.34 \\
 \hat{\sigma}_2 &= 15.91 \\
 N_2 &= 83 \\
\end{aligned}`
$$
]

---

Next we build the sampling distribution under the null hypotheses.

$$
`\begin{aligned}
 \mu &= 0 \\
 \\
 \sigma_D &= \sqrt{\frac{(122-1){21.44}^2 + (83-1){15.91}^2}{122+83-2}} \sqrt{\frac{1}{122} + \frac{1}{83}}\\
 &= 19.39\sqrt{\frac{1}{122} + \frac{1}{83}} = 2.76\\
 \\
 df &= 203
\end{aligned}`
$$ 
---

```r
(cv = qt(p = .975, df = 203))
```

```
## [1] 1.971719
```

![](15-two_means_I_files/figure-html/unnamed-chunk-11-1.png)

---

```r
(t = (x1-x2)/s_d)
```

```
## [1] 2.165904
```

![](15-two_means_I_files/figure-html/unnamed-chunk-13-1.png)

---

.left-column[
We can also calculate the probability of this test statistic or more extreme under the null.

```r
pt(t, 
   df = df, 
   lower.tail = 
     F)*2
```

```
## [1] 0.03148413
```

]

![](15-two_means_I_files/figure-html/unnamed-chunk-15-1.png)

---

Alternatively, we can not do any of that, and use an R function.

```r
t.test(friends~Gender, data = school_new, var.equal = T)
```

```
## 
## 	Two Sample t-test
## 
## data:  friends by Gender
## t = 2.1659, df = 203, p-value = 0.03148
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.5358328 11.4173373
## sample estimates:
## mean in group Female   mean in group Male 
##             18.31393             12.33735
```

---

## Confidence intervals

**Confidence intervals** are used to communicate the precision in our how well our statistic estimates the parameter. As a reminder, they are grounded in frequenst probability: if we repeated our experiment many times, we would expect that 95% of the time, our 95% confidence interval would capture the true population parameter.

In an independent sample's t-test, you have calculated three different statistics, and so you can construct three different confidence intervals.

---

### Confidence interval around the difference in means

The most interpretable statistic is the difference in means -- this is the statistic you are testing using NHST.

`$$CI_{\text{Difference}} = (\bar{X}_1 - \bar{X}_2) \pm \sigma_D(CV)$$`

In our example, we used `$\alpha = .05$` and our df = `$122+83-2 = 203$`.

```r
qt(p = .975, df = 203)
```

```
## [1] 1.971719
```
]
--

`$$CI_{\text{Difference}} = \\(18.31 -12.34) \pm 2.76(1.97)$$`
`$$[0.53, 11.41]$$`

]

---

### Confidence intervals around estimates of the mean

In addition to calculating precision of the estimate in difference in means, you may also want to calculate the precision of the mean estimates themselves.

In this case, you should use the standard deviation of the group sample as your estimate of population sd, instead of merging them.

$$
`\begin{aligned}
CI_{\text{Mean}} &= {X} \pm \sigma_M(CV) \\
 &= {X} \pm \frac{\hat{\sigma}}{\sqrt{N}}(CV)
\end{aligned}`
$$
---

$$
`\begin{aligned}
 &= {X} \pm \frac{\hat{\sigma}}{\sqrt{N}}(CV)
\end{aligned}`
$$
.pull-left[
**Females**

```r
sd(school_new$friends[school_new$Gender == "Female"], na.rm=T)
```

```
## [1] 21.43591
```

```r
qt(.975, df = 122-1)
```

```
## [1] 1.979764
```

`$$18.31 \pm \frac{21.44}{\sqrt{122}}(1.98)$$`
`$$[14.47, 22.15]$$`
]

```r
sd(school_new$friends[school_new$Gender == "Male"], na.rm=T)
```

```
## [1] 15.90772
```

```r
qt(.975, df = 83-1)
```

```
## [1] 1.989319
```

`$$12.34 \pm \frac{15.91}{\sqrt{83}}(1.99)$$`
`$$[8.87, 15.81]$$`
]

---

```r
library(ggpubr)
ggerrorplot(school_new, x = "Gender", y = "friends", 
            desc_stat = "mean_ci", color = "Gender", 
            ylab = "Hours spent with friends (per week)")
```

![](15-two_means_I_files/figure-html/unnamed-chunk-20-1.png)
---

```r
ggerrorplot(school_new, x = "Gender", y = "friends", 
            desc_stat = "mean_ci", color = "Gender", 
            ylim=range(school_new$friends, na.rm=T),
            ylab = "Hours spent with friends (per week)")
```

![](15-two_means_I_files/figure-html/unnamed-chunk-21-1.png)
---

### What is a test statistic?

`$$Z = \frac{\bar{X}-\mu}{\sigma_M}$$`

`$$t = \frac{\bar{X}-\mu}{\frac{s}{\sqrt{N}}}$$`
]

`$$t = \frac{(\bar{X}_1 - \bar{X}_2)-(\mu_1 - \mu_2)}{\sqrt{\frac{(N_1-1)\hat{\sigma}^2_1 + (N_2-1)\hat{\sigma}^2_2}{N_1+N_2-2}} \sqrt{\frac{1}{N_1} + \frac{1}{N_2}}}$$`

]

All test statistics measure

`$$\frac{\text{Signal}}{\text{Noise}}$$`

---

![](images/signal.png)
---

## Next time

* Assumptions
* Cohen's D 
* Matrix algebra