Probability: Normal Distribution

class: center, middle, inverse, title-slide

# Probability: Normal Distribution

---

![](images/twitter1.png)

---

![](images/twitter2.png)

---

![](images/twitter3.png)

---

## Today...
* The normal distribution
* Why the normal is so important
* `$t$`, `$F$`, and `$\chi^2$` distributions (briefly, for now)

---
The **normal distribution** ("bell curve" or "Gaussian distribution") is a two-parameter distribution defined by the mean (
`$\mu$`
) and standard deviation (
`$\sigma$`
) and having the following probability density function:

`$$p(X|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma}}exp[-\frac{(X-\mu)^2}{2\sigma^2}]$$`
`$$X\sim N(\mu,\sigma)$$`
???

Note p not P

### Also note function of Z-scores
pull out 2, pull out 1/2

---

### Expected value
The expected value of the normal distribution is quite easy.

`$$E(X) = \mu$$`

`$$Var(X) = \sigma^2$$`

---

.left-column[

.small[The **probability density function** gives the height of the curve at a particular value for X.

Although these values communicate information about probability or likelihood, they are not probabilities.
]
]

![](8-normal_files/figure-html/normal-1.png)

```r
ggplot(data.frame(x = seq(-4, 4)), aes(x)) +
  stat_function(fun = function(x) dnorm(x)) +
  scale_x_continuous("Variable X") +
  scale_y_continuous("Density")+
  ggtitle("The normal curve") +
  theme(text = element_text(size = 20))
```
---

.pull-left[
![](8-normal_files/figure-html/unnamed-chunk-3-1.png)

`$$\small p(X|\mu, \sigma) = \\ \frac{1}{\sqrt{2\pi\sigma}}exp[-\frac{(X-\mu)^2}{2\sigma^2}]$$`

Probability **Density**
]

.pull-right[
![](8-normal_files/figure-html/unnamed-chunk-4-1.png)

`$$\small P(X|\theta,N) = \\\frac{N!}{X!(N-X)!}\theta^X(1-\theta)^{N-X}$$`

Probability **Mass**
]
---

Both the normal distribution and the binomial distribution follow the Law of Total Probability, but in different ways.

In the *binomial distribution*, each outcome in the sample space has a probability and these probabilities sum to 1.

In the *normal distribution*, there are an infinite number of values, each having a probability of 0, but the probability that some value will occur is 1.  The area under the curve is 1.

The density values are derived to insure that the area under the density curve is 1.  They have no inherent meaning beyond that.

---

Density is mass per volume. In this context (curve in two dimensions) density is mass per area.  The total density in the normal curve is 1.

The density for a part of the curve (a smaller area) will necessarily be less than 1.

---

.left-column[
The area under the curve that lies between the mean (here 0) and a value of 1 is the probability of a score between 0 and 1.

As we shrink that area by moving X closer to the mean, that probability interpretation holds.

]

![](8-normal_files/figure-html/unnamed-chunk-5-1.png)

---

.left-column[

As our interval shrinks closer and closer to 0, our area (probability) shrinks as well.

It can get vanishingly close to 0—essentially a point rather than an area.

The probability of that "point" is 0.

]

![](8-normal_files/figure-html/unnamed-chunk-6-1.png)

---

.left-column[

We can keep shrinking the distance between Z and 0, never reaching 0, and still calculate an area.

It will be very, very small.

]

![](8-normal_files/figure-html/unnamed-chunk-7-1.png)

???

## Key point

It does not make sense to talk about the probability of a point for a continuous variable, but it does make sense to talk about the probability that a score will fall between two points or the probability that a score will fall above or below a point.

---

## Characteristics of the normal distribution

* The mean and standard deviation are independent
* The distribution is unimodal and symmetrical.
* For two normal distributions, the area under the curve between corresponding locations in standard deviation units is the same regardless of `$\mu$` and `$\sigma$`.

* For example, 68.3% of the area under a normal curve falls between 1 `$\sigma$` below the mean and 1 `$\sigma$` above mean—for every normal curve.  This general characteristic is called the **Empirical Rule**.

---

![](8-normal_files/figure-html/unnamed-chunk-8-1.png)

---
All of these distributions are normal and have an equivalent area (proportion) that falls between a standard deviation below and above their respective means.

![](8-normal_files/figure-html/unnamed-chunk-9-1.png)

---

Approximately 68% of the data in a normal distribution will be within one standard deviation of the mean.  
- About 95% of the data will be within two standard deviations of the mean.  
- About 99.7% of the data will be within three standard deviations of the mean.

In other words, nearly all of the data will fall within 3 standard deviations of the mean in a normal distribution.

These benchmarks are convenient for determining if a score (and later, a mean) is rare or unusual in the context of a particular distribution.

---
## Standard normal distribution

The normal distribution with `$\mu$`=0 and `$\sigma$`=1 is called standard normal.

Variables with quite different means and standard deviations can be standardized so that they can be compared in the same metric (standard deviation units). This allows statements such as "relative to the mean, I am more conscientious (e.g., `$Z=2$`) than I am extraverted (e.g., `$Z=1$`)."

I could not say, however, that I am twice as conscientious as I am extraverted.

All continuous distributions can be standardized, but if they are not normal to begin with, standardization will not make them so.  *Standardization does not alter distribution shape.*

???

Why?

- not ratio scales, no set amount to compare
- not the same units -- what I'm twice as heavy as I am tall?
- Z-score is comparison to other people

---

### Standard normal distribution

There is only one (1) standard normal distribution. 
  
**How is this useful?**
  - Given any score, we can calculate the probability of getting a value greater than that z-score. *(Or less than that z-score.)*
  
  - Given any two z-scores, we can calculate the probability of getting a value between these scores. *(Or outside those z-scores)*
  
  - Given a probability `$p$`, we can identify the z-score at which the proportion of scores below *(or above)* `$p$` falls. 
  
  - Given a probability `$p$`, we can identify the z-score at which the proportion of scores that fall above `$-Z$` and below `$Z$` is equal to `$p$`.
  
---
### Sampling distributions are normal

One of the most important discoveries in statistics is that the sampling distributions of many statistics are approximately normal even when the sample (and population) distributions are not.

The **sampling distribution** of a statistic is the probability distribution that specifies probabilities for the possible values that the statistic can take.

- For example, the mean of a random sample will not precisely equal the population mean. But, how far off will it be? And what distribution shape will these possible sample mean values have?

The error represented by how far off a sample mean is from the population mean is called **sampling error**.

???

## DRAW SAMPLING DISTRIBUTION
Show mean, show how more likely close to values of mean
draw bar to represent sampling error

---
### Central limit theorem

According to the **central limit theorem**, as sample size increases, the sampling distribution of the mean approaches normality, even when the data upon which the mean is based are not normally distributed.

The sample size necessary to be "approximately normal" depends on the nature of the underlying data.  The less normal it is, the larger the sample size necessary in order for the sampling distribution of the means to become normal.

"Around sample size of 30" is a common rule of thumb.
- Note, however, that this rule of thumb is sufficiently only to assume that the sampling distribution is normal. This makes no assumptions about statistical power.

???

Regardless of the CLT, if the data are skewed, we might wonder if the mean is the best estimator to use here.

---

.left-column[Ends up that quite a few sample statistics approach normality as sample size increases.

Here is the sample standard deviation from a normal distribution with `$\sigma = 1$`.
]

![](8-normal_files/figure-html/unnamed-chunk-10-1.png)

---

.left-column[And the range.]

![](8-normal_files/figure-html/unnamed-chunk-11-1.png)
???
Note here that we're not just narrowing in, but our mean estimate is getting larger.

Remember bias?

---
The standard deviation of the sampling distribution of the means is called the **standard error of the mean**. It is directly related to the variability of the underlying data:

`$$\sigma_{mean} = \frac{\sigma_X}{\sqrt{N}}$$`

The same benchmarks for "rarity" that exist for normally distributed data also exist for normally distributed means.  This is the basis for much statistical inference.

What if we don’t know `$\mu$` or `$\sigma$`?  What if we want to understand the sampling behavior of the variance rather than the means?

---

If we can assume the data are normally distributed, then other distributions can be used to address these additional questions.

The `$t$` distribution, like the normal, is unimodal and symmetrical.  It has a single parameter, `$\nu$`, called the degrees of freedom (related to the sample size, `$N$`).

If a sample of size, `$N$`, comes from a normal distribution with unknown `$\sigma$`, then the data will follow a `$t$` distribution.

The sampling distribution of the mean will also follow a `$t$` distribution.

---

.left-column[
.small[
The `$t$` distribution is used when data are normally distributed but we do not know the population mean or standard deviation.

This must be estimated from the sample and the added uncertainty produces a distribution with heavier tails. As sample size increases, the `$t$` distribution converges on the normal distribution.

]
]

![](8-normal_files/figure-html/unnamed-chunk-12-1.png)

???

Just like with the standard normal (Z) distribution, there are 1-1 relationships between probability and "$t$-scores".

---

Sometimes our interest will be in variability rather than means.  When we want to make inferences about sample variability, we need to know the sampling distribution.  We can make use of the `$\chi^2$` distribution here. It has a single parameter, `$\nu$`, related to the sample size, `$N$`.  There is an important relationship between the normal distribution and the `$\chi^2$` distribution:

`$$\text{If } Z \sim N(0,1), \\\text{then }Z^2 \sim \chi^2_1 \\\text{ and } \sum_{i=1}^{N}Z^2 \sim \chi^2_N$$`

In other words, the sum of squared standard normal variables will have a `$\chi^2$` distribution.  This is the key to understanding the sampling behavior of the sample variance...

---
The sample variance is defined as:

`$$s^2 = \frac{\sum_{i = 1}^{N}(X_i-\bar{X})^2}{N-1}$$`
--

The important part is the numerator.  It is a sum of squares.  We can standardize by the population value, `$\sigma$`. If we assume the data are normally distributed, then

`$$\frac{(N-1)s^2}{\sigma^2} \sim \chi^2_{N-1}$$`
and we can use the `$\chi^2$` distribution to inform us about the expected sampling behavior of the variance.

???

## On the board

Write the formula (will it work if I just move the screen up?)

We have X's but if we can turn these into Z's, we might be able to simply with a new distribution. How do we get from X to Z?

`$$Z = \frac{X-\bar{X}}{s}$$`

`$$s^2 = \frac{\sum_{i = 1}^{N}(X_i-\bar{X})^2}{N-1}$$`

divide by `$\sigma^2$` and multiple by (N-1) and you'll get Z's!

Also point out, if sample comes from the same population, what does the equation simplify to? 1

If from different populations -- not 1!

---

Another common question is related to comparing multiple variances.  Suppose we conduct an experiment that has four different conditions. We expect that the means for these conditions will be different, but how can we know if they are different enough to claim that it is due to our experiment and not just random variability?

| Condition 1 | Condition 2 | Condition 3 | Condition 4 |
|:-----------:|:-----------:|:-----------:|:-----------:|
|    `$M_1$`    |    `$M_2$`    | `$M_3$`       | `$M_4$`       |
|    `$SD_1$`   |    `$SD_2$`   | `$SD_3$`      | `$SD_4$`      |

One approach to this question would start by taking the variability of the means, that is, variability between conditions: 
`$$\Large s^s_{\text{between}}$$`
---
`$$\Large s^s_{\text{between}}$$`

This variability will include any that was produced by the experimental conditions as well as any random variability. We have no way of separating those two parts. How can we know if the experiment-induced part is "big enough" to claim that the experiment "worked?"

We could compare it to the variability of the individual scores within the conditions:

`$$\Large s^s_{\text{within}}$$`

---

Within each group, everyone is exposed to the same experimental conditions and so the only source of variability is random variability.

The two variances are in different metrics (means versus scores), but that can be taken care of by the following relation:

.pull-left[
`$$\sigma_{\text{mean}} = \frac{\sigma_X}{\sqrt{N}}$$`
]
.pull-right[
`$$s^2_{\text{means}} = \frac{s^2_{\text{scores}}}{N}$$`
`$$Ns^2_{\text{means}} = s^2_{\text{scores}}$$`
]
---
In other words, if the null hypothesis (no mean differences) is true, then both of these variance estimates are estimates of random error.  If their ratio departs enough from 1.00, then I can claim there is something extra in the between source—the contribution due to the experimental conditions.

The ratio `$F$` is distributed:

`$$F \approx \frac{Ns^2_{\text{between}}}{s^2_{\text{within}}}$$`
--

Each variance estimate in the ratio is `$\chi^2$` distributed, if the data are normally distributed. The ratio of two `$\chi^2$` distributed variables is `$F$` distributed.  Eventually we will need to make this more precise (adjust by degrees of freedom), but this is close enough for now.

`$$F_{\nu_1\nu_2} = \frac{\frac{\chi^2_{\nu_1}}{\nu_1}}{\frac{\chi^2_{\nu_2}}{\nu_2}}$$`

---

The `$t$`, `$\chi^2$`, and `$F$` distributions all depend on the normal distribution assumption.  The normal distribution is the "parent" population.

As sample size increases, `$t$` and `$\chi^2$` converge on the normal. `$F$` converges on :

`$$\frac{\chi^2_{\nu_1}}{\nu_1}$$`
with the numerator depending on the normal.

---

### But keep in mind

These probability distributions have a key assumption:

- the sample is a random selection from the population

If the sample is not a random selection, the rules of probability don't apply.

---
class: inverse

## Next time...

sampling