You can download the rmd file here.
The purpose of today’s lab is to introduce tools for sampling from and calculating statistics for different types of distributions in R. The content of the lab will be split into two sections. The first section will focus on binomial distributions and the second section will focus on normal distributions. The Minihacks will test your knowledge of both types of distributions, as well as some distributions that are not discussed in the lab but are nonetheless important.
To quickly navigate to the desired section, click one of the following links:
The binomial distribution describes the theoretical probability of
obtaining a certain outcome over a number of trials when * (1) the
outcome on every trial is binary (e.g., a coin landed on a
heads
or a tails
; a dice was either a
6
or not a 6
) * (2) the probability of the
outcome on every trial is the same (e.g., the probability of getting a
heads
on flip 1
is the same as the probability
of getting a heads
on flip 100
). * (3) the
outcomes are independent of each other
Imagine you are flipping a coin. If it is a fair coin, you would expect a 50% chance of the coin landing on heads and a 50% chance of the coin landing on tails. However, every heads is not always paired with a tails. Sometimes there is a run of heads and sometimes there is a run of tails. Still, we would probably expect that, overall, there would be the same number of heads as tails. In other words, we would expect that if we flipped a single coin 100 times, the most likely outcome would be 50 heads and 50 tails.
Integrating our knowledge from last week about functions, we can turn this command into a function such that when we provide the number of trials and the probability for a success, we can almost instantly draw a binomial distribution.
draw_binomial <- function(size, prob){
#size: the number of trials
#prob: probability of success
binom_dist <- data.frame(x = 0:size) %>% #for the x axis, all possible successes for the x axis
mutate(prob = dbinom(x, size = size, prob = prob)) #probability of any number of heads
(plt <- ggplot(binom_dist, aes(x = x, y = prob))+
geom_bar(stat = "identity", width = 0.7, fill = "white", color = "black")+
geom_hline(yintercept = 0)+
labs(x = "# Successes",
y = "Probability")+
theme_classic())
return(plt)
}
draw_binomial(100, 0.5)
If we want to randomly sample trials from a binomial distribution, we
can use the rbinom()
function in R. The function takes
three arguments. The first argument (n
) is the number of
trials to sample. If we wanted to flip 2
coins
10
times, we would include the argument
n = 10
. The second argument (size
) is the
number of events associated with each trial. If we are flipping
2
coins, we would include size = 2
. The third
argument (prob
) is the probability of success on a given
trial. If we consider a heads
a success and everything else
a failure, we would include the argument prob = 1/2
.
Putting all of that together, we get
rbinom(n = 10, size = 2, prob = 1/2)
.
With random processes, I recommend that you use
set.seed([any number that you like])
to make your code
reproducible and spit out the same result across computers and every
time you run it. If you don’t use this function, the same code will
produce different results on different computers and every time you run
it.
Let’s make the computer flip a coin 10 times.
#side note: if we wrap a command in parentheses, it will assign the output of a function to the object and still print it at the same time!
(flip <- rbinom(n = 10, #trials
size = 1, #number of coins per trial
prob = 0.5)) #probability of success (1)
## [1] 0 0 1 1 0 1 1 1 1 0
What’s your expectation of how many heads we should get? How many heads did we end up getting?
mean(flip)
## [1] 0.6
How about we make the computer do this 10000 times? 😈
flip <- rbinom(n = 10000, #trials
size = 1, #number of coins per trial
prob = 0.5) #probability of success (1)
mean(flip)
## [1] 0.4952
We can also ask the computer to flip two coins per trial.
rbinom(n = 10, size = 2, prob = 1/2)
## [1] 0 1 0 1 1 2 1 1 1 0
Now, the proportion of heads might be closer to the expectation.
rbinom
can be helpful for running simulations. Let’s see
how the variance of the proportion of heads depends on the number of
flips.
flip_simulation <- data.frame(N_flips = seq(1, 10000, by = 10), #this is the number of flips we want to look at
prop_head = NA) #just a placeholder, we're going to fill this with the proportions iteratively
#Don't worry too much about the code below, but this is a for loop, implementing an iterative process
for (i in 1:nrow(flip_simulation)){ #this is telling R to go through each row of our data frame
#and then we fill the proportion of head column with the proportion of heads from the result of the binomial process
flip_simulation$prop_head[i] <- mean(rbinom(n = flip_simulation$N_flips[i], size = 1, prob = 0.5))
}
ggplot(flip_simulation, aes(x = N_flips, y = prop_head))+
geom_line()+
geom_hline(yintercept = 0.5, color = "red")+
labs(x = "Number of flips",
y = "Proportion of Heads")+
theme_classic()
The more trials we have for a random process, the more the proportion of successes converges on the expected value
Now, run the following simulations: How would we change
this if we were flipping 3
coins 10
times?
# your code here
What about rolling 5
6
-sided dice
10
times where getting a 6
is considered a
successful outcome?
# your code here
What about pulling an Ace
out of 1
deck
of cards 100
times?
# your code here
The function dbinom()
gives us the probability of
getting any one result. It takes four arguments, but we will only
concern ourselves with the first three: (1) x
- the number
of successful outcomes expected, (2) size
- the number of
events, and (3) prob
- the probability of success on a
given event.
To get the probability of getting 1
heads
by flipping 2
coins, we could run the following code.
dbinom(x = 1, size = 2, prob = .5)
## [1] 0.5
draw_binomial(size = 2, prob = .5) #that checks out!
There is a .50
probability of getting 1
heads
when you flip 2
coins. We can
investigate why this is the case by looking at the probability of every
outcome.
# HT + TH
(.5 * .5) + (.5 * .5)
## [1] 0.5
I think it might be helpful to look at some visualization of this.
Another way of looking at dbinom()
for, let’s say, a
binomial distribution of 50 trials and probability for success of
0.5:
dbinom(x = 20, size = 50, prob = 0.5)
## [1] 0.04185915
draw_binomial(size = 50, prob = 0.5)+
geom_segment(aes(x = 20-1, xend = 20, y = dbinom(x = 20, size = 50, prob = 0.5)+0.01, yend = dbinom(x = 20, size = 50, prob = 0.5)))+
geom_label(aes(x = 20-1, y = dbinom(x = 20, size = 50, prob = 0.5)+0.01, label = round(dbinom(x = 20, size = 50, prob = 0.5), 3)))
What’s the probability of getting 1
head when you
flip 3
coins?
# your code here
What’s the probability of drawing 2
aces
out of 1
deck of cards (with replacement)
when drawing twice?
# your code here
What’s the probability of drawing 0
Aces out of a
deck of cards twice (with replacement)?
# your code here
If we want to calculate the cumulative probability of getting a
certain result (i.e., the probability of getting a result equal to or
less than what we expect), we would use the function
pbinom()
. Cumulative probability may not sound important,
but it is when you consider that a p-value is the probability of getting
a result equal to or more extreme than that observed in the sample. The
function pbinom()
takes essentially the same arguments as
dbinom()
, but instead of the first argument being called
x
it is called q
.
Returning to the example from above, if we wanted to get the
probability of getting 1 or less
heads
when we
flip 2
coins, we would use
pbinom(q = 1, size = 2, prob = 1/2)
.
pbinom(q = 1, size = 2, prob = 1/2)
## [1] 0.75
The result is .75
. Again, this makes sense if we look at
the probability of every outcome.
# HT + TH + TT
(.5 * .5) + (.5 * .5) + (.5 * .5)
## [1] 0.75
The function pbinom()
can also take the argument
lower.tail
(defaults to TRUE
). The argument
lower.tail
is what specifies what side of the probability
distribution we should be testing from. In practical terms, it is what
decided that we wanted 1 or less
heads rather than
greather than 1
heads
.
For instance, if we wanted to test the probability of getting
greater than 1
heads
when flipping
2
coins, we would specify
lower.tail = FALSE
.
pbinom(q = 1, size = 2, prob = 1/2, lower.tail = FALSE)
## [1] 0.25
This figure tries to illustrate cumulative probability:
#this is just for instruction purposes, no need to fully understand this code
q <- 20
size <- 50
prob <- 0.5
draw_binomial(size = size, prob = prob)+
geom_bracket(xmin = 0, xmax = q,
y.position = (dbinom(x = q, size = size, prob = prob)+0.01),
label = paste0("lower.tail = TRUE: ", round(pbinom(q = q, size = size, prob = prob, lower.tail = TRUE), 3)), color = "red")+
geom_bracket(xmin = q+1, xmax = size,
y.position = (dbinom(x = q+1, size = size, prob = prob)+0.01),
label = paste0("lower.tail = FALSE: ", round(pbinom(q = q, size = size, prob = prob, lower.tail = FALSE), 3)), color = "red")
What’s the probability of getting 1 or less
heads
when flipping 10
coins?
# your code here
What’s the probability of getting 1 or less
6
s when rolling 1
dice?
# your code here
What’s the probability of getting greater than 3
6
s when rolling 9
dice?
# your code here
The function qbinom()
essentially does the opposite of
pbinom
. Instead of taking an outcome (q
) and
returning the cumulative probability, it returns the value that
corresponds to the cumulative probability (p
).
For instance if we wanted the value for which there is
100
% probability of getting that value or less
on 10
coin flips, we would enter
qbinom(p = 1.00, size = 10, prob = 1/2)
.
qbinom(p = 1.00, size = 10, prob = 1/2, lower.tail = TRUE)
## [1] 10
Unsurprisingly, 10 or less
heads
has a 100%
chance of occurring when you flip 10
coins.
q <- 1
size <- 2
prob <- 0.5
#this is just for instruction purposes, no need to fully understand this code
draw_binomial(size = size, prob = prob)+
geom_bracket(xmin = 0, xmax = q,
y.position = (dbinom(x = q, size = size, prob = prob)+0.01),
label = paste0("lower.tail = TRUE: ", round(pbinom(q = q, size = size, prob = prob, lower.tail = TRUE), 3)), color = "red")+
geom_bracket(xmin = q+1, xmax = size,
y.position = (dbinom(x = q+1, size = size, prob = prob)+0.01),
label = paste0("lower.tail = FALSE: ", round(pbinom(q = q, size = size, prob = prob, lower.tail = FALSE), 3)), color = "red")
#At what number of successes (or less) is X% of the distribution covered?
qbinom(p = 0.75, size = size, prob = prob, lower.tail = TRUE)
## [1] 1
#Above (!) what number of successes is X% of the distribution covered?
qbinom(p = 0.75, size = size, prob = prob, lower.tail = FALSE)
## [1] 0
With 100
coin flips, what is the number of
heads
(or less) that has a .50
probability of
occurring?
# your code here
With 100
coin flips, what is the number of
heads
(or less) that has a .25
probability of
occurring?
# your code here
With 100
coin flips, what is the number of
heads
(or greater) that has a .25
probability
of occurring?
# your code here
Recall from class that a normal distribution
is a
continuous probability distribution that is defined by a mean (\(\mu\)) and a standard deviation (\(\sigma\)). Whereas the binomial
distribution describes the theoretical probability of obtaining a
certain outcome over a number of trials when the outcome of every trial
is binary, the normal distribution describes the theoretical probability
of obtaining a certain outcome from a continuous distribution that has a
certain mean (\(\mu\)) and standard
deviation (\(\sigma\)).
A function to create a quick plot might be helpful…
draw_normal <- function(M, SD){
#M = mean of the normal distribution we want to draw
#SD = sd of the normal distribution we want to draw
x = seq(M - 4*SD, M + 4*SD, 0.1) #this is for the x axis, we're plotting a range from M +/- 4 SD, that covers most of the interesting part of the distribution
norm_dist <- data.frame(x = x) %>%
mutate(prob = dnorm(x, mean = M, sd = SD)) #probability of any number of heads
(plt <- ggplot(norm_dist, aes(x = x, y = prob))+
geom_line()+
labs(x = "x",
y = "Density")+
theme_classic())
return(plt)
}
In order to randomly sample observations from a normal distribution,
we use the function rnorm()
. Similar to
rbinom()
, rnorm()
takes three arguments: (1)
n
- the number of observations to sample from the normal
distribution, (2) mean
- the mean of the normal
distribution, and (3) sd
- the standard deviation of the
normal distribution.
Below we sample 5
values from a normal distribution with
a mean
of 0
and a sd
of
1
.
x <- rnorm(n = 5, mean = 0, sd = 1)
x
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176
Calculating the mean()
and sd()
of our
5
numbers can serve as a bit of a sanity check.
mean(x)
## [1] -0.06696949
sd(x)
## [1] 1.0749
With only 5 samples drawn, the distribution may not look normal, but with a sufficiently large sample size, it will start to resemble the true distribution.
How would you sample 10
observations from a normal
distribution with a mean of 100
and a standard deviation of
10
?
# your code here
Are the descriptives for this sample what we would expect?
# your code here
The normal distribution counterpart of dbinom()
is
dnorm()
. Similar to rnorm()
, it takes a mean
(mean
) and a standard deviation (sd
), but
instead of an argument for the number of observations you want to sample
(n
) you provide it a value (x
). The reason
dnorm()
exists is to calculate the height of the
probability curve for any one value.
For example, if we enter the value 0
with a mean of
0
and a standard deviation 1
, we get
0.3989423
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423
draw_normal(M = 0, SD = 1) +
geom_segment(x = 0, xend = -Inf, y = dnorm(x = 0, mean = 0, sd = 1), color = "red")+
geom_segment(x = 0, y = -Inf, yend = dnorm(x = 0, mean = 0, sd = 1))+
geom_label(label = round(dnorm(x = 0, mean = 0, sd = 1), 3), x = 0, y = dnorm(x = 0, mean = 0, sd = 1), hjust = 0)
Likewise, if we calculate the height of the probability plot at an
x
value of 1
, we see the result
0.2419707.
dnorm(x = 1, mean = 0, sd = 1)
## [1] 0.2419707
What would be the height of the probability line at a value of -1
when the mean
is 0
and the sd
is
1
?
# your code here
Like pbinom()
, pnorm()
tells you the
probability of getting a certain value (or less) in a given normal
distribution. Once again, you can set the mean and standard deviation of
the distribution using mean
and sd
. Instead of
taking its value using the x
argument, it takes its
argument using the q
argument.
If we wanted to calculate the probability of getting a value below 0 from a normal distribution with a mean of 0 and sd of 1, we would use:
pnorm(q = 0, mean = 0, sd = 1, lower.tail = TRUE)
## [1] 0.5
q <- 0
M <- 0
SD <- 1
draw_normal(M = M, SD = SD)+
geom_vline(xintercept = q)+
geom_label(x = q, y = 0, label = q)+
geom_bracket(xmin = -4, xmax = q,
y.position = (dnorm(x = q, mean = M, sd = SD)+0.01),
label = paste0("lower.tail = TRUE: ", round(pnorm(q = q, mean = 0, sd = 1, lower.tail = TRUE), 3)), color = "red")+
geom_bracket(xmin = q+0.01, xmax = Inf,
y.position = (dnorm(x = q, mean = M, sd = SD)+0.03),
label = paste0("lower.tail = FALSE: ", round(pnorm(q = q, mean = 0, sd = 1, lower.tail = FALSE), 3)), color = "red")
So, what’s the probability of getting 40 or less
when
the mean is 50
and the standard deviation is
10
?
pnorm(q = 40, mean = 50, sd = 10, lower.tail = TRUE)
## [1] 0.1586553
Looks like the probability of getting a value of
40 or less
is slightly less than .16
.
We can also work through this mathematically. We expect that 68% of
values on a normal distribution will fall between plus or minus one
standard deviation. In our example, 40
is one standard
deviation below the mean. As such, the probability of getting
40
or a value less than 40
would be \(\frac{1 - 0.68}{2} = \frac{.32}{2} =
.16\).
What’s the probability of getting a value of
60 or less
, when the mean of the distribution is
30
and the standard deviation is 15?
# your code here
What’s the probability of getting a value greater than
-5
when the mean is 0
and the sd is
5
?
# your code here
Finally, we can use qnorm()
to get the value that
corresponds to a particular cumulative probability. Once again, we can
set the mean (mean
) and standard deviation
(sd
) of the distribution, but we use p
to set
the target probability.
To get the value or less that has a probability of
0.00135
of occurring in a distribution with a mean of
0
and a standard deviation of 1
is calculated
using qnorm(p = .00135, mean = 0, sd = 1)
.
qnorm(p = .00135, mean = 0, sd = 1, lower.tail = TRUE)
## [1] -2.999977
As we can see, the value is just about -3.00
. This makes
sense if we consider that 99.73
% of values in a normal
distribution are between three standard deviations below and above the
mean; probability of getting -3.00
would be \(\frac{1-.9973}{2} = 0.00135\).
What value or less is associated with a .51
cumulative probability in a normal distribution with a mean of
100
and a standard deviation of 10
?
# your code here
What value or greater is associated with a .51
probability in a normal distribution with a mean of 100
and
a standard deviation of 10
?
# your code here
You are welcome to work with a partner or in a small group of 2-3 people. Please feel free to ask the lab leader any questions you might have!
You are playing Dungeons and Dragons and, to the Dungeon Master’s displeasure, you run immediately to the dragon that is meant to be encountered at the end of her carefully-crafted campaign.
5
20
-sided dice, and get a 20
on each die. What
is the probability of getting this result?# your code here
2
(i.e., 3
or more)
20
s she will let you slay the dragon. What is the
probability of getting more than 2
20
s when
rolling 5
20
-sided dice?# your code here
.10
. She acquiesces.
What number of 20
s or greater is associated with a
cumulative probability of .10
when rolling 5
20
-sided dice?# your code here
From data released from the Graduate Coffee Drinkers Association
(GCDA), you know coffee consumption is normally distributed among
graduate students, with the average student drinking 5
cups
of coffee per day and 68% of students drinking between 4
and 6
cups of coffee per day (i.e., the distribution has a
standard deviation of 1
).
2 or less
cups of coffee per day?# your code here
50
graduate students from the distribution three
times. Plot each of these samples as a histogram. Are the histograms
identical? Why or why not?# your code here
10
cups of coffee per day.# your code here
0.00
. Calculate the probability that a graduate student
would drink 10 or more
cups of coffee a day.# your code here
1
and 100
. You think of
37
and the magician guesses 37
. Assuming the
choice of numbers follows a uniform distribution, what is the
probability that the magician guessed your number at random? Use
dunif()
to prove your intuition is correct.# your code here
3.00
with 10
degrees of freedom. Use
pchisq()
to calculate the probability.# your code here
3.00 or greater
with 10
degrees of freedom from a t-distribution. Google (or use your intuition)
to determine how to calculate a cumulative probability from a
t-distribution.# your code here