The purpose of today’s lab is to introduce tools for sampling from and calculating statistics for different types of distributions in R. The content of the lab will be split into two sections. The first section will focus on binomial distributions and the second section will focus on normal distributions. The Minihacks will test your knowledge of both types of distributions, as well as some distributions that are not discussed in the lab but are nonetheless important.
To quickly navigate to the desired section, click one of the following links:
Imagine you are flipping a coin. If it is a fair coin, you would expect a 50% chance of the coin landing on heads and a 50% chance of the coin landing on heads. However, as shown in the animation below, every heads is not always paired with a tails. Sometimes there is a run of heads and sometimes there is a run of tails. Still, we would probably expect that, overall, there would be the same number of heads as tails. In other words, we would expect that if we flipped a single coin 100 times, the most likely outcome would be 50 heads and 50 tails.
## NULL
As you might recall from class, our intuition about the outcome of
the 100 coin flips can be described in terms of a binomial distribution.
Essentially, the binomial distribution describes the theoretical
probability of obtaining a certain outcome over a number of trials when
(1) the outcome on every trial is binary (e.g., a coin landed on a
heads
or a tails
; a dice was either a
6
or not a 6
) and the probability of the
outcome on every trial is the same (e.g., the probability of getting a
heads
on flip 1
is the same as the probability
of getting a heads
on flip 100
).
If we plot the probability distribution of the example above, two
things are apparent: (1) the most probable outcome of flipping
100
coins is 50
heads
and (2)
there are additional outcomes that are, although less probable, also
possible. For instance, one could even expect to get 0
heads
0% of the time.
If we want to randomly sample trials from a binomial distribution, we
can use the rbinom()
function in R. The function takes
three arguments. The first argument (n
) is the number of
trials to sample. If we wanted to flip 2
coins
10
times, we would include the argument
n = 10
. The second argument (size
) is the
number of events associated with each trial. If we are flipping
2
coins, we would include size = 2
. The third
argument (prob
) is the probability of success on a given
trial. If we consider a heads
a success and everything else
a failure, we would include the argument prob = 1/2
.
Putting all of that together, we get
rbinom(n = 10, size = 2, prob = 1/2)
.
rbinom(n = 10, size = 2, prob = 1/2)
## [1] 0 2 1 1 1 1 0 1 1 1
From the results, we can see we got 0
heads
on the first toss of our 2
coins, 2
heads
on the second toss our 2
coins, and
1
head
on the third through sixth tosses of
our 2
coins.
How would we change this if we were flipping 3
coins
10
times?
rbinom(n = 10, size = 3, prob = .5)
## [1] 2 2 2 2 2 2 0 2 3 1
What about rolling 5
6
-sided dice
10
times where getting a 6
is considered a
successful outcome?
rbinom(n = 10, size = 5, prob = 1/6)
## [1] 0 0 0 0 0 1 1 2 1 1
What about pulling an Ace
out of 1
deck
of cards 100
times?
rbinom(n = 100, size = 1, prob = 1/13)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The function dbinom()
gives us the probability of
getting any one result. It takes four arguments, but we will only
concern ourselves with the first three: (1) x
- the number
of successful outcomes expected, (2) size
- the number of
events, and (3) prob
- the probability of success on a
given event.
To get the probability of getting 1
heads
by flipping 2
coins, we could run the following code.
dbinom(x = 1, size = 2, prob = .5)
## [1] 0.5
There is a .50
probability of getting 1
heads
when you flip 2
coins. We can
investigate why this is the case by looking at the probability of every
outcome.
# HT + TH
(.5 * .5) + (.5 * .5)
## [1] 0.5
What’s the probability of getting 1
head when you
flip 3
coins?
dbinom(x = 1, size = 3, prob = .5)
## [1] 0.375
# HTT + THT + TTH
(.5 * .5 * .5) + (.5 * .5 * .5) + (.5 * .5 * .5)
## [1] 0.375
When drawing a card from a deck of cards twice (with
replacement), what’s the probability of drawing 2
aces
?
dbinom(x = 2, size = 2, prob = 1/13)
## [1] 0.00591716
# AA
(1 / 13) * (1 / 13)
## [1] 0.00591716
When drawing a card from a deck of cards twice (with
replacement), what’s the probability of drawing 0
aces
?
# method 1
dbinom(x = 0, size = 2, prob = 1/13)
## [1] 0.852071
# method 2
dbinom(x = 2, size = 2, prob = 12/13)
## [1] 0.852071
# 00
(12 / 13) * (12 / 13)
## [1] 0.852071
If we want to calculate the cumulative probability of getting a
certain result (i.e., the probability of getting a result equal to or
less than what we expect), we would use the function
pbinom()
. Cumulative probability may not sound important,
but it is when you consider that a p-value is the probability of getting
a result equal to or more extreme than that observed in the sample. The
function pbinom()
takes essentially the same arguments as
dbinom()
, but instead of the first argument being called
x
it is called q
.
Returning to the example from above, if we wanted to get the
probability of getting 1 or less
heads
when we
flip 2
coins, we would use
pbinom(q = 1, size = 2, prob = 1/2)
.
pbinom(q = 1, size = 2, prob = 1/2)
## [1] 0.75
The result is .75
. Again, this makes sense if we look at
the probability of every outcome.
# HT + TH + TT
(.5 * .5) + (.5 * .5) + (.5 * .5)
## [1] 0.75
What’s the probability of getting 1 or less
heads
when flipping 10
coins?
pbinom(q = 1, size = 10, prob = 1/2)
## [1] 0.01074219
What’s the probability of getting 1 or less
6
s when rolling 1
dice?
pbinom(q = 1, size = 1, prob = 1/6)
## [1] 1
The function pbinom()
can also take the argument
lower.tail
(defaults to TRUE
). The argument
lower.tail
is what specifies what side of the probability
distribution we should be testing from. In practical terms, it is what
decided that we wanted 1 or less
heads rather than
greather than 1
heads
.
For instance, if we wanted to test the probability of getting
greater than 1
heads
when flipping
2
coins, we would specify
lower.tail = FALSE
.
pbinom(q = 1, size = 2, prob = 1/2, lower.tail = FALSE)
## [1] 0.25
What’s the probability of getting greater than 3
6
s when rolling 9
dice?
pbinom(q = 3, size = 9, prob = 1/6, lower.tail = FALSE)
## [1] 0.04802149
The function qbinom()
essentially does the opposite of
pbinom
. Instead of taking an outcome (q
) and
returning the cumulative probability, it returns the value that
corresponds to the cumulative probability (p
).
For instance if we wanted the value for which there is
100
% probability of getting that value or less
on 10
coin flips, we would enter
qbinom(p = 1.00, size = 10, prob = 1/2)
.
qbinom(p = 1.00, size = 10, prob = 1/2)
## [1] 10
Unsurprisingly, 10 or less
heads
has a 100%
chance of occurring when you flip 10
coins.
With 100
coin flips, what is the number of
heads
(or less) that has a .50
probability of
occurring?
qbinom(p = .50, size = 100, prob = 1/2)
## [1] 50
With 100
coin flips, what is the number of
heads
(or less) that has a .25
probability of
occurring?
qbinom(p = .25, size = 100, prob = 1/2)
## [1] 47
With 100
coin flips, what is the number of
heads
(or greater) that has a .25
probability
of occurring?
qbinom(p = .25, size = 100, prob = 1/2, lower.tail = FALSE)
## [1] 53
Recall from class that a normal distribution
is a
continuous probability distribution that is defined by a mean (\(\mu\)) and a standard deviation (\(\sigma\)). Whereas the binomial
distribution describes the theoretical probability of obtaining a
certain outcome over a number of trials when the outcome of every trial
is binary, the normal distribution describe the theoretical probability
of obtaining a certain outcome from a continuous distribution that has a
certain mean (\(\mu\)) and standard
deviation (\(\sigma\)).
In order to randomly sample observations from a normal distribution,
we use the function rnorm()
. Similar to
rbinom()
, rnorm()
takes three arguments: (1)
n
- the number of observations to sample from the normal
distribution, (2) mean
- the mean of the normal
distribution, and (3) sd
- the standard deviation of the
normal distribution.
Below we sample 5
values from a normal distribution with
a mean
of 0
and a sd
of
1
.
x <- rnorm(n = 5, mean = 0, sd = 1)
x
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176
The five values were -0.8969145
, 0.1848492
,
1.5878453
, -1.13037567
and
-0.08025176
. Calculating the mean()
and
sd()
of our 5
numbers can serve as a bit of a
sanity check.
mean(x)
## [1] -0.06696949
sd(x)
## [1] 1.0749
If we plot a histogram of the data, it doesn’t look normally distributed.
Not to worry! As illustrated in the animation below, many samples do not appear normal until samples of sufficient size are taken from the population.
How would you sample 10
observations from a normal
distribution with a mean of 100
and a standard deviation of
10
?
x <- rnorm(n = 100, mean = 50, sd = 15)
x
## [1] 70.10702 33.26432 72.59808 25.85051 64.74490 42.08073 40.90247 31.44694
## [9] 60.84194 40.86770 78.70571 37.72669 60.77980 45.40782 38.96492 68.91429
## [17] 56.77912 31.54454 64.42119 42.73362 49.73978 39.50173 80.14184 39.56021
## [25] 20.54169 49.77191 65.71358 35.53914 10.88095 44.83591 44.69725 49.81173
## [33] 52.03666 39.80848 40.65800 56.30383 40.88204 52.85271 66.08110 40.23969
## [41] 20.72601 41.70519 72.86999 52.62223 85.21151 37.54641 46.89434 69.44872
## [49] 42.42647 48.28229 37.41338 55.13768 69.42520 41.95130 75.69487 41.86698
## [57] 62.14193 43.11261 40.42576 53.27791 15.67171 46.75063 66.48480 38.93634
## [65] 51.66526 46.54303 41.52317 39.90583 54.62883 53.93384 70.03018 56.37187
## [73] 38.78450 81.85908 66.13855 63.98095 46.74410 54.21393 41.15392 34.31660
## [81] 64.93405 69.60571 57.38871 48.88348 45.86954 50.38997 78.66571 78.81954
## [89] 42.40974 62.19493 58.28126 39.91096 58.55067 58.07785 78.43734 63.38223
## [97] 52.78941 52.53657 57.99779 31.48749
Are the descriptives for this sample what we would expect?
mean(x)
## [1] 51.26711
sd(x)
## [1] 15.20467
The normal distribution counterpart of dbinom()
is
dnorm()
. Similar to rnorm()
, it takes a mean
(mean
) and a standard deviation (sd
), but
instead of an argument for the number of observations you want to sample
(n
) you provide it a value (x
). As mentioned
in class, the probability of any one value in a normal distribution is
0.00
because the total probability of the distribution is
1.00
and there are infinite values in any continuous
distribution. The reason dnorm()
exists is mostly
mathematical, but we can use it to calculate the height of the
probability curve for any one value.
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423
For example, if we enter the value 0
with a mean of
0
and a standard deviation 1
, we get
0.3989423
As shown in the plot above, the height of the probability curve at a
value of 0
is .399
.
Likewise, if we calculate the height of the probability plot at an
x
value of 1
, we see the result
0.2419707.
dnorm(1, mean = 0, sd = 1)
## [1] 0.2419707
Looking at the plotted normal distribution, this is also expected.
What would be the height of the probability line at a value of -2
when the mean
is 0
and the sd
is
1
?
dnorm(x = -2, mean = 0, sd = 1)
## [1] 0.05399097
Like pbinom()
, pnorm()
tells you the
probability of getting a certain value (or less) in a given normal
distribution. Once again, you can set the mean and standard deviation of
the distribution using mean
and sd
. Instead of
taking its value using the x
argument, it takes its
argument using the q
argument.
If we wanted to calculate the probability of getting a value in the
shaded region above, we would use
pnorm(q = 0, mean = 0, sd = 1)
.
pnorm(q = 0, mean = 0, sd = 1)
## [1] 0.5
As would be expected, we have a 50% probability of getting a value in the shaded region of the plot. However, for most of us, it is not easy to try to convert the shaded region under a curve to a probability value. An easier way to visualize it is through the use of a cumulative probability plot, where each shaded ball represents a .01 (1.00%) probability.
If we were to count every shaded ball on the left of the vertical
line below, we would see that there are 50
balls to the
left of the red line. In other words, there is a 50% probability of
getting a 0 or less
in our distribution.
So, what’s the probability of getting 40 or less
when
the mean is 50
and the standard deviation is
10
?
Looking at our cumulative probability plot, it looks like
16
balls are on the left of the dashed red line. We can
check this with pnorm()
.
pnorm(q = 40, mean = 50, sd = 10)
## [1] 0.1586553
Looks like the probability of getting a value of
40 or less
is slightly less than .16
.
We can also work through this mathematically. We expect that 68% of
values on a normal distribution will fall between plus or minus one
standard deviation. In our example, 40
is one standard
deviation below the mean. As such, the probability of getting
40
or a value less than 40
would be \(\frac{1 - 0.68}{2} = \frac{.32}{2} =
.16\).
What’s the probability of getting a value of 60
or
less, when the mean of the distribution is
30and the standard deviation is
15?`
pnorm(q = 60, mean = 30, sd = 15)
## [1] 0.9772499
((1 - .9545) / 2) + .9545
## [1] 0.97725
What’s the probability of getting a value greater than
-5
when the mean is 0
and the sd is
5
?
pnorm(q = -5, mean = 0, sd = 5, lower.tail = FALSE)
## [1] 0.8413447
Finally, we can use qnorm()
to get the value that
corresponds to a particular cumulative probability. Once again, we can
set the mean (mean
) and standard deviation
(sd
) of the distribution, but we use p
to set
the target probability.
To get the value or less that has a probability of
0.00135
of occurring in a distribution with a mean of
0
and a standard deviation of 1
is calculated
using qnorm(p = .00135, mean = 0, sd = 1)
.
qnorm(p = .00135, mean = 0, sd = 1)
## [1] -2.999977
As we can see, the value is just about -3.00
. This makes
sense if we consider that 99.73
% of values in a normal
distribution are between three standard deviations below and above the
mean; probability of getting -3.00
would be \(\frac{1-.9973}{2} = 0.00135\).
What value or less is associated with a .51
cumulative probability in a normal distribution with a mean of
100
and a standard deviation of 10
?
qnorm(p = .51, mean = 100, sd = 10)
## [1] 100.2507
What value or greater is associated with a .51
probability in a normal distribution with a mean of 100
and
a standard deviation of 10
?
qnorm(p = .51, mean = 100, sd = 10, lower.tail = FALSE)
## [1] 99.74931
You are welcome to work with a partner or in a small group of 2-3 people. Please feel free to ask the lab leader any questions you might have!
You are playing Dungeons and Dragons and, to the Dungeon Master’s displeasure, you run immediately to the dragon that is meant to be encountered at the end of her carefully-crafted campaign.
5
20
-sided dice, and get a 20
on each die. What
is the probability of getting this result?dbinom(x = 5, size = 5, prob = 1/20)
## [1] 0.0000003125
2
(i.e., 3
or more)
20
s she will let you slay the dragon. What is the
probability of getting more than 2
20
s when
rolling 5
20
-sided dice?pbinom(q = 2, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 0.001158125
Note that with the default setting of lower.tail=TRUE, you include the specified q value and below. When you set lower.tail=FALSE, you are including only q values above what you specified. So this line of code give you the cumulative probability of of all outputs above (and not including) 2 20’s.
.10
. She acquiesces.
What number of 20
s or greater is associated with a
cumulative probability of .10
when rolling 5
20
-sided dice?qbinom(p = .10, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 1
We are drawing from the same distribution, so size and prob should stay the same. In this case we want to find the output (number of times out of 5 you rolled a 20) where the cumulative probability of that output and all outputs above that (because it’s ok to go above the minimum needed) is at least .1. As noted above, using lower.tail=FALSE can be kinda tricky because it performs its calculations for above and not including the output value.
This isn’t a concern in this case though because qbinom() essentially “rounds up” (note that since we have lower.tail=FALSE, that by “rounding up” I mean summing up the probability from right to left). In the case of a continuous distribution, there is a continuous change in cumulative probability, but because the binomial distribution is discrete, with each change in outcome, the cumulative probability jumps up. Therefore the output of qbinom() gives you the outcome needed to get at least the specified probability.
To check my work, I added up each of the discrete probabilities using dbinom():
dbinom(5, size = 5, prob = 1/20) + dbinom(4, size = 5, prob = 1/20) + dbinom(3, size = 5, prob = 1/20) + dbinom(2, size = 5, prob = 1/20)
## [1] 0.0225925
#OR
pbinom(1, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 0.0225925
#The probabilities of 2 and above is .023--insufficient to reach the .1 threshold I set.
dbinom(5, size = 5, prob = 1/20) + dbinom(4, size = 5, prob = 1/20) + dbinom(3, size = 5, prob = 1/20) + dbinom(2, size = 5, prob = 1/20) + dbinom(1, size = 5, prob = 1/20)
## [1] 0.2262191
#OR
pbinom(0, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 0.2262191
#The probabilities of 1 and above is .226, therefore this is the minimum I need.
From data released from the Graduate Coffee Drinkers Association
(GCDA), you know coffee consumption is normally distributed among
graduate students, with the average student drinking 5
cups
of coffee per day and 68% of students drinking between 4
and 6
cups of coffee per day (i.e., the distribution has a
standard deviation of 1
).
2 or less
cups of coffee per day?pnorm(q = 2, mean = 5, sd = 1)
## [1] 0.001349898
50
graduate students from the distribution three
times. Plot each of these samples as a histogram. Are the histograms
identical? Why or why not?hist(rnorm(n = 50, mean = 5, sd = 1))
hist(rnorm(n = 50, mean = 5, sd = 1))
hist(rnorm(n = 50, mean = 5, sd = 1))
10
cups of coffee per day.dnorm(x = 10, mean = 5, sd = 1)
## [1] 0.00000148672
#this is the probability density
#One thing I can do to get a estimate of probability is to consider values between 9.5 and 10.5 to be 10 cups of coffee. I can then calculate the area under the curve for that range through subtraction:
dnorm(x = 10.5, mean = 5, sd = 1) - dnorm(x = 9.5, mean = 5, sd = 1)
## [1] -0.00001587604
#Still quit small, because we're in the range of outliers, but I think a more meaningful value.
0.00
. Calculate the probability that a graduate student
would drink 10 or more
cups of coffee a day.pnorm(q = 9, mean = 5, sd = 1, lower.tail = FALSE)
## [1] 0.00003167124
1
and 100
. You think of
37
and the magician guesses 37
. Assuming the
choice of numbers follows a uniform distribution, what is the
probability that the magician guessed your number at random? Use
dunif()
to prove your intuition is correct.dunif(x = 1, min = 0, max = 100)
## [1] 0.01
3.00
with 10
degrees of freedom. Use
pchisq()
to calculate the probability.pchisq(q = 3.00, df = 10, lower.tail = FALSE)
## [1] 0.9814241
3.00 or greater
with 10
degrees of freedom from a t-distribution. Google (or use your intuition)
to determine how to calculate a cumulative probability from a
t-distribution.pt(q = 3, df = 10, lower.tail = FALSE)
## [1] 0.006671828