Please complete this assignment using the RMarkdown file provided (download Markdown file and data here) . Once you download the RMarkdown file please (1) include your name in the preamble, (2) rename the file to include your last name (e.g., “weston-homework-1.Rmd”). When you turn in the assignment, include both the .Rmd and knitted .doc files.
To receive full credit on this homework assignment, you must earn 30 points. You may notice that the total number of points available to earn on this assignment is 65 – this means you do not have to answer all of the questions. You may choose which questions to answer. You cannot earn more than 30 points, but it may be worth attempting many questions. Here are a couple things to keep in mind:
Points are all-or-nothing, meaning you cannot receive partial credit if you correctly answer only some of the bullet points for a question. All must be answered correctly.
After the homework has been graded, you may retry questions to earn full credit, but you may only retry the questions you attempted on the first round.
The first time you complete this assignment, it must be turned in by 9am on the due date (October 25). Late assignments will receive 50% of the points earned. For example, if you correctly answer questions totalling 28 points, the assignment will receive 14 points. If you resubmit this assignment with corrected answers (a total of 30 points), the assignment will receive 15 points.
You may discuss homework assignments with your classmates; however, it is important that you complete each assignment on your own and do not simply copy someone else’s code. If we believe one student has copied another’s work, both students will receive a 0 on the homework assignment and will not be allowed to resubmit the assignment for points.
Data: Some of the questions in this homework
assignment use the dataset referred to as homework-world
.
This dataset is similar to one you’ve seen in class and contains a new
variable called World
. So-called “first world” countries
(coded 1) are those that were aligned with the United States after World
War II (e.g., members of NATO) or were considered to be clearly in the
U.S. sphere of influence. “Second world” countries (coded 2) are former
members of the Soviet Union or countries considered to have been clearly
in the Soviet Union’s sphere of influence. “Third world” countries
(coded 3) include those considered by the United Nations to be among the
least developed countries in the world. The remaining countries are
coded 4 for this variable.
This question is about order of operations:
4 divided by 2 plus 2 times 5
R
, show that this value equals 12.R
, show that this value equals 5.R
, show that this value equals 20.Calculate descriptive statistics (mean, standard deviation, median,
min, max, range, skew, and kurtosis) for all continuous variables in the
homework-world
dataset.
Calculate descriptive statistics (mean, standard deviation, median,
min, max, range, skew, and kurtosis) for all continuous variables in the
homework-world
dataset, but only for the “second world”
countries.
Using the homework-world
dataset, create a boxplot graph
for the Corruption variable, with separate boxplots for the four groups
defined by the new variable, World. What does this graph tell you, in
broad strokes, about:
Calculate the correlations between the following variables in the
homework-world
dataset:
In class, you were shown a formula for variance as a function of expected values, \(\sigma^2 = E(X^2) - E(X)^2\).
E
as if it means “take the mean.”)Using the homework-world
dataset, create a matrix of
scatterplots for the happiness, freedom, and support variables. The
scatterplot for each pair should include the linear best fit line (a
straight line). (You may need to consult the help page for a particular
function.)
Using the homework-world
dataset:
Calculate the correlation between the average perception of freedom in a country and the typical generosity of its citizens.
Create a scatterplot showing the relationship between freedom and generosity.
Referencing the scatterplot, consider the different threats to the validity of a correlation. Are there any that you are concerned about? Are there any that you don’t think are a problem?
If you had to guess, do you think the correlation you calculated (a) underestimate the strength of the true relationship between freedom and generosity, (b) overestimates this relationship, or (c) does a good job representing the relationship?
Use the following steps to create a composite score for the non-categorical variables (Happiness, GDP, Life, Support, Generosity, Freedom, and Corruption) and answer some questions using that composite:
Using the homework-world
dataset, create a histogram of
the life expectancy variable (Life) that includes the following:
Is this distribution skewed? What are two ways you can tell?
Apply a transformation to the Life variable to try and make it more normal (see here for some common ones.) Plot the transformed variable with the same vertical lines and density curve. Is your transformed variable skewed? What are two ways you can tell?
We evaluate statistics on the degree that they are biased and efficient. We discussed bias in class. Now, you’ll have to evaluate efficiency. While bias is a comparison of the statistic to the population parameter (e.g., how close is the expected value of the statistic to the parameter), efficiency is the comparison of two different statistics. More specifically, efficiency refers to the ratio of precision of the estimates of two different parameters at a given sample size.
Use simulation to estimate the value of the mean and median from a known distribution 100K times at many (at least 10) different sample sizes, between \(N = 5\) and \(N = 400\). At each sample size, estimate the variance of the estimates of the mean and median. Use those estimates to calculate the relative efficiency of the mean compared to the median at each sample size, defined as:
\[Efficiency = \frac{\sigma^2_{mean}}{\sigma^2_{median}}\]
What does the an efficiency value of 1 mean? What does an efficiency value of less than one mean?
Create a figure to visually display these results; include a description of this figure, including any key takeaway points.