Instructions

Please complete this assignment using the RMarkdown file provided. Once you download the RMarkdown file please (1) include your name in the preamble, (2) rename the file to include your last name (e.g., “weston-homework-2.Rmd”). When you turn in the assignment, include both the .Rmd and knitted .html files.

To receive full credit on this homework assignment, you must earn 30 points. You may notice that the total number of points available to earn on this assignment is 65 – this means you do not have to answer all of the questions. You may choose which questions to answer. You cannot earn more than 30 points, but it may be worth attempting many questions for learning’s sake. Here are a couple things to keep in mind:

  1. Points are all-or-nothing, meaning you cannot receive partial credit if you correctly answer only some of the bullet points for a question. All must be answered correctly.

  2. After the homework has been graded, you may retry questions to earn full credit, but you may only retry the questions you attempted on the first round.

  3. The first time you complete this assignment, it must be turned in by 9am on the due date. Late assignments will receive 50% of the points earned. For example, if you correctly answer questions totaling 27 points, the assignment will receive 13.5 points. If you resubmit this assignment with corrected answers (a total of 30 points), the assignment will receive 15 points.

  4. You may discuss homework assignments with your classmates; however, it is important that you complete each assignment on your own and do not simply copy someone else’s code. If we believe one student has copied another’s work, both students will receive a 0 on the homework assignment and will not be allowed to resubmit the assignment for points.

Data:

Some of the questions in this homework assignment use the dataset referred to as homework-world. I This dataset is similar to one you’ve seen in class and contains a new variable called World. So-called “first world” countries (coded 1) are those that were aligned with the United States after World War II (e.g., members of NATO) or were considered to be clearly in the U.S. sphere of influence. “Second world” countries (coded 2) are former members of the Soviet Union or countries considered to have been clearly in the Soviet Union’s sphere of influence. “Third world” countries (coded 3) include those considered by the United Nations to be among the least developed countries in the world. The remaining countries are coded 4 for this variable.

You’ll also be asked to use a dataset called homework-happy. This data set contains variables indexing frienship quality, happiness, school success, parental SES, and an intervention group. Note that some of these variable names have spaces in them, which can make coding difficult. You will have to surround variable names with the symbol ` in order for R to recognize the full variable name. Alternatively, the janitor package includes the function clean_names, which will modify variable names by adding _ in place of spaces. I recommend this approach. I’ve provided an example of this in question 2.5.

2-point Questions

Question 1

Use the homework-world data. You’ll need the functions pcor.test and spcor.test in the ppcor package.

  • Calculate the zero-order correlation between Happiness and Freedom. Interpret this correlation and its statistical significance.

  • Calculate the semi-partial correlation of Happiness and Freedom controlling for GDP. Treat Happiness as the outcome. Interpret this correlation and its significance. What do you learn when you compare this correlation to the zero-order correlation?

Question 2

  • Calculate the partial correlation of Happiness and Freedom controlling for GDP. Treat Happiness as the outcome. Interpret this correlation and its significance. What do you learn when you compare this correlation to the semi-partial correlation?

Question 3

Using the data homework-world, create a regression model predicting happiness from GDP, Freedom, and Corruption. Interpret each regression coefficient.

Question 4

Calculate the standardized regression coefficients (b*) for the model above. You can use whatever method you would like.

Question 5

You are testing the efficacy of multiple intervention programs on school success. Your research design includes three groups: a control group, a tutoring program, and a study group program. You analyze your data and report the results:

library(tidyverse)
library(janitor)
happy_d = read_csv("https://raw.githubusercontent.com/uopsych/psy612/master/homework/homework-happy.csv")
happy_d = clean_names(happy_d)
mod2.3 = lm(school_success ~ intervention_group, data = happy_d)
anova(mod2.3)
## Analysis of Variance Table
## 
## Response: school_success
##                     Df  Sum Sq Mean Sq F value Pr(>F)
## intervention_group   1    1.02  1.0191  0.0941 0.7595
## Residuals          116 1255.69 10.8249

During a Zoom meeting, you show these results to your adviser, who hasn’t had any coffee yet today. They glance briefly at your Markdown output and say, “You did this wrong. Do it again.” Before they can explain how to fix the code, their dog starts barking. They mutter something under their breath, say, “Gotta deal with the dog. I’ll see you next week,” and end the call.

Fix the code above to generate the correct analysis for this research question.

5-point Questions

Question 1

Use the homework-world data. Fit a model with Happiness as the outcome and Generosity as the predictor. Use it to answer the following questions:

  • What is the confidence interval for the estimate of the slope? In your own words, what does this confidence interval tell you?

  • A hitherto unknown country is discovered in the middle of the Pacific Ocean, known to its citizens as Westonia. Westonia has a Generosity score of .25. Would you be surprised to learn that they have a happiness score of 7.7?

  • Plot your regression line. Be sure to include (1) the raw data points and (2) the 95% confidence band, and (3) the 95% prediction band.

Question 2

You’re interested in studying the joint and combined influence of school success, ses, and friendship quality on the happiness of adolescents. Using the dataset homework-happy, build three regression models. Each model should use happiness as the outcome variable and be nested within the subsequent model. That is, you should start with one predictor and add one additional predictor at a time.

  • Justify the choices you made in building your models: How did you decide what order to add the variables?

  • Formally (statistically) compare these models. What do you conclude?

Question 3

Using the dataset homework-happy, create a regression model predicting happiness from friendship quality, SES, and school success; be sure to save the output to an object.

Load the sjPlot package and run through the following code. Interpret each figure along the way. (As a general hint: you’ll want to open the help page for this function or use Google to find tutorials online.)

  • Enter your model object into the function plot_model and set the argument type = "est". What does this plot represent?

  • Enter your model object into the function plot_model and set the argument type = "std". What does this plot represent?

  • Enter your model object into the function plot_model and set the argument type = "pred" and the argument terms = "school_success" (or whatever the name of your school success variable is). What does this plot represent?

  • Enter your model object into the function plot_model and set the argument type = "pred" and the argument terms = c("school_success", "ses[meansd]" (or whatever the name of your school success and SES variables are). What does this plot represent? (Hint, think about the regression plane….)

10-point Questions

Question 1

Using the dataset homework-happy, run a three-predictor regression predicting happiness by friendship quality, SES, and school success.

Check each of the six assumptions discussed in class. List each assumption and state how you examined (or would examine) that assumption. Note that not every assumption can be directly examined, but all should be addressed. Include plots where applicable and be sure to interpret your output.

Question 2

Use the homework-world dataset for this question. This dataset contains 7 variables of interest: happiness, gross domestic product, support, life expectancy, freedom, generosity, and corruption.

  • Conduct an analysis of variance (use the aov() function) for each of the measures, using country development status (world) as the grouping variable. In a table, report the degrees of freedom, F-statistic, and p-value for each analysis (one row for each analysis). Make sure this table is formatted nicely, i.e., not just R output. I recommend kable from the knitr package, but there are many good options available.

  • If the F-test is significant for an analysis, conduct follow-up pairwise comparisons using a Holm correction. Report these in a table (one table per analysis). Make sure these tables are formatted nicely.

  • Recall that the ANOVA model has the same homogeneity of variance assumption as the independent samples t-test. Assess the homogeneity of variance assumption for each of the outcome measures (use the leveneTest() function from the car package). Report these results in a single table and comment on whether this assumption is satisfied for each measure.

  • If we cannot assume homogeneity of variance when using a (Student’s) independent samples t-test, we run a Welch’s t-test, which doesn’t have this assumption (but has lower power). There is an analogous test in the ANOVA framework, called the Welch’s one-way test. Re-run the analyses of variance, but now use the Welch one-way test (use the oneway.test() function). Report these results in a single table. Make sure this table is formatted nicely. Comment on whether any conclusions about group differences change compared to the original ANOVAs.

Earn 5 additional points: If you are looking to stretch your R skills a bit, you can earn additional points by completing this problem using iteration functions. Here’s what that means: you’ll see this problem asks you to run multiple ANOVA tests. Instead of hard coding these (writing separate code for each test), write code that applies a single function or set of code to each outcome variable. The tricky part is doing this without loops. To accomplish this, you might look into the various apply functions (apply, sapply, and lapply), but I would recommend diving into the world of purrr. These analyses can be completed elegantly with the use of gather (or pivot_longer) and map. This tutorial is a nice starting place, but you should expect to do a bit of independent learning. To earn these bonus points, you’ll need to use iteration on all parts of this problem except for the pairwise comparisons. (Unless you want to try that too!)

20-point Questions

Question 1

In PSY 611 we discussed the problem of p-hacking, which can arise in quite a number of ways. While we have not spent much time talking about p-hacking this term, regression models are just as susceptible to p-hacking. In fact, there are be additional tools researchers can exploit to generate significant p-values.

Imagine that you have access to a dataset \((N = 50)\) that has an outcome you’re interested in studying (Y) and a variable that you believe causes that outcome (X), as well as 30 other variables that may or may not be related to your research question. You create a model regressing Y on X and test the significance of the slope coefficient of X. If this is significant, then great! You stop analyzing the data. But if this is not significant, you try adding different covariates to your model until the slope of X is significant or you run out of covariates, whichever comes first. You’re concerned about overfitting, so you only include one covariate in your model at a time.

Simulate this scenario 5,000 times. Set a seed (using set.seed at the beginning so I can reproduce your results). For each simulated study, each variable should be randomly drawn from a normal distribution. In other words, the data will be consistent with the null hypothesis. For each study, tally whether the null hypothesis is rejected, the number of regression models that are built, the final unstandardized regression coefficient associated with X, and the final p-value associated with the slope of X. The proportion of rejections over the 5,000 studies is the empirical Type I error rate. We are interested in whether this matches closely the significance level chosen for the t-test (i.e., .05). The average effect size across the 5,000 studies should be close to 0. We’re also interested in the distribution of p-values across this simulation.

One strategy would be to use an outer loop to index the 5,000 scenarios and an inner loop to index the changing of the regression model within a study. Some conditionals will be needed to decide if an interim slope test is significant. You will also need a way to stop a given scenarios and move on to the next one if the test of the slope is significant.

  • First, determine the empirical Type I error rate for the scenario described above. How does it compare to the significance level of .05 set for the inferential test of the slope?

  • Determine the average effect size for this repeated testing scenario. How does it compare to the expected value given that the null hypothesis is true in this simulation?

  • Construct a histogram that shows the distribution of final p-values for the 5,000 studies. Comment on anything you notice as being odd about the shape of this distribution.

  • Construct a histogram that shows the distribution of final p-values for the 5,000 studies, but limit this figure to only p-value smaller than .10. Can you make any conclusion about the likelihood of specific p-values when the null hypothesis is true and the researcher is p-hacking?

  • Repeat the scenario above, but without the addition of any covariates. Construct a histogram that shows the distribution of final p-values for the 5,000 studies. What do you conclude about the distribution of p-values when the null hypothesis is true and the researcher is not p-hacking?

  • Finally, repeat the scenario above (no p-hacking), but using an X and a Y that are associated with each other. (Hint: A regression equation may be especially helpful here.) Construct a histogram that shows the distribution of final p-values for the 5,000 studies.