class: center, middle, inverse, title-slide # Causal models --- ## Correlation does not imply -- causation -- always? --- ## Karl Pearson (1857-1936) - Francis Galton demonstrates that one phenomenon -- regression to the mean -- is not caused by external factors, but merely the result of natural chance variation. - Pearson (his student) takes this to the extreme, which is causation can never be proven. "Force as a cause of motion is exactly on the same footing as a tree-god as a cause of growth." .small[(Pearson, 1892, pp 119)] --- ## Sewall Wright (1889-1988) .pull-left[ ![](images/wright.jpeg) ] .pull-right[ Geneticist; studied guinea pigs at the USDA; then teaches zoology at the University of Chicago, then University of Wisconsin-Madison Developed path models as a way to identify causal forces and estimate values Rebuke from mathematical community, including from Ronald Fisher (who also disagreed with Wright's theories of evolution) ] --- ## Judea Pearl .pull-left[ - [Professor of computer science at UCLA](http://bayes.cs.ucla.edu/jp_home.html); studies AI - Extensively studied this history of causation in statistics, inferring causality through data, and the development of path analysis - Popularized the use of causal graph theory ] .pull-right[ ![](images/pearl.jpg) ] --- ## Article: Rohrer (2018) **Thinking clearly about correlation and causation: Graphical causal models for observational data** Psychologists are interested in inherently causal relationships. * e.g., how does social class cause behavior? According to Rohrer, how have psychologists attempted to study causal relationships while avoiding issues of ethicality and feasibility? -- It is impossible to infer causation from correlation without background knowledge about the domain. * Experimental studies require assumptions as well. ??? "surrogate interventions" -- e.g., perceived social class with subjective ladder * these lack external validity Avoiding causal language * Doesn't stop lay persons or other researchers from inferring; also, maybe that's based in problematic assumptions Controlling for 3rd variables * How do we know what to control for? --- The work by these figures (Pearson*, Wright, Pearl, Rohrer) and others can be used to understand the role of covariates in regression models. Regression models *imply* causality. Pretending that they don't is silly and potentially dangerous. Why include covariates or controls (e.g., age, gender) in a regression model? -- * Remove third-variable problem * Isolate unique contribution of IV(s) of interest * Reduce non-random noise in DV Each covariate needs to be [justified](https://osf.io/38mxq/) theoretically! How can we do this? --- ## Directed Acyclic Graphics (DAGs) * Visual representations of causal assumptions DAG models are _qualitative descriptions_ of your theory. This tool is used before data collection and analysis to plan the most appropriate regression model. * You will not use a DAG model to estimate numeric values. * The benefit of DAG models is that they * (1) help you clarify the causal model at the heart of your research question, and * (2) identify which variables you should and should **not** include in a regression model, based on your causal assumptions. --- ## Directed Acyclic Graphics (DAGs) ### Basic structures * Boxes or circles represent constructs in a causal model: **nodes**. * Arrows represent relationships. A → B * Relationships can follow any functional form (linear, polynomial, sinusoidal, etc.) * DAGs only allow single-headed arrows; constructs cannot cause each other, nor can you cycle from a construct back to itself. * A ⇄ B is not allowed; A ← U → B is * Constructs cannot cause themselves, so no feedback loops. --- ```r library(ggdag) dag.obj = dagify(B ~ A) ggdag(dag.obj) ``` ![](13-dag_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## DAGs **Paths** lead from one node to the next; they can include multiple nodes, and there may be multiple paths connecting two nodes. There are three kinds of paths: * Chains * Forks * Inverted forks --- ## Paths: Chains (Mediation) Chains have the structure A → B → C. Chains can transmit associations from the node at the beginning to the node at the end, meaning that you will find a correlation between A and C. These associations represent *real* causal relationships. **B mediates the influence of A on C.** ```r dag.obj = dagify(edu ~ iq, income ~ edu) ggdag(dag.obj) + geom_dag_node(color = "grey") + geom_dag_text(color = "black") + theme_bw() ``` --- .right-column[ The correlation between IQ and income would be significant and meaningful. ] ![](13-dag_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Paths: Forks (Confounds) Forks have the structure A ← B → C. Forks can transmit associations but they are not causal. They are the structure most relevant for the phenomenon of confounding. **B is a confound of A and C.** ```r dag.obj = dagify(edu ~ iq, income ~ iq) ggdag(dag.obj) + geom_dag_node(color = "grey") + geom_dag_text(color = "black") + theme_bw() ``` --- .right-column[ The correlation between education and income would be significant but not meaningful. ] ![](13-dag_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Paths: Inverted Forks (Colliders) Inverted forks have the structure A → B ← C. Inverted forks cannot transmit associations. In other words, if this is the true causal structure, there won't be a correlation between A and C. **B is a collider of A and C.** ```r dag.obj = dagify(income ~ edu, income ~ iq) ggdag(dag.obj) + geom_dag_node(color = "grey") + geom_dag_text(color = "black") + theme_bw() ``` --- .right-column[ The correlation between education and IQ would not be significant and meaningful. ] ![](13-dag_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Complex paths In reality, the causal pathways between the constructs in your model will be very complex. ```r dag.obj = dagify(gpa ~ iq, edu ~ gpa + iq, income ~ edu + iq) ``` ```r ggdag(dag.obj) + geom_dag_node(color = "grey") + geom_dag_text(color = "black") + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## Confounding DAG models are most useful for psychological scientists thinking about which variables to include in a regression model. Recall that one of the uses of regression is to statisically control for variables when estimating the relationship between an X and Y in our research question. Which variables should we be statistically controlling for? --- ## Confounding We can see with DAG models that we should be controlling for constructs create forks, or a construct that causes both our X and Y variable. These constructs are known as **confounds**, or constructs that represent common causes of our variables of interest. We may also want to control for variables that cause Y but are unrelated to X, as this can increase power by reducing unexplained variance in Y (think about SS). Importantly, we don't need to control for variables that causes just X and not Y. --- ```r dag.obj = dagify(gpa ~ iq, edu ~ gpa + iq, income ~ edu + iq, exposure = "edu", outcome = "income") ggdag_paths(dag.obj)+ geom_dag_text(color = "black") + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-8-1.png)<!-- --> What should I control for? --- ```r ggdag_adjustment_set(dag.obj)+ geom_dag_text(color = "black") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- Confounds open "back-doors" between variables that act like causal pathways, but are not. But closing off the back doors, we isolate the true causal pathways, if there are any, and our regression models will estimate those. Based on the causal structure I hypothesized last slide, if I want to estimate how much education causes income, I should control for intelligence, but I don't need to control for grades. `$$\large \hat{\text{Income}} = b_0 + b_1(\text{Education}) + b_2(\text{Intelligence}) + e$$` --- ## More control is not always better We can see with forks that controlling for confounds improves our ability to correctly estimate causal relationships. So controlling for things is good... Except the other thing that DAGs should teach us is that controlling for the wrong things can dramatically hurt our ability to estimate true causal relationships and can create spurious correlations, or open up new associations that don't represent true causal pathways. Let's return to the other two types of paths, chains and inverted forks. --- ## Control in chains Chains represent mediation; the effect of construct A on C is *through* construct B. Intelligence → Educational attainment → Income What happens if we control for a mediating variable, like education? ```r dag.obj = dagify(edu ~ iq, income ~ edu, exposure = "iq", outcome = "income") ``` --- .pull-left[ ```r ggdag_dseparated(dag.obj) + geom_dag_text(color = "black") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "edu") + geom_dag_text(color = "black") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- Controlling for mediators removes the association of interest from the model -- you may be removing the variance you most want to study! Inverted forks also teach us what not to control for: colliders. A **collider** for a pair of constructs is any third construct that is caused by both constructs in the pair. Controlling for (or conditioning on) these variables introduces bias into your estimates. Intelligence → Income attainment ← Educational What happens if we control for the collider? ```r dag.obj = dagify(income ~ iq, income ~ edu, exposure = "iq", outcome = "edu") ``` --- .pull-left[ ```r ggdag_dseparated(dag.obj) + geom_dag_text(color = "black") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "income") + geom_dag_text(color = "black") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- ## Collider bias ### Unexpected colliders Missing data or restricted range problems arise in collider bias. ```r dag.obj = dagify(c ~ x, c ~ y, labels = c("x" = "Respitory Disease", "c" = "Hospitalization", "y" = "Locomotor Disease"), exposure = "x", outcome = "y") ``` --- .pull-left[ Sampling from a specific environment (or a specific subpopulation) can result in collider bias. Be wary of: * Missing data * Subgroup analysis * Any post treatment variables This will extend to not just controlling for constructs, but estimating interactions with them. If your moderating variable is a collider, you've introduced collider bias. ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "c", text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .small[Sackett et al. (1979)] --- ## Example A health researcher is interested in studying the relationship between dieting and weight loss. She collects a sample of participants, mesaures their weight, and then asks whether they are on a diet. She returns to these participants two months later and assess weight again. What should the researcher control when assessing the relationship between dieting and weight loss? --- ```r dag.obj = dagify(diet ~ wi, wf ~ diet + wi, loss ~ wi + wf, exposure = "diet", outcome = "loss") ``` --- ```r ggdag_adjustment_set(dag.obj) + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- What should be controlled here? ![](13-dag_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- ```r ggdag_adjustment_set(dag.obj) + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- There is a **strong** assumption of DAG models, and that it's that no relevant variables have been omitted from the graph and that causal pathways have been correctly specified. This is a tall order, and potentially impossible to meet. Moreover, two different researchers may reasonably disagree with what the true model looks like. It's your responsibility to decide for yourself what you think the model is, provide as much evidence as you can, and listen when an open mind to those who disagree. ![](images/xkcd.png) --- class: inverse ## Next time... Interactions (moderators)