Causal models

class: center, middle, inverse, title-slide

# Causal models

---

## Annoucements

* Recording Lecture 14 tonight (5pm, Zoom only)

* Homework #2 due Friday

* Time to find data for your final project
  * 3 continuous variables
  * 1 categorical variable

---

## Correlation does not imply

causation

always?

It's true that presence of a correlation is not sufficient to demonstrate causality. However, this axiom has been overgeneralized to the point where psychologists refuse to acknowledge any observational work as indicative of causal models. This ignores a rich history of modeling work that demonstrated that -- under certain assumptions -- correlational work can uncover the real causal mechanisms.

Moreover, the causal modeling tradition informs even analyses which are not necessarily interested in proving causation.

---
## Karl Pearson (1857-1936)

- Francis Galton demonstrates that one phenomenon -- regression to the mean -- is not caused by external factors, but merely the result of natural chance variation.

- Pearson (his student) takes this to the extreme, which is causation can never be proven.

"Force as a cause of motion is exactly on the same footing as a tree-god as a cause of growth." .small[(Pearson, 1892, pp 119)]

---

## Sewall Wright (1889-1988)

.pull-left[
![](images/wright.jpeg)
]
.pull-right[
Geneticist; studied guinea pigs at the USDA; then teaches zoology at the University of Chicago, then University of Wisconsin-Madison

Developed path models as a way to identify causal forces and estimate values

Rebuke from mathematical community, including from Ronald Fisher (who also disagreed with Wright's theories of evolution)
  ]

---

## Judea Pearl

.pull-left[
- [Professor of computer science at UCLA](http://bayes.cs.ucla.edu/jp_home.html); studies AI

- Extensively studied this history of causation in statistics, inferring causality through data, and the development of path analysis

- Popularized the use of causal graph theory
]

.pull-right[
![](images/pearl.jpg)
]

---

## Article: Rohrer (2018)

**Thinking clearly about correlation and causation: Graphical causal models for observational data**

Psychologists are interested in inherently causal relationships. 
  * e.g., how does social class cause behavior?
  
According to Rohrer, how have psychologists attempted to study causal relationships while avoiding issues of ethicality and feasibility?

It is impossible to infer causation from correlation without background knowledge about the domain. 
* Experimental studies require assumptions as well.

???
 "surrogate interventions" -- e.g., perceived social class with subjective ladder
 * these lack external validity
 Avoiding causal language
 * Doesn't stop lay persons or other researchers from inferring; also, maybe that's based in problematic assumptions
 Controlling for 3rd variables
 * How do we know what to control for?
 
---

The work by these figures (Pearson*, Wright, Pearl, Rohrer) and others can be used to understand the role of covariates in regression models.

Regression models *imply* causality. Pretending that they don't is silly and potentially dangerous.

Why include covariates or controls (e.g., age, gender) in a regression model?

* Remove third-variable problem
* Isolate unique contribution of IV(s) of interest
* Reduce non-random noise in DV

Each covariate needs to be [justified](https://osf.io/38mxq/) theoretically! How can we do this?

---

## Directed Acyclic Graphics (DAGs)

* Visual representations of causal assumptions

DAG models are _qualitative descriptions_ of your theory. This tool is used before data collection and analysis to plan the most appropriate regression model.

* You will not use a DAG model to estimate numeric values.

* The benefit of DAG models is that they 
    * (1) help you clarify the causal model at the heart of your research question, and 
    * (2) identify which variables you should and should **not** include in a regression model, based on your causal assumptions.
  
---

## Directed Acyclic Graphics (DAGs)

### Basic structures

* Boxes or circles represent constructs in a causal model: **nodes**.
* Arrows represent relationships. A &rarr; B

* Relationships can follow any functional form (linear, polynomial, sinusoidal, etc.)

* DAGs only allow single-headed arrows; constructs cannot cause each other, nor can you cycle from a construct back to itself.

* A &rlarr; B is not allowed; A &larr; U &rarr; B is
  * Constructs cannot cause themselves, so no feedback loops.  
  
---

```r
library(ggdag)

dag.obj = dagify(B ~ A)
ggdag(dag.obj)
```

![](13-dag_files/figure-html/unnamed-chunk-1-1.png)

---

## DAGs

**Paths** lead from one node to the next; they can include multiple nodes, and there may be multiple paths connecting two nodes. There are three kinds of paths:

* Chains
* Forks
* Inverted forks

---

## Paths: Chains (Mediation)

Chains have the structure

A &rarr; B &rarr; C.

Chains can transmit associations from the node at the beginning to the node at the end, meaning that you will find a correlation between A and C. These associations represent *real* causal relationships.

**B mediates the influence of A on C.**

```r
dag.obj = dagify(edu ~ iq,
                 income ~ edu)
ggdag(dag.obj) +
  geom_dag_node(color = "grey") +
  geom_dag_text(color = "black") +
  theme_bw()
```

---

.right-column[
The correlation between IQ and income would be significant and meaningful.
]

![](13-dag_files/figure-html/unnamed-chunk-2-1.png)

---

## Paths: Forks (Confounds)

Forks have the structure

A &larr; B &rarr; C.

Forks can transmit associations but they are not causal. They are the structure most relevant for the phenomenon of confounding.

**B is a confound of A and C.**

```r
dag.obj = dagify(edu ~ iq,
                 income ~ iq)
ggdag(dag.obj) + 
  geom_dag_node(color = "grey") +
  geom_dag_text(color = "black") +
  theme_bw()
```

---

.right-column[
The correlation between education and income would be significant but not meaningful.
]

![](13-dag_files/figure-html/unnamed-chunk-3-1.png)

---

## Paths: Inverted Forks (Colliders)

Inverted forks have the structure

A &rarr; B &larr; C.

Inverted forks cannot transmit associations. In other words, if this is the true causal structure, there won't be a correlation between A and C.

**B is a collider of A and C.**

```r
dag.obj = dagify(income ~ edu,
                 income ~ iq) 
ggdag(dag.obj) +
  geom_dag_node(color = "grey") +
  geom_dag_text(color = "black") +
  theme_bw()
```

---
.right-column[
The correlation between education and IQ would not be significant and meaningful.
]

![](13-dag_files/figure-html/unnamed-chunk-4-1.png)

---

## Complex paths

In reality, the causal pathways between the constructs in your model will be very complex.

```r
dag.obj = dagify(gpa ~ iq,
                 edu ~ gpa + iq, 
                 income ~ edu + iq)
```

```r
ggdag(dag.obj) +
  geom_dag_node(color = "grey") +
  geom_dag_text(color = "black") +
  theme_bw()
```
---

![](13-dag_files/figure-html/unnamed-chunk-7-1.png)

---

## Confounding

DAG models are most useful for psychological scientists thinking about which variables to include in a regression model. Recall that one of the uses of regression is to statistically control for variables when estimating the relationship between an X and Y in our research question. Which variables should we be statistically controlling for?

---

## Confounding

We can see with DAG models that we should be controlling for constructs create forks, or a construct that causes both our X and Y variable. These constructs are known as .purple[confounds], or constructs that represent common causes of our variables of interest.

We may also want to control for variables that cause Y but are unrelated to X, as this can increase power by reducing unexplained variance in Y (think about SS).

Importantly, we probably do not want for variables that causes just X and not Y.

---

```r
dag.obj = dagify(gpa ~ iq,
                 edu ~ gpa + iq, 
                 income ~ edu + iq, 
                 exposure = "edu", outcome = "income")

ggdag_paths(dag.obj)+
  geom_dag_text(color = "black") +
  theme_bw()
```

---

![](13-dag_files/figure-html/unnamed-chunk-8-1.png)

What should I control for?

---

```r
ggdag_adjustment_set(dag.obj)+
  geom_dag_text(color = "black") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-9-1.png)

---

Confounds open "back-doors" between variables that act like causal pathways, but are not. But closing off the back doors, we isolate the true causal pathways, if there are any, and our regression models will estimate those.

Based on the causal structure I hypothesized last slide, if I want to estimate how much education causes income, I should control for intelligence, but I don't need to control for grades.

`$$\large \hat{\text{Income}} = b_0 + b_1(\text{Education}) + b_2(\text{Intelligence}) + e$$`

---

## More control is not always better

We can see with forks that controlling for confounds improves our ability to correctly estimate causal relationships. So controlling for things is good...

Except the other thing that DAGs should teach us is that controlling for the wrong things can dramatically hurt our ability to estimate true causal relationships and can create .purple[spurious correlations], or open up new associations that don't represent true causal pathways.

Let's return to the other two types of paths, chains and inverted forks.

---

## Control in chains

Chains represent mediation; the effect of construct A on C is *through* construct B.

Intelligence &rarr; Educational attainment &rarr; Income

What happens if we control for a mediating variable, like education?

```r
dag.obj = dagify(edu ~ iq,
                 income ~ edu,
                 exposure = "iq", 
                 outcome = "income")
```

---

.pull-left[

```r
ggdag_dseparated(dag.obj) +
  geom_dag_text(color = "black") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-11-1.png)
]
.pull-right[

```r
ggdag_dseparated(dag.obj, 
                 controlling_for = "edu") +
  geom_dag_text(color = "black") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-12-1.png)
]

---

Controlling for mediators removes the association of interest from the model -- you may be removing the variance you most want to study!

Inverted forks also teach us what not to control for: colliders. A .purple[collider] for a pair of constructs is any third construct that is caused by both constructs in the pair. Controlling for (or conditioning on) these variables introduces bias into your estimates.

Intelligence &rarr; Income attainment &larr; Educational

What happens if we control for the collider?

```r
dag.obj = dagify(income ~ iq,
                 income ~ edu,
                 exposure = "iq", 
                 outcome = "edu")
```

---

.pull-left[

```r
ggdag_dseparated(dag.obj) +
  geom_dag_text(color = "black") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-14-1.png)
]
.pull-right[

```r
ggdag_dseparated(dag.obj, 
                 controlling_for = "income") +
  geom_dag_text(color = "black") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-15-1.png)
]

---

## Collider bias

### Unexpected colliders

Missing data or restricted range problems arise in collider bias.

```r
dag.obj = dagify(c ~ x,
                 c ~ y,
                 labels = c("x" = "Respitory Disease",
                            "c" = "Hospitalization",
                            "y" = "Locomotor Disease"), 
                 exposure = "x", outcome = "y")
```

---

.pull-left[

Sampling from a specific environment (or a specific subpopulation) can result in collider bias.

Be wary of:

* Missing data
* Subgroup analysis
* Any post treatment variables

This will extend to not just controlling for constructs, but estimating interactions with them. If your moderating variable is a collider, you've introduced collider bias.

]
.pull-right[

```r
ggdag_dseparated(dag.obj, 
                 controlling_for = "c", 
                 text = FALSE, 
                 use_labels = "label") +
  theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-17-1.png)
]

.small[Sackett et al. (1979)]

---

## Example

A health researcher is interested in studying the relationship between dieting and weight loss. She collects a sample of participants, measures their weight, and then asks whether they are on a diet. She returns to these participants two months later and measures their final weight and how much weight they lost. What should the researcher control when assessing the relationship between dieting and weight loss?

---

```r
dag.obj = dagify(diet ~ wi,
                 wf ~ diet + wi,
                 loss ~ wi + wf,
                 exposure = "diet", 
                 outcome = "loss")
```

---

```r
ggdag_adjustment_set(dag.obj) + theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-19-1.png)

The researcher should control for the initial weight (fork/third variable), but not the final weight (chain/mediator).

---

The researcher is studying the association between X and Y. What should be controlled?

![](13-dag_files/figure-html/unnamed-chunk-20-1.png)
---

```r
ggdag_adjustment_set(dag.obj) + theme_bw()
```

![](13-dag_files/figure-html/unnamed-chunk-21-1.png)
---
There is a **strong** assumption of DAG models, and that it's that no relevant variables have been omitted from the graph and that causal pathways have been correctly specified. This is a tall order, and potentially impossible to meet. Moreover, two different researchers may reasonably disagree with what the true model looks like.

It's your responsibility to decide for yourself what you think the model is, provide as much evidence as you can, and listen when an open mind to those who disagree.

![](images/xkcd.png)

---

---

class: inverse

## Next time...

Interactions (moderators)