Latent variables

class: center, middle, inverse, title-slide

# Latent variables
## What are we measuring?

---

.remark-slide thead, .remark-slide tr:nth-child(2n) {
        background-color: white;
    }
</style>

## Last time

* Descriptive statistics

* Central tendency
  * Spread
  
  * Correlations

???

* What does correlation communicate?
* Range? What does direction mean?
* What is a large correlation?

---

## Today

Moral of the story: measuring stuff is hard. Put some thought into it.

---

## Quantitude podcast

* According to Curran and Hancock, what is a latent variable?

* What are some examples of latent variables in the podcast?

???

From C&H

Latent variables -- the things we cannot directly observe
Variable -- individual differences on it
A placeholder for the covariation among a set of variables.

Is it missing for at least some of the observations in your dataset?

Examples: depression, anxiety, the economy, the quality of a college/uni, Pandora music station, gravity, disease or underlying conditions

---

### [Bollen (2002)](../readings/Bollen_2002.pdf)

Definitions of latent variables:
.pull-left[
* Informal
  * Hypothetical construct
  * Unmeasureable
  * Data reduction
]
.pull-right[
* Formal
  * Local independence
  * Expected value
  * Nondeterministic function
]

???

Underlying assumption of these definitions: we measure latent variables using multiple observed variables

Sample realization definition: A latent random (or nonrandom) variable is a random (or nonrandom) variable for which there is no sample realization for at least some observations in a given sample.

---
<img src="images/sanjay_twitter.png" width="75%" />

---

### Measuring latent variables

If latent variables are unobserved, how do we study them?

* The challenge of **psychometrics** is assign numbers to observations in a way that best summarizes the underlying constructs ([Revelle, 2009](http://personality-project.org/r/book/Chapter3.pdf))

How do we create this in our dataset (practically speaking)?

* With the people around you, come up with one latent variables that you might be interested in and describe how you would measure them.

???

Walk students through example of job success How would you measure this? What items would you use? How would you assign numbers to those items?

How would you use those numbers to create a job success score?

---

### Thinking about measurements

What questions should we ask ourselves as we construct latent variables?

* What else does our measure capture?

* (If multiple items) are all items weighted equally?

* (If multiple items) are items causal indicators or effect indicators?

* Is our latent variable _a posteriori_ and _a priori_?

???

Use job success measure. How are observations biased?
(who gets raises or promotions?)

---

### Relationship between latent variables and theory

Latent variables live at the level of theory.

* Your theory is about success/happiness/arousal/memory/etc, not about the measure (items or operationalizations).

* Does your theory specify how the latent variable is associated with your measure?
    * Probably not... we'll return to this.

---

### Relationship between latent variables and theory

Do you need theory for good statistics or empirical work?

* Machine learning models
  
  * Don't need theory to make predictions.
  * In fact, best predictions often come by throwing out theory.

* Network models

* No underlying theory about the cause of covariation between items. 
  * Allows for exploration of item structure.
  * E.g., work on [depression](https://eiko-fried.com/)
  
---

![](images/Scientific_Method.png)

---

## What's wrong with latent variables

![](images/borsboom.png)

---

[Borsboom (2006)](../readings/Borsboom_2006.pdf) argues that good measurement practices -- specifically, testing that measures capture latent variable -- has been ignored in psychology.

* Operationalizations assumed substitutes for latent variables
* No exploration or tests of whether measure captures latent variable
* Construct validity (Cronbach & Meehl, 1955, among others) made to seem too difficult

---

### IAT

???

Go through assumptions

* Operationalizations assumed substitutes -- what are we actually trying to measure?

---

### IAT
.pull-left[
<img src="images/original_iat.png" width="100%" />
]
.pull-right[
From Greenwald, McGhee, & Schwartz ([1998](https://pubmed.ncbi.nlm.nih.gov/9654756/))
]

???

* No exploration or tests of whether more captures latent variable
* Evidence presented is group means, not individual differences.
* What's the theory underlying this. What is the shape of the outcome? Is it really linear? Is there a cut-off? How do practice effects weigh in?

---

### IAT

![](images/iat_correlation.png)
From Greenwald, McGhee, & Schwartz ([1998](https://pubmed.ncbi.nlm.nih.gov/9654756/))

???
 Construct validity is hard, am I right?

---

## The underlying process

Where do the numbers come from?

What assumptions do our statistics make about where the numbers come from?

A few examples from Revelle ([2009](http://personality-project.org/r/book/Chapter3.pdf))

---

### Whose point of view?

Consider the problem of a department chairman who wants to recruit faculty by emphasizing the smallness of class size but also report to a dean how effective the department is at meeting its teaching requirements. What is the typical class size?

<table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Faculty Member </th>
   <th style="text-align:right;"> Freshman/Sophmore </th>
   <th style="text-align:right;"> Junior </th>
   <th style="text-align:right;"> Senior </th>
   <th style="text-align:right;"> Graduate </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> Median </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 35.0 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E </td>
   <td style="text-align:right;"> 200 </td>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 400 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 177.5 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 150 </td>
  </tr>
  <tr grouplength="2"><td colspan="7" style="border-bottom: 0;"><strong>Total</strong></td></tr>
<tr>
   <td style="text-align:left;padding-left: 2em;background-color: lightgray !important;" indentlevel="1"> Mean </td>
   <td style="text-align:right;background-color: lightgray !important;"> 56 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 46 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 110 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 50.0 </td>
   <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 39 </td>
  </tr>
  <tr>
   <td style="text-align:left;padding-left: 2em;background-color: lightgray !important;" indentlevel="1"> Median </td>
   <td style="text-align:right;background-color: lightgray !important;"> 20 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;"> 10 </td>
   <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 12.5 </td>
   <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 10 </td>
  </tr>
</tbody>
</table>

???

Tell faculty that median class size is 10 and tell dean the mean class size is 50. Excellent!

---

What about from the students' perspective?

```r
class_size = c(rep(10, 120),
               rep(20, 80),
               rep(100, 200),
               rep(200, 200),
               rep(400, 400))
mean(class_size)
```

```
## [1] 222.8
```

```r
median(class_size)
```

```
## [1] 200
```

---

### Is the process generating numbers linear?

Many of the statistics we use (e.g., mean) assume the process generating numbers is linear. That is, as you move up on the latent construct, you move in a linear fashion along the measurement. What happens if that's not the case?

---

Scores indicate the time of day the subject experienced their peak.

---

### Non-linearirty and pre-existing differences

The issues of non-linearity are especially troublesome when there are pre-existing differences between groups. This can lead to interactions at the level of the observations (measures/operationalization) even when there are not interactions at the level of the latent variable.

Consider a study of "thematic analysis" across three schools: 
* a "high-quality, high prestige 4-year liberal arts college located in New
England" (Ivy)
* a "4-year state supported institution, relatively nonselective, and enrolling mostly lower-middle-class
commuter students who are preparing for specific vocations such as teaching" (TC)
* a community college (CC).

(From Winter & McClelland, 1978)

---

![](05-variables_files/figure-html/unnamed-chunk-10-1.png)

What is your conclusion?

---

![](05-variables_files/figure-html/unnamed-chunk-11-1.png)

What is your conclusion?

---

.pull-left[
![](05-variables_files/figure-html/unnamed-chunk-12-1.png)
]

.pull-right[
![](05-variables_files/figure-html/unnamed-chunk-13-1.png)
]

Both panels are generated from the exact same monotonic curve, but with items of different difficulties.

`$$prob(correct|\theta,\delta) = \frac{1}{1+e^{\delta-\theta}}$$`

---

## Takeaways

* Latent variables are not directly measured for at least some people in a given sample
* We try to infer the value of a latent variable through our observed variable(s)
* In doing so, we must bring theory to bear, not only on how the latent variable connects to other (latent) variables or constructs, but specifically how our latent variable is related to our operationalization
* Misspecifying the relationship between latent variables and operationalizations can result in misleading or wrong results.

---

class: inverse

## Next time...

Probability