class: center, middle, inverse, title-slide # Latent variables ## What are we measuring? --- <style type="text/css"> /* Change the background color to white for shaded rows (even rows) */ .remark-slide thead, .remark-slide tr:nth-child(2n) { background-color: white; } </style> ## Last time * Descriptive statistics * Central tendency * Spread * Correlations ??? * What does correlation communicate? * Range? What does direction mean? * What is a large correlation? --- ## Today Moral of the story: measuring stuff is hard. Put some thought into it. --- ## Quantitude podcast * According to Curran and Hancock, what is a latent variable? * What are some examples of latent variables in the podcast? ??? From C&H Latent variables -- the things we cannot directly observe Variable -- individual differences on it A placeholder for the covariation among a set of variables. Is it missing for at least some of the observations in your dataset? Examples: depression, anxiety, the economy, the quality of a college/uni, Pandora music station, gravity, disease or underlying conditions --- ### [Bollen (2002)](../readings/Bollen_2002.pdf) Definitions of latent variables: .pull-left[ * Informal * Hypothetical construct * Unmeasureable * Data reduction ] .pull-right[ * Formal * Local independence * Expected value * Nondeterministic function ] ??? Underlying assumption of these definitions: we measure latent variables using multiple observed variables -- Sample realization definition: A latent random (or nonrandom) variable is a random (or nonrandom) variable for which there is no sample realization for at least some observations in a given sample. --- <img src="images/sanjay_twitter.png" width="75%" /> --- ### Measuring latent variables If latent variables are unobserved, how do we study them? -- * The challenge of **psychometrics** is assign numbers to observations in a way that best summarizes the underlying constructs ([Revelle, 2009](http://personality-project.org/r/book/Chapter3.pdf)) How do we create this in our dataset (practically speaking)? -- * With the people around you, come up with one latent variables that you might be interested in and describe how you would measure them. ??? Walk students through example of job success How would you measure this? What items would you use? How would you assign numbers to those items? How would you use those numbers to create a job success score? --- ### Thinking about measurements What questions should we ask ourselves as we construct latent variables? -- * What else does our measure capture? * (If multiple items) are all items weighted equally? * (If multiple items) are items causal indicators or effect indicators? * Is our latent variable _a posteriori_ and _a priori_? ??? Use job success measure. How are observations biased? (who gets raises or promotions?) --- ### Relationship between latent variables and theory Latent variables live at the level of theory. * Your theory is about success/happiness/arousal/memory/etc, not about the measure (items or operationalizations). * Does your theory specify how the latent variable is associated with your measure? * Probably not... we'll return to this. --- ### Relationship between latent variables and theory Do you need theory for good statistics or empirical work? * Machine learning models * Don't need theory to make predictions. * In fact, best predictions often come by throwing out theory. * Network models * No underlying theory about the cause of covariation between items. * Allows for exploration of item structure. * E.g., work on [depression](https://eiko-fried.com/) --- ![](images/Scientific_Method.png) --- ## What's wrong with latent variables ![](images/borsboom.png) --- [Borsboom (2006)](../readings/Borsboom_2006.pdf) argues that good measurement practices -- specifically, testing that measures capture latent variable -- has been ignored in psychology. * Operationalizations assumed substitutes for latent variables * No exploration or tests of whether measure captures latent variable * Construct validity (Cronbach & Meehl, 1955, among others) made to seem too difficult --- ### IAT <img src="images/iat.jpeg" width="100%" /> ??? Go through assumptions * Operationalizations assumed substitutes -- what are we actually trying to measure? --- ### IAT .pull-left[ <img src="images/original_iat.png" width="100%" /> ] .pull-right[ From Greenwald, McGhee, & Schwartz ([1998](https://pubmed.ncbi.nlm.nih.gov/9654756/)) ] ??? * No exploration or tests of whether more captures latent variable * Evidence presented is group means, not individual differences. * What's the theory underlying this. What is the shape of the outcome? Is it really linear? Is there a cut-off? How do practice effects weigh in? --- ### IAT ![](images/iat_correlation.png) From Greenwald, McGhee, & Schwartz ([1998](https://pubmed.ncbi.nlm.nih.gov/9654756/)) ??? Construct validity is hard, am I right? --- ## The underlying process Where do the numbers come from? What assumptions do our statistics make about where the numbers come from? A few examples from Revelle ([2009](http://personality-project.org/r/book/Chapter3.pdf)) --- ### Whose point of view? Consider the problem of a department chairman who wants to recruit faculty by emphasizing the smallness of class size but also report to a dean how effective the department is at meeting its teaching requirements. What is the typical class size? <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Faculty Member </th> <th style="text-align:right;"> Freshman/Sophmore </th> <th style="text-align:right;"> Junior </th> <th style="text-align:right;"> Senior </th> <th style="text-align:right;"> Graduate </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Median </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 12.5 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 35.0 </td> <td style="text-align:right;background-color: lightgray !important;"> 15 </td> </tr> <tr> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 200 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 400 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 177.5 </td> <td style="text-align:right;background-color: lightgray !important;"> 150 </td> </tr> <tr grouplength="2"><td colspan="7" style="border-bottom: 0;"><strong>Total</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;background-color: lightgray !important;" indentlevel="1"> Mean </td> <td style="text-align:right;background-color: lightgray !important;"> 56 </td> <td style="text-align:right;background-color: lightgray !important;"> 46 </td> <td style="text-align:right;background-color: lightgray !important;"> 110 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 50.0 </td> <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 39 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;background-color: lightgray !important;" indentlevel="1"> Median </td> <td style="text-align:right;background-color: lightgray !important;"> 20 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;"> 10 </td> <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 12.5 </td> <td style="text-align:right;background-color: lightgray !important;background-color: lightgray !important;"> 10 </td> </tr> </tbody> </table> ??? Tell faculty that median class size is 10 and tell dean the mean class size is 50. Excellent! --- What about from the students' perspective? <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Class size </th> <th style="text-align:right;"> Number of classes </th> <th style="text-align:right;"> Number of students </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 120 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 200 </td> </tr> <tr> <td style="text-align:right;"> 200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 200 </td> </tr> <tr> <td style="text-align:right;"> 400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 400 </td> </tr> </tbody> </table> ```r class_size = c(rep(10, 120), rep(20, 80), rep(100, 200), rep(200, 200), rep(400, 400)) mean(class_size) ``` ``` ## [1] 222.8 ``` ```r median(class_size) ``` ``` ## [1] 200 ``` --- ### Is the process generating numbers linear? Many of the statistics we use (e.g., mean) assume the process generating numbers is linear. That is, as you move up on the latent construct, you move in a linear fashion along the measurement. What happens if that's not the case? --- Scores indicate the time of day the subject experienced their peak. <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Subject </th> <th style="text-align:right;"> Energetic Arousal </th> <th style="text-align:right;"> Positive Affect </th> <th style="text-align:right;"> Tense Arousal </th> <th style="text-align:right;"> Negative Affect </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 10 </td> </tr> <tr grouplength="2"><td colspan="5" style="border-bottom: 0;"><strong>Mean</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Arithmetic </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Circular </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> --- ### Non-linearirty and pre-existing differences The issues of non-linearity are especially troublesome when there are pre-existing differences between groups. This can lead to interactions at the level of the observations (measures/operationalization) even when there are not interactions at the level of the latent variable. Consider a study of "thematic analysis" across three schools: * a "high-quality, high prestige 4-year liberal arts college located in New England" (Ivy) * a "4-year state supported institution, relatively nonselective, and enrolling mostly lower-middle-class commuter students who are preparing for specific vocations such as teaching" (TC) * a community college (CC). (From Winter & McClelland, 1978) --- ![](05-variables_files/figure-html/unnamed-chunk-10-1.png)<!-- --> What is your conclusion? --- ![](05-variables_files/figure-html/unnamed-chunk-11-1.png)<!-- --> What is your conclusion? --- .pull-left[ ![](05-variables_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] .pull-right[ ![](05-variables_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] Both panels are generated from the exact same monotonic curve, but with items of different difficulties. `$$prob(correct|\theta,\delta) = \frac{1}{1+e^{\delta-\theta}}$$` --- ## Takeaways * Latent variables are not directly measured for at least some people in a given sample * We try to infer the value of a latent variable through our observed variable(s) * In doing so, we must bring theory to bear, not only on how the latent variable connects to other (latent) variables or constructs, but specifically how our latent variable is related to our operationalization * Misspecifying the relationship between latent variables and operationalizations can result in misleading or wrong results. --- class: inverse ## Next time... Probability <!-- Physiological arousal is thought to reflect levels of excitement, alertness and energy. --> <!-- * Two psychophysiological indicators of the degree of palmer sweating. --> <!-- * **Skin conductance (SC)** is measured by passing a small current through two electrodes, one attached to one finger, another attached to another finger. --> <!-- * **Skin resistance (SR)** is also measured by two electrodes, and reflects the resistance of the skin to passing an electric current. --> <!-- * These two measures are the reciprocal of one another. --> <!-- --- --> <!-- Consider two experimenters, A and B. They both are interested in the effect of an exciting movie upon the arousal of their subjects. Experimenter A uses Skin Conductance, experimenter B measures Skin Resistance. They first take their measures, and then, after the movie, take their measures again. --> <!-- ```{r, echo = F, results = 'asis'} --> <!-- data.frame( --> <!-- #Condition = c("Pretest", "Pretest", "Average", "Posttest", "Posttest", "Average"), --> <!-- Subject = c(1, 2, NA, 1, 2, NA), --> <!-- Conductance = c(2, 2, 2, 1, 4, 2.5), --> <!-- Resistance = c(.50, .50, .5, 1, .25, .63) --> <!-- ) %>% --> <!-- kable(escape = F, booktab = T) %>% --> <!-- kable_styling(font_size = 16) %>% --> <!-- group_rows("Pretest", 1, 2) %>% --> <!-- group_rows("Average", 3, 3) %>% --> <!-- group_rows("Posttest", 4, 5) %>% --> <!-- group_rows("Average", 6, 6) --> <!-- ``` --> <!-- --- --> <!-- ```{r} --> <!-- conductance = seq(1, 5, length = 100) --> <!-- resistance = 1/conductance --> <!-- plot(conductance, resistance, type = "l") --> <!-- ``` --> <!-- --- --> <!-- ```{r, echo = F, fig.width = 10, fig.height = 8} --> <!-- xcoord = c(2, 2, 1, 4) --> <!-- ycoord = c(.5, .5, 1, .25) --> <!-- conductance = seq(1, 5, length = 100) --> <!-- resistance = 1/conductance --> <!-- plot(conductance, resistance, type = "l") --> <!-- points(xcoord, ycoord) --> <!-- text(xcoord, ycoord, --> <!-- label = c("Pre S1", "Pre S2", "Post S1", "Post S2"), --> <!-- pos = 4) --> <!-- ``` --> <!-- --- --> <!-- ```{r, echo = F, fig.width = 10, fig.height = 8} --> <!-- xcoord = c(2, 2, 1, 4) --> <!-- ycoord = c(.5, .5, 1, .25) --> <!-- conductance = seq(1, 5, length = 100) --> <!-- resistance = 1/conductance --> <!-- plot(conductance, resistance, type = "l") --> <!-- points(xcoord, ycoord) --> <!-- text(xcoord, ycoord, --> <!-- label = c("", "", "Post S1", "Post S2"), --> <!-- pos = 4) --> <!-- points(c(2, 2.5), c(.5, .63)) --> <!-- text(c(2, 2.5), c(.5, .63), label = c("Mean Pre", "Mean Post")) --> <!-- segments(xcoord[3], ycoord[3], xcoord[4], ycoord[4]) --> <!-- ``` --> <!-- --- --> <!-- What's the real conclusion? --> <!-- * The movie induces greater variability --> <!-- What's the moral? --> <!-- * We don't expect either of these variables to behave linearly -- there are bounds. --> <!-- * A traditional mean calculation won't work here. --> <!-- A similar situation is measuring a construct over time that has circular rhythms. Think of anything that goes up and down over the course of a day (energy, affect, etc). -->