Last time
Validity (4 types)
- Statistical Conclusion
- Internal
- External
- Construct(?)
Today
More about construct validity and its role in scale development.
Interactive focus: We’ll “develop” a scale together and learn what makes measurement good (or bad).
Conceptual clarity (Bringmann, Elmer, & Eronen, 2022)
- Identifying and characterizing the concept of study
- This is independent of measurement and must happen before measurement
- Why is conceptual clarity important?
Quantitative fallacy
See this article.
Applicant selection for medical residency positions.
- Subtest of medical licensing exam is best predictor of residency success.
- Great!
- Right?
Quantitative fallacy
Fallacy:
- Measure whatever can be easily measured.
- Disregard things that cannot be measured easily.
- Presume things that cannot be measured easily are not important.
- Presume that things that are not measured easily do not exist.
From concept to measurement
Once a concept has been clarified, the next step is to measure it.
Classical test theory states that:
\[
X = T + E
\]
- X: Observed Score
- T: True Score
- E: Error (random and unpredictable)
Reliability
Reliability = consistency of measurement
- We use multiple items/measurements to reduce error
- Like measuring the person many times in a single session
- Not the same as validity! (You can have reliable measures that aren’t valid)
Key question: How many items? Which items?
Today’s exercise: Developing a scale
We’re going to “develop” a scale for “academic stress”
What are some specific aspects of academic stress?
Examples: coursework pressure, deadlines, fear of failure, workload, exams, research demands, imposter syndrome…
But first: What makes a BAD item?
Let’s look at some problematic items for our academic stress scale.
What’s wrong with these?
Bad Item #1
“I feel stressed about my coursework and my relationships with classmates.”
Problem: Double-barreled!
- Asks about two different things
- What if stressed about coursework but NOT relationships?
- Can’t interpret the response
Bad Item #2
“Sometimes I’m like totally freaking out about school stuff, you know?”
Problem: Slang and vague language
- “freaking out” = ? (different meanings for different people)
- “school stuff” = ? (too broad)
- “you know?” = unnecessary filler
Bad Item #3
“On Thursday evening when I was studying for my statistics exam, I felt anxious.”
Problem: Too specific
- What about people who don’t take statistics?
- What about other days/times?
- We want general patterns, not specific instances
Bad Item #4
“Are you worried about your academic performance impacting your future career prospects and disappointing your family?”
Problem: Multiple problems!
- Double-barreled (career AND family)
- Sensitive topic phrased judgmentally (“disappointing”)
- Leading/assumes family expectations exist
Guidelines for GOOD items
Content:
- Simple and straightforward
- No double-barrels
- Avoid slang
- Phrase items generally
- Use matter-of-fact language for sensitive topics
Format:
- Choose response format carefully
- Dichotomous vs. polytomous
- Number of options (5? 7?)
- Response categories (agree/disagree? frequency?)
- Match phrasing to format
Good Item Examples
Compare these to our bad examples:
✓ “I feel overwhelmed by my academic workload.”
✓ “I worry about my academic performance.”
✓ “I feel pressure to meet academic deadlines.”
What makes these better?
Activity: Write items (10 min)
Break into small groups (3-4 people)
Your task:
- Choose ONE specific aspect of academic stress from the list
- Write 3 items measuring that aspect
- Check each item against the guidelines
- Identify potential problems in your own items
Aspects: coursework pressure, deadlines, fear of failure, workload, exam anxiety, research stress, imposter syndrome
Share & discuss (5 min)
Each group shares:
- Your best item
- One problem you caught and fixed
Moving from items to scales
You’ve written items… now what?
Next steps:
- Collect data from a sample
- Examine item properties
- Assess reliability
- Test dimensionality (factor analysis)
- Iterate!
Let’s look at what this actually looks like…
Hypothetical data: Academic stress scale
I collected data from 75 students using 6 items.
Let’s analyze it together and see:
- Which items work well?
- Which items are problematic?
- How do we decide what to keep?
Item descriptives
1. I feel overwhelmed by workload |
4.2 |
1.1 |
-0.8 |
2. I worry about my performance |
3.9 |
1.3 |
-0.5 |
3. Deadlines stress me out |
4.5 |
0.9 |
-1.2 |
4. Sometimes I’m stressed |
4.8 |
0.5 |
-2.1 |
5. I feel confident in my abilities* |
3.2 |
1.2 |
0.3 |
6. I have trouble sleeping due to stress |
3.1 |
1.4 |
0.2 |
*reverse-coded
What do you notice?
Red flags in the data
Item 4: “Sometimes I’m stressed”
- Mean = 4.8 (on 1-5 scale)
- SD = 0.5
- Skew = -2.1
Problem: No variance! Everyone agrees.
- Too general/vague
- Doesn’t discriminate between people
- Won’t correlate with anything
- Should be removed
Item-total correlations
How well does each item correlate with the total score (excluding that item)?
1. Overwhelmed by workload |
.72 |
2. Worry about performance |
.68 |
3. Deadlines stress me out |
.75 |
4. Sometimes I’m stressed |
.23 |
5. Confident in abilities* |
.41 |
6. Trouble sleeping |
.59 |
Rule of thumb: r > .30 (ideally > .50)
What do these correlations tell us?
Strong correlations (.68-.75):
- Items measuring the same construct
- Good indicators of academic stress
Weak correlation (.23):
- Item 4: Too general, not useful
Moderate correlation (.41):
- Item 5 (confidence): Related but different construct?
- Might be tapping self-efficacy, not stress
- Decision: Keep or remove?
Reliability: Cronbach’s alpha
\[\alpha = \frac{k}{k-1}\left(1 - \frac{\sum s_i^2}{s_T^2}\right)\]
Don’t worry about the formula. What matters:
- α ranges from 0 to 1
- Measures internal consistency (assumes unidimensionality!)
- Rule of thumb: α > .70 (adequate), α > .80 (good)
Our scale: α = .83 (with all items)
What happens if we remove bad items?
Original scale (10 items): α = .83
Remove Item 4 (too general): α = .85
Remove Item 4 & 5 (general + confidence): α = .87
Key insight: Sometimes removing items improves reliability!
- Bad items add noise, not signal
- Quality > quantity
Factor analysis preview
Do all items measure ONE construct, or multiple?
Our data: Factor analysis suggests 2 dimensions
- Performance anxiety (worry, deadlines, overwhelm)
- Physical symptoms (sleep, physical stress responses)
Implications:
- Might need subscales
- Or might narrow construct definition
- This is why iteration matters!
The iterative process
Scale development isn’t one-and-done:
- ✓ Conceptualize construct
- ✓ Write item pool
- ✓ Collect data
- ✓ Analyze items
- → Revise items
- → Collect new data
- → Re-analyze
- → Repeat until satisfactory
Published scales often go through 3-5 iterations
Construct validity revisited
Good measurement = good construct validity
Requires:
- Clear conceptualization
- Well-written items
- Adequate reliability
- Evidence the scale measures what you think it measures
- Correlates with related constructs (convergent validity)
- Doesn’t correlate with unrelated constructs (discriminant validity)
- Predicts relevant outcomes (predictive validity)
Key takeaways
- Conceptual clarity before measurement (avoid quantitative fallacy)
- Good items are harder to write than you think
- Simple, clear, general, one idea at a time
- Data tells you what works
- Look at distributions, correlations, reliability
- Be willing to cut items
- Scale development takes time
- Multiple iterations
- Validation across samples
Next time…
Describing data