Appendix V — SPSS Tutorial: Reliability Analysis
Computing ICC, SEM, MDC, and Bland-Altman limits of agreement in SPSS
V.1 Overview
This tutorial demonstrates a complete reliability analysis for strength_kg using the control group’s pre-test and mid-test measurements as a test-retest dataset (n = 30, no intervening intervention). The same workflow applies to any two-occasion, single-variable reliability study.
Variable: strength_kg Occasions: Pre-test (occasion 1), mid-test (occasion 2) Group: Control only (n = 30) Goal: Compute ICC(2,1), SEM, MDC₉₅, and Bland-Altman limits of agreement
V.2 Part 1: Data preparation
V.2.1 Step 1: Filter to the control group
- Click Data → Select Cases
- Select If condition is satisfied → If…
- Enter:
group = "control"(or the coded value used in your file) - Click Continue → OK
V.2.2 Step 2: Restructure to wide format
The dataset must have one row per participant with separate columns for each test occasion. If your data are in long format:
- Click Data → Restructure → Restructure selected cases into variables → Next
- Move
idto Identifier Variable(s) - Move
timeto Index Variable - Click through remaining dialogs → Finish
After restructuring, rename the columns in Variable View: - The column corresponding to pre → strength_pre - The column corresponding to mid → strength_mid
Verify the mapping by inspecting the first few rows.
V.2.3 Step 3: Compute the difference and mean scores
These variables are needed for the Bland-Altman analysis in Part 3.
- Click Transform → Compute Variable
- Target Variable:
strength_diff— Numeric Expression:strength_mid - strength_pre - Click OK
- Repeat: Target Variable:
strength_mean— Numeric Expression:(strength_pre + strength_mid) / 2 - Click OK
V.3 Part 2: Intraclass Correlation Coefficient
V.3.1 Step 4: Run ICC in SPSS
- Click Analyze → Scale → Reliability Analysis
- Move
strength_preandstrength_midto the Items box - Set Model to Intraclass correlation coefficient
- Click Statistics:
- Under Intraclass Correlation Coefficient, set:
- Model: Two-Way Mixed (for ICC(2,1) — same assessor on all participants, assessor treated as fixed)
- Type: Absolute Agreement
- Confidence interval: 95%
- Test value: 0
- Click Continue
- Under Intraclass Correlation Coefficient, set:
- Check Descriptives for: Item, Scale, Scale if item deleted
- Click OK
SPSS offers three model options that map to the Shrout & Fleiss (1979) taxonomy:
| SPSS model | Shrout & Fleiss equivalent | When to use |
|---|---|---|
| One-Way Random | ICC(1,1) | Different raters for each participant |
| Two-Way Random | ICC(2,1) | Same raters, raters = random sample |
| Two-Way Mixed | ICC(3,1) | Same raters, raters = fixed (specific to this study) |
For most test-retest reliability studies in movement science where the same assessor conducts both sessions and you want results to generalize, Two-Way Mixed with Absolute Agreement is most commonly reported as ICC(2,1). Check your study design against[1] to confirm the appropriate selection.
V.3.2 Step 5: Interpret ICC output
SPSS produces the following Intraclass Correlation Coefficient table:
| ICC | 95% CI Lower | 95% CI Upper | F | df1 | df2 | p | |
|---|---|---|---|---|---|---|---|
| Single Measures | .996 | .993 | .998 | 276.54 | 29 | 29 | < .001 |
| Average Measures | .998 | .996 | .999 | 276.54 | 29 | 29 | < .001 |
Key values to record: - Single Measures ICC = .996 — this is the value to report; it reflects reliability of one measurement, which is what will be used in the study - 95% CI = [.993, .998] — narrow interval confirming excellent precision of the ICC estimate - Classify using[1] benchmarks: ICC > .90 = Excellent
V.4 Part 3: Standard Error of Measurement and MDC₉₅
SPSS does not compute SEM or MDC₉₅ directly from the Reliability Analysis output. Two options are available:
- Statistical Calculators appendix — The interactive reliability calculator in the Statistical Calculators appendix accepts the ICC and pooled SD and returns SEM and MDC₉₅ automatically.
- SPSS 31 and later — Analyze → Power Analysis includes a reliability module that reports SEM alongside ICC.
For reference, the formulas are:
\[\text{SEM} = SD_{\text{pooled}} \times \sqrt{1 - \text{ICC}}\]
\[\text{MDC}_{95} = \text{SEM} \times 1.96 \times \sqrt{2}\]
The pooled SD is obtained from the Item Statistics table in the SPSS output (average of the two SD values). For strength_kg: pooled SD ≈ 13.49 kg, ICC = .996, giving SEM = 0.82 kg and MDC₉₅ = 2.26 kg.
V.5 Part 4: Bland-Altman analysis
V.5.1 Step 6: Test for systematic bias (paired t-test on difference scores)
- Click Analyze → Compare Means → Paired-Samples T Test
- Move
strength_midandstrength_preas a pair to the Paired Variables box - Click OK
Output to read:
| Mean | SD | SE | t | df | p (2-tailed) | |
|---|---|---|---|---|---|---|
| strength_mid − strength_pre | 0.517 | 1.153 | 0.211 | 2.45 | 29 | .021 |
The mean difference (bias) is +0.517 kg and is statistically significant (p = .021), indicating a small but real systematic tendency for mid-test scores to be slightly higher than pre-test scores. This likely reflects a minor familiarization effect.
V.5.2 Step 7: Obtain the limits of agreement
The limits of agreement are computed from the mean difference and its SD (both visible in the paired t-test output):
\[\text{LoA} = \bar{d} \pm 1.96 \times SD_d = 0.517 \pm 1.96 \times 1.153\]
Use the Statistical Calculators appendix or SPSS 31+ to obtain these values without hand arithmetic. For reference, the result is LoA = [−1.74, +2.78] kg.
You can also verify the SD of differences by running:
- Click Analyze → Descriptive Statistics → Explore
- Move
strength_diffto Dependent List - Click OK and read the SD from the Descriptives table
V.5.3 Step 8: Produce the Bland-Altman scatter plot
- Click Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define
- Move
strength_meanto X Axis - Move
strength_diffto Y Axis - Click OK
- Double-click the chart to open the Chart Editor
- Add reference lines for the bias and limits of agreement:
- Click Options → Y Axis Reference Line
- Add three lines: one at Y = 0.517 (bias), one at Y = 2.777 (upper LoA), one at Y = −1.743 (lower LoA)
- Style the bias line as solid and the LoA lines as dashed
- Add axis labels and a title, then export via File → Export
What to look for in the plot: - Are the points randomly scattered around the bias line (desirable), or do they show a pattern (e.g., larger differences at higher means — heteroscedasticity)? - Are any points outside the limits of agreement? Approximately 5% (1–2 out of 30) is expected by chance.
V.6 Part 5: APA-style write-up
Test-retest reliability of the
strength_kgassessment was evaluated in 30 control-group participants measured at two occasions separated by approximately six weeks. ICC(2,1) absolute agreement (single measures) was computed using the Two-Way Mixed model in SPSS following the guidelines of[1]. Reliability was excellent, ICC(2,1) = .996 (95% CI [.993, .998]). The SEM was 0.82 kg and the MDC₉₅ was 2.26 kg, indicating that an individual strength change must exceed 2.26 kg to be attributable to genuine change rather than measurement error with 95% confidence. Bland-Altman analysis revealed a small systematic bias of +0.52 kg (mid-test > pre-test), t(29) = 2.45, p = .021, with 95% limits of agreement of −1.74 to +2.78 kg. The limits of agreement were interpreted as acceptable for a study in which training-induced strength gains are expected to substantially exceed 2.78 kg.
V.7 Troubleshooting
“The Reliability Analysis menu is greyed out”: Analyze → Scale → Reliability Analysis requires at least two items (variables) to be in the Items box. Confirm that both strength_pre and strength_mid have been moved there before attempting to run the analysis.
“My ICC value is very different from what I expected”: Double-check the model selection (One-Way Random, Two-Way Random, or Two-Way Mixed) and the type (Consistency vs. Absolute Agreement). Two-Way Mixed Absolute Agreement will always produce the most conservative (lowest) ICC; One-Way Random Consistency will produce the highest. For test-retest with a single assessor, Two-Way Mixed Absolute Agreement is the standard choice.
“The paired t-test shows a significant bias — should I still report ICC?”: Yes — report both, and note the systematic bias. A significant bias is not a reason to abandon ICC; it is a separate finding that should be described alongside the ICC and addressed in your protocol (e.g., by adding a familiarization session). ICC and Bland-Altman analysis are complementary, not interchangeable.
“I have three test occasions, not two — how do I run ICC?”: For three or more occasions, move all three variables into the Items box in Analyze → Scale → Reliability Analysis and select the same ICC model. SPSS computes the ICC across all occasions simultaneously, treating it as a multi-rater design. The interpretation is the same: the resulting ICC reflects the expected reliability of a single-occasion measurement.
“The difference scores appear to spread out more at higher means (heteroscedasticity)”: This suggests that measurement error is proportional to score magnitude, which violates the assumption of constant variance underlying the standard Bland-Altman analysis. In this case, log-transform both strength_pre and strength_mid before computing the difference and mean variables, then rerun the Bland-Altman analysis on the log-transformed scores. Back-transform the limits of agreement to the original scale for reporting. See[2] for guidance.
V.8 Practice exercises
Exercise 1: Repeat the full reliability analysis (ICC, SEM, MDC₉₅, Bland-Altman) for vo2_mlkgmin using the control group’s pre- and mid-test values. Compare your ICC and limits of agreement to those for strength_kg. What does the difference in reliability tell you about the relative measurement precision of the two instruments?
Exercise 2: Using your MDC₉₅ values for vo2_mlkgmin, revisit the training group’s mean VO₂max change from pre- to post-test (compute from the dataset). Does the group-level mean change exceed the MDC₉₅? What does this mean for interpreting individual responses to training?
Exercise 3: Compute the ICC for sprint_20m_s (control group, pre vs. mid). Before running the analysis, inspect the Bland-Altman plot for this variable and determine whether heteroscedasticity might be present. Describe the pattern you observe and state whether log-transformation would be warranted.
Exercise 4: A new graduate student in your lab suggests that since Pearson r between pre- and mid-test strength_kg scores is r = .997, there is no need to report ICC separately. Write 3–4 sentences explaining why this reasoning is incorrect, referencing the specific limitation of Pearson r in reliability assessment and at least one alternative finding (from the Bland-Altman analysis) that the Pearson r would have missed.