Chapter 13: Comparing Two Means

Student Resources

How to study this chapter

I use the 4 “P’s” framework to help you learn the material in this chapter: Prepare, Practice, Participate, and Perform. To increase the chances to succeed in this course, I strongly encourage you to complete all four “P’s” for each chapter.

1 Prepare

1.1 Chapter Overview

This chapter introduces t-tests—essential tools for evaluating group differences in Movement Science. You’ll learn how to distinguish between independent and paired sample designs, select the appropriate t-test, compute and interpret effect sizes (like Cohen’s d), and use confidence intervals to assess the magnitude and precision of mean differences.

1.2 Multimedia Resources

The following table provides access to video and slide resources for this chapter. Click the links to open them in an overlay for better viewing on all devices.

Multimedia Resources
Resource	Description	Link
Long Video Overview	A detailed video explaining independent and paired t-tests, assumptions, effect sizes, and interpretation in movement science research.	🔗 Watch Video
Slide Overview PDF	PDF slides that serve as an overview of this chapter. Read these before the textbook to introduce the main concepts and vocabulary.	🔗 Download PDF
Slide Deck HTML	Interactive HTML slides for class. During class, the instructor controls the presentation; after class, review at your own pace.	🔗 Open Slides
Slide Deck PDF	PDF version of the slide deck for download and offline viewing.	🔗 Download PDF

1.3 Read the Chapter

Read (Weir & Vincent, 2021, p. Ch.10) and (Furtado, 2026, p. Ch.13) to understand the theoretical and practical application of t-tests for comparing two means.

To succeed in this course, you must read the textbook chapters assigned for each topic. This is the only way to learn the material in depth.

Once done, proceed to the next section to practice what you learned.

2 Practice

Practicing what you learned in the chapter is essential to mastering the material. Below are some resources to help you practice the material in this chapter.

2.1 Frequently Asked Questions

Use a paired t-test when the same participants are measured twice (such as in pre-post designs) or when observations are matched in pairs (e.g., twins, or left-right limb comparisons). Paired designs control for individual differences by comparing each person to themselves, reducing error variance and increasing statistical power. In contrast, use an independent t-test when comparing two separate, unrelated groups (e.g., experimental vs. control) where participants in one group are distinct from those in the other. Using an independent t-test on paired data wastes power, while using a paired t-test on independent data violates the assumption that pairs are related.

Welch’s t-test does not assume equal population variances, making it more robust than the traditional pooled-variance t-test. When variances differ substantially between groups, the pooled-variance t-test can produce inflated Type I error rates or reduced power. Welch’s t-test corrects for this by using separate variance estimates and adjusting degrees of freedom. Importantly, Welch’s t-test performs well even when variances are equal, meaning it rarely performs worse than the pooled-variance version and often performs better. For this reason, most modern statistical practice recommends Welch’s t-test as the default.

Independent t-tests assume: 1. Independence of observations: Scores in one group do not influence scores in the other. 2. Normality: Data in each group are approximately normally distributed. This is particularly critical with small samples (e.g., \(n < 15\)). With large samples, the test is robust to normality violations due to the Central Limit Theorem. 3. Homogeneity of variance: Population variances are equal (though this can be relaxed by using Welch’s t-test).

Paired t-tests assume: 1. Pairs are independent: One pair does not influence another pair. 2. Differences are normally distributed: It is critical to check the normality of the difference scores (e.g., post − pre), not the raw pre- or post-test scores separately. 3. No order effects: For repeated measures, researchers should use counterbalancing or randomization if possible to prevent systematic order effects, like fatigue or learning.

Cohen’s d quantifies the standardized magnitude of a mean difference. Cohen suggested standard benchmarks: \(|d| = 0.2\) (small), \(0.5\) (medium), and \(0.8\) (large). However, context is crucial. In injury prevention research, even a “small” effect (d = 0.2) may save lives and heavily justify an intervention. Conversely, in elite athletic contexts, a “large” effect (d = 0.8) may be unrealistic to achieve. Always interpret effect sizes relative to your specific research domain rather than relying exclusively on arbitrary guidelines.

P-values indicate whether an effect is statistically detectable (e.g., \(p < .05\) suggests the difference is unlikely due to chance), but they do not quantify the size or precision of the effect. Confidence intervals (CIs) provide a range of plausible values for the true population difference. They enable researchers to evaluate both statistical significance (does the CI exclude zero?) and practical importance (are the bounds meaningful in the real world?). Wide CIs indicate high uncertainty, while narrow CIs indicate high precision—bringing transparency to your findings.

Statistical power is the probability of detecting a true effect when it exists (Power = \(1 - \beta\)). Power increases when you have: 1. Larger sample sizes 2. Larger effect sizes 3. A higher significance level (\(\alpha\)) 4. Lower data variability 5. Matched/Paired designs: Paired designs typically offer higher power than independent designs because they control for individual baseline differences.

Low power (< 0.50) typically means studies will frequently miss actual effects, producing false negatives. Researchers should conduct a priori power analysis to determine needed sample sizes.

Statistical significance (\(p < .05\)) simply indicates that an observed difference is unlikely to have occurred by chance, assuming the null hypothesis holds true. Practical significance evaluates whether the magnitude of the difference actually matters in real-world applications. A difference can be statistically significant but completely trivial in practice (e.g., catching 0.1s faster with \(n = 5000\)). To assess practical significance, focus on effect sizes (like Cohen’s d), confidence intervals, and domain-specific thresholds like minimal clinically important differences (MCIDs).

2.2 Test your Knowledge

Take this low-stakes quiz to test your knowledge of the material in this chapter. This quiz is for practice only and will help you identify areas where you may need additional review.

# When should you use a paired-samples t-test? - [x] When passing the same test twice on the same group of participants (e.g. pre- and post-tests) - [ ] When comparing two distinct groups of participants - [ ] When examining the correlation between two separate variables - [ ] When observing categorical data frequencies # What assumption does Welch's t-test NOT require compared to a standard Student's t-test? - [ ] Normality of distributions - [ ] Independence of observations - [x] Homogeneity of variance (equal variances) - [ ] Ratio or interval level data # Which measure provides the magnitude of the difference between two sample means in standardized units? - [ ] P-value - [ ] Standard Error of the mean (SE) - [x] Cohen's d - [ ] Coefficient of determination (r²) # In an independent t-test, what does Levene's test evaluate? - [ ] Whether the samples are normally distributed - [ ] The magnitude of Cohen's d - [x] Whether the two groups have equal variances - [ ] The independence of the groups # Which of the following best describes "post hoc power analysis"? - [ ] Determining the necessary sample size before conducting a study - [ ] Estimating parameter precision using confidence intervals - [x] Calculating power after data collection based on the observed effect size (often circular and uninformative) - [ ] Comparing three or more groups simultaneously # What constitutes a Type I error in the context of comparing two means? - [ ] Failing to reject the null hypothesis when there actually is a difference between groups - [x] Concluding that there is a significant difference between groups when there actually is no true difference - [ ] Making a calculation error in SPSS - [ ] Collecting too few participants for adequate power # How do confidence intervals help interpret mean differences beyond what a p-value provides? - [ ] They reduce Type II error rates automatically - [ ] They prove whether a correlation implies causation - [x] They reveal a plausible range for the actual magnitude of the true population difference - [ ] They replace the need for effect sizes like Cohen's d # If an independent t-test returns a 95% Confidence Interval for the mean difference of [-2.5, 4.3], what can you conclude at alpha = .05? - [ ] The difference is statistically significant because the interval is fairly narrow - [x] The difference is not statistically significant because the interval includes zero - [ ] The difference is statistically significant because it includes positive numbers - [ ] The effect size is very large # In a paired t-test, what data should be checked for the assumption of normality? - [ ] Both the pre-test scores and the post-test scores individually - [x] The difference scores (e.g., post minus pre) - [ ] Only the post-test scores - [ ] The overall combined pool of all test scores # Why does a paired t-test typically have more statistical power than an independent t-test with the same number of observations? - [x] It controls for individual variability by comparing each participant to themselves - [ ] It ignores the assumption of normality - [ ] It calculates degrees of freedom differently to artificially inflate the t-statistic - [ ] It assumes equal variances by default # According to Cohen (1988), what is roughly considered a "large" effect size for Cohen's d? - [ ] 0.2 - [ ] 0.5 - [x] 0.8 - [ ] 1.0 # What is the purpose of conducting an *a priori* power analysis? - [x] To calculate the sample size necessary to detect a specific effect size prior to running the study - [ ] To compute probability after an experiment concludes - [ ] To decide between a Student's t-test and a Welch's t-test - [ ] To evaluate the test-retest reliability of a survey

3 Participate

This section includes activities and discussions that will be completed during class time. Your active participation is essential for deepening your understanding of the material.

In-Class Activities

During class, we will: - Differentiate research scenarios that require independent versus paired t-tests - Verify assumptions (homogeneity of variance, normality) using SPSS - Run independent and paired samples t-tests in SPSS - Compare Student’s t-test with Welch’s t-test outputs - Interpret effect sizes (Cohen’s d) and confidence intervals practically - Practice writing APA-style results statements for mean comparisons

4 Perform

4.1 Apply Your Learning

Now that you’ve prepared, practiced, and participated, it’s time to demonstrate your mastery of the material through assignments and assessments.

Note to Students

I strongly encourage you to complete the previous “Ps” (Prepare, Practice, Participate) before attempting any assignments or assessments associated with this chapter.

4.2 Additional Resources

5 References

Furtado, O., Jr. (2026). Statistics for movement science: A hands-on guide with SPSS (1st ed.). https://drfurtado.github.io/sms/

Weir, J. P., & Vincent, W. J. (2021). Statistics in kinesiology (5th ed.). Human Kinetics.

1 Prepare

1.1 Chapter Overview

1.2 Multimedia Resources

1.3 Read the Chapter

2 Practice

2.1 Frequently Asked Questions

When should you use a paired t-test instead of an independent t-test?

Why is Welch’s t-test preferred over the traditional pooled-variance t-test?

What are the main assumptions of the independent t-test?

What are the main assumptions of the paired t-test?

How do you interpret Cohen’s d, and why is context important?

Why should you report confidence intervals alongside p-values?

What factors determine statistical power in a t-test?

What is the difference between statistical significance and practical significance?