13  Comparing Two Means

Independent and paired t-tests for evaluating group differences in Movement Science

Tip💻 Analytical Software & SPSS Tutorials

A Note on By-Hand Calculations: The purpose of this book is not to teach tedious by-hand statistical calculations. Modern researchers run these analyses using major software packages. While we provide the underlying equations for conceptual understanding, we strongly recommend relying on software for computation to avoid errors and save time.

Please direct your attention to the SPSS Tutorial: Comparing Two Means in the appendix for step-by-step instructions on performing t-tests, checking assumptions, computing effect sizes, and interpreting output!

13.1 Chapter roadmap

Comparing means between groups or conditions is one of the most fundamental tasks in Movement Science research[1,2]. Whether evaluating the effectiveness of a training intervention, comparing performance between athletes and non-athletes, or assessing changes from pre-test to post-test, researchers routinely ask: “Is there a meaningful difference between these two groups?”[3,4]. The t-test provides a principled statistical framework for answering this question by determining whether observed differences between sample means are large enough to infer that true population differences exist, or whether they could plausibly have arisen through sampling variability alone[1,5]. Unlike simply comparing raw sample means (which ignores uncertainty), t-tests account for sample size and data variability, yielding p-values that quantify the strength of evidence against the null hypothesis of no difference[6,7].

Understanding when and how to apply t-tests requires distinguishing between independent samples (comparing two separate groups, such as experimental vs. control) and paired samples (comparing two measurements on the same individuals, such as pre-test vs. post-test)[2,8]. Independent t-tests assume observations in one group do not influence observations in the other, while paired t-tests capitalize on within-subject correlations to increase statistical power[9,10]. Both designs are ubiquitous in Movement Science: independent designs commonly appear in randomized controlled trials comparing different training methods, while paired designs are standard in repeated-measures studies examining learning, fatigue, or intervention effects[11,12]. Choosing the correct test depends on the research design, and misapplying an independent t-test to paired data (or vice versa) can lead to incorrect conclusions[13].

This chapter provides comprehensive coverage of t-tests for comparing two means, including the assumptions underlying these tests, how to check and address violations, and how to compute and interpret effect sizes alongside p-values[7,14]. You will learn about Cohen’s d, the most common standardized effect size for mean differences, and how confidence intervals complement t-tests by revealing not just whether groups differ, but by how much they differ[3,6]. Additionally, you will explore the relationship between sample size, statistical power, and the ability to detect meaningful effects—a critical consideration for planning studies that are adequately powered to answer research questions[10,14]. By integrating hypothesis testing with estimation, effect size reporting, and practical significance evaluation, this chapter equips you to conduct, interpret, and critically evaluate two-group comparisons in Movement Science contexts[2,11].

By the end of this chapter, you will be able to:

  • Distinguish between one-sample, independent, and paired sample designs and select the appropriate t-test.
  • Conduct and interpret one-sample t-tests for comparing a sample mean to a benchmark.
  • Conduct and interpret independent t-tests for comparing two separate groups.
  • Conduct and interpret paired t-tests for comparing two related measurements.
  • Check assumptions of t-tests and recognize when violations may affect results.
  • Compute and interpret Cohen’s d and other effect size measures.
  • Use confidence intervals to assess the magnitude and precision of mean differences.
  • Understand the relationship between sample size, power, and effect detection.
  • Evaluate both statistical significance and practical importance of group differences.

13.2 Workflow for comparing two means

Use this sequence whenever you compare means between two groups or conditions:

  1. Identify the research design (independent or paired samples).
  2. State hypotheses (null: no difference; alternative: difference exists).
  3. Check assumptions (normality, independence; and equal variances for independent t-tests).
  4. Select the appropriate t-test (one-sample, independent, or paired).
  5. Compute the test statistic and p-value.
  6. Calculate the effect size (e.g., Cohen’s d) and confidence interval for the difference.
  7. Interpret results considering both statistical significance and practical importance.

13.3 One-sample t-test: Comparing a sample mean to a constant

A one-sample t-test compares the mean of a single group of observations to a known constant or a hypothesized population mean (\(\mu_0\))[1,5]. Unlike independent or paired tests, there is only one group being evaluated against a benchmark rather than being compared to another group or condition[8].

13.3.1 When to use a one-sample t-test

Use a one-sample t-test when[2,11]:

  • You have one group of participants and one continuous measurement per person
  • You want to compare the group mean to a specific benchmark (e.g., a “passing” score, a neutral point on a Likert scale, or a national average)
  • The population parameter (\(\mu_0\)) is known but the population standard deviation (\(\sigma\)) is unknown (if \(\sigma\) were known, you would use a z-test)
  • The dependent variable is continuous (measured on an interval or ratio scale)
NoteReal example: Comparing average steps to daily recommendations

A researcher measures the average daily step count of 50 university students to see if it differs significantly from the commonly recommended threshold of 10,000 steps per day. Since there is only one group being compared to a constant value, a one-sample t-test is appropriate[1].

13.3.2 Hypotheses for one-sample t-tests

Null hypothesis (H₀): \[ \mu = \mu_0 \]

The population mean is equal to the benchmark value (no difference).

Alternative hypothesis (H₁, two-tailed): \[ \mu \neq \mu_0 \]

The population mean is not equal to the benchmark value (difference exists).

13.3.3 Test statistic for one-sample t-tests

The one-sample t-test statistic is:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

Where:

  • \(\bar{x}\) = sample mean
  • \(\mu_0\) = hypothesized population mean or constant benchmark
  • \(s\) = sample standard deviation
  • \(n\) = sample size
  • \(df = n - 1\)

13.3.4 Worked example: One-sample t-test

A researcher investigates whether the average heart rate of a group of 20 yoga practitioners during meditation differs from a target resting heart rate of 60 bpm.

Steps for analysis:

  1. State hypotheses:
    • H₀: μ = 60 (average heart rate is 60 bpm)
    • H₁: μ ≠ 60 (average heart rate differs from 60 bpm)
  2. Check assumptions: Assess the independence of observations and ensure the heart rate scores are approximately normally distributed.
  3. Run the analysis: Use statistical software to input the sample data and the test value (60) to compute the t-statistic, degrees of freedom, and p-value.
  4. Interpretation: If \(p < .05\), reject the null hypothesis and conclude that the practitioners’ heart rate differs significantly from the benchmark.

(For a step-by-step walkthrough, refer to the SPSS Tutorial: One-Sample T-Test in the appendix).

13.3.5 Assumptions of the one-sample t-test

One-sample t-tests assume[1,8]:

  1. Independence of observations: Each participant’s score is independent of others
  2. Normality: The dependent variable is approximately normally distributed in the population; this is less critical with larger samples (n > 30) due to the Central Limit Theorem[15]

13.4 Independent samples: Comparing two separate groups

An independent samples t-test (also called a two-sample t-test or independent t-test) compares the means of two separate, unrelated groups[1,5]. Observations in one group are independent of observations in the other—knowing the values in Group 1 tells you nothing about the values in Group 2[8].

13.4.1 When to use an independent t-test

Use an independent t-test when[2,11]:

  • You have two separate groups of participants (e.g., males vs. females, trained vs. untrained, experimental vs. control)—the groups must be mutually exclusive, meaning no participant can belong to both
  • Participants are randomly assigned to groups (in experimental designs) or naturally fall into groups (in non-experimental designs such as intact cohorts or stratified samples)
  • Each participant contributes one score to one group only; repeated or linked measurements within the same individual call for a paired design instead
  • The dependent variable is continuous (measured on an interval or ratio scale) and represents the outcome of interest
  • You want to determine whether the observed mean difference between groups is larger than what would be expected by sampling variability alone
NoteReal example: Comparing VO₂max between athletes and non-athletes

A researcher measures VO₂max (mL/kg/min) in 25 collegiate athletes and 25 recreationally active non-athletes. Since these are two separate, independent groups, an independent t-test is appropriate[2].

13.4.2 Hypotheses for independent t-tests

Null hypothesis (H₀): \[ \mu_1 = \mu_2 \quad \text{or} \quad \mu_1 - \mu_2 = 0 \]

The population means are equal (no difference between groups).

Alternative hypothesis (H₁, two-tailed): \[ \mu_1 \neq \mu_2 \quad \text{or} \quad \mu_1 - \mu_2 \neq 0 \]

The population means are not equal (groups differ).

For directional hypotheses, you might specify H₁: μ₁ > μ₂ (one-tailed), but two-tailed tests are preferred unless strong directional predictions exist[13,16].

13.4.3 Test statistic for independent t-tests

The independent t-test statistic is:

\[ t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\text{SE}_{\text{diff}}} \]

Where:

  • \(\bar{x}_1, \bar{x}_2\) = sample means for Groups 1 and 2
  • \(\text{SE}_{\text{diff}}\) = standard error of the difference between means

The standard error of the difference depends on whether we assume equal population variances[1].

13.4.3.1 Equal variances assumed (pooled variance)

If \(\sigma_1^2 = \sigma_2^2\) (homogeneity of variance), we use pooled variance[5]:

\[ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \]

\[ \text{SE}_{\text{diff}} = \sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

Degrees of freedom: \(df = n_1 + n_2 - 2\)

13.4.3.2 Equal variances not assumed (Welch’s t-test)

When group variances are unequal—or when you are unsure whether they are—use Welch’s t-test, which is the default output in most statistical software (including SPSS and R) because it does not require the homogeneity-of-variance assumption[17,18]:

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \]

Degrees of freedom computed using the Welch-Satterthwaite approximation:

\[ df \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}} \]

TipUse Welch’s t-test by default

Welch’s t-test is more robust and does not require the equal variance assumption[13,18]. It performs well even when variances are equal, making it a safer default choice[19].

13.4.4 Worked example: Independent t-test

A study compares reaction time (ms) between two groups (young adults: \(n=20\), older adults: \(n=18\)). The research question is whether young and older adults differ significantly in reaction time.

Steps for analysis:

  1. State hypotheses:
    • H₀: μ₁ = μ₂ (no difference in reaction time)
    • H₁: μ₁ ≠ μ₂ (reaction times differ)
  2. Check assumptions: Evaluate independence, assess normality (e.g., using histograms or formal tests like Shapiro-Wilk), and check homogeneity of variance using Levene’s test.
  3. Run the analysis: Provide the data to your statistical software (like SPSS) to compute the t-statistic, degrees of freedom (often using the Welch-Satterthwaite adjustment for unequal variances), and the associated p-value.
  4. Interpretation: If the p-value is less than your chosen α (e.g., .05), reject the null hypothesis. In this hypothetical study, software might show \(p < .001\), allowing you to conclude that the young adults had significantly faster reaction times than older adults.

(For a step-by-step walkthrough, refer to the SPSS Tutorial: Independent-Samples T-Test in the appendix).

13.4.5 Assumptions of the independent t-test

Independent t-tests assume[1,8]:

  1. Independence of observations: Scores in one group do not influence scores in the other; this is violated when participants are related (e.g., siblings, matched pairs) or when the same individual appears in both groups
  2. Normality: Data in each group are approximately normally distributed—this assumption is less critical with larger samples (n > 30 per group) because the Central Limit Theorem ensures the sampling distribution of the mean approaches normality regardless of the population distribution[15]
  3. Homogeneity of variance: Population variances are assumed equal across groups; this assumption can be relaxed by using Welch’s t-test, which adjusts the degrees of freedom to account for unequal variances[18]

13.4.5.1 Checking normality

Normality assessment is covered in detail in Chapter 7: The Normal Distribution. The key approaches are:

  • Visual inspection: Histograms, Q-Q plots[8]
  • Formal tests: Shapiro-Wilk test (recommended for most sample sizes; modern software implementations support up to n = 5,000), Kolmogorov-Smirnov test[20]
  • Robustness: t-tests are robust to moderate non-normality, especially with n > 30 per group[15,21]

13.4.5.2 Checking equal variances

  • Levene’s test: Formally tests H₀: variances are equal across groups; a significant result (p < .05) signals that the equal-variance assumption is violated and that Welch’s t-test should be preferred[22]
  • Variance ratio (rule of thumb): Divide the larger sample variance by the smaller; if the ratio is less than 2:1, the equal-variance assumption is considered reasonable and either version of the t-test will yield similar results[8]
  • Visual check: Side-by-side boxplots or spread-level plots provide a quick visual sense of whether group spreads look similar before running formal tests
  • Best practice: Use Welch’s t-test as the default regardless of Levene’s test outcome—it performs nearly identically to Student’s t-test when variances are equal, but is substantially more accurate when they are not[13,18]
WarningCommon mistake: Assuming normality for small samples

With small samples (n < 15 per group), even moderate departures from normality can affect t-test validity[8]. Check assumptions visually and consider nonparametric alternatives (e.g., Mann-Whitney U test, Chapter 19) if assumptions are severely violated[23].

13.6 Effect sizes for comparing two means

Statistical significance (p < α) tells us whether an effect is detectable, but effect sizes tell us how large the effect is[7,14]. Always report effect sizes alongside p-values[24,25].

13.6.1 Cohen’s d

Cohen’s d is the most common standardized effect size for mean differences[14]:

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}} \]

For independent samples, the pooled standard deviation is:

\[ s_{\text{pooled}} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

For paired samples[9]:

\[ d = \frac{\bar{d}}{s_d} \]

Where \(\bar{d}\) is the mean difference and \(s_d\) is the SD of differences.

13.6.1.1 Interpreting Cohen’s d

Benchmarks (Cohen, 1988):

  • |d| = 0.2: Small effect
  • |d| = 0.5: Medium effect
  • |d| = 0.8: Large effect
ImportantContext matters

Cohen’s benchmarks are guidelines, not absolute rules[7]. A “small” effect (d = 0.2) may be highly meaningful in some contexts (e.g., injury prevention) and trivial in others[3,4].

13.6.2 Worked example: Computing Cohen’s d

Rather than computing Cohen’s d and pooled standard deviations by hand, modern statistical software (including SPSS) now provides effect size estimates and their confidence intervals automatically as part of the t-test output.

For instance, if software calculates the effect size for the reaction time example (young vs. older adults) to be \(d = -1.42\), we can interpret its magnitude.

Interpretation

\(|d| = 1.42\) indicates a large effect (well exceeding Cohen’s threshold of 0.8)[14]. Young adults’ reaction times are 1.42 standard deviations faster than older adults—a substantial difference with clear practical significance[3].

13.6.3 Confidence intervals for effect sizes

Just like means, effect sizes have uncertainty and should be reported with confidence intervals[6,7]. Software packages (e.g., ESCI, R packages like effectsize) can compute CIs for Cohen’s d[26].

NoteReal example: Effect size with CI

A meta-analysis reports that resistance training improves muscle strength with Cohen’s d = 0.78, 95% CI [0.65, 0.91]. This indicates a large effect that is precisely estimated[27].

13.7 Visualizing group comparisons

Effective visualizations communicate both central tendency and variability[9,24].

13.7.1 Box plots for independent groups

Code
library(ggplot2)
set.seed(42)

# Simulate data
trained <- rnorm(30, mean = 55, sd = 6)
untrained <- rnorm(30, mean = 48, sd = 7)

df_jump <- data.frame(
  Group = rep(c("Trained", "Untrained"), each = 30),
  Jump_Height = c(trained, untrained)
)

ggplot(df_jump, aes(x = Group, y = Jump_Height, fill = Group)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 2) +
  scale_fill_manual(values = c("Trained" = "steelblue", "Untrained" = "coral")) +
  labs(x = "Group", y = "Vertical Jump Height (cm)", 
       title = "Vertical Jump Performance by Training Status") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 13.1: Comparison of vertical jump height (cm) between trained and untrained groups. Box plots show medians, interquartile ranges, and outliers. Trained athletes demonstrate higher and less variable performance.

Box plots reveal the distribution of scores within each group[8]. In Figure 13.1, trained athletes show consistently higher jump heights with less variability, indicating both a difference in central tendency and potentially more consistent performance[4].

13.7.2 Error bar plots with confidence intervals

Code
library(ggplot2)
library(dplyr)

# Compute summary statistics
summary_stats <- df_jump %>%
  group_by(Group) %>%
  summarise(
    Mean = mean(Jump_Height),
    SD = sd(Jump_Height),
    n = n(),
    SE = SD / sqrt(n),
    CI_lower = Mean - qt(0.975, df = n - 1) * SE,
    CI_upper = Mean + qt(0.975, df = n - 1) * SE
  )

ggplot(summary_stats, aes(x = Group, y = Mean, fill = Group)) +
  geom_col(alpha = 0.7, width = 0.6) +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.2, linewidth = 1) +
  scale_fill_manual(values = c("Trained" = "steelblue", "Untrained" = "coral")) +
  labs(x = "Group", y = "Mean Vertical Jump Height (cm)", 
       title = "Mean Jump Height by Training Status (95% CI)") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 65)
Figure 13.2: Mean vertical jump height (cm) with 95% confidence intervals for trained and untrained groups. Non-overlapping error bars suggest a statistically significant difference.

Error bar plots with 95% confidence intervals (Figure 13.2) provide a clear visual comparison of group means and their precision[9]. Non-overlapping 95% CIs indicate roughly p < .01; however, overlapping CIs do not rule out statistical significance at α = .05—always conduct formal tests rather than relying on visual inspection alone[9,28].

13.7.3 Before-after plots for paired designs

Code
library(ggplot2)

# Simulated paired data
set.seed(42)
n_participants <- 12
pre <- rnorm(n_participants, mean = 42, sd = 4)
post <- pre + rnorm(n_participants, mean = 4.3, sd = 0.9)

df_paired <- data.frame(
  Participant = rep(1:n_participants, 2),
  Time = rep(c("Pre", "Post"), each = n_participants),
  Strength = c(pre, post)
)

df_paired$Time <- factor(df_paired$Time, levels = c("Pre", "Post"))

ggplot(df_paired, aes(x = Time, y = Strength, group = Participant)) +
  geom_line(alpha = 0.6, color = "steelblue") +
  geom_point(size = 3, alpha = 0.8, color = "steelblue") +
  stat_summary(aes(group = 1), fun = mean, geom = "line", 
               color = "red", linewidth = 1.5, linetype = "dashed") +
  stat_summary(aes(group = 1), fun = mean, geom = "point", 
               color = "red", size = 4, shape = 18) +
  labs(x = "Test Session", y = "Grip Strength (kg)", 
       title = "Pre-Post Changes in Grip Strength (Individual Lines)") +
  theme_minimal()
Figure 13.3: Individual changes in grip strength (kg) from pre-test to post-test. Each line represents one participant. Most participants improved, with all lines showing upward slopes.

Individual trajectory plots (Figure 13.3) show how each participant changed from pre to post[8,9]. The red dashed line represents the mean change, indicating an overall increase in grip strength across participants.

13.8 Sample size and statistical power

Statistical power is the probability of correctly rejecting a false null hypothesis—in other words, the probability that a study will detect a true effect when one genuinely exists in the population[10,14]. Power ranges from 0 to 1, and a conventional minimum target is 0.80, meaning an 80% chance of detecting a real effect if present[14]. The complement of power is the Type II error rate (β): a power of 0.80 implies a 20% chance of a false negative—concluding “no effect” when in fact there is one.

Underpowered studies are a pervasive problem in movement science and related fields[29,lakens2022?]. They fail to detect meaningful effects, waste participant and researcher resources, and produce an inflated false-negative rate. Critically, when an underpowered study does yield a significant result, the observed effect size is likely an overestimate of the true population effect—a phenomenon known as the winner’s curse[gelman2014?]. Planning adequate power before data collection is therefore an ethical as well as a methodological responsibility[29].

13.8.1 Factors affecting power

  1. Sample size (n): Larger samples reduce the standard error of the mean, making it easier to distinguish a true effect from sampling variability; doubling n does not double power, but it does substantially increase sensitivity, especially in the range of 10–50 participants per group[14]
  2. Effect size (d): Larger effects are inherently easier to detect—a Cohen’s d of 0.8 requires far fewer participants than d = 0.2 for the same power level; if the true effect in the population is small, a large sample is needed to reliably detect it[7,14]
  3. Significance level (α): Setting α = .05 rather than α = .01 increases power by raising the threshold for a “reject” decision, but at the cost of a higher Type I error rate; this trade-off should be made deliberately before data collection, not adjusted post hoc[30]
  4. Measurement reliability: More reliable outcome measures (higher ICC or test-retest correlation) reduce within-group variance and thereby increase power; poor measurement precision has the same effect on power as reducing sample size[12]
  5. Design: Paired designs typically have substantially higher power than independent designs for the same n because between-subject variability is removed from the error term—when within-subject correlation is high (r > .50), a paired design may require less than half the participants needed for an equivalent independent design[8,10]

13.8.2 Power analysis for t-tests

A priori power analysis is conducted before data collection to determine the minimum sample size needed to achieve a desired level of statistical power (typically 0.80) for an expected effect size at a chosen α level[14]. The three inputs—effect size, α, and desired power—are interrelated: fixing any two determines the third. In practice, researchers specify the effect size they consider meaningful (often based on prior literature or pilot data), set α = .05 and power = 0.80, and solve for the required n.

Selecting a realistic effect size is the most consequential—and most difficult—step. Using an inflated effect size (e.g., from an underpowered pilot study) will underestimate the required sample and produce an underpowered study[lakens2022?]. Better sources include meta-analytic estimates for your outcome domain, published studies with comparable designs and populations, or the smallest effect size that would be practically meaningful given your research context[7].

Computations for power and required sample size involve non-central t-distributions, but dedicated software makes the process straightforward:

TipUse G*Power for sample size planning

**G*Power** is free, user-friendly software for power analysis[31]. Input your expected effect size, desired power, and α level to determine required sample size. Available at: https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower

13.8.3 Post hoc power analysis: Why it’s problematic

Post hoc power (computing power after data collection) is widely misused[30,32]. When power is calculated from the observed effect size, it is mathematically determined by the p-value, making it circular and uninformative: a non-significant result will always yield low observed power by definition[33].

Better approach: Report effect sizes with confidence intervals[6]. Wide CIs indicate low precision (consistent with low power), but provide more informative guidance for future studies[10].

13.9 Independent vs. paired: Which test to use?

Characteristic One-sample t-test Independent t-test Paired t-test
Design One group vs. benchmark Two separate groups Same participants measured twice (or matched pairs)
Assumptions Independence, normality Independence, normality, (equal variances) Pairs independent, differences normally distributed
Power N/A Lower (between-subject variability) Higher (controls for individual differences)
Example Compare average BMI to national average Compare trained vs. untrained athletes Compare pre-test vs. post-test in same athletes
Null hypothesis μ = μ₀ μ₁ = μ₂ μ_d = 0
Degrees of freedom n − 1 n₁ + n₂ − 2 (or Welch’s df) n − 1 (n = number of pairs)
WarningCommon mistake: Using independent t-test for paired data

If you use an independent t-test when data are actually paired, you lose power by failing to control for individual differences[10]. Always match the test to the design[8].

13.10 Assumptions violations and alternatives

When assumptions are violated, consider:

13.10.1 For non-normality

  • Transformation: Log, square root, or rank transformations may normalize data[34]
  • Nonparametric alternatives:
    • Mann-Whitney U test (independent samples)
    • Wilcoxon signed-rank test (paired samples)
    • See Chapter 19 for details[23]

13.10.2 For unequal variances (independent t-test)

  • Welch’s t-test: Does not assume equal variances (preferred default)[18]

13.10.3 For small samples

Small samples (n < 15 per group) present particular challenges because the Central Limit Theorem cannot be relied upon to rescue non-normal data, and formal normality tests (e.g., Shapiro-Wilk) have low power to detect departures from normality at these sample sizes[8,20].

  • Check normality visually: With small n, histograms are uninformative—use Q-Q plots and consider whether the underlying construct is plausibly normal given the population[8]
  • Be cautious with outliers: A single extreme value can substantially distort the mean and inflate or deflate the t-statistic in small samples; inspect data carefully before analysis[35]
  • Consider bootstrapping: Resampling-based methods (e.g., bootstrap confidence intervals) make no distributional assumptions and can provide valid inference even with small, non-normal samples[36,efron1993?]
  • Consider permutation tests: Permutation (randomization) tests are exact, assumption-free alternatives to the t-test that work by comparing the observed test statistic to the distribution generated by all possible rearrangements of the data[36]
  • Nonparametric fallback: If normality is clearly violated and the sample is too small to rely on robustness, use the Mann-Whitney U (independent groups) or Wilcoxon signed-rank test (paired design) as described in Chapter 19[23]

13.11 Reporting t-tests in APA style

Template for independent t-test:

“[Group 1] (M = [mean], SD = [SD], n = [n]) [differed/did not differ] significantly from [Group 2] (M = [mean], SD = [SD], n = [n]), t([df]) = [t-value], p = [p-value], d = Cohen’s d, 95% CI [lower, upper].”

Example:

“Trained athletes (M = 55.3 cm, SD = 6.2, n = 30) demonstrated significantly higher vertical jump performance than untrained controls (M = 47.8 cm, SD = 7.1, n = 30), t(58) = 4.52, p < .001, d = 1.12, 95% CI [4.2, 10.8] cm.”

Template for paired t-test:

“[Condition 2] (M = [mean], SD = [SD]) was significantly [higher/lower] than [Condition 1] (M = [mean], SD = [SD]), t([df]) = [t-value], p = [p-value], mean difference = [M_diff], 95% CI [lower, upper], d = Cohen’s d.”

Example:

“Post-training grip strength (M = 45.9 kg, SD = 4.40) was significantly greater than pre-training strength (M = 41.6 kg, SD = 3.75), t(11) = 16.91, p < .001, mean difference = 4.3 kg, 95% CI [3.77, 4.90], d = 4.87.”

13.12 Comparing Two Proportions: The Two-Proportion Z-Test

T-tests can also compare proportions (e.g., injury rates, success rates) between two groups[37,38].

For proportions \(p_1\) and \(p_2\), the test statistic is:

\[ z = \frac{p_1 - p_2}{\text{SE}_{\text{diff}}} \]

Where:

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

For large samples, this follows a standard normal (z) distribution[1,37]. Software implements this as a two-proportion z-test.

NotePooled vs. unpooled standard error for proportions

The formula above uses separate (unpooled) proportions, which is appropriate for confidence intervals for the difference. When conducting a hypothesis test under H₀: p₁ = p₂, many software packages instead use the pooled proportion \(\hat{p} = (x_1 + x_2)/(n_1 + n_2)\):

\[ \text{SE}_{\text{diff (pooled)}} = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

Both formulas give similar results with large samples, but the pooled version is theoretically preferred for significance testing[37].

NoteReal example: Comparing injury rates

A study finds that 18 of 100 runners using minimalist shoes (18%) suffered injuries, compared to 12 of 100 using traditional shoes (12%). A two-proportion z-test yields z = 1.20, p = .23, suggesting no significant difference in injury rates[37].

13.13 Common pitfalls and best practices

13.13.1 Pitfall 1: P-hacking and selective reporting

Problem: Running multiple t-tests and reporting only significant ones inflates Type I error[39,40].

Solution: Preregister analyses, report all tests conducted, and correct for multiple comparisons if appropriate[16].

13.13.2 Pitfall 2: Confusing significance with importance

Problem: Small, trivial effects can be statistically significant with large samples[41].

Solution: Always report and interpret effect sizes and confidence intervals[3,6].

13.13.3 Pitfall 3: Ignoring assumptions

Problem: Violating normality or equal variance assumptions can lead to incorrect p-values[8].

Solution: Check assumptions, use robust methods (e.g., Welch’s t-test), or use nonparametric alternatives[18].

13.13.4 Pitfall 4: Inappropriate test choice

Problem: Using independent t-test for paired data (or vice versa) produces incorrect results[10].

Solution: Match the test to the research design[2,11].

13.14 Chapter summary

Comparing two means is a fundamental task in Movement Science research, and t-tests provide a principled statistical framework for determining whether observed differences between groups or conditions reflect true population differences or merely sampling variability[1,5]. Independent t-tests compare separate groups (e.g., experimental vs. control), while paired t-tests compare related measurements (e.g., pre-test vs. post-test), with paired designs offering greater statistical power by controlling for individual differences[9,10]. Both tests assume approximate normality and independence, with Welch’s t-test providing robust inference when variances differ[17,18]. However, statistical significance (p < α) alone is insufficient for drawing meaningful conclusions—researchers must also report effect sizes (e.g., Cohen’s d) and confidence intervals to assess the magnitude and precision of differences[6,7,14].

Effect sizes quantify how large group differences are in standardized units, enabling comparisons across studies and evaluation of practical significance[7,14]. A Cohen’s d of 1.0 indicates that group means differ by one standard deviation—a substantial difference in most contexts[3]. Confidence intervals complement t-tests by revealing not just whether groups differ (p-value), but by how much they differ and with what precision[6,42]. Sample size and statistical power are inextricably linked: adequately powered studies (typically Power ≥ 0.80) reliably detect meaningful effects, while underpowered studies produce false negatives and waste resources[10,29]. Planning studies with a priori power analysis ensures sufficient sample sizes to answer research questions definitively[14,31].

Ultimately, responsible use of t-tests requires integrating hypothesis testing with estimation, effect size reporting, and practical significance evaluation[3,6]. Researchers should not merely ask “Is there a significant difference?” but also “How large is the difference, how precisely have we estimated it, and does it matter in applied contexts?”[4,7]. By combining t-tests with transparent reporting of means, confidence intervals, and effect sizes, Movement Science practitioners can move beyond binary significant/non-significant thinking toward nuanced interpretation of group differences, fostering more reproducible and impactful research[2,6,11].

13.15 Key terms

independent samples; paired samples; t-test; two-sample t-test; dependent t-test; repeated measures; null hypothesis; alternative hypothesis; test statistic; degrees of freedom; p-value; statistical significance; practical significance; Cohen’s d; effect size; confidence interval; pooled variance; Welch’s t-test; homogeneity of variance; normality; statistical power; Type I error; Type II error; sample size planning

13.16 Practice: quick checks

Use a paired t-test when the same participants are measured twice (pre-post designs) or when observations are matched in pairs (e.g., twins, left-right limb comparisons)[10]. Paired designs control for individual differences by comparing each person to themselves, reducing error variance and increasing statistical power[9]. In contrast, use an independent t-test when comparing two separate, unrelated groups (e.g., experimental vs. control) where participants in one group are distinct from those in the other[2,8]. Using an independent t-test on paired data wastes power, while using a paired t-test on independent data violates the assumption that pairs are related, producing incorrect results[11].

Welch’s t-test does not assume equal population variances, making it more robust than the traditional pooled-variance t-test[17,18]. When variances differ substantially between groups, the pooled-variance t-test can produce inflated Type I error rates (more false positives) or reduced power[19]. Welch’s t-test corrects for this by using separate variance estimates and adjusting degrees of freedom[13]. Importantly, Welch’s t-test performs well even when variances are equal, meaning it rarely performs worse than the pooled-variance version and often performs better[18]. For this reason, most modern statistical software uses Welch’s t-test as the default[8].

Cohen’s d quantifies the standardized magnitude of a mean difference: d = (M₁ − M₂) / SD_pooled[14]. Cohen suggested benchmarks: |d| = 0.2 (small), 0.5 (medium), 0.8 (large). However, context is crucial[3,7]. In injury prevention research, even a “small” effect (d = 0.2) may save lives and justify intervention. Conversely, in elite performance contexts, a “large” effect (d = 0.8) may be unrealistic or impractical to achieve[4]. Always interpret effect sizes relative to domain-specific benchmarks, prior research, and practical significance thresholds rather than relying solely on Cohen’s arbitrary guidelines[7].

P-values indicate whether an effect is statistically detectable (p < .05 suggests the difference is unlikely due to chance), but they do not quantify the size or precision of the effect[6,43]. Confidence intervals provide a range of plausible values for the true population difference, enabling researchers to evaluate both statistical significance (does the CI exclude zero?) and practical importance (are the plausible effect sizes meaningful?)[3,42]. For example, a 95% CI of [0.5, 8.2] cm for a mean difference indicates statistical significance (excludes zero) but substantial uncertainty (wide range), while a CI of [4.0, 4.8] cm indicates both significance and high precision[9]. Reporting CIs fosters transparent communication of uncertainty[24,25].

Statistical power is the probability of detecting a true effect when it exists (Power = 1 − β, where β is Type II error)[14]. Power increases with (1) larger sample sizes, (2) larger effect sizes, (3) higher significance levels (α), and (4) lower data variability[10]. Paired designs also offer higher power than independent designs because they control for individual differences[9]. Low power (< 0.50) means studies frequently miss real effects, producing false negatives and inconclusive results[29]. Underpowered studies waste resources, mislead interpretation, and contribute to publication bias (only “lucky” significant findings get published)[44]. Aim for Power ≥ 0.80 when planning studies to ensure adequate sensitivity for detecting meaningful effects[10,14].

Statistical significance (p < .05) indicates that an observed difference is unlikely to have occurred by chance, assuming the null hypothesis is true[1,41]. Practical significance evaluates whether the magnitude of the difference is large enough to matter in real-world applications[3,7]. A difference can be statistically significant but trivial (e.g., 0.2 ms improvement in reaction time with n = 1000) or large but not statistically significant (e.g., 10 cm jump improvement with n = 5)[10]. To assess practical significance, examine effect sizes, confidence intervals, and domain-specific thresholds such as minimal clinically important differences (MCIDs)[4]. Always ask: “Even if this difference is real, does it matter for performance, health, or decision-making?”[3,6].

NoteRead further

For deeper exploration of t-tests and mean comparisons, see Cumming (2012)[9] (Understanding the New Statistics), Maxwell et al. (2018)[10] (power and design), Cohen (1988)[14] (effect sizes), and Vincent (2005)[2] (Movement Science applications). For practical guidance on assumption checking and robust alternatives, consult Field (2018)[8] and Wilcox (2017)[35].

TipNext chapter

In Chapter 14, you will extend the logic of comparing two means to Analysis of Variance (ANOVA), which enables comparisons across three or more groups simultaneously. You will learn about partitioning variance, F-tests, post hoc comparisons, and interpreting main effects in more complex experimental designs.