13 Comparing Two Means

Independent and paired t-tests for evaluating group differences in Movement Science

💻 Analytical Software & SPSS Tutorials

A Note on By-Hand Calculations: The purpose of this book is not to teach tedious by-hand statistical calculations. Modern researchers run these analyses using major software packages. While we provide the underlying equations for conceptual understanding, we strongly recommend relying on software for computation to avoid errors and save time.

Please direct your attention to the SPSS Tutorial: Comparing Two Means in the appendix for step-by-step instructions on performing t-tests, checking assumptions, computing effect sizes, and interpreting output!

13.1 Chapter roadmap

Comparing means between groups or conditions is one of the most fundamental tasks in Movement Science research^[1,2]. Whether evaluating the effectiveness of a training intervention, comparing performance between athletes and non-athletes, or assessing changes from pre-test to post-test, researchers routinely ask: “Is there a meaningful difference between these two groups?”^[3,4]. The t-test provides a principled statistical framework for answering this question by determining whether observed differences between sample means are large enough to infer that true population differences exist, or whether they could plausibly have arisen through sampling variability alone^[1,5]. Unlike simply comparing raw sample means (which ignores uncertainty), t-tests account for sample size and data variability, yielding p-values that quantify the strength of evidence against the null hypothesis of no difference^[6,7].

Understanding when and how to apply t-tests requires distinguishing between independent samples (comparing two separate groups, such as experimental vs. control) and paired samples (comparing two measurements on the same individuals, such as pre-test vs. post-test)^[2,8]. Independent t-tests assume observations in one group do not influence observations in the other, while paired t-tests capitalize on within-subject correlations to increase statistical power^[9,10]. Both designs are ubiquitous in Movement Science: independent designs commonly appear in randomized controlled trials comparing different training methods, while paired designs are standard in repeated-measures studies examining learning, fatigue, or intervention effects^[11,12]. Choosing the correct test depends on the research design, and misapplying an independent t-test to paired data (or vice versa) can lead to incorrect conclusions^[13].

This chapter provides comprehensive coverage of t-tests for comparing two means, including the assumptions underlying these tests, how to check and address violations, and how to compute and interpret effect sizes alongside p-values^[7,14]. You will learn about Cohen’s d, the most common standardized effect size for mean differences, and how confidence intervals complement t-tests by revealing not just whether groups differ, but by how much they differ^[3,6]. Additionally, you will explore the relationship between sample size, statistical power, and the ability to detect meaningful effects—a critical consideration for planning studies that are adequately powered to answer research questions^[10,14]. By integrating hypothesis testing with estimation, effect size reporting, and practical significance evaluation, this chapter equips you to conduct, interpret, and critically evaluate two-group comparisons in Movement Science contexts^[2,11].

By the end of this chapter, you will be able to:

Distinguish between one-sample, independent, and paired sample designs and select the appropriate t-test.
Conduct and interpret one-sample t-tests for comparing a sample mean to a benchmark.
Conduct and interpret independent t-tests for comparing two separate groups.
Conduct and interpret paired t-tests for comparing two related measurements.
Check assumptions of t-tests and recognize when violations may affect results.
Compute and interpret Cohen’s d and other effect size measures.
Use confidence intervals to assess the magnitude and precision of mean differences.
Understand the relationship between sample size, power, and effect detection.
Evaluate both statistical significance and practical importance of group differences.

13.2 Workflow for comparing two means

Use this sequence whenever you compare means between two groups or conditions:

Identify the research design (independent or paired samples).
State hypotheses (null: no difference; alternative: difference exists).
Check assumptions (normality, independence; and equal variances for independent t-tests).
Select the appropriate t-test (one-sample, independent, or paired).
Compute the test statistic and p-value.
Calculate the effect size (e.g., Cohen’s d) and confidence interval for the difference.
Interpret results considering both statistical significance and practical importance.

13.3 One-sample t-test: Comparing a sample mean to a constant

A one-sample t-test compares the mean of a single group of observations to a known constant or a hypothesized population mean (\(\mu_0\))^[1,5]. Unlike independent or paired tests, there is only one group being evaluated against a benchmark rather than being compared to another group or condition^[8].

13.3.1 When to use a one-sample t-test

Use a one-sample t-test when^[2,11]:

You have one group of participants and one continuous measurement per person
You want to compare the group mean to a specific benchmark (e.g., a “passing” score, a neutral point on a Likert scale, or a national average)
The population parameter (\(\mu_0\)) is known but the population standard deviation (\(\sigma\)) is unknown (if \(\sigma\) were known, you would use a z-test)
The dependent variable is continuous (measured on an interval or ratio scale)

Real example: Comparing average steps to daily recommendations

A researcher measures the average daily step count of 50 university students to see if it differs significantly from the commonly recommended threshold of 10,000 steps per day. Since there is only one group being compared to a constant value, a one-sample t-test is appropriate^[1].

13.3.2 Hypotheses for one-sample t-tests

Null hypothesis (H₀): \[ \mu = \mu_0 \]

The population mean is equal to the benchmark value (no difference).

Alternative hypothesis (H₁, two-tailed): \[ \mu \neq \mu_0 \]

The population mean is not equal to the benchmark value (difference exists).

13.3.3 Test statistic for one-sample t-tests

The one-sample t-test statistic is:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

Where:

\(\bar{x}\) = sample mean
\(\mu_0\) = hypothesized population mean or constant benchmark
\(s\) = sample standard deviation
\(n\) = sample size
\(df = n - 1\)

13.3.4 Worked example: One-sample t-test

A researcher investigates whether the average heart rate of a group of 20 yoga practitioners during meditation differs from a target resting heart rate of 60 bpm.

Steps for analysis:

State hypotheses:
- H₀: μ = 60 (average heart rate is 60 bpm)
- H₁: μ ≠ 60 (average heart rate differs from 60 bpm)
Check assumptions: Assess the independence of observations and ensure the heart rate scores are approximately normally distributed.
Run the analysis: Use statistical software to input the sample data and the test value (60) to compute the t-statistic, degrees of freedom, and p-value.
Interpretation: If \(p < .05\), reject the null hypothesis and conclude that the practitioners’ heart rate differs significantly from the benchmark.

(For a step-by-step walkthrough, refer to the SPSS Tutorial: One-Sample T-Test in the appendix).

13.3.5 Assumptions of the one-sample t-test

One-sample t-tests assume^[1,8]:

Independence of observations: Each participant’s score is independent of others
Normality: The dependent variable is approximately normally distributed in the population; this is less critical with larger samples (n > 30) due to the Central Limit Theorem^[15]

13.4 Independent samples: Comparing two separate groups

An independent samples t-test (also called a two-sample t-test or independent t-test) compares the means of two separate, unrelated groups^[1,5]. Observations in one group are independent of observations in the other—knowing the values in Group 1 tells you nothing about the values in Group 2^[8].

13.4.1 When to use an independent t-test

Use an independent t-test when^[2,11]:

You have two separate groups of participants (e.g., males vs. females, trained vs. untrained, experimental vs. control)—the groups must be mutually exclusive, meaning no participant can belong to both
Participants are randomly assigned to groups (in experimental designs) or naturally fall into groups (in non-experimental designs such as intact cohorts or stratified samples)
Each participant contributes one score to one group only; repeated or linked measurements within the same individual call for a paired design instead
The dependent variable is continuous (measured on an interval or ratio scale) and represents the outcome of interest
You want to determine whether the observed mean difference between groups is larger than what would be expected by sampling variability alone

Real example: Comparing VO₂max between athletes and non-athletes

A researcher measures VO₂max (mL/kg/min) in 25 collegiate athletes and 25 recreationally active non-athletes. Since these are two separate, independent groups, an independent t-test is appropriate^[2].

13.4.2 Hypotheses for independent t-tests

Null hypothesis (H₀): \[ \mu_1 = \mu_2 \quad \text{or} \quad \mu_1 - \mu_2 = 0 \]

The population means are equal (no difference between groups).

Alternative hypothesis (H₁, two-tailed): \[ \mu_1 \neq \mu_2 \quad \text{or} \quad \mu_1 - \mu_2 \neq 0 \]

The population means are not equal (groups differ).

For directional hypotheses, you might specify H₁: μ₁ > μ₂ (one-tailed), but two-tailed tests are preferred unless strong directional predictions exist^[13,16].

13.4.3 Test statistic for independent t-tests

The independent t-test statistic is:

\[ t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\text{SE}_{\text{diff}}} \]

Where:

\(\bar{x}_1, \bar{x}_2\) = sample means for Groups 1 and 2
\(\text{SE}_{\text{diff}}\) = standard error of the difference between means

The standard error of the difference depends on whether we assume equal population variances^[1].

13.4.3.1 Equal variances assumed (pooled variance)

If \(\sigma_1^2 = \sigma_2^2\) (homogeneity of variance), we use pooled variance^[5]:

\[ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \]

\[ \text{SE}_{\text{diff}} = \sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

Degrees of freedom: \(df = n_1 + n_2 - 2\)

13.4.3.2 Equal variances not assumed (Welch’s t-test)

When group variances are unequal—or when you are unsure whether they are—use Welch’s t-test, which is the default output in most statistical software (including SPSS and R) because it does not require the homogeneity-of-variance assumption^[17,18]:

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \]

Degrees of freedom computed using the Welch-Satterthwaite approximation:

\[ df \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}} \]

Use Welch’s t-test by default

Welch’s t-test is more robust and does not require the equal variance assumption^[13,18]. It performs well even when variances are equal, making it a safer default choice^[19].

13.4.4 Worked example: Independent t-test

A study compares reaction time (ms) between two groups (young adults: \(n=20\), older adults: \(n=18\)). The research question is whether young and older adults differ significantly in reaction time.

Steps for analysis:

State hypotheses:
- H₀: μ₁ = μ₂ (no difference in reaction time)
- H₁: μ₁ ≠ μ₂ (reaction times differ)
Check assumptions: Evaluate independence, assess normality (e.g., using histograms or formal tests like Shapiro-Wilk), and check homogeneity of variance using Levene’s test.
Run the analysis: Provide the data to your statistical software (like SPSS) to compute the t-statistic, degrees of freedom (often using the Welch-Satterthwaite adjustment for unequal variances), and the associated p-value.
Interpretation: If the p-value is less than your chosen α (e.g., .05), reject the null hypothesis. In this hypothetical study, software might show \(p < .001\), allowing you to conclude that the young adults had significantly faster reaction times than older adults.

(For a step-by-step walkthrough, refer to the SPSS Tutorial: Independent-Samples T-Test in the appendix).

13.4.5 Assumptions of the independent t-test

Independent t-tests assume^[1,8]:

Independence of observations: Scores in one group do not influence scores in the other; this is violated when participants are related (e.g., siblings, matched pairs) or when the same individual appears in both groups
Normality: Data in each group are approximately normally distributed—this assumption is less critical with larger samples (n > 30 per group) because the Central Limit Theorem ensures the sampling distribution of the mean approaches normality regardless of the population distribution^[15]
Homogeneity of variance: Population variances are assumed equal across groups; this assumption can be relaxed by using Welch’s t-test, which adjusts the degrees of freedom to account for unequal variances^[18]

13.4.5.1 Checking normality

Normality assessment is covered in detail in Chapter 7: The Normal Distribution. The key approaches are:

Visual inspection: Histograms, Q-Q plots^[8]
Formal tests: Shapiro-Wilk test (recommended for most sample sizes; modern software implementations support up to n = 5,000), Kolmogorov-Smirnov test^[20]
Robustness: t-tests are robust to moderate non-normality, especially with n > 30 per group^[15,21]

13.4.5.2 Checking equal variances

Levene’s test: Formally tests H₀: variances are equal across groups; a significant result (p < .05) signals that the equal-variance assumption is violated and that Welch’s t-test should be preferred^[22]
Variance ratio (rule of thumb): Divide the larger sample variance by the smaller; if the ratio is less than 2:1, the equal-variance assumption is considered reasonable and either version of the t-test will yield similar results^[8]
Visual check: Side-by-side boxplots or spread-level plots provide a quick visual sense of whether group spreads look similar before running formal tests
Best practice: Use Welch’s t-test as the default regardless of Levene’s test outcome—it performs nearly identically to Student’s t-test when variances are equal, but is substantially more accurate when they are not^[13,18]

Common mistake: Assuming normality for small samples

With small samples (n < 15 per group), even moderate departures from normality can affect t-test validity^[8]. Check assumptions visually and consider nonparametric alternatives (e.g., Mann-Whitney U test, Chapter 19) if assumptions are severely violated^[23].

13.5 Paired samples: Comparing related measurements

A paired samples t-test (also called a dependent t-test or repeated measures t-test) compares two related measurements on the same participants^[1,8]. Examples include pre-test vs. post-test, left limb vs. right limb, or two different conditions experienced by the same individuals^[2].

13.5.1 When to use a paired t-test

Use a paired t-test when^[11,12]:

The same participants are measured twice (repeated measures)
Participants are matched in pairs (e.g., twins, matched controls)
You have two related measurements (e.g., left and right leg strength)
You want to compare within-subject changes or differences

Real example: Pre-post training study

A researcher measures vertical jump height in 15 athletes before and after an 8-week plyometric training program. Since the same athletes are measured twice, a paired t-test is appropriate^[2].

13.5.2 Why paired designs are more powerful

Paired designs control for individual differences by comparing each person to themselves, so stable participant characteristics—such as body size, fitness level, genetics, or motivation—cannot inflate the error term^[9,10]. In an independent design, that between-person variability is pooled into the error variance and makes it harder to detect a true effect; in a paired design, it is simply subtracted out^[14].

The practical consequence is that a paired t-test often requires fewer participants to achieve the same level of statistical power as an independent t-test, because the denominator of the test statistic (the standard error of the differences) reflects only within-person variability rather than total variability^[8,12]. This makes paired designs particularly valuable in exercise science and rehabilitation research, where recruiting large samples is costly and individual differences in fitness are large relative to the expected treatment effect^[2].

Key advantage: Variability between subjects is removed from the error term, leaving only variability in within-subject change—the signal the test is designed to detect^[8,12].

13.5.3 Hypotheses for paired t-tests

Let \(d_i = x_{i,\text{after}} - x_{i,\text{before}}\) be the difference score for participant \(i\).

Null hypothesis (H₀): \[ \mu_d = 0 \]

The mean difference is zero (no change).

Alternative hypothesis (H₁, two-tailed): \[ \mu_d \neq 0 \]

The mean difference is not zero (change occurred).

13.5.4 Test statistic for paired t-tests

The paired t-test is mathematically equivalent to a one-sample t-test on the difference scores^[1]:

\[ t = \frac{\bar{d} - 0}{\text{SE}_d} = \frac{\bar{d}}{s_d / \sqrt{n}} \]

Where:

\(\bar{d}\) = mean of the difference scores
\(s_d\) = standard deviation of the difference scores
\(n\) = number of pairs
\(df = n - 1\)

13.5.5 Worked example: Paired t-test

A study measures maximal isometric grip strength (kg) before and after a 6-week forearm strengthening program in 12 recreational climbers.

Steps for analysis:

State hypotheses:
- H₀: μ_d = 0 (no change in grip strength)
- H₁: μ_d ≠ 0 (grip strength changed)
Compute differences: Software automatically assesses the difference score (\(d = \text{post} - \text{pre}\)) for each participant.
Check assumptions: Confirm that pairs are independent (one person’s scores do not influence another’s) and that the difference scores are normally distributed.
Run the analysis: The software will compute the mean differences, standard error, the t-statistic, and the corresponding p-value.
Interpretation: Based on the p-value and confidence intervals from your output, determine whether the training program led to a statistically significant and meaningful change.

(For a step-by-step walkthrough, refer to the SPSS Tutorial: Paired-Samples T-Test in the appendix).

13.5.6 Assumptions of the paired t-test

Paired t-tests assume^[1,8]:

Pairs are independent: One pair does not influence another
Differences are normally distributed: Check normality of the difference scores, not the raw scores^[2]
No order effects: For repeated measures, counterbalancing or randomization prevents systematic order effects^[11]

Common mistake: Testing raw scores instead of differences

Always check normality on the difference scores (d = post − pre), not the pre-test or post-test scores separately^[8]. The paired t-test analyzes differences, so their distribution matters.

13.6 Effect sizes for comparing two means

Statistical significance (p < α) tells us whether an effect is detectable, but effect sizes tell us how large the effect is^[7,14]. Always report effect sizes alongside p-values^[24,25].

13.6.1 Cohen’s d

Cohen’s d is the most common standardized effect size for mean differences^[14]:

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}} \]

For independent samples, the pooled standard deviation is:

\[ s_{\text{pooled}} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

For paired samples^[9]:

\[ d = \frac{\bar{d}}{s_d} \]

Where \(\bar{d}\) is the mean difference and \(s_d\) is the SD of differences.

13.6.1.1 Interpreting Cohen’s d

Benchmarks (Cohen, 1988):

|d| = 0.2: Small effect
|d| = 0.5: Medium effect
|d| = 0.8: Large effect

Context matters

Cohen’s benchmarks are guidelines, not absolute rules^[7]. A “small” effect (d = 0.2) may be highly meaningful in some contexts (e.g., injury prevention) and trivial in others^[3,4].

13.6.2 Worked example: Computing Cohen’s d

Rather than computing Cohen’s d and pooled standard deviations by hand, modern statistical software (including SPSS) now provides effect size estimates and their confidence intervals automatically as part of the t-test output.

For instance, if software calculates the effect size for the reaction time example (young vs. older adults) to be \(d = -1.42\), we can interpret its magnitude.

Interpretation

\(|d| = 1.42\) indicates a large effect (well exceeding Cohen’s threshold of 0.8)^[14]. Young adults’ reaction times are 1.42 standard deviations faster than older adults—a substantial difference with clear practical significance^[3].

13.6.3 Confidence intervals for effect sizes

Just like means, effect sizes have uncertainty and should be reported with confidence intervals^[6,7]. Software packages (e.g., ESCI, R packages like effectsize) can compute CIs for Cohen’s d^[26].

Real example: Effect size with CI

A meta-analysis reports that resistance training improves muscle strength with Cohen’s d = 0.78, 95% CI [0.65, 0.91]. This indicates a large effect that is precisely estimated^[27].

13.7 Visualizing group comparisons

Effective visualizations communicate both central tendency and variability^[9,24].

13.7.1 Box plots for independent groups

Code

library(ggplot2)
set.seed(42)

# Simulate data
trained <- rnorm(30, mean = 55, sd = 6)
untrained <- rnorm(30, mean = 48, sd = 7)

df_jump <- data.frame(
  Group = rep(c("Trained", "Untrained"), each = 30),
  Jump_Height = c(trained, untrained)
)

ggplot(df_jump, aes(x = Group, y = Jump_Height, fill = Group)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 2) +
  scale_fill_manual(values = c("Trained" = "steelblue", "Untrained" = "coral")) +
  labs(x = "Group", y = "Vertical Jump Height (cm)", 
       title = "Vertical Jump Performance by Training Status") +
  theme_minimal() +
  theme(legend.position = "none")

Figure 13.1: Comparison of vertical jump height (cm) between trained and untrained groups. Box plots show medians, interquartile ranges, and outliers. Trained athletes demonstrate higher and less variable performance.

Box plots reveal the distribution of scores within each group^[8]. In Figure 13.1, trained athletes show consistently higher jump heights with less variability, indicating both a difference in central tendency and potentially more consistent performance^[4].

13.7.2 Error bar plots with confidence intervals

Code

library(ggplot2)
library(dplyr)

# Compute summary statistics
summary_stats <- df_jump %>%
  group_by(Group) %>%
  summarise(
    Mean = mean(Jump_Height),
    SD = sd(Jump_Height),
    n = n(),
    SE = SD / sqrt(n),
    CI_lower = Mean - qt(0.975, df = n - 1) * SE,
    CI_upper = Mean + qt(0.975, df = n - 1) * SE
  )

ggplot(summary_stats, aes(x = Group, y = Mean, fill = Group)) +
  geom_col(alpha = 0.7, width = 0.6) +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.2, linewidth = 1) +
  scale_fill_manual(values = c("Trained" = "steelblue", "Untrained" = "coral")) +
  labs(x = "Group", y = "Mean Vertical Jump Height (cm)", 
       title = "Mean Jump Height by Training Status (95% CI)") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 65)

Figure 13.2: Mean vertical jump height (cm) with 95% confidence intervals for trained and untrained groups. Non-overlapping error bars suggest a statistically significant difference.

Error bar plots with 95% confidence intervals (Figure 13.2) provide a clear visual comparison of group means and their precision^[9]. Non-overlapping 95% CIs indicate roughly p < .01; however, overlapping CIs do not rule out statistical significance at α = .05—always conduct formal tests rather than relying on visual inspection alone^[9,28].

13.7.3 Before-after plots for paired designs

Code

library(ggplot2)

# Simulated paired data
set.seed(42)
n_participants <- 12
pre <- rnorm(n_participants, mean = 42, sd = 4)
post <- pre + rnorm(n_participants, mean = 4.3, sd = 0.9)

df_paired <- data.frame(
  Participant = rep(1:n_participants, 2),
  Time = rep(c("Pre", "Post"), each = n_participants),
  Strength = c(pre, post)
)

df_paired$Time <- factor(df_paired$Time, levels = c("Pre", "Post"))

ggplot(df_paired, aes(x = Time, y = Strength, group = Participant)) +
  geom_line(alpha = 0.6, color = "steelblue") +
  geom_point(size = 3, alpha = 0.8, color = "steelblue") +
  stat_summary(aes(group = 1), fun = mean, geom = "line", 
               color = "red", linewidth = 1.5, linetype = "dashed") +
  stat_summary(aes(group = 1), fun = mean, geom = "point", 
               color = "red", size = 4, shape = 18) +
  labs(x = "Test Session", y = "Grip Strength (kg)", 
       title = "Pre-Post Changes in Grip Strength (Individual Lines)") +
  theme_minimal()

Figure 13.3: Individual changes in grip strength (kg) from pre-test to post-test. Each line represents one participant. Most participants improved, with all lines showing upward slopes.

Individual trajectory plots (Figure 13.3) show how each participant changed from pre to post^[8,9]. The red dashed line represents the mean change, indicating an overall increase in grip strength across participants.

13.8 Sample size and statistical power

Statistical power is the probability of correctly rejecting a false null hypothesis—in other words, the probability that a study will detect a true effect when one genuinely exists in the population^[10,14]. Power ranges from 0 to 1, and a conventional minimum target is 0.80, meaning an 80% chance of detecting a real effect if present^[14]. The complement of power is the Type II error rate (β): a power of 0.80 implies a 20% chance of a false negative—concluding “no effect” when in fact there is one.

Underpowered studies are a pervasive problem in movement science and related fields^[29,30]. They fail to detect meaningful effects, waste participant and researcher resources, and produce an inflated false-negative rate. Critically, when an underpowered study does yield a significant result, the observed effect size is likely an overestimate of the true population effect—a phenomenon known as the winner’s curse^[31]. Planning adequate power before data collection is therefore an ethical as well as a methodological responsibility^[29].

13.8.1 Factors affecting power

Sample size (n): Larger samples reduce the standard error of the mean, making it easier to distinguish a true effect from sampling variability; doubling n does not double power, but it does substantially increase sensitivity, especially in the range of 10–50 participants per group^[14]
Effect size (d): Larger effects are inherently easier to detect—a Cohen’s d of 0.8 requires far fewer participants than d = 0.2 for the same power level; if the true effect in the population is small, a large sample is needed to reliably detect it^[7,14]
Significance level (α): Setting α = .05 rather than α = .01 increases power by raising the threshold for a “reject” decision, but at the cost of a higher Type I error rate; this trade-off should be made deliberately before data collection, not adjusted post hoc^[32]
Measurement reliability: More reliable outcome measures (higher ICC or test-retest correlation) reduce within-group variance and thereby increase power; poor measurement precision has the same effect on power as reducing sample size^[12]
Design: Paired designs typically have substantially higher power than independent designs for the same n because between-subject variability is removed from the error term—when within-subject correlation is high (r > .50), a paired design may require less than half the participants needed for an equivalent independent design^[8,10]

13.8.2 Power analysis for t-tests

A priori power analysis is conducted before data collection to determine the minimum sample size needed to achieve a desired level of statistical power (typically 0.80) for an expected effect size at a chosen α level^[14]. The three inputs—effect size, α, and desired power—are interrelated: fixing any two determines the third. In practice, researchers specify the effect size they consider meaningful (often based on prior literature or pilot data), set α = .05 and power = 0.80, and solve for the required n.

Selecting a realistic effect size is the most consequential—and most difficult—step. Using an inflated effect size (e.g., from an underpowered pilot study) will underestimate the required sample and produce an underpowered study^[30]. Better sources include meta-analytic estimates for your outcome domain, published studies with comparable designs and populations, or the smallest effect size that would be practically meaningful given your research context^[7].

Computations for power and required sample size involve non-central t-distributions, but dedicated software makes the process straightforward:

Use G*Power for sample size planning

**G*Power** is free, user-friendly software for power analysis^[33]. Input your expected effect size, desired power, and α level to determine required sample size. Available at: https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower

13.8.3 Post hoc power analysis: Why it’s problematic

Post hoc power (computing power after data collection) is widely misused^[32,34]. When power is calculated from the observed effect size, it is mathematically determined by the p-value, making it circular and uninformative: a non-significant result will always yield low observed power by definition^[35].

Better approach: Report effect sizes with confidence intervals^[6]. Wide CIs indicate low precision (consistent with low power), but provide more informative guidance for future studies^[10].

13.9 Independent vs. paired: Which test to use?

Characteristic	One-sample t-test	Independent t-test	Paired t-test
Design	One group vs. benchmark	Two separate groups	Same participants measured twice (or matched pairs)
Assumptions	Independence, normality	Independence, normality, (equal variances)	Pairs independent, differences normally distributed
Power	N/A	Lower (between-subject variability)	Higher (controls for individual differences)
Example	Compare average BMI to national average	Compare trained vs. untrained athletes	Compare pre-test vs. post-test in same athletes
Null hypothesis	μ = μ₀	μ₁ = μ₂	μ_d = 0
Degrees of freedom	n − 1	n₁ + n₂ − 2 (or Welch’s df)	n − 1 (n = number of pairs)

Common mistake: Using independent t-test for paired data

If you use an independent t-test when data are actually paired, you lose power by failing to control for individual differences^[10]. Always match the test to the design^[8].

13.10 Assumptions violations and alternatives

When assumptions are violated, consider:

13.10.1 For non-normality

Transformation: Log, square root, or rank transformations may normalize data^[36]
Nonparametric alternatives:
- Mann-Whitney U test (independent samples)
- Wilcoxon signed-rank test (paired samples)
- See Chapter 19 for details^[23]

13.10.2 For unequal variances (independent t-test)

Welch’s t-test: Does not assume equal variances (preferred default)^[18]

13.10.3 For small samples

Small samples (n < 15 per group) present particular challenges because the Central Limit Theorem cannot be relied upon to rescue non-normal data, and formal normality tests (e.g., Shapiro-Wilk) have low power to detect departures from normality at these sample sizes^[8,20].

Check normality visually: With small n, histograms are uninformative—use Q-Q plots and consider whether the underlying construct is plausibly normal given the population^[8]
Be cautious with outliers: A single extreme value can substantially distort the mean and inflate or deflate the t-statistic in small samples; inspect data carefully before analysis^[37]
Consider bootstrapping: Resampling-based methods (e.g., bootstrap confidence intervals) make no distributional assumptions and can provide valid inference even with small, non-normal samples^[38,39]
Consider permutation tests: Permutation (randomization) tests are exact, assumption-free alternatives to the t-test that work by comparing the observed test statistic to the distribution generated by all possible rearrangements of the data^[39]
Nonparametric fallback: If normality is clearly violated and the sample is too small to rely on robustness, use the Mann-Whitney U (independent groups) or Wilcoxon signed-rank test (paired design) as described in Chapter 19^[23]

13.11 Reporting t-tests in APA style

Template for independent t-test:

“[Group 1] (M = [mean], SD = [SD], n = [n]) [differed/did not differ] significantly from [Group 2] (M = [mean], SD = [SD], n = [n]), t([df]) = [t-value], p = [p-value], d = Cohen’s d, 95% CI [lower, upper].”

Example:

“Trained athletes (M = 55.3 cm, SD = 6.2, n = 30) demonstrated significantly higher vertical jump performance than untrained controls (M = 47.8 cm, SD = 7.1, n = 30), t(58) = 4.52, p < .001, d = 1.12, 95% CI [4.2, 10.8] cm.”

Template for paired t-test:

“[Condition 2] (M = [mean], SD = [SD]) was significantly [higher/lower] than [Condition 1] (M = [mean], SD = [SD]), t([df]) = [t-value], p = [p-value], mean difference = [M_diff], 95% CI [lower, upper], d = Cohen’s d.”

Example:

“Post-training grip strength (M = 45.9 kg, SD = 4.40) was significantly greater than pre-training strength (M = 41.6 kg, SD = 3.75), t(11) = 16.91, p < .001, mean difference = 4.3 kg, 95% CI [3.77, 4.90], d = 4.87.”

13.12 Comparing Two Proportions: The Two-Proportion Z-Test

T-tests can also compare proportions (e.g., injury rates, success rates) between two groups^[40,41].

For proportions \(p_1\) and \(p_2\), the test statistic is:

\[ z = \frac{p_1 - p_2}{\text{SE}_{\text{diff}}} \]

Where:

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

For large samples, this follows a standard normal (z) distribution^[1,40]. Software implements this as a two-proportion z-test.

Pooled vs. unpooled standard error for proportions

The formula above uses separate (unpooled) proportions, which is appropriate for confidence intervals for the difference. When conducting a hypothesis test under H₀: p₁ = p₂, many software packages instead use the pooled proportion \(\hat{p} = (x_1 + x_2)/(n_1 + n_2)\):

\[ \text{SE}_{\text{diff (pooled)}} = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

Both formulas give similar results with large samples, but the pooled version is theoretically preferred for significance testing^[40].

Real example: Comparing injury rates

A study finds that 18 of 100 runners using minimalist shoes (18%) suffered injuries, compared to 12 of 100 using traditional shoes (12%). A two-proportion z-test yields z = 1.20, p = .23, suggesting no significant difference in injury rates^[40].

13.13 Common pitfalls and best practices

13.13.1 Pitfall 1: P-hacking and selective reporting

Problem: Running multiple t-tests and reporting only significant ones inflates Type I error^[42,43].

Solution: Preregister analyses, report all tests conducted, and correct for multiple comparisons if appropriate^[16].

13.13.2 Pitfall 2: Confusing significance with importance

Problem: Small, trivial effects can be statistically significant with large samples^[44].

Solution: Always report and interpret effect sizes and confidence intervals^[3,6].

13.13.3 Pitfall 3: Ignoring assumptions

Problem: Violating normality or equal variance assumptions can lead to incorrect p-values^[8].

Solution: Check assumptions, use robust methods (e.g., Welch’s t-test), or use nonparametric alternatives^[18].

13.13.4 Pitfall 4: Inappropriate test choice

Problem: Using independent t-test for paired data (or vice versa) produces incorrect results^[10].

Solution: Match the test to the research design^[2,11].

13.14 Chapter summary

Comparing two means is a fundamental task in Movement Science research, and t-tests provide a principled statistical framework for determining whether observed differences between groups or conditions reflect true population differences or merely sampling variability^[1,5]. Independent t-tests compare separate groups (e.g., experimental vs. control), while paired t-tests compare related measurements (e.g., pre-test vs. post-test), with paired designs offering greater statistical power by controlling for individual differences^[9,10]. Both tests assume approximate normality and independence, with Welch’s t-test providing robust inference when variances differ^[17,18]. However, statistical significance (p < α) alone is insufficient for drawing meaningful conclusions—researchers must also report effect sizes (e.g., Cohen’s d) and confidence intervals to assess the magnitude and precision of differences^[6,7,14].

Effect sizes quantify how large group differences are in standardized units, enabling comparisons across studies and evaluation of practical significance^[7,14]. A Cohen’s d of 1.0 indicates that group means differ by one standard deviation—a substantial difference in most contexts^[3]. Confidence intervals complement t-tests by revealing not just whether groups differ (p-value), but by how much they differ and with what precision^[6,45]. Sample size and statistical power are inextricably linked: adequately powered studies (typically Power ≥ 0.80) reliably detect meaningful effects, while underpowered studies produce false negatives and waste resources^[10,29]. Planning studies with a priori power analysis ensures sufficient sample sizes to answer research questions definitively^[14,33].

Ultimately, responsible use of t-tests requires integrating hypothesis testing with estimation, effect size reporting, and practical significance evaluation^[3,6]. Researchers should not merely ask “Is there a significant difference?” but also “How large is the difference, how precisely have we estimated it, and does it matter in applied contexts?”^[4,7]. By combining t-tests with transparent reporting of means, confidence intervals, and effect sizes, Movement Science practitioners can move beyond binary significant/non-significant thinking toward nuanced interpretation of group differences, fostering more reproducible and impactful research^[2,6,11].

13.15 Key terms

independent samples; paired samples; t-test; two-sample t-test; dependent t-test; repeated measures; null hypothesis; alternative hypothesis; test statistic; degrees of freedom; p-value; statistical significance; practical significance; Cohen’s d; effect size; confidence interval; pooled variance; Welch’s t-test; homogeneity of variance; normality; statistical power; Type I error; Type II error; sample size planning

13.16 Practice: quick checks

Use a paired t-test when the same participants are measured twice (pre-post designs) or when observations are matched in pairs (e.g., twins, left-right limb comparisons)^[10]. Paired designs control for individual differences by comparing each person to themselves, reducing error variance and increasing statistical power^[9]. In contrast, use an independent t-test when comparing two separate, unrelated groups (e.g., experimental vs. control) where participants in one group are distinct from those in the other^[2,8]. Using an independent t-test on paired data wastes power, while using a paired t-test on independent data violates the assumption that pairs are related, producing incorrect results^[11].

Welch’s t-test does not assume equal population variances, making it more robust than the traditional pooled-variance t-test^[17,18]. When variances differ substantially between groups, the pooled-variance t-test can produce inflated Type I error rates (more false positives) or reduced power^[19]. Welch’s t-test corrects for this by using separate variance estimates and adjusting degrees of freedom^[13]. Importantly, Welch’s t-test performs well even when variances are equal, meaning it rarely performs worse than the pooled-variance version and often performs better^[18]. For this reason, most modern statistical software uses Welch’s t-test as the default^[8].

Cohen’s d quantifies the standardized magnitude of a mean difference: d = (M₁ − M₂) / SD_pooled^[14]. Cohen suggested benchmarks: |d| = 0.2 (small), 0.5 (medium), 0.8 (large). However, context is crucial^[3,7]. In injury prevention research, even a “small” effect (d = 0.2) may save lives and justify intervention. Conversely, in elite performance contexts, a “large” effect (d = 0.8) may be unrealistic or impractical to achieve^[4]. Always interpret effect sizes relative to domain-specific benchmarks, prior research, and practical significance thresholds rather than relying solely on Cohen’s arbitrary guidelines^[7].

P-values indicate whether an effect is statistically detectable (p < .05 suggests the difference is unlikely due to chance), but they do not quantify the size or precision of the effect^[6,46]. Confidence intervals provide a range of plausible values for the true population difference, enabling researchers to evaluate both statistical significance (does the CI exclude zero?) and practical importance (are the plausible effect sizes meaningful?)^[3,45]. For example, a 95% CI of [0.5, 8.2] cm for a mean difference indicates statistical significance (excludes zero) but substantial uncertainty (wide range), while a CI of [4.0, 4.8] cm indicates both significance and high precision^[9]. Reporting CIs fosters transparent communication of uncertainty^[24,25].

Statistical power is the probability of detecting a true effect when it exists (Power = 1 − β, where β is Type II error)^[14]. Power increases with (1) larger sample sizes, (2) larger effect sizes, (3) higher significance levels (α), and (4) lower data variability^[10]. Paired designs also offer higher power than independent designs because they control for individual differences^[9]. Low power (< 0.50) means studies frequently miss real effects, producing false negatives and inconclusive results^[29]. Underpowered studies waste resources, mislead interpretation, and contribute to publication bias (only “lucky” significant findings get published)^[47]. Aim for Power ≥ 0.80 when planning studies to ensure adequate sensitivity for detecting meaningful effects^[10,14].

Statistical significance (p < .05) indicates that an observed difference is unlikely to have occurred by chance, assuming the null hypothesis is true^[1,44]. Practical significance evaluates whether the magnitude of the difference is large enough to matter in real-world applications^[3,7]. A difference can be statistically significant but trivial (e.g., 0.2 ms improvement in reaction time with n = 1000) or large but not statistically significant (e.g., 10 cm jump improvement with n = 5)^[10]. To assess practical significance, examine effect sizes, confidence intervals, and domain-specific thresholds such as minimal clinically important differences (MCIDs)^[4]. Always ask: “Even if this difference is real, does it matter for performance, health, or decision-making?”^[3,6].

Read further

For deeper exploration of t-tests and mean comparisons, see Cumming (2012)^[9] (Understanding the New Statistics), Maxwell et al. (2018)^[10] (power and design), Cohen (1988)^[14] (effect sizes), and Vincent (2005)^[2] (Movement Science applications). For practical guidance on assumption checking and robust alternatives, consult Field (2018)^[8] and Wilcox (2017)^[37].

Next chapter

In Chapter 14, you will extend the logic of comparing two means to Analysis of Variance (ANOVA), which enables comparisons across three or more groups simultaneously. You will learn about partitioning variance, F-tests, post hoc comparisons, and interpreting main effects in more complex experimental designs.

1. Moore, D. S., McCabe, G. P., & Craig, B. A. (2021). Introduction to the practice of statistics (10th ed.). W. H. Freeman; Company.

2. Vincent, W. J. (2005). Statistics in kinesiology.

3. Batterham, A. M., & Hopkins, W. G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50–57. https://doi.org/10.1123/ijspp.1.1.50

4. Hopkins, W. G., Marshall, S. W., Batterham, A. M., & Hanin, J. (2009). Progressive statistics for studies in sports medicine and exercise science. Medicine & Science in Sports & Exercise, 41(1), 3–13. https://doi.org/10.1249/MSS.0b013e31818cb278

5. Student [Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. https://doi.org/10.2307/2331554

6. Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966

7. Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863

8. Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). SAGE Publications.

9. Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.

10. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: A model comparison perspective (3rd ed.). Routledge.

11. Thomas, L. (2015). How to estimate power and sample size. Trauma Surgery & Acute Care Open, 1(1), e000005. https://doi.org/10.1136/tsaco-2015-000005

12. Portney, L. G., & Watkins, M. P. (2020). Foundations of clinical research: Applications to practice.

13. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behavioral Ecology, 17(4), 688–690. https://doi.org/10.1093/beheco/ark016

14. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

15. Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23, 151–169. https://doi.org/10.1146/annurev.publhealth.23.100901.140546

16. Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44, 701–710. https://doi.org/10.1002/ejsp.2023

17. Welch, B. L. (1947). The generalization of "student’s" problem when several different population variances are involved. Biometrika, 34(1-2), 28–35. https://doi.org/10.1093/biomet/34.1-2.28

18. Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use welch’s t-test instead of student’s t-test. International Review of Social Psychology, 30(1), 92–101. https://doi.org/10.5334/irsp.82

19. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57(1), 173–181. https://doi.org/10.1348/000711004849222

20. Razali, N. M., & Wah, Y. B. (2011). Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of Statistical Modeling and Analytics, 2(1), 21–33.

21. Blanca, M. J., Alarcón, R., Arnau, J., Bono, R., & Bendayan, R. (2013). Non-normal data: Is ANOVA still a valid option? Psicothema, 25(4), 552–557. https://doi.org/10.7334/psicothema2013.552

22. Levene, H. (1960). Robust tests for equality of variances. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, 278–292.

23. Conover, W. J. (1999). Practical nonparametric statistics.

24. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604. https://doi.org/10.1037/0003-066X.54.8.594

25. American Psychological Association. (2020). Publication manual of the american psychological association (7th ed.). American Psychological Association.

26. Kelley, K., & Preacher, K. J. (2012). On effect size. Psychological Methods, 17(2), 137–152. https://doi.org/10.1037/a0028086

27. Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical significance: A practical guide for biologists. Biological Reviews, 82, 591–605. https://doi.org/10.1111/j.1469-185X.2007.00027.x

28. Schenker, N., & Gentleman, J. F. (2001). Judging statistical significance from confidence intervals. The American Statistician, 55(3), 182–186. https://doi.org/10.1198/000313001317098149

29. Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376. https://doi.org/10.1038/nrn3475

30. Lakens, D. (2022). Sample size justification. Collabra: Psychology, 8(1), 33267. https://doi.org/10.1525/collabra.33267

31. Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

32. Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Review of Social Psychology, 25(1), 60–75. https://doi.org/10.1080/10463283.2014.922662

33. Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. https://doi.org/10.3758/BF03193146

34. Hoenig, J. M., & Heisey, D. M. (2001). ABCs of alpha, beta, delta, and epsilon. Ecology, 82(12), 3369–3372. https://doi.org/10.1890/0012-9658(2001)082[3369:AOABDE]2.0.CO;2

35. Senn, S. (2002). Letter to the editor: Cross-over trials in clinical research. Statistics in Medicine, 21(19), 2843–2844. https://doi.org/10.1002/sim.1097

36. Osborne, J. (2002). Notes on the use of data transformations. Practical Assessment, Research & Evaluation, 8(6). https://scholarworks.umass.edu/pare/vol8/iss1/6

37. Wilcox, R. R. (2017). Introduction to robust estimation and hypothesis testing (4th ed.). Academic Press.

38. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman; Hall.

39. Good, P. I. (2005). Permutation tests: A practical guide to resampling methods for testing hypotheses.

40. Agresti, A. (2003). Categorical data analysis.

41. Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statistics in Medicine, 17, 857–872. https://doi.org/10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E

42. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

43. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106

44. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

45. Altman, D. G., & Bland, J. M. (2000). Statistics notes: The use of transformation when comparing two means. BMJ, 312, 1153. https://doi.org/10.1136/bmj.312.7039.1153

46. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108

47. Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

13.1 Chapter roadmap

13.2 Workflow for comparing two means

13.3 One-sample t-test: Comparing a sample mean to a constant

13.3.1 When to use a one-sample t-test

13.3.2 Hypotheses for one-sample t-tests

13.3.3 Test statistic for one-sample t-tests

13.3.4 Worked example: One-sample t-test

13.3.5 Assumptions of the one-sample t-test

13.4 Independent samples: Comparing two separate groups

13.4.1 When to use an independent t-test

13.4.2 Hypotheses for independent t-tests

13.4.3 Test statistic for independent t-tests

13.4.3.1 Equal variances assumed (pooled variance)

13.4.3.2 Equal variances not assumed (Welch’s t-test)

13.4.4 Worked example: Independent t-test

13.4.5 Assumptions of the independent t-test

13.4.5.1 Checking normality

13.4.5.2 Checking equal variances

13.5 Paired samples: Comparing related measurements

13.5.1 When to use a paired t-test

13.5.2 Why paired designs are more powerful

13.5.3 Hypotheses for paired t-tests

13.5.4 Test statistic for paired t-tests

13.5.5 Worked example: Paired t-test

13.5.6 Assumptions of the paired t-test

13.6 Effect sizes for comparing two means

13.6.1 Cohen’s d

13.6.1.1 Interpreting Cohen’s d

13.6.2 Worked example: Computing Cohen’s d

13.6.3 Confidence intervals for effect sizes

13.7 Visualizing group comparisons

13.7.1 Box plots for independent groups

13.7.2 Error bar plots with confidence intervals

13.7.3 Before-after plots for paired designs

13.8 Sample size and statistical power

13.8.1 Factors affecting power

13.8.2 Power analysis for t-tests

13.8.3 Post hoc power analysis: Why it’s problematic

13.9 Independent vs. paired: Which test to use?

13.10 Assumptions violations and alternatives

13.10.1 For non-normality

13.10.2 For unequal variances (independent t-test)

13.10.3 For small samples

13.11 Reporting t-tests in APA style

13.12 Comparing Two Proportions: The Two-Proportion Z-Test

13.13 Common pitfalls and best practices

13.13.1 Pitfall 1: P-hacking and selective reporting

13.13.2 Pitfall 2: Confusing significance with importance

13.13.3 Pitfall 3: Ignoring assumptions

13.13.4 Pitfall 4: Inappropriate test choice

13.14 Chapter summary

13.15 Key terms

13.16 Practice: quick checks

Question 1: When should you use a paired t-test instead of an independent t-test?

Question 2: Why is Welch’s t-test preferred over the traditional pooled-variance t-test?

Question 3: How do you interpret Cohen’s d, and why is context important?

Question 4: Why should you report confidence intervals alongside p-values when comparing means?

Question 5: What factors determine statistical power in a t-test, and why does power matter?

Question 6: What is the difference between statistical significance and practical significance in the context of comparing means?