10 Hypothesis Testing and Statistical Inference

Making principled decisions under uncertainty in Movement Science research

💻 SPSS Tutorial Available

Learn how to conduct hypothesis tests in SPSS! See the SPSS Tutorial: Hypothesis Testing in the appendix for step-by-step instructions on performing t-tests, interpreting p-values, and making statistical decisions.

10.1 Chapter roadmap

Hypothesis testing provides a formal framework for making decisions about populations based on sample data^[1,2]. While confidence intervals (Chapter 9) emphasize estimation and uncertainty, hypothesis testing focuses on yes/no decisions: Does a training intervention improve performance? Are two groups different? Is a correlation different from zero?^[3,4]. The logic is straightforward: we propose two competing hypotheses (a null hypothesis claiming no effect and an alternative hypothesis claiming an effect exists), collect sample data, and use probability theory to determine which hypothesis the data support^[1,5]. If the observed data would be very unlikely assuming the null hypothesis is true (typically less than 5% probability), we reject the null hypothesis in favor of the alternative^[6,7]. This procedure, called null hypothesis significance testing (NHST), has dominated scientific inference for nearly a century, despite ongoing debates about its interpretation and misuse^[8,9].

Understanding hypothesis testing requires distinguishing between two types of errors: Type I errors (false positives—concluding an effect exists when it doesn’t) and Type II errors (false negatives—failing to detect a real effect)^[6,10]. Researchers control Type I error rates by choosing a significance level (α, typically 0.05), which defines how rare an outcome must be before we reject the null hypothesis^[1,5]. However, minimizing Type I errors increases Type II errors, creating a fundamental trade-off^[10,11]. In Movement Science, where sample sizes are often modest and effect sizes variable, understanding statistical power (the probability of detecting real effects) is critical for designing informative studies^[12,13]. A well-powered study with n = 100 participants may reliably detect a meaningful training effect, while an underpowered pilot study with n = 10 may fail to detect the same effect—not because the effect doesn’t exist, but because the sample is too small^[11].

This chapter explains the logic and mechanics of hypothesis testing, contrasts frequentist and Bayesian approaches, and demonstrates how to interpret p-values responsibly^[9,14]. You will learn when to use one-tailed versus two-tailed tests, how degrees of freedom affect critical values, and why statistical significance does not equal practical importance^[2,15]. The goal is not to apply hypothesis testing mechanically, but to understand what p-values do and do not tell us, recognize the limitations of dichotomous thinking, and integrate hypothesis testing with confidence intervals for more complete inference^[4,16]. By the end, you will be equipped to conduct, interpret, and critically evaluate hypothesis tests in Movement Science contexts, balancing statistical rigor with biological and practical significance^[13,17].

By the end of this chapter, you will be able to:

Explain the logic of null hypothesis significance testing (NHST).
Distinguish between null and alternative hypotheses and formulate them correctly.
Define Type I and Type II errors and understand the trade-offs between them.
Compute and interpret p-values in the context of hypothesis tests.
Conduct one-sample, two-sample, and paired t-tests.
Understand when to use one-tailed versus two-tailed tests.
Explain the concept of statistical power and its importance in research design.
Distinguish between statistical significance and practical significance.

10.2 Workflow for hypothesis testing

Use this sequence when conducting a hypothesis test:

State the research question clearly (e.g., “Does plyometric training improve vertical jump height?”).
Formulate hypotheses: Write the null hypothesis (H₀) and alternative hypothesis (H₁).
Choose a significance level (α, typically 0.05).
Select the appropriate test (e.g., independent t-test, paired t-test).
Compute the test statistic and corresponding p-value.
Make a decision: Reject H₀ if p < α; otherwise, fail to reject H₀.
Interpret in context: Consider effect size, confidence intervals, and practical significance.

10.3 The logic of hypothesis testing

Hypothesis testing rests on proof by contradiction^[1,5]. We assume the null hypothesis (no effect) is true, then ask: “How likely are our observed data if the null hypothesis were true?” If the data are very unlikely under the null hypothesis (e.g., p < 0.05), we reject the null hypothesis and conclude that an effect probably exists^[6].

10.3.1 Null and alternative hypotheses

Every hypothesis test involves two competing hypotheses^[1]:

Null hypothesis (H₀): A statement of “no effect,” “no difference,” or “no relationship.” It represents the status quo or the assumption that nothing unusual is happening.
- Example: “The mean vertical jump height is equal to 50 cm” (H₀: μ = 50)
- Example: “There is no difference in reaction time between young and older adults” (H₀: μ₁ = μ₂)
Alternative hypothesis (H₁ or Hₐ): A statement that contradicts the null hypothesis. It represents the effect, difference, or relationship we are testing for.
- Example: “The mean vertical jump height is not equal to 50 cm” (H₁: μ ≠ 50)
- Example: “There is a difference in reaction time between young and older adults” (H₁: μ₁ ≠ μ₂)

The null hypothesis is the hypothesis we test by computing a p-value. The alternative hypothesis is what we conclude if we reject the null^[5,6].

The null hypothesis is never “proven”

We never “accept” or “prove” the null hypothesis^[1,5]. We either reject it (sufficient evidence against it) or fail to reject it (insufficient evidence against it). Failing to reject H₀ does not mean H₀ is true—it means the data are compatible with H₀^[18,19].

10.3.2 One-tailed versus two-tailed tests

The directionality of the alternative hypothesis determines whether we use a one-tailed or two-tailed test^[1,3].

Two-tailed test (default):

H₀: μ = μ₀ (or μ₁ = μ₂)
H₁: μ ≠ μ₀ (or μ₁ ≠ μ₂)
Tests for any difference in either direction
Example: “Does training affect reaction time?” (could improve or worsen)

One-tailed test (directional):

H₀: μ ≤ μ₀
H₁: μ > μ₀
Tests for a difference in a specific direction
Example: “Does training decrease reaction time?”

Use two-tailed tests by default

One-tailed tests are only appropriate when you have strong theoretical or practical reasons to test a directional hypothesis and you would not care about effects in the opposite direction^[1,20]. Two-tailed tests are more conservative and align with the typical goal of detecting any effect^[21].

10.3.3 The p-value: What it is and what it isn’t

The p-value is the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming the null hypothesis is true^[9,19].

What a p-value tells you:

p = 0.03 means: “If the null hypothesis were true, there is a 3% chance we would observe data this extreme or more extreme by random sampling variability alone.”
Small p-values (e.g., p < 0.05) suggest the data are inconsistent with the null hypothesis^[5].

What a p-value does NOT tell you:

NOT the probability that the null hypothesis is true^[9,19]
NOT the probability of a Type I error for this specific test^[7]
NOT the size or importance of an effect^[2,14]
NOT proof that a hypothesis is true or false^[5,22]

The p-value is NOT the probability that H₀ is true

A common misinterpretation: “p = 0.03 means there is a 3% chance the null hypothesis is true.” Wrong. The null hypothesis is either true or false (we don’t know which). The p-value is the probability of the data, not the probability of the hypothesis^[9,19].

10.4 Type I and Type II errors

Every hypothesis test can result in two types of errors^[6,10]:

Reality	Decision: Fail to Reject H₀	Decision: Reject H₀
H₀ is true	✅ Correct decision	❌ Type I error (α)
H₀ is false (H₁ true)	❌ Type II error (β)	✅ Correct decision (Power)

10.4.1 Type I error (false positive)

A Type I error occurs when we reject the null hypothesis when it is actually true^[1,6]. In other words, we conclude an effect exists when it doesn’t.

Symbol: α (alpha)
Example: Concluding a training program improves strength when it actually has no effect
Controlled by: The significance level (e.g., α = 0.05 means we accept a 5% risk of Type I error)

10.4.2 Type II error (false negative)

A Type II error occurs when we fail to reject the null hypothesis when the alternative hypothesis is actually true^[6,10]. In other words, we miss a real effect.

Symbol: β (beta)
Example: Concluding a training program has no effect when it actually does improve strength
Related to: Statistical power (Power = 1 − β)

10.4.3 Statistical power

Statistical power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true^[10,11]. It represents the ability of a study to detect a real effect.

\[ \text{Power} = 1 - \beta \]

Factors affecting power:

Sample size (n): Larger samples → higher power^[10]
Effect size: Larger effects → easier to detect → higher power^[15]
Significance level (α): Higher α → higher power (but more Type I errors)^[11]
Variability: Lower variability → higher power^[12]

Recommended power levels

Aim for Power ≥ 0.80 (80% chance of detecting a real effect) when planning studies^[10]. Underpowered studies (Power < 0.50) are unlikely to detect real effects, wasting resources and producing unreliable findings^[11,12].

10.4.4 The trade-off between Type I and Type II errors

Reducing the risk of Type I errors (by using a smaller α, e.g., 0.01 instead of 0.05) increases the risk of Type II errors^[6,23]. Conversely, increasing power (reducing Type II errors) by increasing α makes Type I errors more likely^[11].

Why α = 0.05? The conventional significance level of 0.05 is a historical compromise, balancing the risks of false positives and false negatives^[5,24]. However, some fields are moving toward more stringent thresholds (e.g., α = 0.005) to reduce false positive rates^[25].

10.5 Conducting hypothesis tests: The t-test

The t-test is one of the most common hypothesis tests, used to compare means^[1,26]. There are three main types:

One-sample t-test: Compares a sample mean to a known population value
Two-sample (independent) t-test: Compares means between two independent groups
Paired (dependent) t-test: Compares means for the same group measured twice

10.5.1 One-sample t-test

Tests whether a sample mean differs from a hypothesized population mean^[1].

Hypotheses:

H₀: μ = μ₀ (the population mean equals a specific value)
H₁: μ ≠ μ₀ (two-tailed)

Test statistic:

\[ t = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

Where: * \(\bar{x}\) = sample mean * \(\mu_0\) = hypothesized population mean * \(s\) = sample standard deviation * \(n\) = sample size

The test statistic follows a t-distribution with df = n − 1 degrees of freedom^[26].

Decision rule:

If |t| > t-critical (from t-table) or p < α, reject H₀
Otherwise, fail to reject H₀

10.5.2 Worked example: One-sample t-test

A fitness test manual states that the population mean vertical jump height for college men is 50 cm. A researcher measures 20 college men and obtains:

\(\bar{x} = 54.2\) cm
\(s = 8.5\) cm
\(n = 20\)

Research question: Is the sample mean significantly different from 50 cm?

Step 1: State hypotheses

H₀: μ = 50 cm
H₁: μ ≠ 50 cm (two-tailed)
α = 0.05

Step 2: Compute the test statistic

\[ \text{SE} = \frac{s}{\sqrt{n}} = \frac{8.5}{\sqrt{20}} = 1.90 \text{ cm} \]

\[ t = \frac{54.2 - 50}{1.90} = \frac{4.2}{1.90} = 2.21 \]

Step 3: Determine degrees of freedom and critical value

df = n − 1 = 19
For α = 0.05 (two-tailed), t-critical ≈ 2.093 (from t-table)

Step 4: Make a decision

|t| = 2.21 > 2.093, so reject H₀
Alternatively, using software: p = 0.039 < 0.05, reject H₀

Step 5: Interpret

The sample mean (54.2 cm) is significantly different from the hypothesized population mean of 50 cm, t(19) = 2.21, p = .039. The mean jump height for this sample is higher than expected, with a 95% CI [50.3, 58.1] cm.

Real example: Testing VO₂max norms

A study measures VO₂max in 30 collegiate soccer players and compares it to published norms (μ₀ = 52 mL/kg/min). The observed mean is 56.3 mL/kg/min (SD = 6.8). A one-sample t-test yields t(29) = 3.47, p = .002, indicating that this sample has significantly higher aerobic fitness than the normative population^[13].

10.6 Two-sample (independent) t-test

Compares means between two independent groups^[1,26].

Hypotheses:

H₀: μ₁ = μ₂ (the two population means are equal)
H₁: μ₁ ≠ μ₂ (two-tailed)

Test statistic:

\[ t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\text{SE}_{\text{diff}}} \]

Where:

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \]

Degrees of freedom are computed using the Welch-Satterthwaite approximation (preferred, does not assume equal variances)^[27,28]:

\[ df \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}} \]

10.6.1 Worked example: Two-sample t-test

A researcher compares sprint times (seconds) between two training groups:

Group 1 (traditional training): \(\bar{x}_1 = 5.8\) s, \(s_1 = 0.6\) s, \(n_1 = 15\)
Group 2 (plyometric training): \(\bar{x}_2 = 5.3\) s, \(s_2 = 0.5\) s, \(n_2 = 15\)

Research question: Do the two groups differ in sprint performance?

Step 1: State hypotheses

H₀: μ₁ = μ₂ (no difference in sprint times)
H₁: μ₁ ≠ μ₂ (sprint times differ)
α = 0.05

Step 2: Compute the test statistic

\[ \text{SE}_{\text{diff}} = \sqrt{\frac{0.6^2}{15} + \frac{0.5^2}{15}} = \sqrt{0.024 + 0.0167} = \sqrt{0.0407} = 0.20 \text{ s} \]

\[ t = \frac{5.8 - 5.3}{0.20} = \frac{0.5}{0.20} = 2.50 \]

Step 3: Determine degrees of freedom (Welch approximation)

Using software: df ≈ 27.3

Step 4: Compute p-value

Using software: p = 0.019

Step 5: Make a decision

p = 0.019 < 0.05, so reject H₀

Interpretation

Plyometric training produced significantly faster sprint times (M = 5.3 s, SD = 0.5) than traditional training (M = 5.8 s, SD = 0.6), t(27.3) = 2.50, p = .019, mean difference = 0.5 s, 95% CI [0.09, 0.91] s. The effect size (Cohen’s d ≈ 0.91) indicates a large practical difference^[10].

10.7 Paired (dependent) t-test

Compares two related measurements (e.g., pre-test and post-test on the same participants)^[1].

Hypotheses:

H₀: μd = 0 (mean difference is zero)
H₁: μd ≠ 0 (mean difference is not zero)

Test statistic:

\[ t = \frac{\bar{d} - 0}{\text{SE}_d} = \frac{\bar{d}}{s_d / \sqrt{n}} \]

Where: * \(\bar{d}\) = mean of the difference scores (d = post − pre) * \(s_d\) = standard deviation of the differences * \(n\) = number of pairs

Degrees of freedom: df = n − 1

10.7.1 Worked example: Paired t-test

A study measures vertical jump height (cm) before and after 8 weeks of strength training in 12 athletes.

Data (difference scores: post − pre):

Mean difference: \(\bar{d} = 3.8\) cm
SD of differences: \(s_d = 2.5\) cm
n = 12

Research question: Did training significantly improve jump height?

Step 1: State hypotheses

H₀: μd = 0 (no change in jump height)
H₁: μd ≠ 0 (jump height changed)
α = 0.05

Step 2: Compute test statistic

\[ \text{SE}_d = \frac{s_d}{\sqrt{n}} = \frac{2.5}{\sqrt{12}} = 0.72 \text{ cm} \]

\[ t = \frac{3.8}{0.72} = 5.28 \]

Step 3: Degrees of freedom

df = n − 1 = 11

Step 4: Compute p-value

Using software: p < .001

Step 5: Decision

p < .001, strongly reject H₀

Interpretation

Vertical jump height increased significantly following training, mean improvement = 3.8 cm, 95% CI [2.2, 5.4] cm, t(11) = 5.28, p < .001. This represents a meaningful improvement in explosive power^[13].

10.8 Degrees of freedom

Degrees of freedom (df) represent the number of independent pieces of information available to estimate a parameter^[1,3].

One-sample t-test: df = n − 1
Paired t-test: df = n − 1 (n = number of pairs)
Two-sample t-test (equal variances): df = n₁ + n₂ − 2
Two-sample t-test (unequal variances, Welch): df ≈ computed using Welch-Satterthwaite formula

Why df = n − 1? When estimating the sample standard deviation (s), we first compute the sample mean, “using up” one degree of freedom^[1]. Only n − 1 deviations are independent (the last one is determined by the others).

Effect of df on critical values

Smaller df → larger critical t-values → harder to reject H₀^[26]. With df = 5, t-critical (α = 0.05, two-tailed) = 2.571. With df = 100, t-critical ≈ 1.984. As df increases, the t-distribution approaches the standard normal (z) distribution^[1].

10.9 Statistical significance versus practical significance

Statistical significance (p < α) tells us whether an effect is detectable and unlikely to be due to chance^[2]. Practical significance evaluates whether the effect is large enough to matter in real-world contexts^[13,15].

10.9.1 When they diverge

Statistically significant but trivial effect:
- Large sample (n = 500) detects a 0.2 cm improvement in jump height (p = .03)
- Conclusion: Real effect, but too small to care about^[2]
Large effect but not statistically significant:
- Small sample (n = 8) shows a 5 cm improvement (p = .08)
- Conclusion: Promising effect, but study underpowered^[11,12]

Always report effect sizes alongside p-values

A p-value alone is insufficient^[4,29]. Report:

Mean difference with 95% CI
Standardized effect size (e.g., Cohen’s d)
Practical interpretation in context

Example: “Mean difference = 4.2 cm, 95% CI [2.1, 6.3], d = 0.67, p = .002. This represents a moderate-to-large improvement in performance.”

10.10 Assumptions of t-tests

T-tests assume^[1,3]:

Independence: Observations are independent (one participant does not influence another)
Normality: Data are approximately normally distributed (especially important for small samples)
Equal variances (for independent t-test): Population variances are equal (use Welch’s t-test if violated)^[27,28]

Robustness:

T-tests are robust to moderate violations of normality, especially with larger samples (n > 30)^[30,31]
Use Welch’s t-test (default in most software) to avoid assuming equal variances^[28]
For severe non-normality or small samples, consider nonparametric tests (Chapter 19)

10.11 Frequentist versus Bayesian approaches

The hypothesis testing framework described in this chapter is frequentist, meaning it interprets probability as long-run relative frequency^[5,6]. The p-value answers: “What proportion of samples would produce data this extreme if H₀ were true and we repeated the experiment infinitely?”

Bayesian hypothesis testing offers an alternative^[32,33]:

Computes the probability that H₀ (or H₁) is true given the data
Updates prior beliefs with new data to produce posterior probabilities
Does not rely on arbitrary α thresholds
More intuitive for many researchers but requires specifying prior distributions

Example comparison:

Frequentist: p = 0.03 (data are unlikely if H₀ is true)
Bayesian: P(H₁|data) = 0.87 (87% probability that H₁ is true given the data)

Bayesian methods in Movement Science

Bayesian statistics are gaining popularity in Movement Science for their ability to quantify evidence for or against hypotheses and incorporate prior knowledge^[33,34]. Software like JASP and R packages (e.g., BayesFactor, brms) make Bayesian analysis accessible^[35].

10.12 Common misinterpretations and pitfalls

10.12.1 Misinterpretation 1: “p = 0.05 is the cutoff for truth”

Wrong: P-values are continuous measures of evidence, not binary truth detectors^[9,36]. An effect with p = 0.051 is not fundamentally different from one with p = 0.049^[37].

Better: Report exact p-values and interpret them as gradations of evidence^[14].

10.12.2 Misinterpretation 2: “Non-significant means no effect”

Wrong: Failing to reject H₀ does not mean H₀ is true^[18,19]. The study may lack power to detect a real effect.

Better: Report confidence intervals and effect sizes. “The difference was not statistically significant (p = .12), but the 95% CI [−1.2, 8.5] cm includes both trivial and meaningful effects, suggesting the study was underpowered”^[11].

10.12.3 Misinterpretation 3: “p = 0.01 means a stronger effect than p = 0.05”

Wrong: P-values measure evidence against H₀, not effect magnitude^[2,14]. A tiny effect in a huge sample can yield p < .001, while a large effect in a small sample may yield p = .08.

Better: Examine effect sizes (e.g., Cohen’s d, mean differences) to assess magnitude^[15].

10.12.4 Misinterpretation 4: “Statistically significant = important”

Wrong: Statistical significance only indicates detectability, not practical importance^[2,13].

Better: Evaluate practical significance using domain expertise, minimal clinically important differences (MCIDs), and confidence intervals^[17].

10.13 Reporting hypothesis tests (APA style)

Template:

“[Group 1] (M = [mean], SD = [SD], n = [n]) [differed/did not differ] significantly from [Group 2] (M = [mean], SD = [SD], n = [n]), t([df]) = [t-value], p = [p-value], [effect size], 95% CI [lower, upper].”

Example:

“Trained athletes (M = 55.3 cm, SD = 6.2, n = 20) jumped significantly higher than untrained controls (M = 48.7 cm, SD = 7.1, n = 20), t(36.8) = 3.12, p = .003, d = 0.98, 95% CI [2.4, 10.8] cm.”

Key elements:

Descriptive statistics for each group
Test statistic with degrees of freedom
Exact p-value (unless p < .001)
Effect size
Confidence interval for the difference

10.14 Chapter summary

Hypothesis testing provides a formal decision-making framework for evaluating claims about populations based on sample data^[1,6]. By formulating null and alternative hypotheses, computing a test statistic, and comparing the resulting p-value to a significance level (α), researchers determine whether observed effects are unlikely to have occurred by chance alone^[5,9]. However, hypothesis testing is not without limitations: p-values are often misinterpreted, the dichotomous reject/fail-to-reject framework obscures gradations of evidence, and statistical significance does not imply practical importance^[2,14]. Understanding Type I and Type II errors, statistical power, and the assumptions underlying t-tests is critical for conducting and interpreting hypothesis tests responsibly^[10,11].

The t-test, one of the most widely used hypothesis tests, enables comparisons of means in one-sample, two-sample, and paired designs^[1,26]. Proper interpretation requires reporting not only p-values but also effect sizes and confidence intervals, which provide richer information about the magnitude and precision of effects^[4,15]. As Movement Science increasingly embraces estimation-focused approaches alongside traditional significance testing, researchers must balance the yes/no decisions of hypothesis testing with the nuanced uncertainty quantified by confidence intervals^[13,17]. The p-value is a tool, not a truth—it tells us about the compatibility of data with the null hypothesis, not about the probability that hypotheses are true or the size of effects^[9,22]. By integrating hypothesis testing with effect size reporting, confidence intervals, and substantive expertise, Movement Science researchers can make more informed, transparent, and reproducible inferences about human performance^[4,16].

10.15 Key terms

hypothesis testing; null hypothesis; alternative hypothesis; p-value; significance level; Type I error; Type II error; statistical power; one-tailed test; two-tailed test; t-test; one-sample t-test; two-sample t-test; paired t-test; degrees of freedom; test statistic; critical value; statistical significance; practical significance; effect size

10.16 Practice: quick checks

Rejecting the null hypothesis means concluding that the data are sufficiently unlikely under the assumption that H₀ is true, so we infer that an effect probably exists^[1,5]. It does not mean we have “proven” the alternative hypothesis or that the null hypothesis is definitely false—only that the evidence is inconsistent with H₀ at the chosen significance level^[9]. Conversely, “failing to reject H₀” does not mean H₀ is true; it means we lack sufficient evidence to conclude it is false^[18]. This distinction is crucial: hypothesis testing operates on evidence, not proof, and conclusions are probabilistic rather than certain^[2].

The p-value is P(data | H₀), not P(H₀ | data)^[9,19]. In plain language, the p-value tells us “how likely these data are if H₀ were true,” not “how likely H₀ is given these data.” The null hypothesis is either true or false (we don’t know which), so it doesn’t have a probability in the frequentist framework^[7]. Confusing these probabilities is called the inverse probability fallacy^[38]. To make probability statements about hypotheses, one must use Bayesian methods, which directly compute P(H₀ | data) using prior probabilities^[32,33].

Statistical power is the probability of correctly rejecting the null hypothesis when it is false (i.e., detecting a true effect)^[10]. Type II error (β) is the probability of failing to reject the null hypothesis when it is false (i.e., missing a true effect)^[6]. They are complementary: Power = 1 − β^[11]. High power (e.g., 0.80 or 80%) means low probability of Type II error (β = 0.20 or 20%). Power increases with larger sample sizes, larger effect sizes, higher significance levels, and lower variability^[10,12]. Underpowered studies are problematic because they frequently miss real effects, producing false negatives and wasting resources^[11].

Use a one-tailed test only when you have strong a priori theoretical or practical reasons to test a directional hypothesis and you genuinely do not care about effects in the opposite direction^[20,21]. For example, testing whether a new rehabilitation protocol improves recovery time (and you would not interpret a worsening as meaningful). However, most research situations warrant two-tailed tests because researchers typically want to detect any difference, regardless of direction^[1,3]. One-tailed tests are more controversial because they can be perceived as “p-hacking” if the direction is chosen post-hoc to achieve significance^[39]. When in doubt, use two-tailed tests^[21].

A non-significant result (p > 0.05) means the data are compatible with the null hypothesis, but it does not prove the null hypothesis is true^[18,22]. The study may lack statistical power to detect a real but small effect, or the sample size may be insufficient^[11]. The confidence interval provides critical context: a wide CI (e.g., [−5, 10] cm) suggests high uncertainty and includes both negative and positive effects, while a narrow CI near zero (e.g., [−0.5, 0.8] cm) provides stronger evidence for a truly negligible effect^[4]. “Absence of evidence is not evidence of absence”^[18]—to claim no effect, you need precise estimates (narrow CIs) centered near zero, not just p > 0.05^[40].

Statistical significance (p < α) indicates that an effect is unlikely due to chance and is probably real^[2]. Practical significance assesses whether the magnitude of the effect is large enough to matter in applied contexts^[13,15]. A statistically significant effect may be trivial (e.g., 0.1 cm improvement in a large sample), while a large effect may fail to reach significance in a small, underpowered study^[11]. To evaluate practical significance, examine effect sizes (e.g., Cohen’s d), confidence intervals, and minimal clinically important differences (MCIDs) specific to the outcome^[17]. Always ask: “Even if this effect is real, is it large enough to justify intervention, change practice, or inform theory?”^[13].

Read further

For deeper exploration of hypothesis testing controversies and alternatives, see^[9] (ASA Statement on P-Values),^[14] (Moving to a World Beyond p < 0.05),^[4] (The New Statistics), and^[33] (Bayesian data analysis). For power analysis and sample size planning, consult^[10] and^[11].

Next chapter

In Chapter 11, you will learn about correlation and bivariate regression, methods for quantifying the strength and direction of relationships between two continuous variables. You will see how correlation complements hypothesis testing, understand the critical distinction between correlation and causation, and learn to build simple regression models for prediction.

1. Moore, D. S., McCabe, G. P., & Craig, B. A. (2021). Introduction to the practice of statistics (10th ed.). W. H. Freeman; Company.

2. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

3. Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). SAGE Publications.

4. Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966

5. Fisher, R. A. (1925). Statistical methods for research workers.

6. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337. https://doi.org/10.1098/rsta.1933.0009

7. Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587–606. https://doi.org/10.1016/j.socec.2004.09.033

8. Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115–129. https://doi.org/10.1037/1082-989X.1.2.115

9. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108

10. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

11. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: A model comparison perspective (3rd ed.). Routledge.

12. Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376. https://doi.org/10.1038/nrn3475

13. Batterham, A. M., & Hopkins, W. G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50–57. https://doi.org/10.1123/ijspp.1.1.50

14. Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond "p < 0.05". The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

15. Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863

16. Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences.

17. Hopkins, W. G., Marshall, S. W., Batterham, A. M., & Hanin, J. (2009). Progressive statistics for studies in sports medicine and exercise science. Medicine & Science in Sports & Exercise, 41(1), 3–13. https://doi.org/10.1249/MSS.0b013e31818cb278

18. Altman, D. G., & Bland, J. M. (1995). Statistics notes: Absence of evidence is not evidence of absence. BMJ, 311, 485. https://doi.org/10.1136/bmj.311.7003.485

19. Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3), 135–140. https://doi.org/10.1053/j.seminhematol.2008.04.003

20. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behavioral Ecology, 17(4), 688–690. https://doi.org/10.1093/beheco/ark016

21. Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44, 701–710. https://doi.org/10.1002/ejsp.2023

22. Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31, 337–350. https://doi.org/10.1007/s10654-016-0149-3

23. Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., & and 76 others. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x

24. Cowles, M., & Davis, C. (1982). On the origins of the .05 level of statistical significance. American Psychologist, 37(5), 553–558. https://doi.org/10.1037/0003-066X.37.5.553

25. Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., & and 62 others. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z

26. Student [Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. https://doi.org/10.2307/2331554

27. Welch, B. L. (1947). The generalization of "student’s" problem when several different population variances are involved. Biometrika, 34(1-2), 28–35. https://doi.org/10.1093/biomet/34.1-2.28

28. Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use welch’s t-test instead of student’s t-test. International Review of Social Psychology, 30(1), 92–101. https://doi.org/10.5334/irsp.82

29. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604. https://doi.org/10.1037/0003-066X.54.8.594

30. Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23, 151–169. https://doi.org/10.1146/annurev.publhealth.23.100901.140546

31. Blanca, M. J., Alarcón, R., Arnau, J., Bono, R., & Bendayan, R. (2013). Non-normal data: Is ANOVA still a valid option? Psicothema, 25(4), 552–557. https://doi.org/10.7334/psicothema2013.552

32. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

33. Kruschke, J. K. (2015). Doing bayesian data analysis: A tutorial with r, JAGS, and stan (2nd ed.). Academic Press.

34. Schoot, R. van de, Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J., & Yau, C. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1, 1. https://doi.org/10.1038/s43586-020-00001-2

35. Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J. N., & Morey, R. D. (2018). Bayesian inference for psychology. Part i: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review, 25, 35–57. https://doi.org/10.3758/s13423-017-1343-3

36. Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567, 305–307. https://doi.org/10.1038/d41586-019-00857-9

37. Rosenthal, R. (1986). Meta-analytic procedures for social research.

38. Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164. https://doi.org/10.3758/s13423-013-0572-3

39. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

40. Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

10.1 Chapter roadmap

10.2 Workflow for hypothesis testing

10.3 The logic of hypothesis testing

10.3.1 Null and alternative hypotheses

10.3.2 One-tailed versus two-tailed tests

10.3.3 The p-value: What it is and what it isn’t

10.4 Type I and Type II errors

10.4.1 Type I error (false positive)

10.4.2 Type II error (false negative)

10.4.3 Statistical power

10.4.4 The trade-off between Type I and Type II errors

10.5 Conducting hypothesis tests: The t-test

10.5.1 One-sample t-test

10.5.2 Worked example: One-sample t-test

10.6 Two-sample (independent) t-test

10.6.1 Worked example: Two-sample t-test

10.7 Paired (dependent) t-test

10.7.1 Worked example: Paired t-test

10.8 Degrees of freedom

10.9 Statistical significance versus practical significance

10.9.1 When they diverge

10.10 Assumptions of t-tests

10.11 Frequentist versus Bayesian approaches

10.12 Common misinterpretations and pitfalls

10.12.1 Misinterpretation 1: “p = 0.05 is the cutoff for truth”

10.12.2 Misinterpretation 2: “Non-significant means no effect”

10.12.3 Misinterpretation 3: “p = 0.01 means a stronger effect than p = 0.05”

10.12.4 Misinterpretation 4: “Statistically significant = important”

10.13 Reporting hypothesis tests (APA style)

10.14 Chapter summary

10.15 Key terms

10.16 Practice: quick checks

Question 1: What does it mean to “reject the null hypothesis”?

Question 2: Why is it problematic to interpret a p-value as the probability that the null hypothesis is true?

Question 3: What is the relationship between statistical power and Type II error?

Question 4: When should you use a one-tailed test instead of a two-tailed test?

Question 5: Why doesn’t “p > 0.05” mean “no effect”?

Question 6: How do you distinguish between statistical significance and practical significance?