Chapter 10: Hypothesis Testing
2026-02-11
This presentation is based on the following books. The references are coming from these books unless otherwise specified.
Main sources:
ClassShare App
You may be asked in class to go to the ClassShare App to answer questions.
SPSS Tutorial
By the end of this chapter, you should be able to:
| Symbol | Name | Pronunciation | Definition |
|---|---|---|---|
| \(H_0\) | Null hypothesis | “H naught” | Statement of no effect or no difference |
| \(H_1\) or \(H_a\) | Alternative hypothesis | “H one” or “H sub a” | Statement of an effect or difference |
| \(\alpha\) | Significance level | “\(\alpha\)” | Probability of Type I error (typically 0.05) |
| \(\beta\) | Type II error rate | “beta” | Probability of failing to detect a real effect |
| \(1 - \beta\) | Statistical power | “one minus beta” | Probability of correctly detecting a real effect |
| \(p\) | P-value | “p value” | Probability of data as extreme as observed, if \(H_0\) is true |
| \(t\) | t-statistic | “t” | Test statistic for comparing means |
| \(df\) | Degrees of freedom | “d.f.” | Number of independent pieces of information |
| \(d\) | Cohen’s d | “Cohen’s d” | Effect size (standardized mean difference) |
Hypothesis testing uses indirect reasoning — similar to a courtroom trial[1,2].
Courtroom analogy:
| Courtroom Trial | Hypothesis Testing |
|---|---|
| Presumption of innocence | Assume \(H_0\) is true (no effect) |
| Prosecution presents evidence | Calculate test statistic from data |
| Jury evaluates evidence | Compare p-value to \(\alpha\) |
| Verdict: “Guilty” | Reject \(H_0\) (Significant evidence) |
| Verdict: “Not Guilty” | Fail to reject \(H_0\) (Insufficient evidence) |
Important
“Not guilty” ≠ “innocent” — just as “fail to reject \(H_0\)” ≠ “\(H_0\) is true”
Answer: We start by assuming there is NO effect (the null hypothesis, H₀). This is like the “presumption of innocence” in a courtroom — we need sufficient evidence to overturn this assumption.
Null hypothesis (\(H_0\)): Statement of no effect, no difference, or no relationship
Alternative hypothesis (\(H_1\)): Statement that contradicts \(H_0\)
Movement Science examples:
| Research Question | \(H_0\) | \(H_1\) |
|---|---|---|
| Does training improve jump height? | \(\mu_{\text{post}} = \mu_{\text{pre}}\) | \(\mu_{\text{post}} > \mu_{\text{pre}}\) |
| Is there a difference in balance between groups? | \(\mu_1 = \mu_2\) | \(\mu_1 \neq \mu_2\) |
| Does reaction time differ from the norm? | \(\mu = 200\) ms | \(\mu \neq 200\) ms |
Default choice
Use a two-tailed test unless you have a strong, pre-specified theoretical reason for a directional hypothesis[1].
Example: If testing a new supplement, use a two-tailed test (\(H_1: \mu \neq 0\)) to detect if performance improves OR gets worse. A one-tailed test (\(H_1: \mu > 0\)) would ignore the possibility that the supplement actually harms performance.
The p-value is the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming \(H_0\) is true[1,2].
In other words: If there were truly no effect, how surprising would these results be? A small p-value means the results are very surprising (rare), suggesting the “no effect” assumption (\(H_0\)) might be wrong.
Interpretation:
What the p-value is NOT:
❌ The probability that \(H_0\) is true
❌ The probability the results are due to chance
❌ The probability of making an error
❌ The size or importance of the effect
Important
A p-value of 0.021 means: “If there were truly no effect (\(H_0\) true) and we repeated this study many times, we would obtain results this extreme only 2.1% of the time.” Since 2.1% < 5% (our threshold), we reject \(H_0\).
Answer: We fail to reject \(H_0\) because p = 0.12 > α = 0.05. The data are not extreme enough to provide sufficient evidence against the null hypothesis at the 5% significance level.
Every hypothesis test can result in one of four outcomes[1,2]:
| \(H_0\) is True | \(H_0\) is False | |
|---|---|---|
| Reject \(H_0\) | Type I Error (\(\alpha\)) | Correct Decision (Power) |
| Fail to reject \(H_0\) | Correct Decision | Type II Error (\(\beta\)) |
Type I error (False Positive): Concluding there IS an effect when there isn’t one
Type II error (False Negative): Concluding there is NO effect when there is one
The trade-off
Reducing Type I error (lowering α) increases Type II error (β) — and vice versa. The only way to reduce both is to increase sample size.
Statistical power is the probability of correctly rejecting a false null hypothesis — the ability to detect a real effect when one exists[1,2].
\[ \text{Power} = 1 - \beta \]
Factors affecting power:
Recommended minimum: Power ≥ 0.80 (80%)
This means we want at least an 80% chance of detecting a real effect if one exists.
Important
Why power matters: A study with low power has a high probability of missing real effects (Type II error). Always conduct a power analysis before collecting data to determine the minimum sample size needed.
Answer: Type II error (β) = 1 - Power = 1 - 0.60 = 0.40 (40%). This means there’s a 40% chance the study will fail to detect a real effect — which is unacceptably high. The recommended minimum power is 80% (β = 20%).
Effect size measures the magnitude of the difference — how large the effect is in practical terms, independent of sample size[3].
Formula:
\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}} \]
Cohen’s d formula
Benchmarks[3]:
| \(d\) Value | Interpretation | Example (Jump Height) |
|---|---|---|
| 0.2 | Small | ~1.5 cm improvement |
| 0.5 | Medium | ~3.7 cm improvement |
| 0.8 | Large | ~6.0 cm improvement |
Important
Always report effect sizes alongside p-values. A result can be statistically significant (p < .05) but practically meaningless (tiny d), or statistically non-significant but practically important (large d with small sample).
Examples in Movement Science:
| Scenario | \(p\) | \(d\) | Interpretation |
|---|---|---|---|
| New shoe improves sprint by 0.001 s | .02 | 0.05 | Stat. sig. but trivial |
| Training improves jump by 8 cm | .08 | 0.90 | Not stat. sig. but large effect (underpowered?) |
| Rehab reduces pain by 2 points | .01 | 0.65 | Stat. sig. AND meaningful |
The “significance fallacy”
A “significant” result is not necessarily important, and a “non-significant” result does not mean the effect is zero. Always examine effect sizes and confidence intervals alongside p-values.
Answer: Probably not. While the result is statistically significant (p = 0.001), the effect size is tiny (d = 0.05). This likely occurred because of a very large sample size that detected a trivial difference. The effect is detectable but not meaningful in practice.
Follow these five steps for every hypothesis test[1]:
| Step | Action | Example |
|---|---|---|
| 1. State hypotheses | Write \(H_0\) and \(H_1\) based on research question | \(H_0: \mu = 50\); \(H_1: \mu \neq 50\) |
| 2. Set criteria | Choose \(\alpha\) (typically 0.05) and test direction | Two-tailed, \(\alpha = .05\) |
| 3. Calculate statistic | Compute test statistic from data | \(t = 2.13\) |
| 4. Make decision | Compare p-value to \(\alpha\) | \(p = .043 < .05\) → Reject \(H_0\) |
| 5. State conclusion | Interpret in context, report effect size | “Jump height significantly exceeds 50 cm, \(t(24) = 2.13\), \(p = .043\), \(d = 0.43\)” |
APA reporting format
“A one-sample t-test revealed that mean vertical jump height (\(M\) = 53.2, \(SD\) = 7.5) was significantly greater than the hypothesized mean of 50 cm, \(t(24)\) = 2.13, \(p\) = .043, \(d\) = 0.43.”
Research question: Is the mean vertical jump height of kinesiology students different from the national average of 50 cm?
Data: \(n = 25\), \(\bar{x} = 53.2\) cm, \(s = 7.5\) cm
Step 1: State hypotheses
Step 2: Calculate test statistic
\[t = \frac{53.2 - 50}{7.5 / \sqrt{25}} = \frac{3.2}{1.5} = 2.13\]
Step 3: Find p-value
Step 4: Make decision
Step 5: State conclusion
“The mean vertical jump height of kinesiology students (\(\bar{x} = 53.2\) cm, \(s = 7.5\)) was significantly greater than the national average of 50 cm, \(t(24) = 2.13\), \(p = .043\).”
Effect size:
\[d = \frac{53.2 - 50}{7.5} = 0.43 \text{ (medium)}\]
Cohen’s d benchmarks
Small = 0.2, Medium = 0.5, Large = 0.8[3]
❌ Misconception 1
“p = 0.03 means there is a 3% chance that \(H_0\) is true.”
✅ Correct: p = 0.03 means there is a 3% chance of observing data this extreme if \(H_0\) were true.
❌ Misconception 2
“Failing to reject \(H_0\) proves there is no effect.”
✅ Correct: Failing to reject means insufficient evidence against \(H_0\). The study may lack power.
❌ Misconception 3
“A significant result means the effect is large and important.”
✅ Correct: Statistical significance depends on sample size. Always check effect size.
❌ Misconception 4
“p = 0.001 is ‘more significant’ than p = 0.04.”
✅ Correct: Both are significant at α = 0.05. A smaller p-value indicates stronger evidence against \(H_0\), but does not indicate a larger or more important effect.
❌ Misconception 5
“If I run enough tests, I’ll eventually find a significant result.”
✅ Correct: Multiple testing inflates Type I error. Running 20 tests at α = 0.05 will produce about 1 false positive by chance alone. Use corrections (Bonferroni, FDR).
Important
The goal of hypothesis testing is to make informed decisions under uncertainty. Always report effect sizes, confidence intervals, and p-values together for a complete picture.
Please complete the One-Sample t-Test Activity for this week before leaving.