Fundamentals of Inferential Statistics

A rigorous, evidence-based introduction to inferential statistics for beginner researchers in kinesiology and the humanities. Covers probability concepts, sampling distributions, confidence intervals, hypothesis testing, Type I/II errors, effect size, and statistical power—with R-generated visualizations and applied examples throughout.

inferential statistics
probability distributions
hypothesis testing
sampling error
confidence intervals
type I error
type II error
effect size
statistical power
Author
Affiliation

Cal State Northridge

Published

March 4, 2026

1 Learning Objectives

By the end of this post, you should be able to:

  1. Distinguish descriptive from inferential statistics and explain why the latter requires probability theory.
  2. Define and contrast a population parameter and a sample statistic.
  3. Explain sampling error and the Central Limit Theorem, and demonstrate their relationship graphically.
  4. Construct and interpret a confidence interval for a population mean.
  5. State a null and alternative hypothesis, select a test statistic, and interpret a p-value.
  6. Distinguish Type I and Type II errors and describe the factors that affect each.
  7. Define effect size and statistical power, and explain why both matter independently of the p-value.
  8. Differentiate one-tailed from two-tailed tests and identify when each is appropriate.

2 The Bridge from Description to Inference

Descriptive statistics (covered in the previous post) summarize a sample—they tell you what you observed. Inferential statistics do something more ambitious: they use the sample to reason about the broader population from which it was drawn (Rosner, 2015; Weir & Vincent, 2021). This shift from “what we measured” to “what is likely true in general” is mathematically grounded in probability theory.

A motivating example. A kinesiology researcher wants to know whether a 12-week resistance training program increases vertical jump height in collegiate soccer players. She cannot test every collegiate soccer player in the country, so she recruits a sample of 30. After the intervention, the sample mean jump height increases by 4.2 cm. The key inferential question is: Is this 4.2 cm increase a real population-level effect, or could it plausibly have arisen by chance in a sample of 30 people even if the program has no true effect? (Thomas et al., 2015)

Answering that question requires understanding populations and samples, probability distributions, and hypothesis testing—the topics of this post.

Note

Parameter vs. Statistic
A population parameter is a fixed (usually unknown) numerical characteristic of the entire population (e.g., \(\mu\), \(\sigma\)). A sample statistic is a value computed from the sample that estimates the corresponding parameter (e.g., \(\bar{x}\), \(s\)). Inferential statistics is the principled process of using statistics to estimate parameters (Gravetter et al., 2021).

3 Probability Foundations

Inferential statistics is built on probability theory. Three concepts are fundamental.

Sample space (\(\Omega\)): The set of all possible outcomes. For a single sprint trial, \(\Omega\) might be all positive real numbers representing completion time.

Event: Any subset of the sample space. “Completing the sprint in under 5.0 s” is an event.

Probability: A number in \([0, 1]\) assigned to an event reflecting its long-run relative frequency. A probability of 0 means the event cannot occur; 1 means it is certain (Rosner, 2015).

Two key relationships between events:

  • Mutually exclusive: Two events cannot both occur. A sprint cannot simultaneously be “under 5.0 s” and “over 6.0 s.”
  • Independent: The occurrence of one event does not change the probability of another. Whether athlete A beats 5.0 s is independent of whether athlete B does, if they perform separately.

These concepts underpin the probability distributions used to compute p-values and confidence intervals.

4 Probability Distributions

A probability distribution describes how probability is allocated across possible values of a variable (Gravetter et al., 2021; Rosner, 2015). Distributions fall into two broad families.

4.1 Discrete Distributions

Discrete variables take countable values (usually non-negative integers). Two distributions are especially relevant in kinesiology:

Binomial distribution. Models the number of “successes” in \(n\) independent trials where each trial has the same probability \(p\) of success. Example: A physical therapist conducts 20 balance-beam trials with a stroke patient. If the true probability of a successful crossing is 0.7, the number of successes follows a Binomial\((n=20, p=0.7)\) distribution.

Poisson distribution. Models the number of rare events in a fixed interval of time or space. Example: ACL injuries in a professional soccer league across a season. If injuries occur randomly at a constant average rate \(\lambda\), the count follows a Poisson distribution (Rosner, 2015).

4.2 Continuous Distributions

Continuous variables can take any value within a range. Three distributions are central to hypothesis testing (Figure 1):

Normal distribution. Fully described by its mean \(\mu\) and standard deviation \(\sigma\). Many biological and performance variables (height, VO₂max, grip strength) are approximately normal in large samples. The normal distribution has the well-known 68–95–99.7% rule (Gravetter et al., 2021).

Student’s t-distribution. When the population SD is unknown and sample sizes are small (\(n < 30\)), the test statistic follows a t-distribution rather than the standard normal. The t-distribution has heavier tails than the normal, reflecting added uncertainty from estimating \(\sigma\) with \(s\). As \(n \to \infty\) the t-distribution converges to the normal (Weir & Vincent, 2021).

F-distribution. Used in analysis of variance (ANOVA) to test whether multiple group means differ. It is the ratio of two chi-square-distributed quantities and is right-skewed (Rosner, 2015).

Code
#| fig-cap: "Four probability distributions commonly used in kinesiology research. Top-left: standard normal; top-right: t-distribution for small (df=5) and large (df=30) samples; bottom-left: chi-square; bottom-right: F-distribution."

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

# Standard Normal
x <- seq(-4, 4, length.out = 300)
plot(x, dnorm(x), type = "l", lwd = 2, col = "steelblue",
     main = "Standard Normal", xlab = "z", ylab = "Density")
polygon(c(x, rev(x)), c(dnorm(x), rep(0, length(x))),
        col = rgb(0.27, 0.51, 0.71, 0.2), border = NA)

# t-distribution
plot(x, dt(x, df=5), type="l", lwd=2, col="darkorange",
     main="t-Distribution", xlab="t", ylab="Density", ylim=c(0,0.41))
lines(x, dt(x, df=30), lwd=2, col="purple", lty=2)
lines(x, dnorm(x), lwd=1.5, col="gray50", lty=3)
legend("topright", legend=c("df=5","df=30","Normal"),
       col=c("darkorange","purple","gray50"), lty=c(1,2,3), lwd=2, cex=0.75, bty="n")

# Chi-square
xc <- seq(0, 20, length.out=300)
plot(xc, dchisq(xc, df=4), type="l", lwd=2, col="darkred",
     main="Chi-Square (df=4)", xlab=expression(chi^2), ylab="Density")

# F-distribution
xf <- seq(0, 6, length.out=300)
plot(xf, df(xf, df1=3, df2=20), type="l", lwd=2, col="darkgreen",
     main="F-Distribution (df1=3, df2=20)", xlab="F", ylab="Density")

par(mfrow = c(1,1))
Figure 1

4.3 Null Distributions

Every hypothesis test is built around a null distribution—the sampling distribution of the test statistic assuming the null hypothesis is true. The p-value is the area in the tail(s) of the null distribution at or beyond the observed test statistic (see Table 1 for examples) (Bland & Altman, 1994; Weir & Vincent, 2021).

Table 1: Common null distributions and their applications
Distribution Test statistic Common application
Standard Normal \(z\) Large-sample tests for means or proportions
t-distribution \(t\) Comparing means with unknown \(\sigma\)
F-distribution \(F\) ANOVA; testing equality of variances
\(\chi^2\) \(\chi^2\) Tests for independence, goodness of fit
Binomial Exact tests for proportions

5 Sampling Error and the Central Limit Theorem

5.1 Sampling Error

Sampling error is the unavoidable discrepancy between a sample statistic and the true population parameter (Thomas et al., 2015; Weir & Vincent, 2021). Even with a perfectly designed study and no measurement error, different random samples from the same population will yield different means. This fluctuation is not a mistake—it is a mathematical certainty arising from the randomness of sampling.

Movement science example. Suppose the true population mean VO₂max of recreational runners is 48.0 mL·kg⁻¹·min⁻¹ (\(\sigma = 6.0\)). A researcher draws a random sample of 25 runners and computes \(\bar{x} = 50.4\) mL·kg⁻¹·min⁻¹. The sampling error here is \(50.4 - 48.0 = 2.4\) mL·kg⁻¹·min⁻¹. It exists not because of procedural mistakes but because any one sample will not perfectly mirror the population.

How to reduce sampling error: - Increase sample size (\(n\))—larger samples produce more stable estimates. - Use probability sampling methods (simple random, stratified, cluster). - Reduce measurement error (use calibrated equipment, standardized protocols) (Atkinson & Nevill, 1998).

5.2 The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most important results in statistics (Gravetter et al., 2021; Rosner, 2015). It states that, regardless of the shape of the population distribution, the sampling distribution of the sample mean approaches a normal distribution as sample size increases. Formally:

\[ \bar{X} \sim \mathcal{N}\!\left(\mu, \; \frac{\sigma}{\sqrt{n}}\right) \]

The standard deviation of this sampling distribution, \(\text{SE} = \sigma/\sqrt{n}\), is called the standard error of the mean. It quantifies how much sample means vary around the true population mean (Figure 2).

Code
#| fig-cap: "Central Limit Theorem demonstration using simulated sprint times drawn from a right-skewed population (exponential distribution, mean ≈ 5.8 s). Each panel shows the distribution of 5,000 sample means for increasing sample sizes. As n grows, the sampling distribution becomes approximately normal regardless of the population's shape."

set.seed(123)
pop_size <- 100000
# Right-skewed population: exponential + shift
pop <- rexp(pop_size, rate = 0.5) + 3.3  # mean ≈ 5.3 s, right skewed

sample_sizes <- c(2, 5, 15, 30)
n_sim        <- 5000

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

for (n in sample_sizes) {
  means <- replicate(n_sim, mean(sample(pop, n)))
  hist(means,
       breaks = 30,
       col    = "steelblue",
       border = "white",
       freq   = FALSE,
       main   = paste0("n = ", n),
       xlab   = "Sample Mean Sprint Time (s)",
       ylab   = "Density")
  # Overlay theoretical normal
  curve(dnorm(x, mean = mean(pop), sd = sd(pop)/sqrt(n)),
        col = "red", lwd = 2, add = TRUE)
}

par(mfrow = c(1,1))
Figure 2

Key insight: Even though the underlying sprint-time population is skewed, the distribution of sample means is approximately normal by \(n = 15\)\(30\). This is why parametric tests (which assume normality of the sampling distribution, not the raw data) are robust for moderate sample sizes (Field, 2018).

6 Confidence Intervals

A confidence interval (CI) is a range of plausible values for the unknown population parameter, calculated from the sample data (Rosner, 2015). A 95% CI does not mean “there is a 95% probability the true value lies inside this specific interval.” Rather, it means: if we repeated this study 100 times, approximately 95 of the resulting intervals would contain the true parameter (Gravetter et al., 2021).

6.1 Formula for the CI of a Population Mean

When \(\sigma\) is unknown (the typical case), we use the t-distribution:

\[ CI = \bar{x} \pm t^*_{\alpha/2, \, n-1} \cdot \frac{s}{\sqrt{n}} \]

where \(t^*_{\alpha/2, \, n-1}\) is the critical value from the t-distribution with \(n-1\) degrees of freedom for the chosen confidence level (see Table 2).

Table 2: Critical values for common confidence levels
Confidence Level Z (large \(n\)) Approximate \(t\) (\(n=20\))
90% 1.645 1.729
95% 1.960 2.093
99% 2.576 2.861

Movement science example. Researchers measure the time to complete an obstacle course (s) for a sample of 30 Army officer candidates. The sample mean is \(\bar{x} = 142.5\) s and \(s = 18.3\) s.

Code
xbar <- 142.5
s    <- 18.3
n    <- 30
alpha <- 0.05

t_star <- qt(1 - alpha/2, df = n - 1)
SE     <- s / sqrt(n)
MOE    <- t_star * SE

writeLines(c(
  paste0("Standard Error (SE): ", round(SE, 3), " s"),
  paste0("Critical t*        : ", round(t_star, 3)),
  paste0("Margin of Error    : ", round(MOE, 2), " s"),
  paste0("95% CI             : [", round(xbar - MOE, 2), ", ",
         round(xbar + MOE, 2), "] s")
))
Standard Error (SE): 3.341 s
Critical t*        : 2.045
Margin of Error    : 6.83 s
95% CI             : [135.67, 149.33] s

The 95% CI is approximately [135.7, 149.3] seconds, meaning the data are consistent with population means ranging from about 2 min 16 s to 2 min 29 s (Figure 3).

Code
#| fig-cap: "Simulation of 50 confidence intervals (95%) drawn from a population with true mean µ = 142.5 s. Intervals that capture the true mean are shown in blue; the two that miss it are shown in red. In the long run, 95% of all intervals would capture µ."

set.seed(7)
true_mu <- 142.5
true_sd <- 18.3
n_obs   <- 30
n_intervals <- 50

results <- matrix(NA, nrow = n_intervals, ncol = 3)
for (i in seq_len(n_intervals)) {
  samp <- rnorm(n_obs, mean = true_mu, sd = true_sd)
  xm   <- mean(samp)
  tstar <- qt(0.975, df = n_obs - 1)
  se_i  <- sd(samp) / sqrt(n_obs)
  results[i, ] <- c(xm, xm - tstar * se_i, xm + tstar * se_i)
}

captures <- results[,2] <= true_mu & results[,3] >= true_mu
cols <- ifelse(captures, "steelblue", "red")

plot(NULL, xlim = c(120, 165), ylim = c(0, n_intervals + 1),
     xlab = "Completion Time (s)", ylab = "Interval #",
     main = "50 Simulated 95% Confidence Intervals")
abline(v = true_mu, lty = 2, col = "black", lwd = 2)
text(true_mu + 0.5, n_intervals + 0.5, expression(mu), cex = 1.1)
for (i in seq_len(n_intervals)) {
  segments(results[i,2], i, results[i,3], i, col = cols[i], lwd = 1.5)
  points(results[i,1], i, pch = 19, col = cols[i], cex = 0.6)
}
legend("bottomright", legend = c("Contains µ", "Misses µ"),
       col = c("steelblue","red"), lwd = 2, cex = 0.8, bty = "n")
Figure 3

7 Estimation

7.1 Point Estimation

A point estimate is a single value used to estimate a population parameter. The sample mean \(\bar{x}\) is the most common point estimate for the population mean \(\mu\) (Weir & Vincent, 2021). A good point estimator should be:

  • Unbiased: The expected value of the estimator equals the parameter (\(E[\bar{x}] = \mu\)).
  • Efficient: Among all unbiased estimators, it has the smallest variance.
  • Consistent: The estimate converges to the true parameter as \(n \to \infty\) (Rosner, 2015).

Movement science example. A researcher estimates mean VO₂max (\(\mu\)) for elite road cyclists using a sample of \(n = 40\). The sample yields \(\bar{x} = 72.1\) mL·kg⁻¹·min⁻¹. This is the point estimate; it provides no information about how precisely \(\mu\) is estimated. The 95% CI ([70.0, 74.2]) provides that crucial precision information.

7.2 Interval Estimation

As shown in Section 5, interval estimation captures uncertainty around the point estimate through confidence intervals. Interval estimates are always preferred over point estimates in scientific reporting because they reveal the precision of measurement and the plausibility of the null hypothesis value (Cumming, 2014).

Tip

Reporting recommendation. The American Psychological Association and the American College of Sports Medicine both recommend routinely reporting confidence intervals alongside p-values (Cumming, 2014; Sullivan & Feinn, 2012). A result can be statistically significant (\(p < 0.05\)) yet practically trivial if the CI reveals the effect is tiny.

8 Hypothesis Testing

Hypothesis testing is a formal decision-making procedure that uses sample data to evaluate a claim about a population parameter (Thomas et al., 2015; Weir & Vincent, 2021). It does not prove hypotheses—it assesses whether the data are sufficiently inconsistent with the null hypothesis to warrant rejecting it.

8.1 Null and Alternative Hypotheses

The null hypothesis (\(H_0\)) is the default claim of “no effect” or “no difference.” The alternative hypothesis (\(H_1\) or \(H_a\)) is the researcher’s prediction of an effect.

Movement science example.

Does a plyometric training program increase jump height in adolescent basketball players?

\[ H_0: \mu_{\text{post}} - \mu_{\text{pre}} = 0 \quad \text{(no change)} \] \[ H_1: \mu_{\text{post}} - \mu_{\text{pre}} > 0 \quad \text{(jump height increases)} \]

Humanities example.

Is the mean reading speed of students who used an e-reader different from those who used a printed textbook?

\[ H_0: \mu_{e} = \mu_{p} \] \[ H_1: \mu_{e} \neq \mu_{p} \]

The directional (\(>\)) vs. non-directional (\(\neq\)) distinction has consequences for test selection—covered in detail in the companion post on directional hypotheses.

8.2 Test Statistics and p-Values

A test statistic is a single number that summarizes how far the observed sample result is from what \(H_0\) predicts, measured in units of standard error (Weir & Vincent, 2021):

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

The p-value is the probability of observing a test statistic as extreme as—or more extreme than—the one obtained, assuming \(H_0\) is true (Bland & Altman, 1994; Rosner, 2015).

Warning

The p-value is widely misinterpreted. A p-value is not the probability that \(H_0\) is true, nor the probability that the result occurred by chance. It is a conditional probability: \(P(\text{data} \geq \text{observed} \mid H_0 \text{ is true})\). This distinction matters enormously for drawing correct conclusions (Cumming, 2014).

Movement science example. Using data from the jump-height intervention above:

Code
# Pre- and post-intervention vertical jump heights (cm) for 20 athletes
set.seed(42)
pre  <- round(rnorm(20, mean = 42, sd = 5), 1)
post <- round(pre + rnorm(20, mean = 3.5, sd = 4), 1)  # true mean gain ≈ 3.5 cm

result <- t.test(post, pre, paired = TRUE, alternative = "greater")

writeLines(c(
  paste0("Mean pre : ",    round(mean(pre),  2), " cm"),
  paste0("Mean post: ",    round(mean(post), 2), " cm"),
  paste0("Mean gain: ",    round(mean(post - pre), 2), " cm"),
  paste0("t-statistic: ",  round(result$statistic, 3)),
  paste0("df         : ",  result$parameter),
  paste0("p-value    : ",  round(result$p.value, 4)),
  paste0("95% CI (lower bound): ", round(result$conf.int[1], 2), " cm")
))
Mean pre : 42.97 cm
Mean post: 45.38 cm
Mean gain: 2.41 cm
t-statistic: 2.425
df         : 19
p-value    : 0.0127
95% CI (lower bound): 0.69 cm

With \(p < 0.05\), we reject \(H_0\) and conclude the plyometric program produced a statistically significant increase in jump height.

8.3 The Decision Rule and Significance Level

The significance level (\(\alpha\)) is the pre-chosen probability threshold below which the p-value triggers rejection of \(H_0\). By convention, \(\alpha = 0.05\) is most common in kinesiology (Weir & Vincent, 2021), though \(\alpha = 0.01\) is used in fields with higher costs of error (e.g., clinical trials). The choice of \(\alpha\) must be made before data collection, not adjusted after examining results (Bland & Altman, 1994).

8.4 Type I and Type II Errors

Any binary decision procedure can make two types of errors (Table 3) (Gravetter et al., 2021; Weir & Vincent, 2021):

Table 3: Decision outcomes in hypothesis testing
\(H_0\) is True \(H_0\) is False
Reject \(H_0\) Type I Error (\(\alpha\)) — False Positive Correct Decision (Power \(= 1 - \beta\))
Fail to reject \(H_0\) Correct Decision (\(1 - \alpha\)) Type II Error (\(\beta\)) — False Negative

Type I error (\(\alpha\)): Concluding there is an effect when none exists. In kinesiology, this could mean recommending a useless supplement because a study happened to produce a false positive. Controlled by setting \(\alpha\) low.

Type II error (\(\beta\)): Failing to detect a real effect. This might mean dismissing an effective rehabilitation protocol because the study was underpowered. Controlled by increasing sample size or reducing measurement error.

Weir & Vincent (2021) list five common causes of Type I errors:

  1. Measurement error
  2. Non-random sampling
  3. \(\alpha\) set too liberally (e.g., \(\alpha = .10\))
  4. Investigator bias
  5. Improper use of a one-tailed test

And four common causes of Type II errors:

  1. Measurement error
  2. Insufficient statistical power (\(n\) too small)
  3. \(\alpha\) set too conservatively (e.g., \(\alpha = .01\))
  4. Treatment effect not properly administered

A visual representation of Type I and Type II errors is shown in Figure 4.

Code
#| fig-cap: "Visualization of Type I (α, red region) and Type II (β, blue region) errors. The left curve is the null distribution; the right curve is the alternative distribution (true effect d = 0.8). The vertical dashed line marks the critical value for α = 0.05 (one-tailed). The power of the test (1 - β) is the blue area to the right of the critical value under the alternative distribution."

d   <- 0.8  # effect size (Cohen's d)
n   <- 25   # sample size per group
se  <- 1 / sqrt(n)

x   <- seq(-3.5, 5.5, length.out = 500)
crit_val <- qnorm(0.95)  # one-tailed α = 0.05

y_null <- dnorm(x, mean = 0, sd = 1)
y_alt  <- dnorm(x, mean = d / se, sd = 1)

plot(x, y_null, type = "l", lwd = 2, col = "navy",
     ylim = c(0, 0.42),
     xlab = "Test Statistic", ylab = "Density",
     main = expression(paste("Type I (", alpha, ") and Type II (", beta, ") Errors")))
lines(x, y_alt, lwd = 2, col = "darkgreen")

# Type I error region (right tail of null)
x_I <- x[x >= crit_val]
polygon(c(x_I, rev(x_I)),
        c(dnorm(x_I, 0, 1), rep(0, length(x_I))),
        col = rgb(1, 0, 0, 0.4), border = NA)

# Type II error region (left tail of alternative)
x_II <- x[x <= crit_val]
polygon(c(x_II, rev(x_II)),
        c(dnorm(x_II, d/se, 1), rep(0, length(x_II))),
        col = rgb(0, 0, 1, 0.3), border = NA)

abline(v = crit_val, lty = 2, col = "black", lwd = 1.5)

legend("topright",
       legend = c(expression(H[0]~Distribution),
                  expression(H[1]~Distribution),
                  expression(paste("Type I Error (", alpha, ")")),
                  expression(paste("Type II Error (", beta, ")"))),
       col = c("navy","darkgreen",
               rgb(1,0,0,0.6), rgb(0,0,1,0.5)),
       lty = c(1,1,NA,NA),
       pch = c(NA,NA,15,15),
       pt.cex = 1.5,
       lwd = c(2,2,NA,NA),
       cex = 0.78, bty = "n")
Figure 4

8.5 Effect Size and Statistical Power

A statistically significant result (\(p < \alpha\)) tells us the observed effect is unlikely under \(H_0\), but it says nothing about the size or practical importance of the effect (Cumming, 2014; Sullivan & Feinn, 2012). Two complementary metrics fill this gap.

8.5.1 Effect Size

Cohen’s \(d\) is the standardized mean difference between two groups:

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}} \]

Conventional benchmarks (Table 4) (Cohen, 1988):

Table 4: Conventional benchmarks for Cohen’s \(d\)
Cohen’s \(d\) Interpretation Kinesiology example
0.2 Small 1–2 cm difference in jump height
0.5 Medium ~5 cm difference in jump height
0.8 Large ~8 cm difference in jump height

These benchmarks are domain-general defaults; researchers should consult domain-specific norms when available (Lakens, 2013).

Movement science example. In a study comparing maximal isometric strength (N) between trained and untrained adults:

Code
trained   <- c(350, 380, 410, 390, 420, 360, 405, 430, 375, 395)
untrained <- c(250, 270, 290, 260, 280, 300, 265, 285, 275, 295)

d_pooled <- sqrt((var(trained) + var(untrained)) / 2)
cohens_d <- (mean(trained) - mean(untrained)) / d_pooled

writeLines(c(
  paste0("Mean trained  : ", round(mean(trained),   1), " N"),
  paste0("Mean untrained: ", round(mean(untrained), 1), " N"),
  paste0("Pooled SD     : ", round(d_pooled,        2), " N"),
  paste0("Cohen's d     : ", round(cohens_d,        2), " (large effect)")
))
Mean trained  : 391.5 N
Mean untrained: 277 N
Pooled SD     : 21.42 N
Cohen's d     : 5.34 (large effect)

8.5.2 Statistical Power

Statistical power (\(1 - \beta\)) is the probability of correctly rejecting \(H_0\) when \(H_1\) is true (Cohen, 1988). It is determined by four interacting factors:

  • Effect size (\(d\)): larger effects are easier to detect.
  • Sample size (\(n\)): larger samples yield more power.
  • Significance level (\(\alpha\)): relaxing \(\alpha\) increases power but also increases Type I errors.
  • Variability (\(\sigma\)): lower measurement error increases power.

The conventional minimum power standard is 0.80 (Cohen, 1988), meaning an 80% chance of detecting a true effect. Studies with power below 0.50 are sometimes called “underpowered” and are at substantial risk of Type II errors (see Figure 5).

Code
#| fig-cap: "Statistical power as a function of sample size (per group) for a two-sample independent t-test (α = 0.05, two-tailed) at three effect sizes: small (d = 0.2), medium (d = 0.5), and large (d = 0.8). The dashed horizontal line marks the conventional 0.80 power target."

n_seq <- seq(5, 200, by = 5)
ds    <- c(0.2, 0.5, 0.8)
cols  <- c("salmon", "steelblue", "darkgreen")

plot(NULL, xlim = c(5, 200), ylim = c(0, 1),
     xlab = "Sample Size (per group)", ylab = "Statistical Power",
     main = expression(paste("Power Curves (", alpha, " = 0.05, two-tailed)")))
abline(h = 0.80, lty = 2, col = "gray40")
text(190, 0.82, "0.80", col = "gray40", cex = 0.8)

for (i in seq_along(ds)) {
  d  <- ds[i]
  pw <- sapply(n_seq, function(n) {
    ncp <- d * sqrt(n / 2)      # non-centrality parameter for independent t-test
    crit <- qt(0.975, df = 2*(n-1))
    pt(crit, df = 2*(n-1), ncp = ncp, lower.tail = FALSE) +
    pt(-crit, df = 2*(n-1), ncp = ncp, lower.tail = TRUE)
  })
  lines(n_seq, pw, col = cols[i], lwd = 2)
}
legend("bottomright",
       legend = c("d = 0.2 (small)", "d = 0.5 (medium)", "d = 0.8 (large)"),
       col = cols, lwd = 2, cex = 0.8, bty = "n")
Figure 5

Practical takeaway: Detecting a small effect (\(d = 0.2\)) requires well over 150 participants per group to achieve 80% power—a sobering finding for under-resourced kinesiology labs. A large effect (\(d = 0.8\)) can be reliably detected with ~25–30 participants per group (Cohen, 1988).

8.6 One- and Two-Tailed Tests

A one-tailed test localizes all of \(\alpha\) in a single tail of the null distribution; a two-tailed test splits \(\alpha\) equally between both tails. The choice is driven by the directional specificity of \(H_1\) (Figure 6) (Bland & Altman, 1994; Weir & Vincent, 2021).

Code
#| fig-cap: "Rejection regions for one-tailed (left) and two-tailed (right) tests at α = 0.05. In the one-tailed test all 5% is in the right tail; in the two-tailed test 2.5% is allocated to each tail. Shaded red regions are rejection regions; the shaded gray region shows the non-rejection region."

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

z  <- seq(-3.5, 3.5, length.out = 500)
yz <- dnorm(z)

# --- One-tailed ---
plot(z, yz, type = "l", lwd = 2, col = "navy",
     main = "One-Tailed (Right)\nα = 0.05",
     xlab = "z", ylab = "Density")
# Gray non-rejection region
z_nr <- z[z <= qnorm(0.95)]
polygon(c(z_nr, rev(z_nr)),
        c(dnorm(z_nr), rep(0, length(z_nr))),
        col = "lightgray", border = NA)
# Red rejection region
z_rj <- z[z >= qnorm(0.95)]
polygon(c(z_rj, rev(z_rj)),
        c(dnorm(z_rj), rep(0, length(z_rj))),
        col = rgb(1, 0, 0, 0.5), border = NA)
abline(v = qnorm(0.95), lty = 2, col = "red")
text(qnorm(0.95) + 0.1, 0.20, paste0("z = ", round(qnorm(0.95),2)),
     col = "red", cex = 0.75, adj = 0)
text(2.8, 0.05, "α = 0.05", col = "red", cex = 0.75)

# --- Two-tailed ---
plot(z, yz, type = "l", lwd = 2, col = "navy",
     main = "Two-Tailed\nα = 0.05",
     xlab = "z", ylab = "Density")
# Gray non-rejection region
z_nr2 <- z[z >= qnorm(0.025) & z <= qnorm(0.975)]
polygon(c(z_nr2, rev(z_nr2)),
        c(dnorm(z_nr2), rep(0, length(z_nr2))),
        col = "lightgray", border = NA)
# Red left tail
z_L <- z[z <= qnorm(0.025)]
polygon(c(z_L, rev(z_L)), c(dnorm(z_L), rep(0, length(z_L))),
        col = rgb(1,0,0,0.5), border = NA)
# Red right tail
z_R <- z[z >= qnorm(0.975)]
polygon(c(z_R, rev(z_R)), c(dnorm(z_R), rep(0, length(z_R))),
        col = rgb(1,0,0,0.5), border = NA)
abline(v = qnorm(0.025), lty = 2, col = "red")
abline(v = qnorm(0.975), lty = 2, col = "red")
text(qnorm(0.025) - 0.1, 0.20, paste0("z = ", round(qnorm(0.025),2)),
     col = "red", cex = 0.72, adj = 1)
text(qnorm(0.975) + 0.1, 0.20, paste0("z = ", round(qnorm(0.975),2)),
     col = "red", cex = 0.72, adj = 0)
text(-3, 0.05, "α/2", col = "red", cex = 0.75)
text( 3, 0.05, "α/2", col = "red", cex = 0.75)

par(mfrow = c(1,1))
Figure 6

Two-tailed tests are the default in most kinesiology research; one-tailed tests require strong a priori directional predictions grounded in prior literature (Bland & Altman, 1994; Weir & Vincent, 2021). See the companion post on directional hypotheses for a detailed treatment.

9 Summary

Inferential statistics provides the mathematical tools to make principled conclusions about populations from limited samples. The core chain of reasoning is:

  1. Clearly define the population and formulate \(H_0\) and \(H_1\).
  2. Draw a random, representative sample and measure variables carefully.
  3. Compute an appropriate test statistic; locate it on the relevant null distribution.
  4. Calculate the p-value and compare it to the pre-specified \(\alpha\).
  5. Report the decision, the effect size, the confidence interval, and the power of the study.

No single p-value tells the complete story. Effect sizes and confidence intervals are essential complements that reveal the magnitude and precision of an effect—information that a binary reject/fail-to-reject decision discards (Cumming, 2014; Lakens, 2013; Sullivan & Feinn, 2012).

9.1 Check your Knowledge

# Question 1 What is the main purpose of inferential statistics? - [ ] To summarize the exact characteristics of the collected sample data. - [x] To use sample data to make probability-based claims about a broader population. - [ ] To eliminate error from measurement instruments. - [ ] To ensure the sample perfectly matches the population. > Inferential statistics utilizes probability to estimate population parameters from sample statistics. # Question 2 Which distribution is used to compute confidence intervals and test statistics when the population standard deviation is unknown and the sample size is small? - [ ] Standard Normal distribution - [ ] Chi-square distribution - [x] Student's t-distribution - [ ] F-distribution > The Student's t-distribution has heavier tails than the normal distribution, accounting for the added uncertainty of having to estimate the population standard deviation. # Question 3 What does the Central Limit Theorem (CLT) state concerning the sampling distribution of the sample mean? - [ ] It takes the exact shape of the underlying population distribution. - [x] It approaches a normal distribution as sample size increases, regardless of the population's shape. - [ ] It becomes perfectly uniform at $n > 30$. - [ ] It only applies if the population is normally distributed. > The CLT is a fundamental theorem explaining why sample means tend to follow a normal curve, justifying parametric tests even for skewed data (at sufficient sample sizes). # Question 4 In hypothesis testing, what is a Type I error? - [x] Concluding an effect exists when the null hypothesis is actually true (False Positive). - [ ] Failing to detect a real effect when the alternative hypothesis is true (False Negative). - [ ] Using the wrong test statistic for the data. - [ ] Using an insufficient sample size to achieve 80% power. > A Type I error happens when random chance produces an extreme sample result, leading the researcher to incorrectly reject a true null hypothesis. # Question 5 If a 95% confidence interval for a mean is [12.4, 18.6], what is the correct interpretation? - [ ] 95% of the data points in the sample fall between 12.4 and 18.6. - [ ] There is a 95% chance that the next subject tested will score between 12.4 and 18.6. - [x] If the study were repeated many times, we would expect about 95% of the resulting intervals to contain the true population mean. - [ ] The true population mean changes 95% of the time. > A confidence interval is an interval estimate; 95% confidence refers to the long-run success rate of the estimation procedure, not the probability of a specific interval. # Question 6 What does Cohen's $d$ measure? - [ ] The probability that the null hypothesis is true. - [x] The standardized mean difference between two groups (effect size). - [ ] The standard error of the sampling distribution. - [ ] The probability of making a Type II error. > Cohen's d is an effect size metric that expresses the difference between means in standard deviation units. # Question 7 What does statistical power ($1 - \beta$) represent? - [ ] The probability of rejecting a true null hypothesis. - [x] The probability of correctly rejecting the null hypothesis when an effect actually exists. - [ ] The probability of failing to find a significant result. - [ ] The probability that the sample mean perfectly equals the population mean. > High statistical power means the study has a high chance of successfully detecting a true effect, minimizing Type II error risk. # Question 8 Why are two-tailed tests generally preferred as the scientific default over one-tailed tests? - [ ] They have higher statistical power in the predicted direction. - [x] They allow developers to detect potentially harmful or unexpected effects in the opposite direction and prevent false discoveries. - [ ] They require half the sample size. - [ ] They automatically correct for measurement error. > Two-tailed tests protect against ignoring adverse outcomes and prevent post-hoc adjusting of alpha levels (p-hacking).

Image credit

Illustration by Elisabet Guba from Ouch!

References

Atkinson, G., & Nevill, A. M. (1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26(4), 217–238. https://doi.org/10.2165/00007256-199826040-00002
Bland, J. M., & Altman, D. G. (1994). Statistics notes: One and two sided tests of significance. BMJ, 309(6949), 248. https://doi.org/10.1136/bmj.309.6949.248
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. L. Erlbaum Associates.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). SAGE Publications.
Gravetter, F. J., Wallnau, L. B., & Forzano, L.-A. B. (2021). Statistics for the behavioral sciences (10th ed.). Cengage Learning.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863
Rosner, B. (2015). Fundamentals of biostatistics (8th ed.). Cengage Learning.
Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the p value is not enough. Journal of Graduate Medical Education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1
Thomas, J. R., Nelson, J. K., & Silverman, S. J. (2015). Research methods in physical activity (7th ed.). Human Kinetics.
Weir, J. P., & Vincent, W. J. (2021). Statistics in kinesiology (5th ed.). Human Kinetics.

Reuse

Citation

BibTeX citation:
@misc{furtado2026,
  author = {Furtado, Ovande},
  title = {Fundamentals of {Inferential} {Statistics}},
  date = {2026-03-04},
  url = {https://drfurtado.github.io/randomstats/posts/022523-inferential-stats/},
  langid = {en}
}
For attribution, please cite this work as:
Furtado, O. (2026, March 4). Fundamentals of Inferential Statistics. RandomStats. https://drfurtado.github.io/randomstats/posts/022523-inferential-stats/