Chapter 11: Correlation and Bivariate Regression
2026-02-21
This presentation is based on the following books. The references are coming from these books unless otherwise specified.
Main sources:
ClassShare App
You may be asked in class to go to the ClassShare App to answer questions.
SPSS Tutorial
By the end of this chapter, you should be able to:
| Symbol | Name | Pronunciation | Definition |
|---|---|---|---|
| \(r\) | Pearson’s correlation | “r” | Strength and direction of the linear relationship |
| \(\rho\) | Population correlation | “rho” | True correlation in the population |
| \(r^2\) | Coefficient of determination | “r squared” | Proportion of variance in \(Y\) explained by \(X\) |
| \(R^2\) | Coefficient of determination (regression) | “R squared” | Proportion of variance explained by the regression model |
| \(\hat{y}\) | Predicted value | “y hat” | Value of \(Y\) predicted by the regression equation |
| \(a\) | Intercept | “a” | Predicted \(Y\) when \(X = 0\) |
| \(b\) | Slope | “b” | Change in \(\hat{y}\) for a one-unit increase in \(X\) |
| \(e\) | Residual | “residual” | Difference between observed and predicted \(Y\) |
Correlation measures the strength and direction of the linear relationship between two continuous variables[1,2].
Key properties:
Directions:
Benchmarks[1]:
| \(|r|\) | Strength |
|---|---|
| \(> 0.7\) | Strong |
| \(0.4–0.7\) | Moderate |
| \(< 0.4\) | Weak |
There are many guidelines for interpreting the strength of a correlation. The choice of criteria depends on the field and study. Here are some common guidelines:
\[ \begin{array}{lccccc} \hline \textbf{Guideline} & \textbf{Negligible} & \textbf{Weak} & \textbf{Moderate} & \textbf{Strong} & \textbf{Very Strong / Perfect} \\ \hline \text{Cohen (1988)} & \text{–} & 0.10 \le |r| < 0.30 & 0.30 \le |r| < 0.50 & |r| \ge 0.50 & \text{–} \\ \text{Evans (1996)} & |r| < 0.20 & 0.20 \le |r| < 0.40 & 0.40 \le |r| < 0.60 & 0.60 \le |r| < 0.80 & 0.80 \le |r| \le 1.00 \\ \text{Hinkle (1994)} & |r| < 0.30 & 0.30 \le |r| < 0.50 & 0.50 \le |r| < 0.70 & 0.70 \le |r| < 0.90 & 0.90 \le |r| \le 1.00 \\ \text{Dancey and Reidy (2004)} & |r| < 0.10 & 0.10 \le |r| < 0.30 & 0.30 \le |r| < 0.60 & 0.60 \le |r| < 1.00 & |r| = 1.00 \\ \text{Mukaka (2012)} & |r| < 0.30 & 0.30 \le |r| < 0.50 & 0.50 \le |r| < 0.70 & 0.70 \le |r| < 0.90 & 0.90 \le |r| \le 1.00 \\ \hline \end{array} \]
\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \; \sum (y_i - \bar{y})^2}} \tag{1}\]
Where \(\bar{x}\) and \(\bar{y}\) are the means of \(X\) and \(Y\), respectively.
Alternatively, if we convert our data to z-scores first, the formula simplifies elegantly. Since z-scores already standardize for individual variability, \(r\) is simply the “average” product of the z-scores[1]:
\[ r = \frac{1}{n-1} \sum z_x \, z_y \tag{2}\]
Where \(z_x\) and \(z_y\) are the z-scores of \(X\) and \(Y\), respectively.
Intuition
When above-average \(X\) tends to pair with above-average \(Y\) (both deviations have the same sign), products are mostly positive → \(r > 0\). When they pair with opposite signs → \(r < 0\).
Answer: Positive. When above-average \(X\) pairs with above-average \(Y\), the deviations \((x_i - \bar{x})\) and \((y_i - \bar{y})\) both have the same sign, making their product positive. Summing many positive products gives a positive numerator, and thus \(r > 0\).
Scatterplots are the essential first step — never compute \(r\) without visualizing the data first[3,4].
What scatterplots reveal:
Anscombe’s Quartet
Four datasets with identical \(r = 0.816\), \(\bar{x}\), \(\bar{y}\), and regression lines — but completely different patterns when plotted. Correlation alone can be deeply misleading[1].
Pearson’s \(r\) measures only linear associations[1,3]. Two variables can have a strong, meaningful relationship yet produce \(r \approx 0\) if the relationship is nonlinear.
Movement Science examples of nonlinearity:
What to do if nonlinear:
A strong correlation does not prove that one variable causes the other[1,5].
Three reasons correlations can be misleading:
| Explanation | Description | Example |
|---|---|---|
| Confounding variable | Third variable drives both | Hot weather → ice cream sales AND drownings |
| Reverse causation | Direction assumed backwards | Do fit people exercise, or does exercise make people fit? |
| Spurious correlation | Coincidence | Spelling bee winner length ↔︎ spider deaths |
Establishing causation requires:
Language matters
Use: “X is associated with Y” or “X and Y are related”
Avoid: “X causes Y” (unless you have experimental evidence)
Movement Science example
A strong negative correlation between physical activity and cardiovascular disease does not prove that activity prevents heart disease[5]. Healthier individuals may simply be more likely to exercise (reverse causation), or genetic factors may influence both (confounding). Only RCTs can establish causation[6].
Answer: No. Alternative explanations include: (1) skilled gymnasts may be more motivated to practice (reverse causation); (2) talent, coaching quality, or physical ability may drive both practice time and skill (confounding). Correlation alone cannot establish causation. A randomized controlled trial, where athletes are randomly assigned to different practice schedules, would be needed[1].
Squaring \(r\) gives \(r^2\), the coefficient of determination: the proportion of variance in \(Y\) explained by \(X\)[1,3]. This is similar to the effect size in in the t-test and ANOVA.
\[ r^2 = (0.590)^2 = 0.348 \tag{3}\]
Interpretation:
\[ t = \frac{r \sqrt{n-2}}{\sqrt{1 - r^2}}, \quad df = n - 2 \tag{4}\]
For our example: \(t = 1.790\), \(df = 6\), \(p = .124\)
Note
Equation 3 refers to the equation for the coefficient of determination.
Equation 4 refers to the equation for the significance test of the correlation coefficient.
Practical benchmarks[7]
| \(r^2\) | Interpretation |
|---|---|
| \(< 0.10\) | Weak (< 10% shared variance) |
| \(0.10–0.30\) | Moderate |
| \(\geq 0.30\) | Strong |
Data: leg strength (kg) and vertical jump height (cm) in 8 athletes.
Data:
| Athlete | \(X\) (kg) | \(Y\) (cm) |
|---|---|---|
| 1 | 80 | 42 |
| 2 | 90 | 53 |
| 3 | 70 | 44 |
| 4 | 100 | 49 |
| 5 | 85 | 52 |
| 6 | 95 | 48 |
| 7 | 75 | 40 |
| 8 | 88 | 56 |
Summary Stats:
Before calculating any correlation coefficient, we must verify that the dataset meets the necessary statistical assumptions.
| Assumption | How to check in SPSS |
|---|---|
| 1. Continuous Variables | Both \(X\) and \(Y\) must be interval or ratio level variables. Confirm this in the Variable View (Measure column). |
| 2. Linearity | The relationship must be roughly linear. Look at the scatterplot via Graphs > Chart Builder... and ensure the pattern follows a line, not a curve. |
| 3. Independence | Each observation must be independent. This is confirmed via study design (e.g., each row is a unique athlete). |
| 4. No Outliers | Ensure extreme points aren’t pulling the linear trend. Check the scatterplot visually or use boxplots via Analyze > Descriptive Statistics > Explore. |
| 5. Normality | Variables should be roughly normally distributed (necessary for significance testing). Run the Shapiro-Wilk test or check Q-Q plots via Analyze > Descriptive Statistics > Explore. |
| 6. Homoscedasticity | Data points should be evenly spread along the regression line (no funnel shape). Check the scatterplot or regression residuals visually. |
Once assumptions are met, we compute Pearson’s \(r\).
2. Compute \(r\) using z-scores
First, convert all values to z-scores: \(z_x = \frac{x - \bar{x}}{s_x}, \quad z_y = \frac{y - \bar{y}}{s_y}\)
For the 8 athletes, the sum of their cross-products is \(\sum z_x z_y = 4.132\)
\[r = \frac{1}{n-1} \sum z_x z_y = \frac{4.132}{7} = \mathbf{0.590}\]
Interpretation: A moderate positive linear relationship.
Calculate in SPSS
You can find step-by-step instructions on how to compute Pearson’s \(r\) in the SPSS Tutorial: Correlation and Bivariate Regression chapter of the SMS textbook.
Regression goes beyond correlation: it fits a mathematical equation to predict \(Y\) from \(X\)[1,3].
\[\hat{y} = a + bx\]
Components:
| Symbol | Name | Meaning |
|---|---|---|
| \(\hat{y}\) | Predicted value | Estimated \(Y\) for a given \(X\) |
| \(a\) | Intercept | Predicted \(Y\) when \(X = 0\) |
| \(b\) | Slope | Change in \(\hat{y}\) per 1-unit increase in \(X\) |
Correlation vs. Regression
| Correlation | Regression | |
|---|---|---|
| Goal | Quantify association | Predict \(Y\) from \(X\) |
| Output | \(r\), \(r^2\) | Equation \(\hat{y} = a + bx\) |
| Symmetric? | Yes1 (\(r_{XY} = r_{YX}\)) | No (predicting \(Y\) from \(X\) ≠ vice versa) |
| When to use | Describe relationship | Make predictions |
To build the regression equation \(\hat{y} = a + bx\), we must calculate the slope (\(b\)) and intercept (\(a\)).
The Slope (\(b\))
The Intercept (\(a\))
Using the leg strength data: \(\bar{x} = 85.375\), \(\bar{y} = 48.000\), \(s_x = 10.056\), \(s_y = 5.632\), \(r = 0.590\).
Step 1: Compute the slope
\[b = r \frac{s_y}{s_x} = 0.590 \times \frac{5.632}{10.056} = 0.590 \times 0.560 = \mathbf{0.331 \text{ cm/kg}}\]
Step 2: Compute the intercept
\[a = \bar{y} - b\bar{x} = 48.000 - (0.331 \times 85.375) = \mathbf{19.741 \text{ cm}}\]
Step 3: Write the equation
\[\hat{y} = 19.741 + 0.331(x)\]
Making a prediction: If leg strength = 92 kg:
\[\hat{y} = 19.741 + 0.331(92) = 50.19 \text{ cm}\]
Slope (\(b = 0.331\))
For every 1 kg increase in leg strength, the predicted jump height increases by 0.331 cm on average. Note: This represents the average trend across the entire sample. It does not guarantee that if a specific individual gains 1 kg of strength, their jump will mechanically increase by exactly 0.331 cm.
Intercept (\(a = 19.741\))
Predicted jump height when leg strength = 0. This value is not meaningful here — no one has zero leg strength. Do not over-interpret intercepts outside the data range.
Extrapolation
Never predict outside the observed range of \(X\) (70–100 kg in this example). The linear relationship may not hold beyond that range[1].
A residual is the difference between the observed and predicted value[1]:
\[e_i = y_i - \hat{y}_i \tag{5}\]
What residuals tell us:
\(R^2\) (in bivariate regression = \(r^2\)):
\[ R^2 = 0.348 \tag{6}\]
Reading the Residual Plot:
| Pattern | Diagnosis |
|---|---|
| 1. Random scatter ✓ | Assumptions met: Errors are random (linearity) with constant spread (homoscedasticity). |
| 2. Funnel shape | Heteroscedasticity: Spread of errors changes over time, violating constant variance. |
| 3. Curved pattern | Nonlinearity: The linear model missed a curved relationship. |
| 4. Outliers | Influential points: Specific extreme values that might distort the model. |
Answer: Calculation:
\(\hat{y} = 19.741 + 0.331(80) = 19.741 + 26.48 = \mathbf{46.22 \text{ cm}}\).
What does this mean?
Our model predicts an athlete with 80 kg of leg strength will jump 46.22 cm.
The Residual (Error):
If we look back at our original data table, Athlete 1 actually had 80 kg of leg strength, but only jumped 42 cm.
Residual (\(y - \hat{y}\)) = 42 − 46.22 = -4.22 cm.
The negative residual means Athlete 1 jumped 4.22 cm lower than the model expected!
Both methods rely on five key assumptions[1,3]:
1. Linearity The \(X\)–\(Y\) relationship must be approximately linear. → Check: scatterplot and residual plot.
2. Homoscedasticity Variance in \(Y\) is constant across all values of \(X\). Violations produce a funnel shape in residual plots. → Check: residual plot.
3. Independence Each observation must be independent (one data point per participant, or use appropriate repeated-measures methods). → Check: study design.
4. Normality of residuals Residuals should be approximately normally distributed (for inference). Less critical for large samples (Central Limit Theorem). → Check: histogram or Q-Q plot of residuals.
5. No extreme outliers Outliers — especially those with high leverage (extreme \(X\)) and large residuals — can distort \(r\) and regression coefficients. → Check: scatterplot, residual plot, Cook’s distance.
Answer: Homoscedasticity is violated. The funnel shape indicates heteroscedasticity — the variance of residuals increases with the predicted value. This can distort standard errors and confidence intervals. Possible remedies include log-transforming \(Y\), using weighted least squares, or robust regression methods[8].
Outliers can have a disproportionate influence on \(r\) and regression coefficients[1,8].
Types of problematic points:
What to do with outliers:
Just as with hypothesis testing, a statistically significant correlation may not be practically meaningful[4,6].
Examples in Movement Science:
| Scenario | \(r\) | \(p\) | \(r^2\) | Practical interpretation |
|---|---|---|---|---|
| Training volume & VO2max | .10 | .02 | 1% | Stat. sig. but trivial (n = 400) |
| Strength & jump height | .85 | .08 | 72% | Large effect, underpowered (n = 8) |
| Balance score & fall risk | .55 | .001 | 30% | Stat. sig. AND meaningful |
Key principle:
Effect size benchmarks for \(r^2\)[10]
Correlation:
“Leg strength was significantly and positively correlated with vertical jump height, \(r(6) = .998\), \(p < .001\).”
Note: \(df = n - 2 = 6\) in parentheses.
Regression:
“A bivariate linear regression revealed that leg strength significantly predicted vertical jump height (\(b = 0.50\), \(\beta = .998\)), \(R^2 = .996\), \(F(1, 6) = 1502.1\), \(p < .001\). For every 1 kg increase in leg strength, jump height increased by 0.50 cm.”
Always include:
APA formatting rules
Misconception 1
❌ incorrect: “\(r = 0\) means there is no relationship between the variables.”
✅ correct: \(r = 0\) means there is no linear relationship. A strong nonlinear (curved) relationship can produce \(r \approx 0\). Always check a scatterplot[1].
Misconception 3
❌ incorrect: “\(r = 0.90\) is twice as strong as \(r = 0.45\).”
✅ correct: \(r\) is not a ratio scale. Compare using \(r^2\): \(0.90^2 = 81\%\) vs. \(0.45^2 = 20\%\) variance explained — a 4× difference, not 2×.
Misconception 4
❌ incorrect: “Non-overlapping confidence intervals for \(r\) confirm the correlations are different.”
✅ correct: Use Fisher’s z-transformation to formally test whether two \(r\) values differ — visual overlap of CIs is not a reliable test.
Use this sequence whenever examining the relationship between two continuous variables[1,3]:
| Step | Action | Tool |
|---|---|---|
| 1 | Create a scatterplot | Visualize pattern, outliers, linearity |
| 2 | Compute \(r\) | Quantify strength and direction |
| 3 | Test significance | \(t = r\sqrt{n-2}/\sqrt{1-r^2}\), \(df = n-2\) |
| 4 | Fit regression model (if prediction needed) | \(\hat{y} = a + bx\); report slope, intercept, \(R^2\) |
| 5 | Check assumptions | Residual plot, Q-Q plot |
| 6 | Interpret cautiously | Correlation ≠ causation; report effect sizes |
Important
The goal is not just a number — it is understanding the nature of the relationship and communicating it honestly, including its limitations.
Important
Correlation and regression are powerful descriptive tools — but responsible use requires knowing their limits.
Please complete the Bivariate Correlation Activity for this week before leaving.