Week 8: Regression Analysis

KIN 610 - Spring 2023

Ovande Furtado

Credits

Navarro and Foxcroft (2022)

Simple Linear Regression

Linear Regression Models

A way of measuring the relationship between two variables
Similar to Pearson correlation, but more powerful
Can be used to predict one variable from another

Example: Parenthood Data Set

Data set contains measures of sleep and grumpiness for Dani
Hypothesis: less sleep leads to more grumpiness
Scatterplot shows a strong negative correlation (r = -.90)

Regression Line

A straight line that best fits the data
Represents the average relationship between the variables
Can be used to estimate grumpiness from sleep

How to Draw a Regression Line?

The line should go through the middle of the data
The line should minimize the vertical distances between the data points and the line
The line should have a slope and an intercept that can be calculated from the data

The formula for a straight line

Usually written like this: \(y = a + bx\)
Two variables: \(x\) and \(y\)
Two coefficients: \(a\) and \(b\)
Coefficient \(a\) represents the y-intercept of the line
Coefficient \(b\) represents the slope of the line

The interpretation of intercept and slope

Intercept: the value of \(y\) that you get when \(x\) = 0
Slope: the change in \(y\) that you get when you increase \(x\) by 1 unit
Positive slope: \(y\) goes up as \(x\) goes up
Negative slope: \(y\) goes down as \(x\) goes up

The formula for a Regression line

Same as the formula for a straight line, but with some extra notation
So if \(y\) is the outcome variable (DV) and \(x\) is the predictor variable (IV), then:

\[\hat{y}_i = b_0 + b_1 x_i\]

\(\hat{y}_i\): the predicted value of the outcome variable (\(y\)) for observation \(i\)

\({y}_i\): the actual value of the outcome variable (\(y\)) for observation \(i\)

\({x}_i\): the value of the predictor variable (\(x\)) for observation \(i\)

\({b}_0\): the estimated intercept of the regression line

\({b}_1\): the estimated slope of the regression line

The assumptions of the regression model

We assume that the formula works for all observations in the data set (i.e., for all i)
We distinguish between the actual data \({y}_i\) and the estimate \(\hat{y}_i\) (i.e., the prediction that our regression line is making)
We use \(b_0\) and \(b_1\) to refer to the coefficients of the regression model
- \(b_0\): the estimated intercept of the regression line
- \(b_1\): the estimated slope of the regression line

Residuals of the Regression model

Code

# Generate some example data with a strong negative correlation
set.seed(123)
x <- rnorm(100)
y <- -0.8*x + rnorm(100, sd=0.5)

# Plot the data
plot(x,y)

# Add the best fit line
abline(lm(y ~ x), col="red")

Now, we have the complete linear regression model

\[\hat{y}_i = b_0 + b_1 x_i + {e}_i\]

The data do not fall perfectly on the regression line
The difference between the model prediction and that actual data point is called a residual, and we refer to it as \({e}_i\)
Mathematically, the residuals are defined as \({e}_i = {y}_i - \hat{y}_i\)
The residuals measure how well the regression line fits the data
- Smaller residuals: better fit
- Larger residuals: worse fit

Estimating a linear regression model

We want to find the regression line that fits the data best
We can measure how well the regression line fits the data by looking at the residuals
The residuals are the differences between the actual data and the model predictions
Smaller residuals mean better fit, larger residuals mean worse fit

Ordinary least squares regression

We use the method of least squares to estimate the regression coefficients
The regression coefficients are estimates of the population parameters
We use \(\hat{b}_0\) and \(\hat{b}_1\) to denote the estimated coefficients
Ordinary least squares (OLS) regression is the most common way to estimate a linear regression model

How to find the estimated coefficients

There are formulas to calculate \(\hat{b}_0\) and \(\hat{b}_1\) from the data
The formulas involve some algebra and calculus that are not essential to understand the logic of regression
We can use jamovi to do all the calculations for us
jamovi will also provide other useful information about the regression model

Linear Regression in jamovi

We can use jamovi to estimate a linear regression model from the data
We need to specify the dependent variable and the covariate(s) in the analysis
jamovi will output the estimated coefficients and other statistics

Example: Parenthood data

Data file: parenthood.csv (found in module lsj data in jamovi)

Dependent variable: dani.grump (Dani’s grumpiness)

Covariate: dani.sleep (Dani’s hours of sleep)

Estimated intercept: \(\hat{b}_0\) = 125.96

Estimated slope: \(\hat{b}_1\) = -8.94

Regression equation: \(\hat{Y}_i = 125.96+(-8.94 X_i)\)

Interpreting the estimated model

We need to understand what the estimated coefficients mean
The slope \(\hat{b}_1\) tells us how much the dependent variable changes when the covariate increases by one unit
The intercept \(\hat{b}_0\) tells us what the expected value of the dependent variable is when the covariate is zero

Example: Parenthood data

Dependent variable: dani.grump (Dani’s grumpiness)
Covariate: dani.sleep (Dani’s hours of sleep)
Estimated slope: \(\hat{b}_1\) = -8.94
- Interpretation: Each additional hour of sleep reduces grumpiness by 8.94 points
Estimated intercept: \(\hat{b}_0\) = 125.96
- Interpretation: If Dani gets zero hours of sleep, her grumpiness will be 125.96 points

Multiple Regression

Introduction

We can use more than one predictor variable to explain the variation in the outcome variable
- Add more terms to our regression equation to represent each predictor variable
Each term has a coefficient that indicates how much the outcome variable changes when that predictor variable increases by one unit

Example: Parenthood data

Outcome variable: dani.grump (Dani’s grumpiness)
Predictor variables: dani.sleep (Dani’s hours of sleep) and baby.sleep (Baby’s hours of sleep)

Regression equation: \(Y_i=b_0+b_1X_{i1}+b_2X_{i2}+\epsilon_i\)

\(Y_i\): Dani’s grumpiness on day \(i\)

\(X_{i1}\): Dani’s hours of sleep on day \(i\)

\(X_{i2}\): Baby’s hours of sleep on day \(i\)

\(b_0\): Intercept

\(b_1\): Coefficient for Dani’s sleep

\(b_2\): Coefficient for Baby’s sleep

\(\epsilon_i\): Error term on day \(i\)

Estimating the coefficients in multiple regression

We want to find the coefficients that minimize the sum of squared residuals
Residuals are the differences between the observed and predicted values of the outcome variable
We use a similar method as in simple regression, but with more terms in the equation

Doing it in jamovi

jamovi can estimate multiple regression models easily
We just need to add more variables to the Covariates box in the analysis
jamovi will output the estimated coefficients and other statistics for each predictor variable
The Table shows the coefficients for dani.sleep and baby.sleep as predictors of dani.grump

Interpreting the coefficients in multiple regression

The coefficients tell us how much the outcome variable changes when one predictor variable increases by one unit, holding the other predictor variables constant
The larger the absolute value of the coefficient, the stronger the effect of that predictor variable on the outcome variable
The sign of the coefficient indicates whether the effect is positive or negative

Example: Parenthood data

Coefficient (slope) for dani.sleep: -8.94
- Interpretation: Each additional hour of sleep reduces Dani’s grumpiness by 8.94 points, regardless of how much sleep the baby gets
Coefficient (slope) for baby.sleep: 0.01
- Interpretation: Each additional hour of sleep for the baby increases Dani’s grumpiness by 0.01 points, regardless of how much sleep Dani gets

Quantifying the fit of the regression model

We want to know how well our regression model predicts the outcome variable
We can compare the predicted values ( \(\hat{Y}_i\) ) to the observed values ( \(Y_i\) ) using two sums of squares
- Residual sum of squares ( \(SS_{res}\) ): measures how much error there is in our predictions
- Total sum of squares ( \(SS_{tot}\) ): measures how much variability there is in the outcome variable

The \(R^2\) value (effect size)

The \(R^2\) value is a proportion that tells us how much of the variability in the outcome variable is explained by our regression model
It is calculated as:

\[R^2=1-\frac{SS_{res}}{SS_{tot}}\]

It ranges from 0 to 1, with higher values indicating better fit
It can be interpreted as the percentage of variance explained by our regression model

The relationship between regression and correlation

Regression and correlation are both ways of measuring the strength and direction of a linear relationship between two variables
For a simple regression model with one predictor variable, the \(R^2\) value is equal to the square of the Pearson correlation coefficient (\(r^2\))
- Running a Pearson correlation is equivalent to running a simple linear regression model

The adjusted \(R^2\) value

The adjusted \(R^2\) value is a modified version of the \(R^2\) value that takes into account the number of predictors in the model
- The adjusted \(R^2\) value adjusts for the degrees of freedom in the model
It increases only if adding a predictor improves the model more than expected by chance

Which one to report: \(R^2\) or adjusted \(R^2\)?

There is no definitive answer to this question
It depends on your preference and your research question
Some factors to consider are:
- Interpretability: \(R^2\) is easier to understand and explain
- Bias correction: Adjusted \(R^2\) is less likely to overestimate the model performance
- Hypothesis testing: There are other ways to test if adding a predictor improves the model significantly

Hypothesis tests for regression models

We can use hypothesis tests to evaluate the significance of our regression model and its coefficients
There are two types of hypothesis tests for regression models:
- Testing the model as a whole: Is there any relationship between the predictors and the outcome?
- Testing a specific coefficient: Is a particular predictor significantly related to the outcome?

Test the model as a whole

\(H_0\): there is no relationship between the predictors and the outcome

\(H_a\): data follow the regression model

\[F=\frac{(R^2/K)}{(1-R^2)/(N-K-1)}\]

where \(R^2\) is the proportion of variance explained by our model, \(K\) is the number of predictors, and \(N\) is the number of observations
The F-test statistic follows an F-distribution with \(K\) and \(N-K-1\) degrees of freedom
We can use a p-value to determine if our F-test statistic is significant
jamovi can do this for us!

Tests for Individual Coefficients

The F-test checks if the model as a whole is performing better than chance
If the F-test is not significant, then the regression model may not be good
However, passing the F-test does not imply that the model is good

Example of Multiple Linear Regression

In a multiple linear regression model with baby.sleep and dani.sleep as predictors:
- The estimated regression coefficient for baby.sleep is small (0.01) compared to dani.sleep (-.8.95)
- This suggests that only dani.sleep matters in predicting grumpiness

Hypothesis Testing for Regression Coefficients

A t-test can be used to test if a regression coefficient is significantly different from zero

\(H_0\): b = 0 (the true regression coefficient is zero)

\(H_0\): b ≠ 0 (the true regression coefficient is not zero)

Running Hypothesis Tests in Jamovi

To compute statistics, check relevant options and run regression in jamovi
See result in the next slide

Output

Interpretation

Conclusion

The current regression model may not be the best fit for the data
Dropping baby.sleep predictor entirely may improve the model

The model performs significantly better than chance
- \(F(2,97) = 215.24\), \(p< .001\)
- \(R^2 = .81\) value indicates that the regression model accounts for 81% of the variability in the outcome measure
Individual Coefficients
- baby.sleep variable has no significant effect
- All work in this model is being done by the dani.sleep variable

Assumptions of Regression

The linear regression model relies on several assumptions.

Linearity: The relationship between X and Y is assumed to be linear.
Independence: Residuals are assumed to be independent of each other.
Normality: The residuals are assumed to be normally distributed.
Equality of Variance: The standard deviation of the residual is assumed to be the same for all values of Y-hat.

Assumptions of Regression, cont.

Also…

Uncorrelated Predictors: In a multiple regression model, predictors should not be too strongly correlated with each other.
- Strongly correlated predictors (collinearity) can cause problems when evaluating the model.
No “Bad” Outliers: The regression model should not be too strongly influenced by one or two anomalous data points.
- Anomalous data points can raise questions about the adequacy of the model and trustworthiness of data.

Diagnostics

Checking for linearity

Checking Linearity

It is important to check for the linearity of relationships between predictors and outcomes.

Plotting Relationships

One way to check for linearity is to plot the relationship between predicted values and observed values for the outcome variable.

Using Jamovi

In Jamovi, you can save predicted values to the dataset and then draw a scatterplot of observed against predicted (fitted) values.

Interpreting Results

If the plot looks approximately linear, then it suggests that your model is not doing too badly. However, if there are big departures from linearity, it suggests that changes need to be made.

Checking for linearity, cont.

To get a more detailed picture of linearity, it can be helpful to look at the relationship between predicted values and residuals.

Using Jamovi

In Jamovi, you can save residuals to the dataset and then draw a scatterplot of predicted values against residual values.

Interpreting Results

Ideally, the relationship between predicted values and residuals should be a straight, perfectly horizontal line. In practice, we’re looking for a reasonably straight or flat line. This is a matter of judgement.

Checking for normality (residuals)

Regression models rely on a normality assumption: the residuals should be normally distributed.

Using Jamovi

In Jamovi, you can draw a QQ-plot via the ‘Assumption Checks’ - ‘Assumption Checks’ - ‘Q-Q plot of residuals’ option.

Interpreting Results

The output shows the standardized residuals plotted as a function of their theoretical quantiles according to the regression model. The dots should be somewhat near the line.

Checking for normality (residuals), cont.

Checking Relationship between Predicted Values and Residuals

In Jamovi, you can use the ‘Residuals Plots’ option to check the relationship between predicted values and residuals.
The output provides a scatterplot for each predictor variable, the outcome variable, and the predicted values against residuals.

Interpreting Results

We are looking for a fairly uniform distribution of dots with no clear bunching or patterning.
- The dots are fairly evenly spread across the whole plot
Issues with the relationship between predicted values and residuals?
- Transform one or more of the variables (Box-Cox Transform in jamovi)

Checking for equality of variance

Regression models make an assumption of equality (homogeneity) of variance.

This means that the variance of the residuals is assumed to be constant.

Plotting Equality of Variance in Jamovi

To check this assumption in Jamovi, first calculate the square root of the absolute size of the residual.
- Compute this new variable using the formula SQRT(ABS(Residuals))
Then plot this against the predicted values.
The plot should show a straight horizontal line running through the middle.

Checking for Collineary

Variance Inflation Factors (VIFs) can be used to determine if predictors in a regression model are too highly correlated with each other.
- Each predictor has an associated VIF.
In Jamovi, click on the ‘Collinearity’ checkbox in the ‘Regression’ - ‘Assumptions’ options to see VIF values.
Interpreting VIF
- A VIF of 1 means no correlation among the predictor and the remaining predictor variables
- VIFs exceeding 4 warrant further investigation
- VIFs exceeding 10 are signs of serious multicollinearity requiring correction

Checking for outliers

Used in regression analysis to identify influential data points that may negatively affect your regression model
Datasets with a large number of highly influential points might not be suitable for linear regression without further processing such as outlier removal or imputation
Identifying Outliers
- A general rule of thumb: Cook’s distance greater than 1 is often considered large
What if the value is greater than 1?
- remove the outlier and run the regression again
- How? In jamovi you can save the Cook’s distance values to the dataset, then draw a boxplot of the Cook’s distance values to identify the specific outliers.

References

Navarro, Danielle J, and David R Foxcroft. 2022. Learning Statistics with Jamovi: A Tutorial for Psychology Students and Other Beginners (Version 0.75). Danielle J. Navarro; David R. Foxcroft. https://doi.org/10.24384/HGC3-7P15.