KIN 610 - Spring 2023
Navarro and Foxcroft (2022)
The line should go through the middle of the data
The line should minimize the vertical distances between the data points and the line
The line should have a slope and an intercept that can be calculated from the data
Usually written like this: \(y = a + bx\)
Two variables: \(x\) and \(y\)
Two coefficients: \(a\) and \(b\)
Coefficient \(a\) represents the y-intercept
of the line
Coefficient \(b\) represents the slope
of the line
Same as the formula for a straight line
, but with some extra notation
So if \(y\) is the outcome variable (DV) and \(x\) is the predictor variable (IV), then:
\[\hat{y}_i = b_0 + b_1 x_i\]
\(\hat{y}_i\): the predicted value of the outcome variable
(\(y\)) for observation \(i\)
\({y}_i\): the actual value of the outcome variable
(\(y\)) for observation \(i\)
\({x}_i\): the value of the predictor variable
(\(x\)) for observation \(i\)
\({b}_0\): the estimated intercept
of the regression line
\({b}_1\): the estimated slope
of the regression line
We assume that the formula works for all observations in the data set (i.e., for all i)
We distinguish between the actual data \({y}_i\) and the estimate \(\hat{y}_i\) (i.e., the prediction that our regression line is making)
We use \(b_0\) and \(b_1\) to refer to the coefficients of the regression model
\(b_0\): the estimated intercept of the regression line
\(b_1\): the estimated slope of the regression line
Now, we have the complete linear regression model
\[\hat{y}_i = b_0 + b_1 x_i + {e}_i\]
The data do not fall perfectly on the regression line
The difference between the model prediction and that actual data point is called a residual, and we refer to it as \({e}_i\)
Mathematically, the residuals are defined as \({e}_i = {y}_i - \hat{y}_i\)
The residuals measure how well the regression line fits the data
We want to find the regression line that fits the data best
We can measure how well the regression line fits the data by looking at the residuals
The residuals are the differences between the actual data and the model predictions
Smaller residuals mean better fit, larger residuals mean worse fit
We use the method of least squares
to estimate the regression coefficients
The regression coefficients are estimates of the population parameters
We use \(\hat{b}_0\) and \(\hat{b}_1\) to denote the estimated coefficients
Ordinary least squares (OLS) regression is the most common way to estimate a linear regression model
There are formulas to calculate \(\hat{b}_0\) and \(\hat{b}_1\) from the data
The formulas involve some algebra and calculus that are not essential to understand the logic of regression
We can use jamovi to do all the calculations for us
jamovi will also provide other useful information about the regression model
dependent variable
and the covariate(s)
in the analysisData file: parenthood.csv (found in module lsj data
in jamovi)
Dependent variable: dani.grump
(Dani’s grumpiness)
Covariate: dani.sleep
(Dani’s hours of sleep)
Estimated intercept: \(\hat{b}_0\) = 125.96
Estimated slope: \(\hat{b}_1\) = -8.94
Regression equation: \(\hat{Y}_i = 125.96+(-8.94 X_i)\)
dependent variable
changes when the covariate
increases by one unitdependent variable
is when the covariate
is zerodani.grump
(Dani’s grumpiness)dani.sleep
(Dani’s hours of sleep)reduces
grumpiness by 8.94
points125.96
pointsWe can use more than one predictor variable
to explain the variation in the outcome variable
Each term has a coefficient that indicates how much the outcome variable changes when that predictor variable increases by one unit
Outcome variable: dani.grump
(Dani’s grumpiness)
Predictor variables: dani.sleep
(Dani’s hours of sleep) and baby.sleep
(Baby’s hours of sleep)
Regression equation: \(Y_i=b_0+b_1X_{i1}+b_2X_{i2}+\epsilon_i\)
\(Y_i\): Dani’s grumpiness on day \(i\)
\(X_{i1}\): Dani’s hours of sleep on day \(i\)
\(X_{i2}\): Baby’s hours of sleep on day \(i\)
\(b_0\): Intercept
\(b_1\): Coefficient for Dani’s sleep
\(b_2\): Coefficient for Baby’s sleep
\(\epsilon_i\): Error term on day \(i\)
simple regression
, but with more terms in the equation
Covariates
box in the analysisoutcome variable
changes when one predictor variable
increases by one unit, holding
the other predictor variables constant
larger
the absolute value of the coefficient, the stronger
the effect of that predictor variable on the outcome variableCoefficient (slope) for dani.sleep: -8.94
reduces
Dani’s grumpiness by 8.94 points
, regardless of how much sleep the baby getsCoefficient (slope) for baby.sleep: 0.01
increases
Dani’s grumpiness by 0.01 points
, regardless of how much sleep Dani getsWe want to know how well our regression model predicts the outcome variable
We can compare the predicted values ( \(\hat{Y}_i\) ) to the observed values ( \(Y_i\) ) using two sums of squares
Residual
sum of squares ( \(SS_{res}\) ): measures how much error there is in our predictions
Total
sum of squares ( \(SS_{tot}\) ): measures how much variability there is in the outcome variable
The \(R^2\) value is a proportion that tells us how much of the variability
in the outcome variable
is explained by our regression model
It is calculated as:
\[R^2=1-\frac{SS_{res}}{SS_{tot}}\]
It ranges from 0 to 1, with higher
values indicating better fit
It can be interpreted as the percentage of variance explained by our regression model
Regression and correlation are both ways of measuring the strength and direction of a linear relationship between two variables
For a simple regression
model with one predictor variable, the \(R^2\) value is equal
to the square of the Pearson correlation coefficient (\(r^2\))
if adding a predictor
improves the model more than expected by chanceInterpretability: \(R^2\) is easier to understand and explain
Bias correction: Adjusted \(R^2\) is less likely to overestimate the model performance
Hypothesis testing: There are other ways to test if adding a predictor improves the model significantly
significance
of our regression model and its coefficients
Testing the model as a whole
: Is there any relationship between the predictors and the outcome?
Testing a specific coefficient
: Is a particular predictor significantly related to the outcome?
\(H_0\): there is no relationship between the predictors and the outcome
\(H_a\): data follow the regression model
\[F=\frac{(R^2/K)}{(1-R^2)/(N-K-1)}\]
p-value
to determine if our F-test statistic is significant
The F-test checks if the model as a whole is performing better than chance
If the F-test is not significant, then the regression model may not be good
However, passing the F-test does not imply that the model is good
In a multiple linear regression model with baby.sleep and dani.sleep as predictors:
The estimated regression coefficient for baby.sleep is small (0.01) compared to dani.sleep (-.8.95)
This suggests that only dani.sleep matters in predicting grumpiness
\(H_0\): b = 0 (the true regression coefficient is zero)
\(H_0\): b ≠ 0 (the true regression coefficient is not zero)
To compute statistics, check relevant options and run regression in jamovi
See result in the next slide
Conclusion
baby.sleep
predictor entirely may improve
the model\(F(2,97) = 215.24\), \(p< .001\)
\(R^2 = .81\) value indicates that the regression model accounts for 81% of the variability in the outcome measure
baby.sleep
variable has no significant effect
All work in this model is being done by the dani.sleep
variable
The linear regression model relies on several assumptions.
Linearity: The relationship between X and Y is assumed to be linear.
Independence: Residuals are assumed to be independent of each other.
Normality: The residuals are assumed to be normally distributed.
Equality of Variance: The standard deviation of the residual is assumed to be the same for all values of Y-hat.
Also…
Uncorrelated Predictors: In a multiple regression model, predictors should not be too strongly correlated with each other.
No “Bad” Outliers: The regression model should not be too strongly influenced by one or two anomalous data points.
Checking Linearity
Plotting Relationships
predicted
values and observed
values for the outcome variable.Using Jamovi
Interpreting Results
To get a more detailed picture of linearity, it can be helpful to look at the relationship between predicted values and residuals.
Using Jamovi
esiduals
to the dataset and then draw a scatterplot of predicted
values against residual values
.Interpreting Results
Regression models rely on a normality assumption: the residuals should be normally distributed.
Using Jamovi
Interpreting Results
Checking Relationship between Predicted Values and Residuals
predictor variable
, the outcome variable
, and the predicted values
against residuals.Interpreting Results
We are looking for a fairly uniform distribution of dots with no clear bunching or patterning.
Issues with the relationship between predicted values and residuals?
Regression models make an assumption of equality (homogeneity) of variance.
Plotting Equality of Variance in Jamovi
SQRT(ABS(Residuals))
remove the outlier and run the regression again
How? In jamovi you can save the Cook’s distance values to the dataset, then draw a boxplot of the Cook’s distance values to identify the specific outliers.