
KIN 610 - Spring 2023
Navarro and Foxcroft (2022)

The line should go through the middle of the data
The line should minimize the vertical distances between the data points and the line
The line should have a slope and an intercept that can be calculated from the data
Usually written like this: \(y = a + bx\)
Two variables: \(x\) and \(y\)
Two coefficients: \(a\) and \(b\)
Coefficient \(a\) represents the y-intercept of the line
Coefficient \(b\) represents the slope of the line

Same as the formula for a straight line, but with some extra notation
So if \(y\) is the outcome variable (DV) and \(x\) is the predictor variable (IV), then:
\[\hat{y}_i = b_0 + b_1 x_i\]
\(\hat{y}_i\): the predicted value of the outcome variable (\(y\)) for observation \(i\)
\({y}_i\): the actual value of the outcome variable (\(y\)) for observation \(i\)
\({x}_i\): the value of the predictor variable (\(x\)) for observation \(i\)
\({b}_0\): the estimated intercept of the regression line
\({b}_1\): the estimated slope of the regression line
We assume that the formula works for all observations in the data set (i.e., for all i)
We distinguish between the actual data \({y}_i\) and the estimate \(\hat{y}_i\) (i.e., the prediction that our regression line is making)
We use \(b_0\) and \(b_1\) to refer to the coefficients of the regression model
\(b_0\): the estimated intercept of the regression line
\(b_1\): the estimated slope of the regression line

Now, we have the complete linear regression model
\[\hat{y}_i = b_0 + b_1 x_i + {e}_i\]
The data do not fall perfectly on the regression line
The difference between the model prediction and that actual data point is called a residual, and we refer to it as \({e}_i\)
Mathematically, the residuals are defined as \({e}_i = {y}_i - \hat{y}_i\)
The residuals measure how well the regression line fits the data
We want to find the regression line that fits the data best
We can measure how well the regression line fits the data by looking at the residuals
The residuals are the differences between the actual data and the model predictions
Smaller residuals mean better fit, larger residuals mean worse fit
We use the method of least squares to estimate the regression coefficients
The regression coefficients are estimates of the population parameters
We use \(\hat{b}_0\) and \(\hat{b}_1\) to denote the estimated coefficients
Ordinary least squares (OLS) regression is the most common way to estimate a linear regression model
There are formulas to calculate \(\hat{b}_0\) and \(\hat{b}_1\) from the data
The formulas involve some algebra and calculus that are not essential to understand the logic of regression
We can use jamovi to do all the calculations for us
jamovi will also provide other useful information about the regression model
dependent variable and the covariate(s) in the analysis
Data file: parenthood.csv (found in module lsj data in jamovi)
Dependent variable: dani.grump (Dani’s grumpiness)
Covariate: dani.sleep (Dani’s hours of sleep)
Estimated intercept: \(\hat{b}_0\) = 125.96
Estimated slope: \(\hat{b}_1\) = -8.94
Regression equation: \(\hat{Y}_i = 125.96+(-8.94 X_i)\)
dependent variable changes when the covariate increases by one unitdependent variable is when the covariate is zerodani.grump (Dani’s grumpiness)dani.sleep (Dani’s hours of sleep)reduces grumpiness by 8.94 points125.96 pointsWe can use more than one predictor variable to explain the variation in the outcome variable
Each term has a coefficient that indicates how much the outcome variable changes when that predictor variable increases by one unit
Outcome variable: dani.grump (Dani’s grumpiness)
Predictor variables: dani.sleep (Dani’s hours of sleep) and baby.sleep (Baby’s hours of sleep)
Regression equation: \(Y_i=b_0+b_1X_{i1}+b_2X_{i2}+\epsilon_i\)
\(Y_i\): Dani’s grumpiness on day \(i\)
\(X_{i1}\): Dani’s hours of sleep on day \(i\)
\(X_{i2}\): Baby’s hours of sleep on day \(i\)
\(b_0\): Intercept
\(b_1\): Coefficient for Dani’s sleep
\(b_2\): Coefficient for Baby’s sleep
\(\epsilon_i\): Error term on day \(i\)
simple regression, but with more terms in the equation
Covariates box in the analysisoutcome variable changes when one predictor variable increases by one unit, holding the other predictor variables constantlarger the absolute value of the coefficient, the stronger the effect of that predictor variable on the outcome variableCoefficient (slope) for dani.sleep: -8.94
reduces Dani’s grumpiness by 8.94 points, regardless of how much sleep the baby getsCoefficient (slope) for baby.sleep: 0.01
increases Dani’s grumpiness by 0.01 points, regardless of how much sleep Dani getsWe want to know how well our regression model predicts the outcome variable
We can compare the predicted values ( \(\hat{Y}_i\) ) to the observed values ( \(Y_i\) ) using two sums of squares
Residual sum of squares ( \(SS_{res}\) ): measures how much error there is in our predictions
Total sum of squares ( \(SS_{tot}\) ): measures how much variability there is in the outcome variable
The \(R^2\) value is a proportion that tells us how much of the variability in the outcome variable is explained by our regression model
It is calculated as:
\[R^2=1-\frac{SS_{res}}{SS_{tot}}\]
It ranges from 0 to 1, with higher values indicating better fit
It can be interpreted as the percentage of variance explained by our regression model
Regression and correlation are both ways of measuring the strength and direction of a linear relationship between two variables
For a simple regression model with one predictor variable, the \(R^2\) value is equal to the square of the Pearson correlation coefficient (\(r^2\))
if adding a predictor improves the model more than expected by chanceInterpretability: \(R^2\) is easier to understand and explain
Bias correction: Adjusted \(R^2\) is less likely to overestimate the model performance
Hypothesis testing: There are other ways to test if adding a predictor improves the model significantly
significance of our regression model and its coefficientsTesting the model as a whole: Is there any relationship between the predictors and the outcome?
Testing a specific coefficient: Is a particular predictor significantly related to the outcome?
\(H_0\): there is no relationship between the predictors and the outcome
\(H_a\): data follow the regression model
\[F=\frac{(R^2/K)}{(1-R^2)/(N-K-1)}\]
p-value to determine if our F-test statistic is significantThe F-test checks if the model as a whole is performing better than chance
If the F-test is not significant, then the regression model may not be good
However, passing the F-test does not imply that the model is good
In a multiple linear regression model with baby.sleep and dani.sleep as predictors:
The estimated regression coefficient for baby.sleep is small (0.01) compared to dani.sleep (-.8.95)
This suggests that only dani.sleep matters in predicting grumpiness
\(H_0\): b = 0 (the true regression coefficient is zero)
\(H_0\): b ≠ 0 (the true regression coefficient is not zero)
To compute statistics, check relevant options and run regression in jamovi
See result in the next slide

Conclusion
baby.sleep predictor entirely may improve the model\(F(2,97) = 215.24\), \(p< .001\)
\(R^2 = .81\) value indicates that the regression model accounts for 81% of the variability in the outcome measure
baby.sleep variable has no significant effect
All work in this model is being done by the dani.sleep variable
The linear regression model relies on several assumptions.
Linearity: The relationship between X and Y is assumed to be linear.
Independence: Residuals are assumed to be independent of each other.
Normality: The residuals are assumed to be normally distributed.
Equality of Variance: The standard deviation of the residual is assumed to be the same for all values of Y-hat.
Also…
Uncorrelated Predictors: In a multiple regression model, predictors should not be too strongly correlated with each other.
No “Bad” Outliers: The regression model should not be too strongly influenced by one or two anomalous data points.

Checking Linearity
Plotting Relationships
predicted values and observed values for the outcome variable.Using Jamovi
Interpreting Results

To get a more detailed picture of linearity, it can be helpful to look at the relationship between predicted values and residuals.
Using Jamovi
esiduals to the dataset and then draw a scatterplot of predicted values against residual values.Interpreting Results

Regression models rely on a normality assumption: the residuals should be normally distributed.
Using Jamovi
Interpreting Results




Checking Relationship between Predicted Values and Residuals
predictor variable, the outcome variable, and the predicted values against residuals.Interpreting Results
We are looking for a fairly uniform distribution of dots with no clear bunching or patterning.
Issues with the relationship between predicted values and residuals?

Regression models make an assumption of equality (homogeneity) of variance.
Plotting Equality of Variance in Jamovi
SQRT(ABS(Residuals))

remove the outlier and run the regression again
How? In jamovi you can save the Cook’s distance values to the dataset, then draw a boxplot of the Cook’s distance values to identify the specific outliers.