Extending the linear model
Multiple predictors
Transforming variables in a model
Mean-centring
Scaling
z-transforming
outcomeobs=intercept+slope×predictorobs+residualobs
yi=b0+b1×x1i+ei
The model is a line through the scatter of data
The line shows what the value of outcome for a given value of predictor should be according to the model
Residual is the difference between prediction and observation
Just like we can fix b1 to zero in yi=b0+0×x1i+ei, we can fix any other b coefficient as well
We can think of the basic single-predictor linear model as
yi=b0+b1×x1i+0×x2i+0×x3i+⋯+0×xni+ei
We're just ignoring all but one of the infinity possible predictors we could put in the model
Not including a predictor in a model is the same as saying that there is no relationship between that variable and the outcome
We can include them in the model if we wish to so that their associated b coefficient gets estimated, rather than set to 0
Including more than one predictor allows us to model the outcome variable in a more sophisticated way
Every slope ( bn coefficient, for n>0) expresses the relationship between a given predictor and the outcome after the relationship of all other predictors has been accounted for
A relationship – causal or not – between two variables can drastically change when another variable is taken into account
It's important to consider all variables with a known effect when modelling a relationship (especially in observational research)
Lot of ink has been spilled over the claim that breastfeeding leads to increase in child IQ (BBC, The Guardian, The New York Times, FiveThirtyEigth)
When assessed at face value breastfed children have higher IQ
Whether or not a person breastfeeds their child is also linked to things like socio‑economic status or the person's IQ
When these effects are adjusted for, the effect shrinks substantially – 3 IQ points difference is a generous estimate and even that has been contested
The linear model allows us to build these more nuanced models and get closer to the Truth about the UniverseTM
ABCDEFGHIJ0123456789 |
ID <dbl> | Length <dbl> | Birthweight <dbl> | Headcirc <dbl> | Gestation <dbl> | smoker <dbl> | mage <dbl> | |
---|---|---|---|---|---|---|---|
1360 | 56 | 4.55 | 34 | 44 | 0 | 20 | |
1016 | 53 | 4.32 | 36 | 40 | 0 | 19 | |
462 | 58 | 4.10 | 39 | 41 | 0 | 35 | |
1187 | 53 | 4.07 | 38 | 44 | 0 | 20 | |
553 | 54 | 3.94 | 37 | 42 | 0 | 24 | |
1636 | 51 | 3.93 | 38 | 38 | 0 | 29 | |
820 | 52 | 3.77 | 34 | 40 | 0 | 24 | |
1191 | 53 | 3.65 | 33 | 42 | 0 | 21 | |
1081 | 54 | 3.63 | 38 | 38 | 0 | 18 | |
822 | 50 | 3.42 | 35 | 38 | 0 | 20 |
p1 <- bweight %>% ggplot(aes(mage, Birthweight)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = theme_col, fill = second_col) + labs(x = "Mother's age at birth", y = "Birth weight (lbs)")p2 <- bweight %>% ggplot(aes(Gestation, Birthweight)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = second_col, fill = theme_col) + labs(x = "Gestation duration (weeks)", y = "")cowplot::plot_grid(p1, p2)
lm()
## Intercept-only modelm_null <- lm(Birthweight ~ 1, bweight)## Add mother's age as predictorm_age <- lm(Birthweight ~ mage, bweight)# alternatively update(m_null, ~ . + mage)## Add gestation duration as predictorm_gest <- lm(Birthweight ~ mage + Gestation, bweight)# same as update(m_age, ~ . + Gestation)
summary(m_null)
## ## Call:## lm(formula = Birthweight ~ 1, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -1.39286 -0.37286 -0.01786 0.33464 1.25714 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.31286 0.09318 35.55 <0.0000000000000002 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6039 on 41 degrees of freedom
summary(m_age)
## ## Call:## lm(formula = Birthweight ~ mage, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -1.39275 -0.37288 -0.01786 0.33473 1.25702 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.31238583 0.44072153 7.516 0.00000000362 ***## mage 0.00001845 0.01685112 0.001 0.999 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6114 on 40 degrees of freedom## Multiple R-squared: 2.996e-08, Adjusted R-squared: -0.025 ## F-statistic: 1.199e-06 on 1 and 40 DF, p-value: 0.9991
summary(m_gest)
## ## Call:## lm(formula = Birthweight ~ mage + Gestation, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -0.77485 -0.35861 -0.00236 0.26948 0.96943 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.0092887 1.0567990 -2.848 0.00699 ** ## mage -0.0007953 0.0120469 -0.066 0.94770 ## Gestation 0.1618369 0.0258242 6.267 0.000000221 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4371 on 39 degrees of freedom## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762 ## F-statistic: 19.64 on 2 and 39 DF, p-value: 0.00000126
Linear model can tell us the expected value of outcome for any combination of predictor values
According to our model, expected birth weight for a baby whose mother is 29 years old and whose gestation period was 38 weeks is:
^y=−3.01+0×age+0.16×gestation=−3.01+0×29+0.16×38=−3.01+0+6.08=3.07
bweight %>% filter(mage == 29 & Gestation == 38) %>% rmarkdown::paged_table()
ABCDEFGHIJ0123456789 |
ID <dbl> | Length <dbl> | Birthweight <dbl> | Headcirc <dbl> | Gestation <dbl> | smoker <dbl> | mage <dbl> | |
---|---|---|---|---|---|---|---|
1636 | 51 | 3.93 | 38 | 38 | 0 | 29 |
The intercept always tells us the value of the outcome when all predictors are 0
bweight %>% ggplot(aes(Gestation, Birthweight)) + geom_point(size = 3, alpha = .4) + geom_vline(xintercept = 0, lty=2) + geom_hline(yintercept = coefs[1], lty=2) + geom_point(data = tibble(x = 0, y = coefs[1]), mapping = aes(x, y), colour = "#fdfdfd", size = 4) + geom_abline(intercept = coefs[1], slope = coefs[3], color = second_col, size = 1) + geom_point(data = tibble(x = 0, y = coefs[1]), mapping = aes(x, y), pch=21, size = 4) + xlim(c(0, 45)) + ylim(c(-4, 5)) + labs(x = "Gestation duration (weeks)", y = "Birth weight (lbs)")
We can apply various transformations to variables in the model
Centring, scaling, standardising
Non-linear transformations are also possible (e.g., log-transform)
Transforming variables changes the interpretation of the coefficients
# untransformed predictorlm(Birthweight ~ Gestation, bweight)
## ## Call:## lm(formula = Birthweight ~ Gestation, data = bweight)## ## Coefficients:## (Intercept) Gestation ## -3.0289 0.1618
# centred predictorbweight <- bweight %>% mutate(gest_cntrd = Gestation - mean(Gestation, na.rm=TRUE))lm(Birthweight ~ gest_cntrd, bweight)
## ## Call:## lm(formula = Birthweight ~ gest_cntrd, data = bweight)## ## Coefficients:## (Intercept) gest_cntrd ## 3.3129 0.1618
# centre mother's agebweight <- bweight %>% mutate(age_cntrd = mage - mean(mage, na.rm=TRUE))lm(Birthweight ~ age_cntrd + gest_cntrd, bweight) %>% summary()
## ## Call:## lm(formula = Birthweight ~ age_cntrd + gest_cntrd, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -0.77485 -0.35861 -0.00236 0.26948 0.96943 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.3128571 0.0674405 49.123 < 0.0000000000000002 ***## age_cntrd -0.0007953 0.0120469 -0.066 0.948 ## gest_cntrd 0.1618369 0.0258242 6.267 0.000000221 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4371 on 39 degrees of freedom## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762 ## F-statistic: 19.64 on 2 and 39 DF, p-value: 0.00000126
# untransformed outcomelm(Birthweight ~ gest_cntrd, bweight)
## ## Call:## lm(formula = Birthweight ~ gest_cntrd, data = bweight)## ## Coefficients:## (Intercept) gest_cntrd ## 3.3129 0.1618
# scaled outcomebweight <- bweight %>% mutate(bweight_g = Birthweight / 2.205 * 1000) # 2.205 lbs in kglm(bweight_g ~ gest_cntrd, bweight)
## ## Call:## lm(formula = bweight_g ~ gest_cntrd, data = bweight)## ## Coefficients:## (Intercept) gest_cntrd ## 1502.43 73.39
# untransformed predictorlm(Birthweight ~ Gestation, bweight)
## ## Call:## lm(formula = Birthweight ~ Gestation, data = bweight)## ## Coefficients:## (Intercept) Gestation ## -3.0289 0.1618
# standardised predictorbweight <- bweight %>% mutate(gest_z = scale(Gestation))lm(bweight_g ~ gest_z, bweight)
## ## Call:## lm(formula = bweight_g ~ gest_z, data = bweight)## ## Coefficients:## (Intercept) gest_z ## 1502 194
p1 <- bweight %>% ggplot(aes(Gestation, Birthweight)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = second_col, fill = theme_col) + labs(x = "Gestation time (weeks)", y = "Birth weight (lbs)")p2 <- bweight %>% ggplot(aes(gest_cntrd, Birthweight)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = second_col, fill = theme_col) + labs(x = "Gestation time (weeks from mean)", y = "")p3 <- bweight %>% ggplot(aes(Gestation, bweight_g)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = second_col, fill = theme_col) + labs(x = "Gestation time (weeks)", y = "Birth weight (g)")p4 <- bweight %>% ggplot(aes(gest_z, bweight_g)) + geom_point(size = 3, alpha = .4) + geom_smooth(method = "lm", color = second_col, fill = theme_col) + labs(x = "Gestation time (z-score)", y = "")cowplot::plot_grid(p1, p2, p3, p4, nrow = 2)
Standardised coefficients are equivalent to b coefficients in a model where both the predictors and the outcome have been z-transformed
We'll call them B to distinguish them from "raw" coefficients b but there is a lot of confusion in literature about the notation (you may see b, B, β, or Beta used to mean either of the two)
B expresses the change in outcome in terms of number of SD as a result of 1 SD change in predictor
Handy function – QuantPsyc::lm.beta()
Only gives B for slopes, not intercept!
m_gest <- lm(Birthweight ~ mage + Gestation, bweight)# raw coefficients (b)m_gest %>% coef()
## (Intercept) mage Gestation ## -3.0092887340 -0.0007952874 0.1618368592
# standardised coefficeints (B)m_gest %>% QuantPsyc::lm.beta()
## mage Gestation ## -0.007462176 0.708383324
# same as if we z-transform everything ourselveslm(scale(Birthweight) ~ scale(mage) + scale(Gestation), bweight) %>% coef() %>% round(9)
## (Intercept) scale(mage) scale(Gestation) ## 0.000000000 -0.007462176 0.708383324
Linear model can be easily extended to more than one predictor
Each predictor entered into the model adds an extra dimension to the space in which the model exists
Each b coefficient (except for b0) is a slope of the regression plane in its dimension
Both including and omitting a variable is a claim about its relationship with the outcome
A b coefficient for a predictor tells us about the relationship between the predictor and the outcome after accounting for the relationship between all other predictors and the outcome
Intercept may not be a sensible value if variables are not transformed
Transforming variables changes the interpretation of the coefficients
Standardised coefficients, B, express the change in outcome in terms of number of SD as a result of 1 SD change in predictor
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |