PDF
+ - 0:00:00
Notes for current slide
Notes for next slide

The Linear Model 3: Return of the yi

Lecture 9

Dr Milan Valášek

28 March 2022

1 / 27
PDF

Stats wars

  • LM1:   A New Equation

  • LM2:   R2 strikes back

  • LM3:   Return of the yi

2 / 27
PDF

Today

  • Extending the linear model

  • Multiple predictors

  • Transforming variables in a model

    • Mean-centring

    • Scaling

    • z-transforming

3 / 27
PDF

Basic linear model

outcomeobs=intercept+slope×predictorobs+residualobs

yi=b0+b1×x1i+ei

  • The model is a line through the scatter of data

  • The line shows what the value of outcome for a given value of predictor should be according to the model

  • Residual is the difference between prediction and observation

4 / 27
PDF

Mean as linear model

  • The simplest linear model is the mean
  • yi=b0+ei
  • b0=Mean(y)
  • That's literally the same as yi=b0+0×x1i+ei
  • Mean is the intercept-only model: a linear model where all b coefficients other than b0 have been set (fixed) to zero


5 / 27
PDF

Other coefficients?

  • Just like we can fix b1 to zero in yi=b0+0×x1i+ei, we can fix any other b coefficient as well

  • We can think of the basic single-predictor linear model as

yi=b0+b1×x1i+0×x2i+0×x3i++0×xni+ei

  • We're just ignoring all but one of the infinity possible predictors we could put in the model

  • Not including a predictor in a model is the same as saying that there is no relationship between that variable and the outcome

    • It's just said implicitly rather than aloud
  • We can include them in the model if we wish to so that their associated b coefficient gets estimated, rather than set to 0

6 / 27
PDF

Variables are dimensions

  • We've been representing the mean as a line on a plot of 2 variables
  • It can also be represented as a point on the number line
  • Every predictor adds a dimension


7 / 27
PDF

More complex models

  • Including more than one predictor allows us to model the outcome variable in a more sophisticated way

  • Every slope ( bn coefficient, for n>0) expresses the relationship between a given predictor and the outcome after the relationship of all other predictors has been accounted for

  • A relationship – causal or not – between two variables can drastically change when another variable is taken into account

  • It's important to consider all variables with a known effect when modelling a relationship (especially in observational research)

    • Say we find a relationship between home environment and mental health
    • However, mental health has a strong genetic component
    • Parental predisposition to worse mental health is also linked to home environment
    • Can we really claim a relationship between environment and mental health if we don't consider genetics?
8 / 27
PDF

Breast is best but is it smartest?

  • Lot of ink has been spilled over the claim that breastfeeding leads to increase in child IQ (BBC, The Guardian, The New York Times, FiveThirtyEigth)

  • When assessed at face value breastfed children have higher IQ

  • Whether or not a person breastfeeds their child is also linked to things like socio‑economic status or the person's IQ

  • When these effects are adjusted for, the effect shrinks substantially – 3 IQ points difference is a generous estimate and even that has been contested


The linear model allows us to build these more nuanced models and get closer to the Truth about the UniverseTM

9 / 27
PDF

Mutiple predictors in practice

  • Today's example focuses on data about babies' birth weights and parental characteristics (source)
ABCDEFGHIJ0123456789
ID
<dbl>
Length
<dbl>
Birthweight
<dbl>
Headcirc
<dbl>
Gestation
<dbl>
smoker
<dbl>
mage
<dbl>
1360564.553444020
1016534.323640019
462584.103941035
1187534.073844020
553543.943742024
1636513.933838029
820523.773440024
1191533.653342021
1081543.633838018
822503.423538020
10 / 27
PDF

Birth weight, mother's age, and gestation time

p1 <- bweight %>%
ggplot(aes(mage, Birthweight)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = theme_col, fill = second_col) +
labs(x = "Mother's age at birth", y = "Birth weight (lbs)")
p2 <- bweight %>%
ggplot(aes(Gestation, Birthweight)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = second_col, fill = theme_col) +
labs(x = "Gestation duration (weeks)", y = "")
cowplot::plot_grid(p1, p2)
11 / 27
PDF

Fit model using lm()

## Intercept-only model
m_null <- lm(Birthweight ~ 1, bweight)
## Add mother's age as predictor
m_age <- lm(Birthweight ~ mage, bweight)
# alternatively update(m_null, ~ . + mage)
## Add gestation duration as predictor
m_gest <- lm(Birthweight ~ mage + Gestation, bweight)
# same as update(m_age, ~ . + Gestation)
12 / 27
PDF

Results - null model

summary(m_null)
##
## Call:
## lm(formula = Birthweight ~ 1, data = bweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39286 -0.37286 -0.01786 0.33464 1.25714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.31286 0.09318 35.55 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6039 on 41 degrees of freedom
13 / 27
PDF

Results - Mother's age as predictor

summary(m_age)
##
## Call:
## lm(formula = Birthweight ~ mage, data = bweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39275 -0.37288 -0.01786 0.33473 1.25702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.31238583 0.44072153 7.516 0.00000000362 ***
## mage 0.00001845 0.01685112 0.001 0.999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6114 on 40 degrees of freedom
## Multiple R-squared: 2.996e-08, Adjusted R-squared: -0.025
## F-statistic: 1.199e-06 on 1 and 40 DF, p-value: 0.9991
14 / 27
PDF

Results - M's age and gestation time

summary(m_gest)
##
## Call:
## lm(formula = Birthweight ~ mage + Gestation, data = bweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77485 -0.35861 -0.00236 0.26948 0.96943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.0092887 1.0567990 -2.848 0.00699 **
## mage -0.0007953 0.0120469 -0.066 0.94770
## Gestation 0.1618369 0.0258242 6.267 0.000000221 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4371 on 39 degrees of freedom
## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762
## F-statistic: 19.64 on 2 and 39 DF, p-value: 0.00000126
15 / 27
PDF

Model prediction

  • Linear model can tell us the expected value of outcome for any combination of predictor values

  • According to our model, expected birth weight for a baby whose mother is 29 years old and whose gestation period was 38 weeks is:

y^=3.01+0×age+0.16×gestation=3.01+0×29+0.16×38=3.01+0+6.08=3.07

  • Let's compare to observations in sample
bweight %>% filter(mage == 29 & Gestation == 38) %>%
rmarkdown::paged_table()
ABCDEFGHIJ0123456789
ID
<dbl>
Length
<dbl>
Birthweight
<dbl>
Headcirc
<dbl>
Gestation
<dbl>
smoker
<dbl>
mage
<dbl>
1636513.933838029
16 / 27
PDF

Negative intercept?!

  • The intercept always tells us the value of the outcome when all predictors are 0

    • Not always sensible (instantaneous childbirth in women aged 0 is not a common occurrence)

bweight %>%
ggplot(aes(Gestation, Birthweight)) +
geom_point(size = 3, alpha = .4) +
geom_vline(xintercept = 0, lty=2) +
geom_hline(yintercept = coefs[1], lty=2) +
geom_point(data = tibble(x = 0, y = coefs[1]),
mapping = aes(x, y), colour = "#fdfdfd", size = 4) +
geom_abline(intercept = coefs[1], slope = coefs[3], color = second_col, size = 1) +
geom_point(data = tibble(x = 0, y = coefs[1]),
mapping = aes(x, y), pch=21, size = 4) +
xlim(c(0, 45)) +
ylim(c(-4, 5)) +
labs(x = "Gestation duration (weeks)", y = "Birth weight (lbs)")
17 / 27
PDF

Transforming variables in the model

  • We can apply various transformations to variables in the model

    • Centring, scaling, standardising

    • Non-linear transformations are also possible (e.g., log-transform)

  • Transforming variables changes the interpretation of the coefficients

18 / 27
PDF

Centring

  • Centring predictors changes the interpretation of the intercept
# untransformed predictor
lm(Birthweight ~ Gestation, bweight)
##
## Call:
## lm(formula = Birthweight ~ Gestation, data = bweight)
##
## Coefficients:
## (Intercept) Gestation
## -3.0289 0.1618
# centred predictor
bweight <- bweight %>%
mutate(gest_cntrd = Gestation - mean(Gestation, na.rm=TRUE))
lm(Birthweight ~ gest_cntrd, bweight)
##
## Call:
## lm(formula = Birthweight ~ gest_cntrd, data = bweight)
##
## Coefficients:
## (Intercept) gest_cntrd
## 3.3129 0.1618
19 / 27
PDF

Centring

  • What's the weight of a baby born to a "typical" mother in terms of age and pregnancy duration
# centre mother's age
bweight <- bweight %>%
mutate(age_cntrd = mage - mean(mage, na.rm=TRUE))
lm(Birthweight ~ age_cntrd + gest_cntrd, bweight) %>% summary()
##
## Call:
## lm(formula = Birthweight ~ age_cntrd + gest_cntrd, data = bweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77485 -0.35861 -0.00236 0.26948 0.96943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3128571 0.0674405 49.123 < 0.0000000000000002 ***
## age_cntrd -0.0007953 0.0120469 -0.066 0.948
## gest_cntrd 0.1618369 0.0258242 6.267 0.000000221 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4371 on 39 degrees of freedom
## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762
## F-statistic: 19.64 on 2 and 39 DF, p-value: 0.00000126
20 / 27
PDF

Scaling

  • Scaling predictors or outcome changes the interpretation of the slopes
# untransformed outcome
lm(Birthweight ~ gest_cntrd, bweight)
##
## Call:
## lm(formula = Birthweight ~ gest_cntrd, data = bweight)
##
## Coefficients:
## (Intercept) gest_cntrd
## 3.3129 0.1618
# scaled outcome
bweight <- bweight %>%
mutate(bweight_g = Birthweight / 2.205 * 1000) # 2.205 lbs in kg
lm(bweight_g ~ gest_cntrd, bweight)
##
## Call:
## lm(formula = bweight_g ~ gest_cntrd, data = bweight)
##
## Coefficients:
## (Intercept) gest_cntrd
## 1502.43 73.39
21 / 27
PDF

Standardising

  • Sometimes it's useful to talk about change in outcome associated with a 1 SD change in predictors
# untransformed predictor
lm(Birthweight ~ Gestation, bweight)
##
## Call:
## lm(formula = Birthweight ~ Gestation, data = bweight)
##
## Coefficients:
## (Intercept) Gestation
## -3.0289 0.1618
# standardised predictor
bweight <- bweight %>%
mutate(gest_z = scale(Gestation))
lm(bweight_g ~ gest_z, bweight)
##
## Call:
## lm(formula = bweight_g ~ gest_z, data = bweight)
##
## Coefficients:
## (Intercept) gest_z
## 1502 194
22 / 27
PDF

It's all the same model!

p1 <- bweight %>%
ggplot(aes(Gestation, Birthweight)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = second_col, fill = theme_col) +
labs(x = "Gestation time (weeks)", y = "Birth weight (lbs)")
p2 <- bweight %>%
ggplot(aes(gest_cntrd, Birthweight)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = second_col, fill = theme_col) +
labs(x = "Gestation time (weeks from mean)", y = "")
p3 <- bweight %>%
ggplot(aes(Gestation, bweight_g)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = second_col, fill = theme_col) +
labs(x = "Gestation time (weeks)", y = "Birth weight (g)")
p4 <- bweight %>%
ggplot(aes(gest_z, bweight_g)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", color = second_col, fill = theme_col) +
labs(x = "Gestation time (z-score)", y = "")
cowplot::plot_grid(p1, p2, p3, p4, nrow = 2)
23 / 27
PDF

Standardised coefficients

  • Standardised coefficients are equivalent to b coefficients in a model where both the predictors and the outcome have been z-transformed

  • We'll call them B to distinguish them from "raw" coefficients b but there is a lot of confusion in literature about the notation (you may see b, B, β, or Beta used to mean either of the two)

  • B expresses the change in outcome in terms of number of SD as a result of 1 SD change in predictor

24 / 27
PDF

Standardised coefficients

  • Handy function – QuantPsyc::lm.beta()

  • Only gives B for slopes, not intercept!

m_gest <- lm(Birthweight ~ mage + Gestation, bweight)
# raw coefficients (b)
m_gest %>% coef()
## (Intercept) mage Gestation
## -3.0092887340 -0.0007952874 0.1618368592
# standardised coefficeints (B)
m_gest %>% QuantPsyc::lm.beta()
## mage Gestation
## -0.007462176 0.708383324
# same as if we z-transform everything ourselves
lm(scale(Birthweight) ~ scale(mage) + scale(Gestation), bweight) %>% coef() %>% round(9)
## (Intercept) scale(mage) scale(Gestation)
## 0.000000000 -0.007462176 0.708383324
25 / 27
PDF

Take-home message

  • Linear model can be easily extended to more than one predictor

  • Each predictor entered into the model adds an extra dimension to the space in which the model exists

  • Each b coefficient (except for b0) is a slope of the regression plane in its dimension

  • Both including and omitting a variable is a claim about its relationship with the outcome

  • A b coefficient for a predictor tells us about the relationship between the predictor and the outcome after accounting for the relationship between all other predictors and the outcome

  • Intercept may not be a sensible value if variables are not transformed

  • Transforming variables changes the interpretation of the coefficients

  • Standardised coefficients, B, express the change in outcome in terms of number of SD as a result of 1 SD change in predictor

26 / 27
PDF
27 / 27

Stats wars

  • LM1:   A New Equation

  • LM2:   R2 strikes back

  • LM3:   Return of the yi

2 / 27
PDF
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow