Your progress on quizzes is being saved!

Tutorial 9

In the previous tutorial, we practised fitting a linear model with a single predictor to see if we can say anything about the length of a penguin’s flipper if we know something about its body mass. This time, we will build more sophisticated models with more than one predictor. To do that, we’ll make use of the understanding of multiple-predictor linear models that you gained in lecture 9.

Warm-up quiz

Before we start, let’s warm up by answering a few quiz questions!

As you know, the general equation for the linear model with n predictors is

\[y_i = b_0 + b_1\times x_{1_i} + b_2\times x_{2_i} + b_3\times x_{3_i} + \dots + b_{n-1}\times x_{n-1_i} + b_n\times x_{n_i} + e_i\]

QUESTION 1

How many slopes will there be in a model with 4 predictors?

4

Correct!That’s not right…

QUESTION 2

How many intercepts will there be in a model with 3 predictors?

1

Correct!That’s not right…

QUESTION 3

In general terms, how many b coefficients will there be in a model with n predictors? (do not include white spaces)

n + 1

QUESTION 4

What is the geometrical representation of the linear model with 2 predictors?

2D plane in a 3D space

QUESTION 5

What does each slope represent?

Relationship between a given predictor and the outcome after accounting for all other predictors

Setting Up

As usual, all you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1punk!

Create a new week_10 project and open it in RStudio. Then, create two new folders in the new week_10 folder: r_docs and data. Finally, open a new R Markdown file and save it in the r_docs folder. Get into the habit of creating new code chunk in the file for each of the tasks below.

Remember, you can add new code chunks by:

  1. Using the RStudio toolbar: Click Code > Insert Chunk
  2. Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + Alt + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
  3. Typing it out: ```{r}, press Enter, then ``` again.
  4. Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

 

Task 2punk!

In this tutorial, just like in the previous one, we’ll continue using the penguins data set from the palmerpenguins package. You will also need the packages tidyverse, broom, and parameters (install it if you don’t have it yet; watch out for spelling and upper/lower case letters!). Put the code that loads all of these in the setup chunk of your Rmd file and run it.

Data

Task 3punk!

Just like last time, save the penguins data set in your environment as peng_data to avoid having to type out palmerpenguins::penguins every time you want to use it.

Task 4punk!

Remind yourself of what’s in the data by looking at a rough summary.

Single-predictor model

Last time we looked at how we can use information about a penguin’s body mass to make a more accurate prediction about the length of its flippers than simply guessing the mean flipper length for any given specimen.

Task 5

Let’s take a look at this model again.

Task 5.1punk!

Fit the model to the data using the lm(), saving it’s output as m_mass

Task 5.2punk!

Use broom::tidy() to get a neat tibble of model summary statistics for this model. Save it as m_mass_summary

Task 5.3Prog-rocK

Inspect the results and answer the following questions.

QUESTION 6

What kind of relationship between body mass and flipper length does the model describe

Positive

QUESTION 7

By how many millimetres is flipper length predicted to increase for every 10 gramme increase in a penguin’s body mass? (give answer to 2 decimal places)

0.15

Correct!That’s not right…

QUESTION 8

Is this a statistically significant relationship (assuming an α = .05)?

Yes

Task 5.4punk!

Use broom::glance() to get the model fit statistics for m_mass and answer the quiz question.

QUESTION 9

What proportion of total variance in flipper length is accounted for by body mass? (give answer as % to 1 decimal place, such as 77.2)

75.9

Correct!That’s not right…

Now that you’ve refreshed your memory of the single-predictor model we built last time, let’s extend it by adding another predictor.

Multiple-predictor model

Imagine we’re also interested in whether or not bill length is related to flipper length. We can extend out m_mass model to explore this research question.

Before we do that, let’s look at what a simple model of flipper length as predicted by bill length tells us.

Task 6Prog-rocK

Fit this model, saving it as m_bill, inspect the results and answer the following questions.

QUESTION 10

What kind of relationship between bill length and flipper length does the model describe

Positive

QUESTION 11

By how many millimetres is flipper length predicted to increase for every 1 mm increase in a penguin’s bill length? (give answer to 2 decimal places)

1.69

Correct!That’s not right…

QUESTION 12

Is this a statistically significant relationship (assuming an α = .05)?

Yes

QUESTION 13

What proportion of total variance in flipper length is accounted for by bill length? (give answer as % to 1 decimal place, such as 77.2)

43.1

Correct!That’s not right…

OK, so we know that bill length is a significant predictor or flipper length. However, it could well be the case, that both of these variables are just reflections of body mass and, if we take body mass into account, the relationship between bill and flipper length disappears. In other words, it’s possible that larger penguins have both longer bills and longer flippers but once you only consider birds of a certain body mass, the ones with longer bills do not have longer flippers. Putting both of the predictors in the model allows us to see whether or not this is actually the case.

A careful inspection of the R2 values of our two models reveals that the relationship is not going to be completely straightforward. The first model accounts for 75.9% of the variance in the outcome, while the second model accounts for 43.1%. That means, that the model that combines the two predictors cannot simply be the same as the two models added together. The reason for this is that the total amount of variance in flipper length is, well, 100% and not 75.9% + 43.1% = 119%.

In other words, some of the variance accounted for by the two predictors will be shared between them.

Let’s do it then!

Task 7

Let’s fit and interpret a two-predictor model of flipper length one step at a time.

Task 7.1punk!

First of all, let’s fit the model using lm() and save it as m_mass_bill.

Hint

Check out lecture 9 if you can’t remember how to add multiple predictors into the formula.

Task 7.2Prog-rocK

Get the summary of the model and have a look at it. Are the model coefficients the same as the ones from m_mass and m_bill? If not, how are they different?

Remember that the interpretation of the slopes in a multi-predictor model is different from that in a single-predictor model.

In our particular model, the \(b\) coefficient for body mass represents the increase in flipper length given a certain fixed bill length. Conversely, the \(b\) coefficient for body bill length represents the increase in flipper length given a certain fixed body mass.

Task 8jazz...

Given the coefficients of our model – 121.956, 0.013, and 0.549 – give the predicted value of flipper length for the following values of body mass and bill length. (2 decimal places)

Hint

Do you dare use R to do the numbers for you? 😛

QUESTION 14

Mass: 4,200 g
Bill length: 45 mm

201.48 | 201.26

Correct!That’s not right…

QUESTION 15

Mass: 4,200 g
Bill length: 30 mm

193.25 | 193.03

Correct!That’s not right…

QUESTION 16

Mass: 3,800 g
Bill length: 45 mm

196.26 | 196.06

Correct!That’s not right…

QUESTION 17

Mass: 3,800 g
Bill length: 30 mm

188.03 | 187.83

Correct!That’s not right…

A good thing to notice is the relationships between the answers to each of the questions and the \(b\) coefficients.

The values of body mass in questions 14 and 15 are the same and the values of bill length differ by 15 mm. That means that the difference between the results is going to be 15 × 0.549 = 8.24. That’s going to be the same as the difference between answers to questions 16 and 17 because the difference in the respective values of bill length in these two questions are also 15 mm, while the values of body mass are, again, the same.

Questions 14 and 16 on the one hand and 15 and 17 on the other have the same values of bill length but, in both cases, they differ in body mass by 400 g. That means that the differences between both pairs of results is going to be 400 × 0.013 = 5.2.

Finally, the differences between questions 14 and 17 and 15 and 16, respectively, is going to be the sum of the previous two differences: 400 × 0.013 + 15 × 0.549 = 13.44.

gimme moR!

Calculating predicted values in R

Of course, we can tell the computer to calculate the predicted values for us!

There are multiple ways of doing this and there are even functions written exactly for this purpose, such as predict() or broom::augment(). But, for now, let’s use what we already know.

If you look at the output of our model using broom::tidy() (which we saved as m_mass_bill_smry), you’ll find the coefficients there, in the estimate column:

m_mass_bill_smry
# A tibble: 3 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    122.      2.86         42.7  1.70e-138
2 body_mass_g      0.0131  0.000545     23.9  7.56e- 75
3 bill_length_mm   0.549   0.0801        6.86 3.31e- 11

We can use this column to calculate the predicted value for any set of predictor values. All we need is a column of values to multiply the estimate column by and then add the products together:

b x b * x
121.956 1 121.956
0.013 4200 54.813
0.549 45 24.715
sum 201.484

To do this in R you only need to come up with an algorithm. An OK one could look something like this:

  1. Add a column, let’s call it x, with the values of predictors. Since we’re multiplying two columns, the predictor column also needs a 1 to multiply the intercept by.
  2. Add another column that is the product of the multiplication of the estimate and x columns. Let’s call it bx.
  3. Pull the bx column from the tibble.
  4. Add up all its elements.
  5. Round the result to two decimal places.

All that’s left is to translate these steps into R using basic data wrangling functions:

m_mass_bill_smry %>%
  dplyr::mutate(x = c(1, 4200, 45), # step 1
                bx = estimate * x) %>% # step 2
  dplyr::pull(bx) %>% # step 3
  sum() %>% # step 4
  round(2) # step 5
[1] 201.48

And there’s your answer… It differs from the result of the calculation in the Solution box (201.26), because here we are not rounding our coefficients to 3 decimal places. Not doing so makes the answer more precise.

With a little thinking, you could even write code that calculates the answer to all four questions in one go!

Why don’t you give it a try as an extra task? Using the predict() function will make it very easy.

Task 9punk!

Look at the model fit statistic, R2 in particular.

QUESTION 18

What proportion of variance in the outcome does the model account for? (Give answer as % to 1 decimal place)

78.8

Correct!That’s not right…

Notice that R2 of this model is far smaller than the R2 of the two one-predictor models added together. In fact, it’s not that much larger than that of the body mass only model: While that model accounted for 75.9% of the total variance in flipper length, this one accounts for 78.8%. Once again, this is because there is a substantial overlap (correlation) between body mass and bill length.

Transforming variables in the model

When we were reporting our body mass model in the last tutorial, we changed the scale of the predictor from grammes to kilogrammes. This is by no means a problematic thing to do. After all, given the mean and SD of the body_mass_g variable (4201.75 and 801.95, respectively), it is not really reasonable to expect every 1 g increase in body mass to have a substantial effect on flipper length and so working with body mass in kg is fine.

Task 10Prog-rocK

Rescale the body mass variable, refit the model and inspect the results to see how the coefficients change.

Hint

Creating a mass_kg variable in the peng_data shouldn’t be difficult as it’s just a conversion of grammes to kilogrammes…

You should be able to see that both the intercept and the slope for bill length stayed the same as in the original model but the slope for mass (kg) scaled up by 1,000, compared to the one in the original model. Other than that, it’s the same number.

So as you can see, scaling the predictors scales the coefficients accordingly. If we wanted, we could scale flipper or bill length to some other units (such as centimetres) but, in this case, millimetres work well.

What doesn’t work as well, however, is the intercept. Not that there’s something wrong with it but, in the current model, it’s not very meaningful. After all, the flipper length of a hypothetical massless bill-less penguin is not that interesting. If may be more interesting to know what the expected flipper length is for a penguin of average size and average bill length.

Task 11Prog-rocK

Re-fit the model with the predictors centred around their respective means.

Hint

To mean-centre a variable, you can either subtract its mean from it, or use the scale() function with the right combination of its arguments. Check out the documentation for the function (?scale) to find out what arguments the function accepts.

QUESTION 19

So… What is the expected flipper length (in mm) for a penguin of average size and average bill length? (2 decimal places)

200.92

Correct!That’s not right…

Standardised B

The centred model we just made is pretty informative. However, it’s not immediately obvious which of our two predictors has a stronger effect on the outcome. That’s because the two predictors are measured on incomparable scales (kg vs mm). This is where standardised coefficients (Bs) come in handy: because they are both on the scale of standard deviations, they allow us to compare effects of very different variables.

Task 12

Use parameters::model_parameters() with the standardize = "refit" argument to get B coefficients for the predictors in our model.

QUESTION 20

This of the predictors has the strongest effect on the outcome?

Body mass

Reporting

Task 13jazz...

Report the findings of the model. At the very least, you should report R2 and the individual model coefficients along with their test of significance.

Hint

Check out tutorial 9 for inspiration.

 

 

That’s all for today. Good job!