The Linear Model 2: Single predictor

Last time you learnt about the equation for a straight line and how it can be used to model a relationship between a predictor and an outcome variable by fitting a linear model to data. In this tutorial, we’ll talk about how we find line of best fit that characterises the linear model and also what we mean by “best fit”. In other words, we’ll see how we (well, R really) can find the values of the intercept and slope of the line that fits the data best. We will also talk about how the linear model is tested in the framework of Null Hypothesis Significance Testing (NHST). Finally, once we’ve found and tested our linear model, we will see how we can evaluate how good this model is.

Warm-up quiz

Before we start, why not get in the mood by answering a few quiz questions!

Remember, the equation for the linear model is

\[y_i = b_0 + b_1\times x_i + e_i\]

QUESTION 1

What does b₁ represent in the formula?

The increase in outcome per unit increase in predictor

Correct!That’s not right…

QUESTION 2

What value of the outcome does the model y_i = 13 − 2.5 × x_i predict when the value of the predictor is 4? (give answer to 2 decimal places)

3.00

Correct!That’s not right…

QUESTION 3

What outcome value does the same model predict when the the predictor is −1? (give answer to 2 decimal places)

15.50

Correct!That’s not right…

QUESTION 4

What kind of relationship is this?

Negative

Correct!That’s not right…

Remember that \(x\) represents the predictor in the linear model. To solve for the outcome, simply replace \(x\) in the equation with the given value of the predictor and finish up the maths!

Make sure you give your answer to two decimal places, even if there are 0s.

What about the little \(i\) guy?

The \(i\) subscript in \(y_i\)¹, \(x_i\), and \(e_i\) is simply the index of the given data point. You can imagine it as the row number in your data set: \(i = 1\) is the first row, \(i = 2\) is the second row, \(i = n\) is the \(n^{\text{th}}\) row of your data.

The model then expresses a different value of the outcome variable \(y\) for each value of the predictor variable \(x\). Because each point has a different amount of “error” (difference between what the model predicts for each row and what we observe), there’s also a separate \(e_i\) for each data point.

However, the model intercept and slope (the \(b_0\) and \(b_1\) coefficients) are the same for every point, and so they do not have an \(i\) subscript.

Setting Up

All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1

Create a new week_09 project and open it in RStudio. Then, create two new folders in the new week_09 folder: r_docs and data. Finally, open a new R Markdown file and save it in the r_docs folder. Get into the habit of creating new code chunk in the file for each of the tasks below.

Remember, you can add new code chunks by:

Using the RStudio toolbar: Click Code > Insert Chunk
Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + Alt + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
Typing it out: ```{r}, press Enter, then ``` again.
Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

Task 2

In this tutorial, we’ll return to the lovely penguins data set from the palmerpenguins package. You will also need the tidyverse and broom packages (install it if you don’t have it yet). Put the code that loads all of these in the setup chunk of your Rmd file and run it.

library(palmerpenguins)
library(tidyverse)
library(broom)

What is best fit?

Remember how we said in the lecture that there are many ways to draw a line through data. Take a moment to play with the intercept and slope of the line in the plot below. Use the yellow circles to drag the line to a position where you think it adequately fits the scattered data points.

If you’re thinking that it’s not really obvious how precisely to fit the line, you are correct! It’s really difficult to just eyeball the position and orientation. What we need is some sort of criterion for deciding which line fits the data best.

Deviation and residual

You learnt about the concepts of deviation and sum of squares last term (you can refresh your memory by revisiting the lecture handout) but to reiterate, deviation is the difference from each point in the data from some place, for example the mean. We use deviations to quantify the amount of variability in a variable. However, because the individual distances of each point from the mean of a variable always add up to zero, we calculate the squared deviations and add those up. The result of this operation is the sum of squares and it is the basis for the calculation of the variance and the standard deviation (check out the formulae in the handout linked above if you don’t know what they are).

Just like we can calculate the distance of points from a point, we can calculate their distance from a line. After all, the mean can be equally well represented as a point on the number line or as a line on a 2-axis plot! Let’s see what that looks like.

In the applet below, change the slope and intercept of the line so that it is the same as the dashed line that represents the mean of the y variable.

QUESTION 5

What is the intercept of this line?

Mean of y

Correct!That’s not right…

QUESTION 6

What is the slope of this line? (2 decimal places)

0.00

Correct!That’s not right…

The thin dashed lines connecting each point with the line are the deviations and the teal squares are, well, their squares. The purple square under the plot is the sum of squares: its area is equal to the area of all the teal squares combined.

Move the line about and change its slope and watch what happens with the teal and purple squares. Have a go!

Let’s talk about jargon for a second. While the term deviation is reserved for differences betwen data points and their mean, the vertical distances of points from a line of the linear model are called residuals.

There’s no principled difference between the two, since the mean can be represented as a line in the context of the linear model but it’s important that you are familiar with the term residual.

So, when we talk about the Sum of Squared Residuals (SSR or SS_R), we mean the total area of these teal squares which is shown in the applet above as the purple square under the plot.

SS_R tells us the amount to which the line (i.e., the linear model) does not fit the data. If the points were all aligned, the line connecting them would have SS_R = 0.

There are several ways of finding the line of best fit, depending on what your definition of “best” is but the most straightforward way is to get the line that results in the smallest possible SS_R. This method is called Ordinary Least Squares (OLS) regression because it finds the line with the smallest (least) square.

Play around with with the visualisation and try to find the OLS line of best fit. The line will change colour when you get it right!

QUESTION 7

What is the value of SS_R of the line of best fit in the visualisation? (2 decimal places)

58.80

Correct!That’s not right…

That’s pretty much all there is to it to finding the intercept and slope of the line (there’s a little more maths but you don’t have to worry about it). Of course, we don’t have to find the line of best fit ourselves. We can ask R to do it for us.

Data

Task 3

The penguins object from the package palmerpenguins contains the dataset we worked with last term. Save it in your environment as peng_data so that you don’t have to type out palmerpenguins::penguins everytime you want to do something with it.

peng_data <- palmerpenguins::penguins

Task 4

Make sure you know what variables there are in the dataset and look at the rough summary of it.

summary(peng_data)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

Basic visualisation

Let’s build a model of penguin flipper length predicted by the body mass of these majestic creatures.

Task 5

First of all, let’s visualise the relationship between these two variables on a scatter plot. Create a plot that looks something like this:

peng_data %>%
  ggplot2::ggplot(aes(body_mass_g, flipper_length_mm)) +
  geom_point(alpha=.4) +
  labs(x="Body mass (g)", y="Flipper length (mm)") +
  cowplot::theme_cowplot()

The plot gives us a decent impression of what the relationship between the two variables is like.

QUESTION 8

What can be said of the relationship between body mass and flipper length in the penguins in our sample?

As body mass increases, so does flipper length

Correct!That’s not right…

QUESTION 9

Looking at the plot, what can you safely claim about the slope of the line of best fit through the data?

It’s smaller than 1 but larger than 0

Correct!That’s not right…

The relationship is positive and so the slope of the line must also be positive. However, given the scale of the variables (body mass in 1,000s and flipper length in 100s), a unit change in body mass results in a change in flipper length of less than one.

Fitting a linear model with `lm()`

The null model

Before we fit a model predicting flipper length by body mass to our data, let’s see what the null model looks like. The null – or intercept-only – model is the most basic model of a variable we can build. As the name suggests, it only includes the intercept, meaning that the slope is fixed to the value of zero. Conceptually, this model represents our best guess about the value of any observation of the outcome variable, if we have no information about its relationship with any other variables.

Task 6

Let’s build us one such model!

Task 6.1

Fit the intercept-only model of flipper_length_mm and save its output in an object called null_model. The R formula notation of a null model is outcome ~ 1.

null_model <- lm(flipper_length_mm ~ 1, data = peng_data)

Task 6.2

Let’s have a look what’s inside this object.

null_model


Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)

Coefficients:
(Intercept)  
      200.9

What R returns is just a printout of the most basic information about the model contained in the object. The object however has much more stuff in it.

Task 6.3

Tell R to print out the structure of the null_model object.

Use the str() function to do this.

str(null_model)

List of 12
 $ coefficients : Named num 201
  ..- attr(*, "names")= chr "(Intercept)"
 $ residuals    : Named num [1:342] -19.92 -14.92 -5.92 -7.92 -10.92 ...
  ..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
 $ effects      : Named num [1:342] -3715.57 -13.89 -4.89 -6.89 -9.89 ...
  ..- attr(*, "names")= chr [1:342] "(Intercept)" "" "" "" ...
 $ rank         : int 1
 $ fitted.values: Named num [1:342] 201 201 201 201 201 ...
  ..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
 $ assign       : int 0
 $ qr           :List of 5
  ..$ qr   : num [1:342, 1] -18.4932 0.0541 0.0541 0.0541 0.0541 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:342] "1" "2" "3" "5" ...
  .. .. ..$ : chr "(Intercept)"
  .. ..- attr(*, "assign")= int 0
  ..$ qraux: num 1.05
  ..$ pivot: int 1
  ..$ tol  : num 1e-07
  ..$ rank : int 1
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 341
 $ na.action    : 'omit' Named int [1:2] 4 272
  ..- attr(*, "names")= chr [1:2] "4" "272"
 $ call         : language lm(formula = flipper_length_mm ~ 1, data = peng_data)
 $ terms        :Classes 'terms', 'formula'  language flipper_length_mm ~ 1
  .. ..- attr(*, "variables")= language list(flipper_length_mm)
  .. ..- attr(*, "factors")= int(0) 
  .. ..- attr(*, "term.labels")= chr(0) 
  .. ..- attr(*, "order")= int(0) 
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(flipper_length_mm)
  .. ..- attr(*, "dataClasses")= Named chr "numeric"
  .. .. ..- attr(*, "names")= chr "flipper_length_mm"
 $ model        :'data.frame':  342 obs. of  1 variable:
  ..$ flipper_length_mm: int [1:342] 181 186 195 193 190 181 195 193 190 186 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language flipper_length_mm ~ 1
  .. .. ..- attr(*, "variables")= language list(flipper_length_mm)
  .. .. ..- attr(*, "factors")= int(0) 
  .. .. ..- attr(*, "term.labels")= chr(0) 
  .. .. ..- attr(*, "order")= int(0) 
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(flipper_length_mm)
  .. .. ..- attr(*, "dataClasses")= Named chr "numeric"
  .. .. .. ..- attr(*, "names")= chr "flipper_length_mm"
  ..- attr(*, "na.action")= 'omit' Named int [1:2] 4 272
  .. ..- attr(*, "names")= chr [1:2] "4" "272"
 - attr(*, "class")= chr "lm"

As you can see, this is quite a long list of things. What’s important to understand is that there’s a lot of useful information returned by lm() and that we can query our objects for all kinds of things using a few handy functions.

One of these functions is summary() which we have already encountered.

Task 6.4

Ask R to give you a summary of the model.

summary(null_model)


Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.915 -10.915  -3.915  12.085  30.085 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 200.9152     0.7604   264.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.06 on 341 degrees of freedom
  (2 observations deleted due to missingness)

Let’s break down the summary that R gives us:

Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)

This is just R’s way of reminding us what model we are looking at and what function call was used to fit it.

Residuals:
    Min      1Q  Median      3Q     Max 
-28.915 -10.915  -3.915  12.085  30.085

These are some basic descriptive statistics of the model residuals. Remember that residuals are just the differences between what the model predicts – the line of best fit – and our actual observed data. The residuals are then a bunch of numbers and there are as many of them as we have data points. This little summary tells us the minimum and maximum value of these numbers, their median, and their 1^st and 3^rd quartiles.

This information is useful for doing so-called “model diagnostics” which is something we will learn next term so, for the time being, don’t worry about it too much.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 200.9152     0.7604   264.2   <2e-16 ***

This is the main meat of the summary: the model coefficients and their associated statistics. The model we fitted is an intercept-only model so it stands to reason that it only has one coefficient – the intercept. If we also included a predictor, this little table would have two rows, one for the intercept and one for the slope.

Let’s have a look what the columns represent:

Estimate is the value of \(b_0\), the intercept itself. Since the point of all statistics is to estimate population values based on samples, this \(b_0\) coefficient is also used as the estimate (\(\hat{\beta_0}\)) of the population intercept \(\beta_0\)
Std. Error is the estimated standard error of this estimate. Just like we can estimate the standard error of the mean, SEM, we can estimate the standard error of any other estimate. Yeah, there is a lot of estimation going on in this game.
t value is, well, the t-statistic associated with the estimate and is calculated simply as the ratio of the first two numbers:\[\begin{aligned}t&=\frac{b_0}{\widehat{SE}_{b_0}}\\&=\frac{200.9152}{0.7604}\\&\approx264.2\end{aligned}\]
Pr(>|t|) is the p-value associated with this t-statistic and, just like any p-value it tells us the probability of getting a t at least as extreme as the one we got if the null hypothesis is true

The null hypothesis with respect to every row of this summary table is that the population value of the parameter (in this case \(b_0\)) is zero! If the p-value associated with the test statistic is smaller than our chosen \(\alpha\) level, we reject this null hypothesis and claim that the population value of the given parameter is statistically significantly different from 0.

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This is just a legend for the asterisks next to the p-values:

* – result is significant at \(\alpha\) = .05
** – result is significant at \(\alpha\) = .01
*** – result is significant at \(\alpha\) = .001
. – result is significant at \(\alpha\) = .1 (not really used much)

Residual standard error: 14.06 on 341 degrees of freedom
  (2 observations deleted due to missingness)

This is additional information about out model and, at this stage, you don’t need to worry about it. However, notice that R is telling us that two rows were deleted from the dataset because they contained missing values in either of the two columns used to fit the model!

QUESTION 10

Without checking the mean of flipper_length_mm, what’s the relationship of the intercept of the null model to the mean of the outcome?

They are the same

Correct!That’s not right…

The null model represents the mean of the outcome and it’s only value is its intercept, and so the intercept of the null model is the mean of the outcome variable.

Model of flipper length as predicted by body mass

Task 7

OK, now that we know how to read the output of the linear model, let’s fit the model we’ve been meaning to – one that predicts flipper length by body mass of a penguin.

Task 7.1

Fit this model using lm() and save its output into body_mass_model.

body_mass_model <- lm(flipper_length_mm ~ body_mass_g, data = peng_data)

Task 7.2

Ask R for the summary of this model.

summary(body_mass_model)


Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = peng_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.7626  -4.9138   0.9891   5.1166  16.6392 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.367e+02  1.997e+00   68.47   <2e-16 ***
body_mass_g 1.528e-02  4.668e-04   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.913 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

As you can see, the printout now has one more row in the Coefficients table and some additional information about the model fit:

Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

As explained in the lecture, R² (here labelled Multiple R-squared) can be interpreted as the proportion of variance in the outcome accounted for by the predictor.

Let’s dig into this a little.

First, look at this plot:

Look at how much the points are dispersed above and below the purple line that represents the null model (\(\bar{y}\)). That will give you an impression of the total variance in the outcome variable.

Now look at how much the points are scattered with respect to the line of our model (above and below it, not side-to-side). You can see that they vary much less than with respect to the intercept-only model with no predictor.

That means that the predictor accounts for some of the variance in the outcome. So, if we know the body mass of a penguin, we can more accurately guess its flipper length than if we don’t know its body mass!

That’s exactly what it means for a variable to predict another variable – it makes our guess about the likely value of the outcome more “educated” and precise.

As for the multiple R² and the F-statistic, we will talk about what they are for next week.

Task 7.3

Answer the following questions.

QUESTION 11

How many coefficients are there in this model? (whole number)

Correct!That’s not right…

QUESTION 12

What’s the value of the slope? (3 decimal places and watch out for scientific notation!)

0.015

Correct!That’s not right…

QUESTION 13

Is the relationship between body mass and flipper length statistically significant at α = .05?

Yes

Correct!That’s not right…

QUESTION 14

What proportion of the total variance in flipper length is accounted for by a penguin’s body mass? (in % to 1 decimal place, e.g., 33.3)

75.9

Correct!That’s not right…

Task 7.4

Finally, visualise the model by adding the line of best fit to our scatterplot. Perhaps something like this.

We did this in the previous tutorial. The function is geom_smooth()

peng_data %>%
  ggplot2::ggplot(aes(body_mass_g, flipper_length_mm)) +
  geom_point(alpha=.4) +
  #for colours, change theme_col and second_col to whatever colours you like!
  geom_smooth(method = "lm", colour = theme_col, fill = second_col) +
  labs(x="Body mass (g)", y="Flipper length (mm)") +
  cowplot::theme_cowplot()

Tidying up `R` output

The one drawback of the summary() of an lm object is that it doesn’t come in a nice tibble. Luckily, the package broom was written precisely to tidy² the output of models into this familiar and convenient format. The package contains a few handy functions we’ll be using quite a bit.

First up, it’s the broom::tidy() function.³

Task 8

Take the body_mass_model object and pipe it into the broom::tidy() function to see what it does.

body_mass_model %>%
  broom::tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) 137.      2.00          68.5 5.71e-201
2 body_mass_g   0.0153  0.000467      32.7 4.37e-107

As you can see, everything that was previously in the Coefficients section of the summary() printout is now in a neat tibble. That means, that we can easily query individual values using our dplyr chops.

Task 9

Write code that only print out the p-value associated with the b₁ coefficient.

You want to filter() this tibble so that it only has a single row and then pull() a single variable out of it.

body_mass_model %>%
  broom::tidy() %>%
  # only get the second row representing b1
  dplyr::filter(term == "body_mass_g") %>%
  # pull out the p.value column
  dplyr::pull(p.value)

[1] 4.370681e-107

That’s quite useful, don’t you think?

The second function you should be aware of is broom::glance().

Task 10

Take the body_mass_model object and pipe it into the broom::glance() function to see what it does.

body_mass_model %>%
  broom::glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl>
1     0.759         0.758  6.91     1071. 4.37e-107     1 -1146. 2297.
# … with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

The output of this function, provided you gave it a lm object, will only ever have one row with all the various model fit statistics. We will talk about some of these in the coming months but for now, the most important is the r.squared column containing, you guessed it, R².

Reporting results

Now that we have the results from our linear model, we can report them. Let’s first see what a pretty standard way of reporting this kind of a model is and then break it down piece by piece.

The model accounted for a significant proportion of variance in flipper length, \(R^2 = .76\), 90% CI \([0.72, 0.79]\), \(F(1, 340) = 1,070.74\), \(p < .001\). Body mass was positively related to flipper length, with flipper length increasing by 15.28 millimetres with each one kilogramme increase in body mass, 95% CI \([14.36\), \(16.19]\), \(t(340) = 32.72\), \(p < .001\).

So what do we have here…

First, we report that, overall, the model was significant, i.e., that it explained a significant proportion in the outcome’s variance. To evidence this claim, we should report:

R², ideally, along with a 90% confidence interval. The reason for using a 90% CI rather than 95% for this particular statistic is a little technical but suffice to say that that’s the one you should use if you chose the \(\alpha\) = 0.05 significance level
The F-statistic with its degrees of freedom, its value, and its associated p-value. We will talk about what this test does and mean next week

Once we’ve reported the overall model fit, we should report relationship of interest and, once again, provide evidence by quoting the relevant coefficient and its statistical test. In our case, we’re only interested in the slope (\(b_1\)) because it is the slope of the line that tells us about the relationship between our predictor and outcome. To fully report the relationship, we need:

The value of the relevant \(b\) coefficient with a 95% CI
The t-statistic of the test of statistical significance of the coefficient: its name (t) the correct number of degrees of freedom (340), and its value (32.72). The number of df is calculated as \(N - k\), where \(k\) is the number of coefficients in the model – in our case 2 (intercept and slope)
The p-value associated with the test

It is nice to tell the reader what this relationship looks like by telling them what change in the outcome is associated with a unit change in the predictor. Notice that we reported this in terms of a 1 kg change in body mass, even though the data is in grammes. This means that we’re reporting a \(b\) that’s 1,000 times as large as we saw in the model results. The reason for doing this is that the number was so small that rounding it to 2 decimal places resulted in a value and a confidence interval that was not very informative (basically 0.02 [0.01, 0.02]).

If this ever happens to you and you decide to change the scale of any of the variables in the model you must rerun all models and redo all plots to reflect this change.

It is absolutely crucial that all results, tables, and figures are consistent and in agreement with each other!

Extra task: Calculating R² “by hand”

Task 11

To cement the understanding of proportion of variance explained, let’s calculate R² ourselves!

First, let’s introduce another handy broom function.

Task 12

Pipe the model object into the broom::augment() function and assign the output to an object called body_mass_model_data.

body_mass_model_data <- body_mass_model %>%
  broom::augment()

Task 13

Let’s have a look what the function outputted…

body_mass_model_data

# A tibble: 342 × 9
   .rownames flipper_length_mm body_mass_g .fitted .resid    .hat
   <chr>                 <int>       <int>   <dbl>  <dbl>   <dbl>
 1 1                       181        3750    194. -13.0  0.00385
 2 2                       186        3800    195.  -8.78 0.00366
 3 3                       195        3250    186.   8.62 0.00705
 4 5                       193        3450    189.   3.57 0.00550
 5 6                       190        3650    192.  -2.49 0.00431
 6 7                       181        3625    192. -11.1  0.00444
 7 8                       195        4675    208. -13.1  0.00395
 8 9                       193        3475    190.   3.19 0.00533
 9 10                      190        4250    202. -11.7  0.00293
10 11                      186        3300    187.  -1.14 0.00663
# … with 332 more rows, and 3 more variables: .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>

For this task, we only need to understand a few of these variables:

flipper_length_mm is the original observed flipper length data; this was the outcome variable in our model
body_mass_g contains original body mass data; this was our predictor
.fitted are the values of the outcome as predicted by the model.
.resid are the residuals: the difference between observed values of outcome and .fitted

Task 13.1

Calculate the variance of the residuals from this tibble and save the value in var_resid.

var_resid <- var(body_mass_model_data$.resid)

QUESTION 15

What is the variance of the residuals? (to 2 decimal places)

47.65

Correct!That’s not right…

Task 13.2

Calculate the variance of the outcome variable and assign it to var_total.

There may be NAs in the variable, *nudge, nudge, wink, wink*…

var_total <- var(peng_data$flipper_length_mm, na.rm = TRUE)

# alternatively, using body_mass_model_data (no missing values)
var_total <- var(body_mass_model_data$flipper_length_mm)

QUESTION 16

What is the total variance of the outcome? (to 2 decimal places)

197.73

Correct!That’s not right…

Task 13.3

Apply the formula for R² to get the value.

\(R^2 = 1 - \frac{var_{\text{resid}}}{var_{\text{total}}}\)

1 - var_resid / var_total

[1] 0.7589925

As a sanity check, make sure you get the same value as the one in the model summary (or in the broom::glance() tibble).

That’s all for today. Good job!

Pronounced as “why aye” or “why sub-aye”.↩︎
Get it? broom, tidy… SO fUnNY!↩︎
Seriously, will the lulz ever end?!↩︎

Contents

Warm-up quiz

QUESTION 1

QUESTION 2

QUESTION 3

QUESTION 4

What about the little \(i\) guy?

Setting Up

Task 1

Task 2

What is best fit?

Deviation and residual

QUESTION 5

QUESTION 6

QUESTION 7

Data

Task 3

Task 4

Basic visualisation

Task 5

QUESTION 8

QUESTION 9

Fitting a linear model with lm()

The null model

Task 6

Task 6.1

Task 6.2

Task 6.3

Task 6.4

QUESTION 10

Model of flipper length as predicted by body mass

Task 7

Task 7.1

Task 7.2

Task 7.3

QUESTION 11

QUESTION 12

QUESTION 13

QUESTION 14

Task 7.4

Tidying up R output

Task 8

Task 9

Task 10

Reporting results

Extra task: Calculating R2 “by hand”

Task 11

Task 12

Task 13

Task 13.1

QUESTION 15

Task 13.2

QUESTION 16

Task 13.3

Fitting a linear model with `lm()`

Tidying up `R` output

Extra task: Calculating R² “by hand”