Summarising descriptive statistics with kableExtra

Introduction

In this tutorial, we will learn how to create publication quality tables summarising descriptive statistics of variables in your data. Looking at these kinds of statistics is really important because, before you start visualising your data and fitting statistical models, you want to understand what your variables look like and what kind of information they contain. It’s also a good idea to share a summary of these details with whoever is reading your dissertation/report/paper in a neat and concise way.

For the purpose of this tutorial, the particular data is not all that important, so we might as well fake it:

n <- 782
fake_data <- tibble::tibble(
  id = sprintf("%03d", 1:n),
  n_friends = round(rnorm(n, mean = 24, sd = 6)),
  salary_k = round(rnorm(n, 37, 4.5), 1),
  happiness = 32 + 4.5 * n_friends + rnorm(n, 0, 20),
  group = rep(c("control", "experimental"), each = n/2)
) %>%
  dplyr::mutate(happiness = round(happiness/max(happiness) * 98))
## add a few outliers
samp <- sample(n, 5)
# quadruple salary of random 5 participants
fake_data$salary_k[samp] <- fake_data$salary_k[samp] * 4
samp <- sample(n, 4)
# make random 4 extra friendly
fake_data$n_friends[samp] <- fake_data$n_friends[samp] * 13

Let’s see what the data looks like:

OK, let’s get exploring!

Making a good table is a 3 step process:

  1. Design the table
  2. Create a tibble that contains all the information you wish to show in the table
  3. Format the table

Table design

The first step happens in your head or on a piece of paper. To design a good table, you should have a notion of what tables of descriptive statistics tend to look like. To get a feel for that, the best thing to do is read high quality papers and notice how the authors tabulate the descriptives of their data.

A solid table of descriptive stats based on our data might look something like this:

Mean [95% CI] *
SD
Median
Range
Table 1: Table 1
Descriptive statistics of relevant variables.
Happiness 55.57 [54.59, 56.59] 13.35 55.0 16 98
N of friends 25.47 [24.28, 26.98] 19.93 24.0 7 338
Annual salary (£1k) 37.82 [37.22, 38.57] 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.
* Based on 1,000 bootstrap samples.

Look at that beauty! Concise, clear, and jam-packed with useful information. Notice a few characteristics of a good table:

  • It’s numbered and has a title. This particular style (number in boldface and title in italics) is compliant with APA 6th style guide and you can’t really go wrong following this guide.
  • It’s well organised:
    • Variables in rows
    • Statistics in columns
    • Ordering of columns aids understanding: the CIs displayed are confidence intervals around the means so putting them next to the column of means makes sense.
  • It contains a lot of information but there’s no redundancy (no point showing both SD and variance, since the former is just the square root of the latter)
  • Has a note that provides extra information that is required for a full understanding of the table

Now, admittedly, creating this table is a little advanced but you already have the skills to make one that’s not that far off:

Table 2: Table 1 Descriptive statistics of relevant variables.
95% CI *
Mean Lower Upper SD Median Min Max
Happiness 55.57 54.67 56.49 13.35 55.0 16.0 98.0
N of friends 25.47 24.16 27.05 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.21 38.50 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.
* Based on 1,000 bootstrap samples.

Data wrangling

Now, that we have a good design in mind, let’s replicate this table in R and discuss the whole process in a step-by-step fashion.

The second step of the table-making process consists in creating a tibble with all the information in it. In our case, the tibble should look like this:

# A tibble: 3 x 9
  var                     n     y  ymin  ymax    sd   med   min   max
  <chr>               <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Happiness             782  55.6  54.7  56.6 13.4   55    16     98 
2 *N* of friends        782  25.5  24.2  26.9 19.9   24     7    338 
3 Annual salary (£1k)   782  37.8  37.2  38.6  9.51  37.3  24.9  157.

We don’t have to worry about column names at this point: The table header gets edited at a later stage. What’s important is that all the data in the body of the tibble is correct.

So how do we get from our raw data to this neat summary? Why of course, with tidyverse!

We will be including package names before calling any function that doesn’t get loaded by R on startup, e.g., dplyr::select() or kableExtra::kbl(). The only exception is the pipe operator %>% that cannot be easily written out with its package name. For that reason, we need to load the package that contains it:

Now, you already know the dplyr::summarise() function that you can use to, well, summarise a variable:

fake_data %>%
  dplyr::summarise(m_happiness = mean(happiness, na.rm = TRUE),
            sd_happiness = sd(happiness, na.rm = TRUE))
# A tibble: 1 x 2
  m_happiness sd_happiness
        <dbl>        <dbl>
1        55.6         13.4

We will definitely be using this function but how can we get it to summarise each variable and arrange the variables in rows? In other words, how can we get a summary of multiple columns to look like the result of the dplyr::group_by() %>% dplyr::summarise() pipeline? You know, kinda like this:

fake_data %>%
  dplyr::group_by(group) %>%
  dplyr::summarise(m_happiness = mean(happiness, na.rm = TRUE),
            sd_happiness = sd(happiness, na.rm = TRUE))
# A tibble: 2 x 3
  group        m_happiness sd_happiness
  <chr>              <dbl>        <dbl>
1 control             56.1         13.5
2 experimental        55.0         13.1

But instead of having the two groups in the first column, we want have the different variables we want to report.

To do that, we need to reshape our data so that it has two columns: - a variable column that contains the names of the variable (just like the group column in our data contains the names of the groups) - a value column that has the actual values of these variables.

# A tibble: 2,346 x 2
   var       value
   <chr>     <dbl>
 1 n_friends  21  
 2 salary_k   37  
 3 happiness  46  
 4 n_friends  23  
 5 salary_k   31.7
 6 happiness  45  
 7 n_friends  33  
 8 salary_k   31.1
 9 happiness  79  
10 n_friends  24  
# ... with 2,336 more rows

This essentially means pivoting the tibble from this tidy format to a strange kind of superlong format that you probably aren’t going to want to use for anything but a summary table. It’s rubbish for most kinds of analysis but it works very well with the dplyr::group_by() %>% dplyr::summarise() pipeline.

We can use the tidyr::pivot_longer() function but to do it easyly, we can also use dplyr::select() to only keep columns we’re going to be using:

fake_data %>%
  # keep only relevant variables
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(
    1:3, # take columns 1-3 (so all columns)
    names_to = "var") # put their names into a var column

That’s all really! The function will take the three columns we selected, turn the data into this superlong format we saw in the output above and create an indicator column called var that includes the names of our original three variables.

After this, we can just add the already familiar dplyr::group_by() %>% dplyr::summarise() to the pipeline to create the summary tibble:

fake_data %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
  dplyr::group_by(var) %>%
  dplyr::summarise(
    n = n(),
    ggplot2::mean_cl_boot(value),
    sd = sd(value),
    med = median(value),
    min = min(value),
    max = max(value)
  )
# A tibble: 3 x 9
  var           n     y  ymin  ymax    sd   med   min   max
  <chr>     <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 happiness   782  55.6  54.6  56.5 13.4   55    16     98 
2 n_friends   782  25.5  24.2  26.8 19.9   24     7    338 
3 salary_k    782  37.8  37.2  38.5  9.51  37.3  24.9  157.

As you can see the ggplot2::mean_cl_boot() function gave us means of each of the variables along with lower and upper bounds of 95% bootstrap confidence intervals around for these means. It really is a rather handy function!

OK, the only thing that still bothers me about this summary is the names of the summaried variables. It would be much nicer to have proper labels in the table, wouldn’t you agree? To change values of a character variable, we can use the dplyr::recode() function and tell it what value should be changed to what character string. And while we’re at it, let’s store the tibble in the environment so that we don’t have to write the same code over and over again when we’re formatting our table:

smry_tib <- fake_data %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
  dplyr::group_by(var) %>%
  dplyr::summarise(
    n = n(),
    ggplot2::mean_cl_boot(value),
    sd = sd(value),
    med = median(value),
    min = min(value),
    max = max(value)
  ) %>%
  dplyr::mutate(var = dplyr::recode(
    var, # which column to recode
    n_friends = "*N* of friends", # we can use markdown too!
    salary_k = "Annual salary (£1k)",
    happiness = "Happiness")
  )
smry_tib
# A tibble: 3 x 9
  var                     n     y  ymin  ymax    sd   med   min   max
  <chr>               <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Happiness             782  55.6  54.6  56.5 13.4   55    16     98 
2 *N* of friends        782  25.5  24.2  27.0 19.9   24     7    338 
3 Annual salary (£1k)   782  37.8  37.2  38.5  9.51  37.3  24.9  157.

And that’s it! That’s our summary tibble. All we have left to do now is format it. Once that is done, the *N* will come out as N.

Formatting

To turn our summary tibble into a nice formatted table, we will be using the knitrExtra package which is a very powerful and flexible extension of the knitr::kable() function. The authors of the package created a pretty comprehensive guide to beautiful tables so do check it out!

The first step in formatting a tibble consists of converting it to a HTML table and setting a few basic parameters, such as column names in the table header, column alignment and rounding of the values, and cations. This is done with the kableExtra::kbl() function, which is pretty much the same function as knitr::kable():

smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*")
Table 3: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2

When making tables, it’s always best to knit your Rmd files and see what the tables look like in the final formatted document.

Output under code chunks is often not a good representation of what the table looks like. For example, RStudio doesn’t render markdown so instead of “Table 1 Descriptive statistics of relevant variables.” you can still see the unformatted markdown with asterisks.

Depending on the setup of your final knitted document, this table can look anywhere from not great to pretty awful. To make it look nicer, we can use one of several styling functions that the kableExtra package offers:

  • kable_styling()
  • kable_classic()
  • kable_classic_2()
  • kable_material()
  • kable_material_dark()
  • kable_minimal()
  • kable_paper()

Have a click through the display below to see the different styles.

smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_styling()
Table 4: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_classic()
Table 5: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_classic_2()
Table 6: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_material()
Table 7: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_material_dark()
Table 8: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_minimal()
Table 9: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2
smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_paper()
Table 10: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2

Any of these styles are fine, really, but since we’re after a publication-quality table, let’s stick with the one that resembles those found in psychology papers - kableExtra::kable_classic().

At this stage, the tables are good enough. But let’s not let the good be the enemy of the perfect and see how we can make our kableExtraNice, hur, hur, hur…

Going the extra mile

If you read through the package guide linked to above, you’ll see that you can easily change a few options and make the table striped and not take up the entire width of the page:

smry_tib %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  )
Table 11: Table 1 Descriptive statistics of relevant variables.
N Mean Lower Upper SD Median Min Max
Happiness 782 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 782 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.48 9.51 37.3 24.9 157.2

OK, now, as all variables have the same N, we can save some space and make the table a little nicer to read by getting rid of the N column and including a note under the table. There are two ways of excluding the column: either we can re-run the code that created our smry_tib, getting rid of the bit in dplyr::summarise() that created the column, or, we can just select the column out of our tib before we pipe it into kableExtra::kbl(). The second approach is a little quicker so I’ll choose that one.

smry_tib %>%
  dplyr::select(-n) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "N", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  )
Error in dimnames(x) <- dn: length of 'dimnames' [2] not equal to array extent

Why do you think this didn’t work?

The tibble now has 8 columns but we’re still giving 9 column names to kableExtra::kbl()

 

OK, so let’s fix the error and use the kableExtra::footnote() function to add a general note to our table.

 smry_tib %>%
  dplyr::select(-n) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  ) %>%
  kableExtra::footnote(
    general = "For all reported variables, *N* = 782.")
Table 12: Table 1 Descriptive statistics of relevant variables.
Mean Lower Upper SD Median Min Max
Happiness 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.20 38.48 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.

Getting there!

Let’s just add an extra header above the “Lower” and “Higher” columns that will actually say what those columns contain; 95% confidence intervals. To do this, we’ll use the kableExtra::add_header_above() function. This function pretty much does what it says on the tin: add an extra header above the one we already have. All we need to do is give it header labels, one for each column, or tell it to collapse several columns into one with a single label. In our case, we want two columns of no label, then two columns with a 95% CI label, and finally 4 columns with no label again. For no column, we can simply use a character string with only a space in it (" ").

 smry_tib %>%
  dplyr::select(-n) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c(" " = 2, "95% CI" = 2, " " = 4)) %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  ) %>%
  kableExtra::footnote(
    general = "For all reported variables, *N* = 782.")
Table 13: Table 1 Descriptive statistics of relevant variables.
95% CI
Mean Lower Upper SD Median Min Max
Happiness 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.20 38.48 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.

As a rule of thumb, put all code, including extra headers, except the footnote before the styling function!

Finally, let’s also put an asterisk next to the CI column label and add a note informing the revered reader that these are in fact bootstrapped CIs based on 1,000 samples.

 smry_tib %>%
  dplyr::select(-n) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "Mean", "Lower", "Upper",
                  "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c(" " = 2, "95% CI *" = 2, " " = 4)) %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  ) %>%
  kableExtra::footnote(
    general = c(
      "For all reported variables, *N* = 782.",
      "* Based on 1,000 bootstrap samples."))
Table 14: Table 1 Descriptive statistics of relevant variables.
95% CI *
Mean Lower Upper SD Median Min Max
Happiness 55.57 54.58 56.48 13.35 55.0 16.0 98.0
N of friends 25.47 24.25 26.99 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.20 38.48 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.
* Based on 1,000 bootstrap samples.

This really is a summary table of descriptive statistics that you could easily submit for publication in a papaer. Not that it can’t be even better, mind you, but it’s pretty good.

One important thing we haven’t dealt with yet is what to do if there are missing values in our data. So let’s talk about that for a hot minute…

Dealing with NAs

Let’s introduce a few missing values in the happiness variable:

# introduce missing data
fake_data$happiness[sample(n, 8)] <- NA

If we were to run our entire table-making pipeline (including the summary tibble) with this data, we’d run into some problems.

Table 15: Table 1 Descriptive statistics of relevant variables.
95% CI *
Mean Lower Upper SD Median Min Max
Happiness 55.56 54.64 56.54 NA NA NA NA
N of friends 25.47 24.25 27.01 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.23 38.55 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.
* Based on 1,000 bootstrap samples.

Because there are now missing values in happiness, most the functions we used in dplyr::summarise() return the value NA. To get around it, we can either include the na.rm=TRUE argument in all of them, or simply discard rows with missing values from our dataset. If we choose the second, easier, option, we need to realise that we want to delete missing observations of happiness but not any of the other variables. That means, that we can only discard the rows once the row only contains the value of happiness, i.e., after we pivot our data to the superlong format:

... %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
                                          <---- HERE!
  dplyr::group_by(var) %>%
  dplyr::summarise(
    n = n(),
    ggplot2::mean_cl_boot(value),
    ...

The function we want to use here is tidyr::drop_na()

fake_data %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
  tidyr::drop_na() %>%
  dplyr::group_by(var) %>%
  dplyr::summarise(
    n = n(),
    ggplot2::mean_cl_boot(value),
    sd = sd(value),
    med = median(value),
    min = min(value),
    max = max(value)
  ) %>%
  dplyr::mutate(var = dplyr::recode(
    var, n_friends = "*N* of friends",
    salary_k = "Annual salary (£1k)",
    happiness = "Happiness")
  ) %>%
  dplyr::select(-n) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = c("", "Mean", "Lower", "Upper", "SD", "Median", "Min", "Max"),
    caption = "**Table 1** *Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c(" " = 2, "95% CI *" = 2, " " = 4)) %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE
  ) %>%
  kableExtra::footnote(
    general = c(
      "For all reported variables, *N* = 782.",
      "* Based on 1,000 bootstrap samples."))
Table 16: Table 1 Descriptive statistics of relevant variables.
95% CI *
Mean Lower Upper SD Median Min Max
Happiness 55.56 54.68 56.43 13.40 55.0 16.0 98.0
N of friends 25.47 24.19 27.03 19.93 24.0 7.0 338.0
Annual salary (£1k) 37.82 37.18 38.50 9.51 37.3 24.9 157.2
Note:
For all reported variables, N = 782.
* Based on 1,000 bootstrap samples.

Hurrah, it worked! But now, we need to remember that our footnote isn’t true anymore: There are fewer than 782 observations of the happiness variable so we need to ammend the table accordingly, bringing back the N column:

Table 17: Table 1 Descriptive statistics of relevant variables.
95% CI *
N Mean Lower Upper SD Median Min Max
Happiness 774 55.56 54.59 56.48 13.40 55.0 16.0 98.0
N of friends 782 25.47 24.22 27.23 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 37.20 38.50 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

And that’s it, you now know how to create an impressive table of descriptives in R!

Well done!

Extra content for the really keen

So you want even more, huh?

OK, I’m game… Let’s have a look at the very first table in this tutorial again:

Table 18: Table 1
Descriptive statistics of relevant variables.
N
Mean [95% CI] *
SD
Median
Range
Happiness 774 55.56 [54.61, 56.47] 13.40 55.0 16 98
N of friends 782 25.47 [24.27, 27.06] 19.93 24.0 7 338
Annual salary (£1k) 782 37.82 [37.22, 38.56] 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

I remember seeing the “Mean [95% CI]” column in a paper I was reading during my PhD and thought it looked pretty cool1 and I’ve been collapsing CIs, SEs, and SDs in my tables ever since. To see how this can be done, we first need to talk about the paste0() function.

In its most basic use, paste0() (that’s a zero, not O) takes an arbitrary number of arguments and combines them all into a single character string:

paste0(1, 2)
[1] "12"
paste0(1, " and ", 2)
[1] "1 and 2"
x <- 1
y <- 2
paste0("Object ", x, " and object ", y)
[1] "Object 1 and object 2"

We can easily use it to create a string such as the ones in the CI column of the table above:

paste0("[", x, ", ", y, "]")
[1] "[1, 2]"

Let’s use it in our pipeline inside of the dplyr::mutate() step. We’ll create a column ci containing and empty string in the dplyr::summarise() step and then mutate() this column using the paste0() function. The reason for doing it this way rather than just using dplyr::mutate(ci = ...) will become clear in a little bit. We can save this new summary tibble in an object.

extra_tib <- fake_data %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
  dplyr::group_by(var) %>%
  tidyr::drop_na() %>%
  dplyr::summarise(
    n = dplyr::n(),
    ggplot2::mean_cl_boot(value),
    ci = "",
    sd = sd(value),
    med = median(value),
    min = min(value),
    max = max(value)
  ) %>%
  dplyr::mutate(
    var = dplyr::recode(var,
                        n_friends = "*N* of friends",
                        salary_k = "Annual salary (£1k)",
                        happiness = "Happiness"),
    ci = paste0("[", round(ymin, 2), ", ", round(ymax, 2), "]"))
extra_tib
# A tibble: 3 x 10
  var             n     y  ymin  ymax ci          sd   med   min   max
  <chr>       <int> <dbl> <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <dbl>
1 Happiness     774  55.6  54.6  56.5 [54.56,~ 13.4   55    16     98 
2 *N* of fri~   782  25.5  24.2  26.9 [24.22,~ 19.9   24     7    338 
3 Annual sal~   782  37.8  37.2  38.6 [37.23,~  9.51  37.3  24.9  157.

Now we can just select out the now obsolete ymin and ymax column and carry on building our table:

extra_tib %>%
  dplyr::select(-ymin, -ymax) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = rep("", ncol(.)),
    align = "lrrlrrrrr",
    caption = "**Table 1**<br>*Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c("", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Min", "Max"),
    line = FALSE) %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "* Based on 1,000 bootstrap samples.")
Table 19: Table 1
Descriptive statistics of relevant variables.
N
Mean [95% CI] *
SD
Median
Min
Max
Happiness 774 55.56 [54.56, 56.51] 13.40 55.0 16.0 98.0
N of friends 782 25.47 [24.22, 26.9] 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 [37.23, 38.58] 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

The point of creating the ci = "" column was for the column to be inserted in the correct place (right after the column of means). If we only created it in the mutate() step, it would be inserted as the last column and we’d have to rearrange the columns in another step.2

Also notice that we specified that the main header (col.names=) should just be a vector of empty strings, thereby not displaying it at all, and instead used kableExtra::add_header_above() to create a header with the “[95% CI]” label collapsed across two columns.

That looks almost good, but the mean and CI columns are a tad too far apart. We will solve this later but now let’s focus on the “Range” column of the original table. While it looks like a single column aligned by an en-dash (–), it really is three separate columns: one with the minima, one with just –s and one with the maxima of the variables.

Let’s insert the column of –s between min and max:

extra_tib <- fake_data %>%
  dplyr::select(n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(1:3, names_to = "var") %>%
  dplyr::group_by(var) %>%
  tidyr::drop_na() %>%
  dplyr::summarise(
    n = dplyr::n(),
    ggplot2::mean_cl_boot(value),
    ci = "",
    sd = sd(value),
    med = median(value),
    min = min(value),
    sep = "–",
    max = max(value)
  ) %>%
  dplyr::mutate(
    var = dplyr::recode(var,
                        n_friends = "*N* of friends",
                        salary_k = "Annual salary (£1k)",
                        happiness = "Happiness"),
    ci = paste0("[", round(ymin, 2), ", ", round(ymax, 2), "]"))
extra_tib
# A tibble: 3 x 11
  var         n     y  ymin  ymax ci        sd   med   min sep     max
  <chr>   <int> <dbl> <dbl> <dbl> <chr>  <dbl> <dbl> <dbl> <chr> <dbl>
1 Happin~   774  55.6  54.6  56.5 [54.6~ 13.4   55    16   –       98 
2 *N* of~   782  25.5  24.2  27.0 [24.2~ 19.9   24     7   –      338 
3 Annual~   782  37.8  37.2  38.6 [37.1~  9.51  37.3  24.9 –      157.
extra_tib %>%
  dplyr::select(-ymin, -ymax) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = rep("", ncol(.)),
    caption = "**Table 1**<br>*Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c("", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Range" = 3),
    line = FALSE) %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "* Based on 1,000 bootstrap samples.")
Table 20: Table 1
Descriptive statistics of relevant variables.
N
Mean [95% CI] *
SD
Median
Range
Happiness 774 55.56 [54.61, 56.49] 13.40 55.0 16.0 98.0
N of friends 782 25.47 [24.24, 27.03] 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 [37.18, 38.56] 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

This is almost good but the “Mean [95% CI]” and “Range” columns are still a little weird: They are obviously multiple columns and it just doesn’t look nice. To make them appear like they are each just a single column, we can style these columns using the kableExtra::column_spec() function. This function takes the number of the column we want to style and other arguments. One of these is the extra_css= argument that allows us to style the column in the resulting HTML document using the language used for styling websites – Cascading Style Sheet, or CSS. Without going into detail about CSS, we can use it here to change the padding on either side of any column by specifying the padding-left: and padding-right: CSS porperties:

extra_tib %>%
  dplyr::select(-ymin, -ymax) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = rep("", ncol(.)),
    caption = "**Table 1**<br>*Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c("", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Range" = 3),
    line = FALSE) %>%
  # right padding of column 3 (the CI colum) set to 3 pixels
  kableExtra::column_spec(3, extra_css = "padding-right: 3px;") %>%
  # left-padding removed from cols 4, 8, and 9
  kableExtra::column_spec(c(4, 8, 9), extra_css = "padding-left: 0;") %>%
  kableExtra::column_spec(7:8, extra_css = "padding-right: 0;") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "* Based on 1,000 bootstrap samples.")
Table 21: Table 1
Descriptive statistics of relevant variables.
N
Mean [95% CI] *
SD
Median
Range
Happiness 774 55.56 [54.61, 56.49] 13.40 55.0 16.0 98.0
N of friends 782 25.47 [24.24, 27.03] 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 [37.18, 38.56] 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

So what happened here? We reduced the space (padding) to the right of the “Mean” column to 3 pixels, removed it altogether from columns 7 and 8 (min and sep), and removed left padding from columns 4, 8, and 9 (ci, sep, and max). This has the effect of bringing the columns closer together makeing them look like one column of Means and CIs and one of Ranges separated by –s. The latter still doesn’t look perfect though. The maxima are aligned to the right but it would be better to left-align them so that the entire column looks like it’s aligned by the dash. By default, numeric columns are right-aligned, while strings are left-aligned. We can change the default with the lign= argument of kableExtra::kbl() by giving it a character string of letters corresponding to the alignment of each column of the table:

So the alignment string should be "lrrrrrrcl":

extra_tib %>%
  dplyr::select(-ymin, -ymax) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = rep("", ncol(.)),
    # l = left-aligned; r = right=aligned; c = centre-aligned
    align = "lrrrrrrcl",
    caption = "**Table 1**<br>*Descriptive statistics of relevant variables.*") %>%
  kableExtra::add_header_above(
    c("", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Range" = 3),
    line = FALSE) %>%
  # right padding of column 3 (the CI colum) set to 3 pixels
  kableExtra::column_spec(3, extra_css = "padding-right: 3px;") %>%
  # left-padding removed from cols 4, 8, and 9
  kableExtra::column_spec(c(4, 8, 9), extra_css = "padding-left: 0;") %>%
  kableExtra::column_spec(7:8, extra_css = "padding-right: 0;") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "* Based on 1,000 bootstrap samples.")
Table 22: Table 1
Descriptive statistics of relevant variables.
N
Mean [95% CI] *
SD
Median
Range
Happiness 774 55.56 [54.61, 56.49] 13.40 55.0 16.0 98.0
N of friends 782 25.47 [24.24, 27.03] 19.93 24.0 7.0 338.0
Annual salary (£1k) 782 37.82 [37.18, 38.56] 9.51 37.3 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

And there you have it, a really, REALLY professionally-looking summary table!

 

If this somehow still isn’t enough, I’ll leave you with this beauty. See if you can figure out and understand the code!

fake_data %>%
  dplyr::select(group, n_friends, salary_k, happiness) %>%
  tidyr::pivot_longer(2:4, names_to = "var") %>%
  # group by combinations of the group and var variables 
  dplyr::group_by(group, var) %>%
  tidyr::drop_na() %>%
  dplyr::summarise(
    n = dplyr::n(),
    ggplot2::mean_cl_boot(value),
    ci = "",
    sd = sd(value),
    med = median(value),
    # paste is the simplest way to remove trailing zero from numbers
    # e.g. paste(1.0) -> "1"
    min = paste(min(value)),
    sep = "–",
    max = paste(max(value))
  ) %>%
  dplyr::mutate(
    var = dplyr::recode(var,
                        n_friends = "*N* of friends",
                        salary_k = "Annual salary (£1k)",
                        happiness = "Happiness"),
    ci = paste0("[", round(ymin, 2), ", ", round(ymax, 2), "]")) %>%
  dplyr::select(-ymin, -ymax) %>%
  # put descriptives of the two groups side-by-side in the tibble
  tidyr::pivot_wider(names_from = group, values_from = 3:10) %>%
  # create an empty column that will separate the two groups in the table
  dplyr::mutate(empty = "") %>%
  # take all _experimental columns and move them after the last column
  dplyr::relocate(dplyr::contains("experimental"),
                  .after = dplyr::last_col()) %>%
  dplyr::mutate(var = dplyr::recode(var, n_friends = "*N* of friends", salary_k = "Annual salary (£1k)", happiness = "Happiness")) %>%
  kableExtra::kbl(
    digits = 2,
    col.names = rep("", ncol(.)),
    # we have 18 columns now!
    align = "lrrlrrrclcrrlrrrcl",
    caption = "**Table 2**<br>*Descriptive statistics of relevant variables by condition.*") %>%
  kableExtra::column_spec(c(3, 12), extra_css = "padding-right: 3px;") %>%
  kableExtra::column_spec(c(7, 8, 16, 17), extra_css = "padding-right: 0;") %>%
  kableExtra::column_spec(c(4, 8, 9, 13, 17, 18), extra_css = "padding-left: 0;") %>%
  kableExtra::column_spec(c(1, 4, 13), extra_css = "white-space: nowrap;") %>%
  kableExtra::add_header_above(
    c("", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Range" = 3,
      "", "N", "Mean [95% CI] *" = 2, "SD", "Median", "Range" = 3),
    line = FALSE) %>%
  # add another header with names of our two groups over the columns
  kableExtra::add_header_above(c("", "Control" = 8, "", "Experimental" = 8), extra_css = "border-color: #222222") %>%
  kableExtra::kable_classic(
    lightable_options = "striped",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "* Based on 1,000 bootstrap samples.")
Table 23: Table 2
Descriptive statistics of relevant variables by condition.
Control
Experimental
N
Mean [95% CI] *
SD
Median
Range
N
Mean [95% CI] *
SD
Median
Range
Happiness 386 56.10 [54.78, 57.37] 13.61 55.0 16 98 388 55.02 [53.69, 56.31] 13.19 55.0 18 98
N of friends 391 24.76 [23.74, 26.38] 13.26 24.0 9 260 391 26.18 [24.05, 28.76] 24.88 24.0 7 338
Annual salary (£1k) 391 37.14 [36.65, 37.56] 4.60 37.2 25.5 51.3 391 38.50 [37.38, 39.82] 12.61 37.5 24.9 157.2
Note:
* Based on 1,000 bootstrap samples.

  1. Yep, that’s the kind of person I am…↩︎

  2. Technically, there is a way of doing it directly in dplyr::mutate() using either the .before= or .after= arguments but this functionality is currently marked as experimental and might not be kept by the package developers, so best not worry about it.↩︎