Lab 8: Dynamic Documents

🧑‍💻 Make sure to finish previous labs before continuing with this lab! It might be tempting to skip some stuff if you’re behind, but you may just end up progressively more lost … we will happily answer questions from any lab each week 😃

5.11 Covid-19 reporting

One of the few “success stories” of the pandemic was the way that the UK Health Security Agency rapidly made available some of the best monitoring data in the world to all researchers via a new UKHSA data dashboard. This dashboard still exists and has been expanded to cover more diseases the UK Government need to monitor, right from measles through to clostridioides difficile.

Covid-19 now primarily represents a risk to those with existing serious health conditions, so although monitoring continues it is less detailed. So for the lab today we’ll use some of the historic data which was very detailed, covering the period from 19 March 2020 through to 16 November 2021. This data provides a great example of managing large real world data which is not in precisely the format you need for analysis, so we will get to cover lots of the data science pipeline: data import, tidying, visualisation, and reporting.

For this toy scenario it is late 2021 and you have landed a job as a data scientist reporting to Prof Sir Chris Whitty, Chief Medical Officer (CMO) for England, and Prof Sir Patrick Vallance FRS, Government Chief Scientific Adviser (GCSA), at a time when monitoring is rapidly updated.

Your first role in the job is to create a new RMarkdown document in which you will produce a mini report for the CMO and GCSA on the state of Covid-19. You need to create this in a way which can be rapidly and automatically updated when the data refreshes every week.

Exercise 5.75 I have created an archive of the data at https://www.louisaslett.com/Courses/DSSC/data/c19.csv

Do not download the data, instead copy and paste the permanent link into a read.csv() call so that each time you compile the document it downloads the latest data (we want to emulate the situation where your report downloads the data every time you compile it and would automatically get new information if it was being refreshed).

Place that read.csv() in an R code block near the start of your report with the setting that prevents the code being printed to the document (you want the data loading to be “silent”).

Click for solution

We then add this link to a code block somewhere near the start of the document, so like this:

```{r,echo=FALSE}
c19 <- read.csv('https://www.louisaslett.com/Courses/DSSC/data/c19.csv')
```

Exercise 5.76 Load the data above into your working environment by running the read.csv() function in the console. You will often find you experiment on the console first to figure out the functions you want to embed in your document.

Look at the data frame that you’ve loaded.

By using as.Date() convert the date column from a string to a date type.
By using the ordered() function (look at the help), convert the age column from a string to an ordered factor.

Hint 1: be careful, the default ordering will be alphabetical because age is a string … is that ordering correct? If not, you will need to manually override the order by following the help file!

Hint 2: Look at the levels argument to ordered().

Once you have figured out these commands, add them immediately after the read.csv() function in your hidden RMarkdown code block.

Hint: if you find unwanted warnings or messages produced, you can more agressively silence the spurious output by adding warning=FALSE,message=FALSE to your code block options.

Click for solution

## SOLUTION

library("tidyverse")

c19 <- read.csv("https://www.louisaslett.com/Courses/DSSC/data/c19.csv")

# See what the ages in the data are
ages <- sort(unique(c19$age))
ages

[1] "0_to_5"   "18_to_64" "6_to_17"  "65_to_84" "85+"

# So, clearly we want to take these unique levels in the order 1,3,2,4,5 based on that vector

# Update the data as instructed in the question
c19 <- c19 |> 
  mutate(date = as.Date(date),
         age = ordered(age, levels = ages[c(1,3,2,4,5)]))

# Check that the age is ordered and with the correct levels by looking at the first few with head:
head(c19$age)

[1] 85+      18_to_64 65_to_84 0_to_5   6_to_17  85+     
Levels: 0_to_5 < 6_to_17 < 18_to_64 < 65_to_84 < 85+

Exercise 5.77 Create an opening section and write a short sentence crediting the source of the data, including a working hyperlink to the official government UKHSA data dashboard at https://ukhsa-dashboard.data.gov.uk/.

Following that, create a graphic (using ggplot) which plots date on the x-axis, rate on the y-axis, colours the line by areaName and facets by age. Make sure the code does not print out (so it looks professional) and add some surrounding text to explain the graphic.

Is there a problem with this? By reading the documentation for facet_wrap or otherwise, fix the problem.

Click for solution

The graphic you will add should use the following code (wrapped in an R code block with echo=FALSE):

## SOLUTION

ggplot(c19, aes(x = date, y = rate, colour = areaName)) +
  facet_wrap(~ age) +
  geom_line()

The problem here is that the rate in different age groups varies a lot, so we may want to allow each facet panel to have different scales:

## SOLUTION

ggplot(c19, aes(x = date, y = rate, colour = areaName)) +
  facet_wrap(~ age, scales = "free") +
  geom_line()

Exercise 5.78 Create another hidden code block which uses the Tidyverse to summarise over age groups and produce the total daily rate in each region and save this in a new data frame.

Note: you need to be really careful! The rate is per 100,000 and the number of people in each age range in each region is not the same, so you can’t just sum the rates!

Hint: a rate of 8469.3 per 100k in 85+ in London corresponds to a total of 12914 admissions, but a rate of 3158.2 per 100k in 65–84 in London corresponds to a total of 29846 admissions. So create a total column calculating the population size for each row, then summarise to get the new rate.

Click for solution

Taking into account the note and hint, we have that rate/100000 = value/total, where total is unknown. So we solve for total first, then group before summarising to compute the total rate per 100k across age ranges.

Then, we add a new code block with the following:

## SOLUTION

c19.regions <- c19 |> 
  mutate(total = value*100000/rate) |> 
  group_by(date, areaName) |> 
  summarise(totalrate = sum(value)/sum(total)*100000)

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.

Exercise 5.79 Using this new data frame, create a new graphic in your report with ggplot which shows the total rate over time with line colour indicating the region.

Add the option warning=FALSE to your block in RMarkdown to avoid the warning about missing values (these are from the start of the pandemic before there were any admissions in some regions).

Click for solution

You will add a new code block with the following:

## SOLUTION

ggplot(c19.regions, aes(x = date, y = totalrate, colour = areaName)) +
  geom_line()

Warning: Removed 34 rows containing missing values or values outside the scale
range (`geom_line()`).

Exercise 5.80 Add another hidden code block which subsets the full data to retain only the last 7 days available in the data, in only the "North East and Yorkshire" region. Call the data frame recent_rates_ne.

Note: do not use a fixed date to find the last 7 days in the data: it must be automatically calculated. In other words, if you recompile your document a week from now, it should use whatever is the last 7 days at that time without you changing anything.

Click for solution

## SOLUTION

# First figure out the most recent 7 days
# This gets the unique dates, sorting them most recent to oldest
all_dates <- sort(unique(c19$date), decreasing = TRUE)
# Then, the date 7 days ago will be in all_dates[7]

# First subset:
recent_rates_ne <- c19 |> 
  filter(areaName == "North East and Yorkshire",
         date >= all_dates[7])

Exercise 5.81 Add to your hidden code block. By using pivot_wider in the Tidyverse or otherwise, create a data frame recent_rates_ne_wide like this (these are the most recent 7 days at the time of writing):

# A tibble: 7 × 6
  date       `0_to_5` `6_to_17` `18_to_64` `65_to_84` `85+`
  <date>        <dbl>     <dbl>      <dbl>      <dbl> <dbl>
1 2021-11-16     170.      90.9       616       2482. 7746 
2 2021-11-15     170.      90.7       615.      2478. 7739.
3 2021-11-14     169.      90.6       614.      2475  7734.
4 2021-11-13     168.      90.3       613.      2471. 7724.
5 2021-11-12     168.      90         612.      2468. 7716.
6 2021-11-11     168.      89.9       611.      2464. 7709.
7 2021-11-10     167.      89.8       610.      2460. 7700.

Hint: if the columns are in a funny order on your first attempt, you might find names_sort = TRUE a useful argument for pivot_wider.

Click for solution

## SOLUTION

recent_rates_ne_wide <- pivot_wider(recent_rates_ne |> select(date, age, rate),
                                    names_from = "age",
                                    values_from = "rate",
                                    names_sort = TRUE)
# Looks right?
recent_rates_ne_wide

# A tibble: 7 × 6
  date       `0_to_5` `6_to_17` `18_to_64` `65_to_84` `85+`
  <date>        <dbl>     <dbl>      <dbl>      <dbl> <dbl>
1 2021-11-16     170.      90.9       616       2482. 7746 
2 2021-11-15     170.      90.7       615.      2478. 7739.
3 2021-11-14     169.      90.6       614.      2475  7734.
4 2021-11-13     168.      90.3       613.      2471. 7724.
5 2021-11-12     168.      90         612.      2468. 7716.
6 2021-11-11     168.      89.9       611.      2464. 7709.
7 2021-11-10     167.      89.8       610.      2460. 7700.

Exercise 5.82 As a final line in the hidden code block we’ve been building up, add the following code:

knitr::kable(recent_rates_ne_wide)

This is a function which automatically creates an HTML table from a data frame, so will do a nicely formatted table of the most recent 7 day totals for the North East and Yorkshire.

Click for solution

Once you put it in the table should look like this:

## SOLUTION

knitr::kable(recent_rates_ne_wide)

date	0_to_5	6_to_17	18_to_64	65_to_84	85+
2021-11-16	169.9	90.9	616.0	2481.5	7746.0
2021-11-15	169.5	90.7	615.1	2477.8	7739.4
2021-11-14	169.2	90.6	613.9	2475.0	7733.8
2021-11-13	168.5	90.3	613.1	2471.4	7723.9
2021-11-12	168.5	90.0	611.9	2468.5	7716.4
2021-11-11	167.6	89.9	610.7	2464.2	7708.9
2021-11-10	167.1	89.8	609.7	2460.1	7699.5

This is a good point to check that your report is successfully compiling in HTML, PDF and Word (if available). Make sure there are no ugly bits of code or warning messages showing up.

To see the solution RMarkdown file, download here. To see the solution HTML file which that compiles to, download here.

Exercise 5.83 Prof Whitty and Prof Vallence are impressed with your work and want to roll it out to the official UK Government Website https://www.gov.uk/, so you need to update the styling to match correctly.

To do this, install the govdown package:

install.packages("govdown")

You can see the documentation for this package here: https://ukgovdatascience.github.io/govdown/

Next, replace all of the document preamble by copy-pasting the following in place instead (this should be the whole preamble now):

---
output:
  govdown::govdown_document:
    font: "sans-serif" # (default) or "new-transport"
    logo_url: "https://ukgovdatascience.github.io/govdown/"
    logo_text: "GOV.UK"
    page_title: "My Covid Dashboard"
    title: "My Covid Dashboard"
    phase: alpha
---

Recompile your document and ensure it looks like the GOV.UK website.

Click for solution

To see the solution RMarkdown file in GOV.UK style, download here. To see the solution HTML file in GOV.UK style which that compiles to, download here.

Well done! If Dominic Cummings was still at No.10, maybe you would have been rewarded with lifetime free entry to Klute! (It was a close shave, but apparently is reopened!) 💃🪩🕺

🏁🏁 Done, end of lab! 🏁🏁