Data Science and Statistical Computing

class: center, middle, inverse, title-slide

.title[
# Data Science and Statistical Computing
]
.subtitle[
## Practical Lecture 5<br>Data Wrangling
]
.author[
### Dr Louis Aslett
]
.institute[
### Durham University
]
.date[
### 7 November 2024
]

---

## The story so far ...

- Vectors
    - accessing, vector operations, ...

- Data frames
    - loading (CSV), accessing, subsetting/querying, variable creation, ...

- Lists
    - named/unnamed, nested, ...

- `for` loops, `if` statements, functions, ...

- Data types
    - numeric, logical, factors, ...

- Missing data

- Packages

- Base R plotting

---

## Data science workflow

<br>

<br>

Image from "R for Data Science", by H. Wickham & G. Grolemund.

Free online: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)

---

## The Tidyverse ([tidyverse.org](https://www.tidyverse.org/))

- Collection of packages which took a new and structured approach to data science in R

- Focus on readable, clean and intuitive code.

- All packages follow common design philosophy, grammar, and data structures.

- Core Tidyverse packages include
    - `dplyr` for data manipulation
    - `tidyr` for wrangling data to a consistent form
    - `forcats` to ease working with factors
    - `ggplot2` advanced graphics with a consistent system for declarative specification based on a sound theory of visualisation
    - ... and more (`readr`, `purrr`, `tibble`, `stringr`)

- Many more packages in R now loosely align with the Tidyverse by adopting the same philosophies.

- There are even other aligned meta-package collections, for example Tidymodels ([tidymodels.org](https://www.tidymodels.org/)) for standard statistical and statistical machine learning modelling.

---

class: inverse

## Install Tidyverse

``` r
install.packages("tidyverse")
```

``` r
library("tidyverse")
```

```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

---

## What is "tidy" data?

We say that data is in a "tidy" form when:

- each variable is in a column

- each observation is in a row

- each type of observational unit forms a table

This is also known in the database world as *"third normal form"*.

Often data is provided in a way that is *not* in this form. The following are a few rows from the World Health Organization Global Tuberculosis Report.

``` r
who_wide1
```

```
      country   1999   2000
1 Afghanistan    745   2666
2      Brazil  37737  80488
3       China 212258 213766
```
Very common to see this in spreadsheets!

---

## Why tidy data?

- Data stored in the tidy form is usually easiest for further manipulation, searching, sorting, etc.
- Tidy form *not* only option ... but easiest to transition to any other layout and aligns with how R operates on vectors.

The most common ways data violates the tidy form is either:

- It is too 'wide' ... one variable is actually spread over multiple columns
- Or, it is too 'long' ... one observation is actually spread over multiple rows

.pull-left[
.smaller[

``` r
who_wide1
```

```
      country   1999   2000
1 Afghanistan    745   2666
2      Brazil  37737  80488
3       China 212258 213766
```

``` r
who_wide2
```

```
      country       1999       2000
1       China 1272915272 1280428583
2      Brazil  172006362  174504898
3 Afghanistan   19987071   20595360
```
]
]
.pull-right[
.smaller[

``` r
who_long
```

```
       country year       type      count
1  Afghanistan 1999      cases        745
2  Afghanistan 1999 population   19987071
3  Afghanistan 2000      cases       2666
4  Afghanistan 2000 population   20595360
5       Brazil 1999      cases      37737
6       Brazil 1999 population  172006362
7       Brazil 2000      cases      80488
8       Brazil 2000 population  174504898
9        China 1999      cases     212258
10       China 1999 population 1272915272
11       China 2000      cases     213766
12       China 2000 population 1280428583
```
]
]

---

## Moving to/from tidy form (`tidyr` package)

The two key functions to tackle the wide/long problem are `pivot_longer()` and `pivot_wider()`.

- `pivot_longer()` to gather multiple columns into key-value pairs: makes wide data longer. Arguments are:
    - data frame
    - columns to transform
    - the name of the column where previous column names should go
    - the name of the column where values from the column should go

.pull-left[
.smaller[

``` r
who_wide1
```

```
      country   1999   2000
1 Afghanistan    745   2666
2      Brazil  37737  80488
3       China 212258 213766
```
]
]

.pull-right[
.smaller[

``` r
pivot_longer(who_wide1,
             c(`1999`, `2000`),
             names_to = "year",
             values_to = "cases")
```

```
      country year  cases
1 Afghanistan 1999    745
2 Afghanistan 2000   2666
3      Brazil 1999  37737
4      Brazil 2000  80488
5       China 1999 212258
6       China 2000 213766
```
]
]

---

## Moving to/from tidy form (`tidyr` package)

- `pivot_wider()` to gather multiple columns into key-value pairs: makes wide data longer. Arguments are:
    - data frame
    - the name of the column where previous column names should go
    - the name of the column where values from the column should go

.pull-left[
.smaller[

``` r
who_long
```

.pull-right[
.smaller[

``` r
pivot_wider(who_long,
            names_from = "type",
            values_from = "count")
```

```
      country year  cases population
1 Afghanistan 1999    745   19987071
2 Afghanistan 2000   2666   20595360
3      Brazil 1999  37737  172006362
4      Brazil 2000  80488  174504898
5       China 1999 212258 1272915272
6       China 2000 213766 1280428583
```
]
]

---

## Other useful `tidyr` functions

If you have data stuck within string column(s) the following are very helpful:

- `separate()` splits one column of strings into multiple new columns
- `unite()` combines many columns into one (as a string)
- `extract()` uses regular expressions to pull out specific information from a string column

``` r
fball
```

```
       home     away score
1     Man U Shef Wed   2-1
2 Tottenham  Arsenal   0-0
3   Chelsea    W Ham   1-0
```

``` r
separate(fball, "score", c("home_goals", "away_goals"))
```

```
       home     away home_goals away_goals
1     Man U Shef Wed          2          1
2 Tottenham  Arsenal          0          0
3   Chelsea    W Ham          1          0
```

---

## Data manipulation (`dplyr`)

The other huge task is to reorganise, manipulate, and transform data.

**Reality check:** this is 80% of data science in the real world!

Following are the main `dplyr` functions which with careful combination are incredibly powerful:

- `filter()` focus on a subset of rows

- `arrange()` reorder the rows

- `select()` focus on a subset of variables (columns)

- `mutate()` create new derived variables

- `summarise()` create summary statistics (collapsing many rows) by groupings

First argument is *always* the data set (important for reasons we'll see in pipelines)

---

## `filter()`

Other args:

- conditions to filter by ... separate logically with `&` (*and*), `|` (*or*)

``` r
filter(who, year == 1999)
```

```
      country year  cases population
1 Afghanistan 1999    745   19987071
2      Brazil 1999  37737  172006362
3       China 1999 212258 1272915272
```

``` r
filter(fball, home == "Chelsea" | away == "Chelsea")
```

```
     home  away score
1 Chelsea W Ham   1-0
```

``` r
filter(fball, home != "Chelsea" & away != "Chelsea")
```

```
       home     away score
1     Man U Shef Wed   2-1
2 Tottenham  Arsenal   0-0
```

---

## `arrange()`

Other args:

- variable names to sort by, subsorting by later variables
- sorts ascending, unless wrap variable with `desc()`

``` r
arrange(who, year, desc(country))
```

```
      country year  cases population
1       China 1999 212258 1272915272
2      Brazil 1999  37737  172006362
3 Afghanistan 1999    745   19987071
4       China 2000 213766 1280428583
5      Brazil 2000  80488  174504898
6 Afghanistan 2000   2666   20595360
```

``` r
arrange(who, desc(country), desc(year))
```

```
      country year  cases population
1       China 2000 213766 1280428583
2       China 1999 212258 1272915272
3      Brazil 2000  80488  174504898
4      Brazil 1999  37737  172006362
5 Afghanistan 2000   2666   20595360
6 Afghanistan 1999    745   19987071
```

---

## `select()`

Other args:

- names of variables to retain

``` r
select(who, year, cases)
```

```
  year  cases
1 1999    745
2 2000   2666
3 1999  37737
4 2000  80488
5 1999 212258
6 2000 213766
```

---

## `mutate()`

Other args:

- equation defining the new variable to create

``` r
mutate(who, rate = cases/population)
```

```
      country year  cases population         rate
1 Afghanistan 1999    745   19987071 0.0000372741
2 Afghanistan 2000   2666   20595360 0.0001294466
3      Brazil 1999  37737  172006362 0.0002193930
4      Brazil 2000  80488  174504898 0.0004612363
5       China 1999 212258 1272915272 0.0001667495
6       China 2000 213766 1280428583 0.0001669488
```

<br>

**Note:** how I can freely refer to variable names within *all* these functions, as though they are variables within my R session without having to use the `$` syntax on the data frame to access them!

---

## `summarise()`

Other args:

- function to summarise by

``` r
summarise(who, total = sum(cases))
```

```
   total
1 547660
```

**BUT** often want summaries by group, so use group_by first:

``` r
who2 <- group_by(who, country)
summarise(who2, total = sum(cases), change = max(cases)-min(cases))
```

```
# A tibble: 3 × 3
  country      total change
  <chr>        <dbl>  <dbl>
1 Afghanistan   3411   1921
2 Brazil      118225  42751
3 China       426024   1508
```

---

## Chaining functions (I)

Often need to run multiple manipulation functions in a row to get to the desired result.
How would you normally do this?

``` r
arrange(group_by(mutate(select(filter(x, ...), ...), ...), ...), ...)
```

``` r
arrange(
  group_by(
    mutate(
      select(
        filter(x, ...),
        ...),
      ...),
    ...),
  ...)
```

Hard to read, need to read backwards, arguments for particular layer of function are miles from the function name, ...

.center[ **Yuk!** 🤮 ]

---

## Chaining function (II)