class: center, middle, inverse, title-slide .title[ # Data Science and Statistical Computing ] .subtitle[ ## Practical Lecture 5
Data Wrangling ] .author[ ### Dr Louis Aslett ] .institute[ ### Durham University ] .date[ ### 7 November 2024 ] --- <style type="text/css"> .smaller .remark-code { /*Change made here*/ font-size: 80% !important; } </style> ## The story so far ... - Vectors - accessing, vector operations, ... - Data frames - loading (CSV), accessing, subsetting/querying, variable creation, ... - Lists - named/unnamed, nested, ... - `for` loops, `if` statements, functions, ... - Data types - numeric, logical, factors, ... - Missing data - Packages - Base R plotting --- ## Data science workflow <br> <img src="i/data-science.png" width="90%" style="display: block; margin: auto;" /> <br> Image from "R for Data Science", by H. Wickham & G. Grolemund. Free online: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/) --- ## The Tidyverse ([tidyverse.org](https://www.tidyverse.org/)) - Collection of packages which took a new and structured approach to data science in R - Focus on readable, clean and intuitive code. - All packages follow common design philosophy, grammar, and data structures. -- - Core Tidyverse packages include - `dplyr` for data manipulation - `tidyr` for wrangling data to a consistent form - `forcats` to ease working with factors - `ggplot2` advanced graphics with a consistent system for declarative specification based on a sound theory of visualisation - ... and more (`readr`, `purrr`, `tibble`, `stringr`) -- - Many more packages in R now loosely align with the Tidyverse by adopting the same philosophies. - There are even other aligned meta-package collections, for example Tidymodels ([tidymodels.org](https://www.tidymodels.org/)) for standard statistical and statistical machine learning modelling. --- class: inverse ## Install Tidyverse ``` r install.packages("tidyverse") ``` ``` r library("tidyverse") ``` ``` ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` --- ## What is "tidy" data? We say that data is in a "tidy" form when: - each variable is in a column - each observation is in a row - each type of observational unit forms a table This is also known in the database world as *"third normal form"*. -- Often data is provided in a way that is *not* in this form. The following are a few rows from the World Health Organization Global Tuberculosis Report. ``` r who_wide1 ``` ``` country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766 ``` Very common to see this in spreadsheets! --- ## Why tidy data? - Data stored in the tidy form is usually easiest for further manipulation, searching, sorting, etc. - Tidy form *not* only option ... but easiest to transition to any other layout and aligns with how R operates on vectors. -- The most common ways data violates the tidy form is either: - It is too 'wide' ... one variable is actually spread over multiple columns - Or, it is too 'long' ... one observation is actually spread over multiple rows .pull-left[ .smaller[ ``` r who_wide1 ``` ``` country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766 ``` ``` r who_wide2 ``` ``` country 1999 2000 1 China 1272915272 1280428583 2 Brazil 172006362 174504898 3 Afghanistan 19987071 20595360 ``` ] ] .pull-right[ .smaller[ ``` r who_long ``` ``` country year type count 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 ``` ] ] --- ## Moving to/from tidy form (`tidyr` package) The two key functions to tackle the wide/long problem are `pivot_longer()` and `pivot_wider()`. - `pivot_longer()` to gather multiple columns into key-value pairs: makes wide data longer. Arguments are: - data frame - columns to transform - the name of the column where previous column names should go - the name of the column where values from the column should go .pull-left[ .smaller[ ``` r who_wide1 ``` ``` country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766 ``` ] ] .pull-right[ .smaller[ ``` r pivot_longer(who_wide1, c(`1999`, `2000`), names_to = "year", values_to = "cases") ``` ``` country year cases 1 Afghanistan 1999 745 2 Afghanistan 2000 2666 3 Brazil 1999 37737 4 Brazil 2000 80488 5 China 1999 212258 6 China 2000 213766 ``` ] ] --- ## Moving to/from tidy form (`tidyr` package) - `pivot_wider()` to gather multiple columns into key-value pairs: makes wide data longer. Arguments are: - data frame - the name of the column where previous column names should go - the name of the column where values from the column should go .pull-left[ .smaller[ ``` r who_long ``` ``` country year type count 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 ``` ] ] .pull-right[ .smaller[ ``` r pivot_wider(who_long, names_from = "type", values_from = "count") ``` ``` country year cases population 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 ``` ] ] --- ## Other useful `tidyr` functions If you have data stuck within string column(s) the following are very helpful: - `separate()` splits one column of strings into multiple new columns - `unite()` combines many columns into one (as a string) - `extract()` uses regular expressions to pull out specific information from a string column ``` r fball ``` ``` home away score 1 Man U Shef Wed 2-1 2 Tottenham Arsenal 0-0 3 Chelsea W Ham 1-0 ``` ``` r separate(fball, "score", c("home_goals", "away_goals")) ``` ``` home away home_goals away_goals 1 Man U Shef Wed 2 1 2 Tottenham Arsenal 0 0 3 Chelsea W Ham 1 0 ``` --- ## Data manipulation (`dplyr`) The other huge task is to reorganise, manipulate, and transform data. **Reality check:** this is 80% of data science in the real world! -- Following are the main `dplyr` functions which with careful combination are incredibly powerful: - `filter()` focus on a subset of rows - `arrange()` reorder the rows - `select()` focus on a subset of variables (columns) - `mutate()` create new derived variables - `summarise()` create summary statistics (collapsing many rows) by groupings First argument is *always* the data set (important for reasons we'll see in pipelines) --- ## `filter()` Other args: - conditions to filter by ... separate logically with `&` (*and*), `|` (*or*) ``` r filter(who, year == 1999) ``` ``` country year cases population 1 Afghanistan 1999 745 19987071 2 Brazil 1999 37737 172006362 3 China 1999 212258 1272915272 ``` ``` r filter(fball, home == "Chelsea" | away == "Chelsea") ``` ``` home away score 1 Chelsea W Ham 1-0 ``` ``` r filter(fball, home != "Chelsea" & away != "Chelsea") ``` ``` home away score 1 Man U Shef Wed 2-1 2 Tottenham Arsenal 0-0 ``` --- ## `arrange()` Other args: - variable names to sort by, subsorting by later variables - sorts ascending, unless wrap variable with `desc()` ``` r arrange(who, year, desc(country)) ``` ``` country year cases population 1 China 1999 212258 1272915272 2 Brazil 1999 37737 172006362 3 Afghanistan 1999 745 19987071 4 China 2000 213766 1280428583 5 Brazil 2000 80488 174504898 6 Afghanistan 2000 2666 20595360 ``` ``` r arrange(who, desc(country), desc(year)) ``` ``` country year cases population 1 China 2000 213766 1280428583 2 China 1999 212258 1272915272 3 Brazil 2000 80488 174504898 4 Brazil 1999 37737 172006362 5 Afghanistan 2000 2666 20595360 6 Afghanistan 1999 745 19987071 ``` --- ## `select()` Other args: - names of variables to retain ``` r select(who, year, cases) ``` ``` year cases 1 1999 745 2 2000 2666 3 1999 37737 4 2000 80488 5 1999 212258 6 2000 213766 ``` --- ## `mutate()` Other args: - equation defining the new variable to create ``` r mutate(who, rate = cases/population) ``` ``` country year cases population rate 1 Afghanistan 1999 745 19987071 0.0000372741 2 Afghanistan 2000 2666 20595360 0.0001294466 3 Brazil 1999 37737 172006362 0.0002193930 4 Brazil 2000 80488 174504898 0.0004612363 5 China 1999 212258 1272915272 0.0001667495 6 China 2000 213766 1280428583 0.0001669488 ``` <br> **Note:** how I can freely refer to variable names within *all* these functions, as though they are variables within my R session without having to use the `$` syntax on the data frame to access them! --- ## `summarise()` Other args: - function to summarise by ``` r summarise(who, total = sum(cases)) ``` ``` total 1 547660 ``` -- **BUT** often want summaries by group, so use group_by first: ``` r who2 <- group_by(who, country) summarise(who2, total = sum(cases), change = max(cases)-min(cases)) ``` ``` # A tibble: 3 × 3 country total change <chr> <dbl> <dbl> 1 Afghanistan 3411 1921 2 Brazil 118225 42751 3 China 426024 1508 ``` --- ## Chaining functions (I) Often need to run multiple manipulation functions in a row to get to the desired result. How would you normally do this? ``` r arrange(group_by(mutate(select(filter(x, ...), ...), ...), ...), ...) ``` -- ``` r arrange( group_by( mutate( select( filter(x, ...), ...), ...), ...), ...) ``` Hard to read, need to read backwards, arguments for particular layer of function are miles from the function name, ... .center[ **Yuk!** 🤮 ] --- ## Chaining function (II) More readable? ... ``` r tmp <- filter(x, ...) tmp <- select(tmp, ...) tmp <- mutate(tmp, ...) tmp <- group_by(tmp, ...) arrange(tmp, ...) ``` ✅ reads forwards, ✅ arguments kept together, ❌ temporary variables and redundant assignments/code But it does give an idea ... introducing: --- ## Pipelines `|>` First: Tools > Global Options > Use native pipe operator -- ``` r filter(x, ...) |> select(...) |> mutate(...) |> group_by(...) |> arrange(...) ``` 😍 readable, minimalistic and with no unnecessary repeated code! -- The pipe operator `|>` takes the output of a function and passes it as the *first* argument of the function that comes next. *Not* limited to Tidyverse functions! eg ``` r rnorm(10) |> mean() ``` ``` [1] 0.3450766 ``` You might see `%>%` used in older code, as the 'native' operator was only introduced recently in R 4.1.0.