class: center, middle, inverse, title-slide .title[ # Data Science and Statistical Computing ] .subtitle[ ## Practical Lecture 3
Lists, Factors, Odds and Ends ] .author[ ### Dr Louis Aslett ] .institute[ ### Durham University ] .date[ ### 24 October 2024 ] --- ## Built in datasets R has a lot of built-in data sets which can be great for experimenting, whilst others are available in packages. - Run `data()` to see currently available data sets in base (and loaded packages) - The `data()` function also loads them - name as first argument - optional `package` argument - Initially loads as a *promise* until you access -- ``` r data("cars") head(cars) ``` ``` speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 ``` --- ## Lists - Think of as a "bucket" to hold variables and pass them round together. - Unlike data frame, each variable in the bucket can be completely different size, type, etc - Allows nesting, so good for hierachical/tree structures -- ``` r x <- list(1, "a", c(1,2,3), data.frame(a = 1:3, b = 4:6)) x ``` ``` [[1]] [1] 1 [[2]] [1] "a" [[3]] [1] 1 2 3 [[4]] a b 1 1 4 2 2 5 3 3 6 ``` --- ## Lists: naming Each variable within can also be named: ``` r x <- list(bob = 1, jill = "a", jack = c(1,2,3), eve = data.frame(a = 1:3, b = 4:6)) x ``` ``` $bob [1] 1 $jill [1] "a" $jack [1] 1 2 3 $eve a b 1 1 4 2 2 5 3 3 6 ``` --- ## Lists: nesting Nesting is permitted: ``` r x <- list(bob = 1, eve = list(jill = "a", jack = c(1,2,3), data.frame(a = 1:3, b = 4:6))) x ``` ``` $bob [1] 1 $eve $eve$jill [1] "a" $eve$jack [1] 1 2 3 $eve[[3]] a b 1 1 4 2 2 5 3 3 6 ``` --- ## Lists: access Access with: - `[ ]` to access an element of the list as a single item list (ie retains name and structure) - `[[ ]]` to access item directly, stripping away a level of hierarchy - `$` to access item by name Can all be nested, eg: ``` r x[1] x[[1]] x$bob x[1]$bob x$eve$jack x$eve[[3]] ``` -- Note that many complex objects in R (eg returned by modelling functions etc) are just sophisticated list structures. Summarise any object (including lists) with `str()` ``` r str(x) ``` --- ## Lists: demo .center[**DEMO OF:**] - `t.test` - `lm` Using `cars` data. --- ## Functions ``` r myFunction <- function(arg1, arg2 = 1) { z <- arg1 + arg2 return(z) } myFunction(5) ``` ``` [1] 6 ``` ``` r myFunction(5,3) ``` ``` [1] 8 ``` ``` r x <- myFunction(5,3) x ``` ``` [1] 8 ``` --- ## Some notes on RStudio .center[**DEMO OF:**] - Source files - Don't save workspace on exit - Do manually save workspace if need be - Projects! (eg Stats Inference in one project, DSSC in another) --- ## 📦 Packages R is powerful mostly because of the incredible ecosystem of add-on packages. - Packages are "libraries" for bundling up code, documentation and data in a super easy format. - Install **once** - `install.packages("dplyr")` - Load **many times** - `library("dplyr")` - To access without loading - `package::function` or `package::data` -- - CRAN is the centralised (official) repository. [https://cran.r-project.org](https://cran.r-project.org) - Bioconductor has a lot of very high quality packages [https://bioconductor.org](https://bioconductor.org) - Many packages still being developed available on Github (a software version control & development site) [https://github.com](https://github.com) -- `install.packages` looks only at CRAN by default. `remotes` package helps installing from other places. See packages tab in RStudio. --- ## Not all data is numeric! - Numeric - `\(\in \mathbb{R}, \mathbb{Z}, \mathbb{N}, \mathbb{N}^+, ...\)` -- - Logical (`TRUE/FALSE`) -- - Categorical - eg eye colour, place of birth, ... - in R parlance called a "factor", possible values "levels" - can be ordered factor -- - Date/Time - eg date of birth, precise time of sale, ... - One of the most notoriously difficult data types to handle in other languages -- - Text (in programming speak, a *string*) - factor or text? - often treat as text when every observation has unique value -- - Many more - Image, Spatial, Audio, Video, ... --- ## Factors Factor variables have a value from a limited set of possible *levels*. ``` r data("chickwts") head(chickwts) summary(chickwts) ``` -- Are these just text? ``` r team <- c("Liverpool", "Man U", "Newcastle U", "Chelsea") team ``` ``` [1] "Liverpool" "Man U" "Newcastle U" "Chelsea" ``` ``` r str(team) ``` ``` chr [1:4] "Liverpool" "Man U" "Newcastle U" "Chelsea" ``` ``` r str(as.factor(team)) ``` ``` Factor w/ 4 levels "Chelsea","Liverpool",..: 2 3 4 1 ``` --- We can use factors within data frame subsetting too: ``` r chickwts[chickwts$feed=="sunflower", ] ``` Or subset a whole set of factor levels: ``` r chickwts[chickwts$feed %in% c("sunflower","linseed"), ] ``` -- ``` r nlevels(chickwts$feed) ``` ``` [1] 6 ``` -- ``` r levels(chickwts$feed) ``` ``` [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower" ``` -- ``` r as.integer(chickwts$feed) ``` -- ``` r table(chickwts$feed) ``` ``` casein horsebean linseed meatmeal soybean sunflower 12 10 12 11 14 12 ``` --- class: inverse ## .center[ 🚨🚨 Warning! 🚨🚨 ] When you load data using the standard `read.csv()` function, it can't tell between a factor and string (or even a factor and a number), so everything is loaded as a string (or number)! In other words, part of your data tidying work is to identify and correctly transform variables that should be factors. <br><br> .center[**DEMO**] <br><br> -- ``` r mydat$var <- as.factor(mydat$var) ``` --- background-image: url("i/forcats.png") background-size: contain --- ## `forcats` package The `forcats` package provides a suite of tools that solve common problems with factors, including changing the order of levels or the values. ``` r # REMEMBER: only install ONCE # install.packages("forcats") # Load thereafter library("forcats") ``` -- ``` r data("chickwts") levels(chickwts$feed) ``` ``` [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower" ``` ``` r head(chickwts$feed, 13) ``` ``` [1] horsebean horsebean horsebean horsebean horsebean horsebean horsebean [8] horsebean horsebean horsebean linseed linseed linseed Levels: casein horsebean linseed meatmeal soybean sunflower ``` --- ``` r chickwts$feed <- fct_inorder(chickwts$feed) levels(chickwts$feed) ``` ``` [1] "horsebean" "linseed" "soybean" "sunflower" "meatmeal" "casein" ``` -- ``` r table(chickwts$feed) ``` ``` horsebean linseed soybean sunflower meatmeal casein 10 12 14 12 11 12 ``` -- ``` r chickwts$feed <- fct_infreq(chickwts$feed) levels(chickwts$feed) ``` ``` [1] "soybean" "linseed" "sunflower" "casein" "meatmeal" "horsebean" ``` ``` r table(chickwts$feed) ``` ``` soybean linseed sunflower casein meatmeal horsebean 14 12 12 12 11 10 ``` --- Or, we can reorder by our own custom summary statistic: ``` r library("forcats") data("chickwts") chickwts boxplot(weight ~ feed, chickwts) chickwts2 <- chickwts aggregate(weight ~ feed, chickwts2, median) chickwts2$feed <- fct_reorder(chickwts2$feed, chickwts2$weight, median) levels(chickwts2$feed) aggregate(weight ~ feed, chickwts2, median) boxplot(weight ~ feed, chickwts2) as.integer(chickwts$feed) as.integer(chickwts2$feed) ``` --- ## Data science workflow <br> <img src="i/data-science.png" width="90%" style="display: block; margin: auto;" /> <br> Image from "R for Data Science", by H. Wickham & G. Grolemund. Free online: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/) --- ## Missing data Unfortunately, few real data sets are blessed with 100% complete observations. R can express missing data and handle it very well: it is represented as `NA` internally. -- Care is required when reading in external data to verify that it has correctly placed NA where appropriate: many external data sources indicate missingness differently! See `na.strings` argument of `read.csv` function. -- ``` r chickwts$weight[1] <- NA mean(chickwts$weight) ``` ``` [1] NA ``` -- ``` r mean(chickwts$weight, na.rm = TRUE) ``` ``` [1] 262.4857 ``` ``` r mean(na.omit(chickwts$weight)) ``` ``` [1] 262.4857 ```