Data Science and Statistical Computing

class: center, middle, inverse, title-slide

.title[
# Data Science and Statistical Computing
]
.subtitle[
## Practical Lecture 3 Lists, Factors, Odds and Ends
]
.author[
### Dr Louis Aslett
]
.institute[
### Durham University
]
.date[
### 24 October 2024
]

---

## Built in datasets

R has a lot of built-in data sets which can be great for experimenting, whilst others are available in packages.

- Run `data()` to see currently available data sets in base (and loaded packages)
- The `data()` function also loads them
    - name as first argument
    - optional `package` argument
- Initially loads as a *promise* until you access

``` r
data("cars")
head(cars)
```

```
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
```

---

## Lists

- Think of as a "bucket" to hold variables and pass them round together.

- Unlike data frame, each variable in the bucket can be completely different size, type, etc

- Allows nesting, so good for hierachical/tree structures

``` r
x <- list(1, "a", c(1,2,3), data.frame(a = 1:3, b = 4:6))
x
```

```
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] 1 2 3

[[4]]
  a b
1 1 4
2 2 5
3 3 6
```

---

## Lists: naming

Each variable within can also be named:

``` r
x <- list(bob = 1, jill = "a", jack = c(1,2,3),
 eve = data.frame(a = 1:3, b = 4:6))
x
```

```
$bob
[1] 1

$jill
[1] "a"

$jack
[1] 1 2 3

$eve
  a b
1 1 4
2 2 5
3 3 6
```

---

## Lists: nesting

Nesting is permitted:

``` r
x <- list(bob = 1, eve = list(jill = "a", jack = c(1,2,3),
 data.frame(a = 1:3, b = 4:6)))
x
```

```
$bob
[1] 1

$eve
$eve$jill
[1] "a"

$eve$jack
[1] 1 2 3

$eve[[3]]
  a b
1 1 4
2 2 5
3 3 6
```

---

## Lists: access

Access with:

- `[ ]` to access an element of the list as a single item list (ie retains name and structure)
- `[[ ]]` to access item directly, stripping away a level of hierarchy
- `$` to access item by name

Can all be nested, eg:

``` r
x[1]
x[[1]]
x$bob
x[1]$bob
x$eve$jack
x$eve[[3]]
```

Note that many complex objects in R (eg returned by modelling functions etc) are just sophisticated list structures. Summarise any object (including lists) with `str()`

``` r
str(x)
```

---

## Lists: demo

.center[**DEMO OF:**]

- `t.test`

- `lm`

Using `cars` data.

---

## Functions

``` r
myFunction <- function(arg1, arg2 = 1) {
 z <- arg1 + arg2
 return(z)
}

myFunction(5)
```

```
[1] 6
```

``` r
myFunction(5,3)
```

```
[1] 8
```

``` r
x <- myFunction(5,3)
x
```

```
[1] 8
```

---

## Some notes on RStudio

.center[**DEMO OF:**]

- Source files

- Don't save workspace on exit
    - Do manually save workspace if need be

- Projects! (eg Stats Inference in one project, DSSC in another)

---

## 📦 Packages

R is powerful mostly because of the incredible ecosystem of add-on packages.

- Packages are "libraries" for bundling up code, documentation and data in a super easy format.
- Install **once**
    - `install.packages("dplyr")`
- Load **many times**
    - `library("dplyr")`
- To access without loading
    - `package::function` or `package::data`

- CRAN is the centralised (official) repository. [https://cran.r-project.org](https://cran.r-project.org)
- Bioconductor has a lot of very high quality packages [https://bioconductor.org](https://bioconductor.org)
- Many packages still being developed available on Github (a software version control & development site) [https://github.com](https://github.com)

`install.packages` looks only at CRAN by default. `remotes` package helps installing from other places.

See packages tab in RStudio.

---

## Not all data is numeric!

- Numeric
    - `$\in \mathbb{R}, \mathbb{Z}, \mathbb{N}, \mathbb{N}^+, ...$`

- Logical (`TRUE/FALSE`)

- Categorical
    - eg eye colour, place of birth, ...
    - in R parlance called a "factor", possible values "levels"
    - can be ordered factor

- Date/Time
    - eg date of birth, precise time of sale, ...
    - One of the most notoriously difficult data types to handle in other languages

- Text (in programming speak, a *string*)
    - factor or text?
    - often treat as text when every observation has unique value

- Many more
    - Image, Spatial, Audio, Video, ...

---

## Factors

Factor variables have a value from a limited set of possible *levels*.

``` r
data("chickwts")

head(chickwts)
summary(chickwts)
```

Are these just text?

``` r
team <- c("Liverpool", "Man U", "Newcastle U", "Chelsea")
team
```

```
[1] "Liverpool"   "Man U"       "Newcastle U" "Chelsea"    
```

``` r
str(team)
```

```
 chr [1:4] "Liverpool" "Man U" "Newcastle U" "Chelsea"
```

``` r
str(as.factor(team))
```

```
 Factor w/ 4 levels "Chelsea","Liverpool",..: 2 3 4 1
```

---

We can use factors within data frame subsetting too:

``` r
chickwts[chickwts$feed=="sunflower", ]
```

Or subset a whole set of factor levels:

``` r
chickwts[chickwts$feed %in% c("sunflower","linseed"), ]
```

``` r
nlevels(chickwts$feed)
```

```
[1] 6
```

``` r
levels(chickwts$feed)
```

```
[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"
```

``` r
as.integer(chickwts$feed)
```

``` r
table(chickwts$feed)
```

```

casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12 
```

---
class: inverse

## .center[ 🚨🚨 Warning! 🚨🚨 ]

When you load data using the standard `read.csv()` function, it can't tell between a factor and string (or even a factor and a number), so everything is loaded as a string (or number)!

In other words, part of your data tidying work is to identify and correctly transform variables that should be factors.

.center[**DEMO**]

``` r
mydat$var <- as.factor(mydat$var)
```

---

background-image: url("i/forcats.png")
background-size: contain

---

## `forcats` package

The `forcats` package provides a suite of tools that solve common problems with factors, including changing the order of levels or the values.

``` r
# REMEMBER: only install ONCE
# install.packages("forcats")

# Load thereafter
library("forcats")
```

``` r
data("chickwts")

levels(chickwts$feed)
```

```
[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"
```

``` r
head(chickwts$feed, 13)
```

```
 [1] horsebean horsebean horsebean horsebean horsebean horsebean horsebean
 [8] horsebean horsebean horsebean linseed   linseed   linseed  
Levels: casein horsebean linseed meatmeal soybean sunflower
```

---

``` r
chickwts$feed <- fct_inorder(chickwts$feed)
levels(chickwts$feed)
```

```
[1] "horsebean" "linseed"   "soybean"   "sunflower" "meatmeal"  "casein"   
```

``` r
table(chickwts$feed)
```

```

horsebean   linseed   soybean sunflower  meatmeal    casein 
       10        12        14        12        11        12 
```

``` r
chickwts$feed <- fct_infreq(chickwts$feed)
levels(chickwts$feed)
```

```
[1] "soybean"   "linseed"   "sunflower" "casein"    "meatmeal"  "horsebean"
```

``` r
table(chickwts$feed)
```

```

soybean   linseed sunflower    casein  meatmeal horsebean 
       14        12        12        12        11        10 
```

---

Or, we can reorder by our own custom summary statistic:

``` r
library("forcats")

data("chickwts")
chickwts

boxplot(weight ~ feed, chickwts)

chickwts2 <- chickwts

aggregate(weight ~ feed,
          chickwts2,
          median)

chickwts2$feed <- fct_reorder(chickwts2$feed,
 chickwts2$weight,
 median)
levels(chickwts2$feed)

aggregate(weight ~ feed,
          chickwts2,
          median)

boxplot(weight ~ feed, chickwts2)

as.integer(chickwts$feed)
as.integer(chickwts2$feed)
```

---

## Data science workflow

Image from "R for Data Science", by H. Wickham & G. Grolemund.

Free online: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)

---

## Missing data

Unfortunately, few real data sets are blessed with 100% complete observations.  R can express missing data and handle it very well: it is represented as `NA` internally.

Care is required when reading in external data to verify that it has correctly placed NA where appropriate: many external data sources indicate missingness differently!

See `na.strings` argument of `read.csv` function.
  
--

``` r
chickwts$weight[1] <- NA
mean(chickwts$weight)
```

```
[1] NA
```

``` r
mean(chickwts$weight, na.rm = TRUE)
```

```
[1] 262.4857
```

``` r
mean(na.omit(chickwts$weight))
```

```
[1] 262.4857
```