Data Science and Statistical Computing

class: center, middle, inverse, title-slide

.title[
# Data Science and Statistical Computing
]
.subtitle[
## Practical Lecture 2<br>Data Frames
]
.author[
### Dr Louis Aslett
]
.institute[
### Durham University
]
.date[
### 17 October 2024
]

---

## 🤔 Quick quiz!

- Create a vector containing the values 1 up to 100 called `x`

``` r
x <- 1:100
x <- seq(1, 100)
```

- Bootstrap resample the vector and store the result in `y`

``` r
y <- sample(x, replace = TRUE)
y <- sample(x, 100, replace = TRUE)
```

- Store the vector without elements 10 through 20 inclusive into `z`

``` r
z <- y[-(10:20)]
```

- Centre the data by subtracting the mean from every element, overwriting `z`, `$(z_i = z_i - \bar{z})$`

``` r
z <- z - mean(z)
```

---

## More realistic data

Vectors are fine when dealing only with *univariate* data (meaning 1 variable).

Although this is all you saw in Stats 1, in the "real world" data is almost always higher dimensional: collect many different *variables* for each *observation*.

eg could collect height 📏, weight ⚖️, age 👵, eye colour 👁️, date of birth 📆, ... all on the same individual 💁

**Don't** want to have to store these haphazardly!

``` r
Heights <- c(147, 150, 152)
Weights <- c(52.2, 53.1, 54.4)
```

Most large scale datasets are stored in a matrix arrangement called a **design matrix**.

- rows: contain observations (eg different individuals)
- columns: contain variables (eg each recorded quantity)

R calls these a **data frame**.

---

## Data frames

We can create a data frame manually:

``` r
X <- data.frame(Height = c(147, 150, 152),
                Weight = c(52.2, 53.1, 54.4))
X
```

```
  Height Weight
1    147   52.2
2    150   53.1
3    152   54.4
```

But, nobody valuing their sanity actually loads data into R by typing it in!

In practice we will usually load it from some external source, like a CSV (Comma Separated Values), etc.

```
"Height","Weight"
147,52.2
150,53.1
152,54.4
```

---

## Loading external data

Probably the best format for loading data into R *is* indeed CSV. We download the full height and weight data from:

[https://www.louisaslett.com/Courses/DSSC/data/hw.csv](https://www.louisaslett.com/Courses/DSSC/data/hw.csv)

Load either in the R console or using RStudio:

``` r
hw <- read.csv("hw.csv")
```

Note:

1. R looks in the current "working directory". If you download elsewhere, need to specify the full path.

1. If not *comma* separated, but some other way (e.g. tab delimited), then `read.table` allows you to specify the separator.

---

## Using data frames

.pull-left[
### Accessing data frames

``` r
hw$Height
hw[,1]

hw$Weight
hw[,2]

hw$Height[3]
hw[3,1]

hw[3,]

hw[hw$Weight>50,]
hw[sample(1:nrow(hw), 4),]
```
]

.pull-right[
### Interrogating data frames

``` r
names(hw)

dim(hw)

nrow(hw)
ncol(hw)

head(hw)

summary(hw)

str(hw)
```
]

Useful self explanatory functions:

`colMeans`, `rowMeans`, `colSums`, `rowSums`, `cov`, `cor`, `scale`, ...

---
class: inverse

## .center[ 🚨🚨 Warning! 🚨🚨 ]

A common mistake is not to realise that a data frame can suddenly turn into a vector when we subset a single row or column, but not if we subset more than one:

``` r
hw[,1]
```

```
 [1] 147 150 152 155 157 160 163 165 168 170 173 175 178 180 183
```

Not how we think mathematically: we 'expect' an `$n \times 1$` matrix?

Use additional argument `drop = FALSE`

``` r
hw[, 1, drop = FALSE]
```

```
   Height
1     147
2     150
3     152
4     155
5     157
6     160
7     163
8     165
9     168
10    170
11    173
12    175
13    178
14    180
15    183
```

---

## Creating new variables

Putting together vector calculations and data frames we can create new derived variables within a data set.

For example,
`$$\mathrm{BMI} = \frac{\text{mass}(\text{kg})}{\left(\text{height}(\text{m})\right)^2}$$`

``` r
hw$BMI <- hw$Weight/(hw$Height/100)^2
head(hw)
```

```
  Height Weight      BMI
1    147   52.2 24.15660
2    150   53.1 23.60000
3    152   54.4 23.54571
4    155   55.8 23.22581
5    157   57.2 23.20581
6    160   58.5 22.85156
```

---

## 🧸 Toy problem

In a letter to The Economist on 5<sup>th</sup> January 2013, a mathematics professor suggested that a modification to BMI would improve it since "We live in a 3D world"!
`$$\mathrm{BMI_{3D}} = \frac{1.3 \times \text{mass}(\text{kg})}{\left(\text{height}(\text{m})\right)^\frac{5}{2}}$$`
  
Create a new variable called `BMI3D` in the data frame holding this modified BMI. Has anyone switched position in the rankings on this new scale?

``` r
hw$BMI3D <- 1.3*hw$Weight/(hw$Height/100)^2.5

rank(hw$BMI)
```

```
 [1] 15 14 13 12 11 10  9  8  7  6  2  3  1  5  4
```

``` r
rank(hw$BMI3D)
```

```
 [1] 15 14 13 12 11 10  9  8  7  6  5  4  2  3  1
```

---

## Full data frame access

.pull-left[

``` r
wq.red$pH
wq.red[, 9]
wq.red[, "pH"]

wq.red[, c(9,11)]
wq.red[, c("pH","alcohol")]
```
]

.pull-right[

``` r
sort(wq.red$pH)

wq.red[order(wq.red$pH), ]

wq.red[wq.red$pH>3,]
wq.red[wq.red$pH>3, "alcohol"]
```
]

``` r
wq.red[wq.red$pH>3 & wq.red$density<1, "alcohol"]
```

---
class: inverse

## .center[ 🚨🚨 Warning! 🚨🚨 ]

All of these are **accessors** ... the original data frame is unchanged!

So if you run:

``` r
wq.red[order(wq.red$pH), ]
```

then look at `wq.red`, it will *not* be ordered. The ordering was a *new* temporary object.

If you intended to actually change the order of the data, need to overwrite it:

``` r
wq.red <- wq.red[order(wq.red$pH), ]
```