class: center, middle, inverse, title-slide .title[ # Data Science and Statistical Computing ] .subtitle[ ## Practical Lecture 2
Data Frames ] .author[ ### Dr Louis Aslett ] .institute[ ### Durham University ] .date[ ### 17 October 2024 ] --- ## 🤔 Quick quiz! - Create a vector containing the values 1 up to 100 called `x` -- ``` r x <- 1:100 x <- seq(1, 100) ``` - Bootstrap resample the vector and store the result in `y` -- ``` r y <- sample(x, replace = TRUE) y <- sample(x, 100, replace = TRUE) ``` - Store the vector without elements 10 through 20 inclusive into `z` -- ``` r z <- y[-(10:20)] ``` - Centre the data by subtracting the mean from every element, overwriting `z`, `\((z_i = z_i - \bar{z})\)` -- ``` r z <- z - mean(z) ``` --- ## More realistic data Vectors are fine when dealing only with *univariate* data (meaning 1 variable). -- Although this is all you saw in Stats 1, in the "real world" data is almost always higher dimensional: collect many different *variables* for each *observation*. eg could collect height 📏, weight ⚖️, age 👵, eye colour 👁️, date of birth 📆, ... all on the same individual 💁 -- **Don't** want to have to store these haphazardly! ``` r Heights <- c(147, 150, 152) Weights <- c(52.2, 53.1, 54.4) ``` -- Most large scale datasets are stored in a matrix arrangement called a **design matrix**. - rows: contain observations (eg different individuals) - columns: contain variables (eg each recorded quantity) R calls these a **data frame**. --- ## Data frames We can create a data frame manually: ``` r X <- data.frame(Height = c(147, 150, 152), Weight = c(52.2, 53.1, 54.4)) X ``` ``` Height Weight 1 147 52.2 2 150 53.1 3 152 54.4 ``` -- But, nobody valuing their sanity actually loads data into R by typing it in! In practice we will usually load it from some external source, like a CSV (Comma Separated Values), etc. ``` "Height","Weight" 147,52.2 150,53.1 152,54.4 ``` --- ## Loading external data Probably the best format for loading data into R *is* indeed CSV. We download the full height and weight data from: [https://www.louisaslett.com/Courses/DSSC/data/hw.csv](https://www.louisaslett.com/Courses/DSSC/data/hw.csv) -- Load either in the R console or using RStudio: ``` r hw <- read.csv("hw.csv") ``` Note: 1. R looks in the current "working directory". If you download elsewhere, need to specify the full path. 1. If not *comma* separated, but some other way (e.g. tab delimited), then `read.table` allows you to specify the separator. --- ## Using data frames .pull-left[ ### Accessing data frames ``` r hw$Height hw[,1] hw$Weight hw[,2] hw$Height[3] hw[3,1] hw[3,] hw[hw$Weight>50,] hw[sample(1:nrow(hw), 4),] ``` ] -- .pull-right[ ### Interrogating data frames ``` r names(hw) dim(hw) nrow(hw) ncol(hw) head(hw) summary(hw) str(hw) ``` ] -- Useful self explanatory functions: `colMeans`, `rowMeans`, `colSums`, `rowSums`, `cov`, `cor`, `scale`, ... --- class: inverse ## .center[ 🚨🚨 Warning! 🚨🚨 ] A common mistake is not to realise that a data frame can suddenly turn into a vector when we subset a single row or column, but not if we subset more than one: ``` r hw[,1] ``` ``` [1] 147 150 152 155 157 160 163 165 168 170 173 175 178 180 183 ``` Not how we think mathematically: we 'expect' an `\(n \times 1\)` matrix? Use additional argument `drop = FALSE` ``` r hw[, 1, drop = FALSE] ``` ``` Height 1 147 2 150 3 152 4 155 5 157 6 160 7 163 8 165 9 168 10 170 11 173 12 175 13 178 14 180 15 183 ``` --- ## Creating new variables Putting together vector calculations and data frames we can create new derived variables within a data set. For example, `$$\mathrm{BMI} = \frac{\text{mass}(\text{kg})}{\left(\text{height}(\text{m})\right)^2}$$` -- ``` r hw$BMI <- hw$Weight/(hw$Height/100)^2 head(hw) ``` ``` Height Weight BMI 1 147 52.2 24.15660 2 150 53.1 23.60000 3 152 54.4 23.54571 4 155 55.8 23.22581 5 157 57.2 23.20581 6 160 58.5 22.85156 ``` --- ## 🧸 Toy problem In a letter to The Economist on 5<sup>th</sup> January 2013, a mathematics professor suggested that a modification to BMI would improve it since "We live in a 3D world"! `$$\mathrm{BMI_{3D}} = \frac{1.3 \times \text{mass}(\text{kg})}{\left(\text{height}(\text{m})\right)^\frac{5}{2}}$$` Create a new variable called `BMI3D` in the data frame holding this modified BMI. Has anyone switched position in the rankings on this new scale? -- ``` r hw$BMI3D <- 1.3*hw$Weight/(hw$Height/100)^2.5 rank(hw$BMI) ``` ``` [1] 15 14 13 12 11 10 9 8 7 6 2 3 1 5 4 ``` ``` r rank(hw$BMI3D) ``` ``` [1] 15 14 13 12 11 10 9 8 7 6 5 4 2 3 1 ``` --- ## Full data frame access <img src="i/df.png" width="90%" style="display: block; margin: auto;" /> -- .pull-left[ ``` r wq.red$pH wq.red[, 9] wq.red[, "pH"] wq.red[, c(9,11)] wq.red[, c("pH","alcohol")] ``` ] -- .pull-right[ ``` r sort(wq.red$pH) wq.red[order(wq.red$pH), ] wq.red[wq.red$pH>3,] wq.red[wq.red$pH>3, "alcohol"] ``` ] ``` r wq.red[wq.red$pH>3 & wq.red$density<1, "alcohol"] ``` --- class: inverse ## .center[ 🚨🚨 Warning! 🚨🚨 ] All of these are **accessors** ... the original data frame is unchanged! So if you run: ``` r wq.red[order(wq.red$pH), ] ``` then look at `wq.red`, it will *not* be ordered. The ordering was a *new* temporary object. If you intended to actually change the order of the data, need to overwrite it: ``` r wq.red <- wq.red[order(wq.red$pH), ] ```