class: center, middle, inverse, title-slide .title[ # Data Science and Statistical Computing ] .subtitle[ ## Practical Lecture 1
The base R language ] .author[ ### Dr Louis Aslett ] .institute[ ### Durham University ] .date[ ### 10 October 2024 ] --- background-image: url("i/R_and_RStudio.png") background-size: contain --- # What is R? - "R is an integrated suite of software facilities for data manipulation, calculation and graphical display. ... an environment within which many classical and modern statistical techniques have been implemented." -- - an interactive language (look for the prompt `>` ) - which you can equally well treat as an ordinary programming language. -- - R stores results in a way which allows further manipulation and investigation (as 'objects'). -- - R provides most of its functionality through a rich collection of packages. -- - First publicly released in 1993. - Considered stable for 'production use' since 2000 (that's the base: packages another story!) --- ## Diving Straight In! One can view R as an elaborate desktop calculator, which provides a simple starting point. ``` r 1+2 ``` ``` [1] 3 ``` -- Using the 'backwards arrow', you can also save results of calculations in variables: ``` r x <- 1+2 x ``` ``` [1] 3 ``` `=` works ... but it's not idiomatic R, so please avoid! `x = 1+2` 🤮 --- ## Veeeerrrryyy expensive calculator 🤑🧮 Some useful functions for using R as a calculator include: | | | |-----|-----| | `\(\sqrt{2}\)` | `sqrt(2)` | | `\(2 \times 3\)` | `2*3` | | `\(2^3\)` | `2^3` | | `\(2 \div 3\)` | `2/3` | | `\(\log_{10}{2}\)` | `log10(2)` | | `\(\pi\)` | `pi` | | `\(\ln{2}\)` | `log(2)` | | `\(\sin{2}\)` | `sin(2)` | | `\(e^2\)` | `exp(2)` | | `\(5 \bmod{2}\)` | `5%%2` | --- ## 🧸 Toy problem Calculate the average of 2, 3, 5 and 8. Store the result in a variable called `avg`. `$$\frac{2+3+5+8}{4}$$` -- ``` r avg <- (2+3+5+8)/4 avg ``` ``` [1] 4.5 ``` -- How would we calculate the standard deviation? ``` r sqrt(((2-4.5)^2+(3-4.5)^2+(5-4.5)^2+(8-4.5)^2)/3) ``` ``` [1] 2.645751 ``` This is horrible! Fortunately, R can do things like this for us if we supply a *vector* of the numbers. --- background-image: url("i/too-easy-dude.jpg") background-size: contain --- background-image: url("i/maybe-i-just-make-it-look-easy.jpg") background-size: contain --- ## Vectors A vector, being an ordered list of numbers, is the simplest kind of data. So, our 'data' before is a vector `\((2,3,5,8)\)`. In R, a vector is created with the `c()` function and can be used in many built-in functions. So, we can create a variable `myData` holding this data and then calculate the mean and standard deviation using R functions: ``` r myData <- c(2,3,5,8) mean(myData) ``` ``` [1] 4.5 ``` ``` r sd(myData) ``` ``` [1] 2.645751 ``` -- These are *functions* whose first *argument* is the vector of values we want to compute on. --- .pull-left[ ### Accessing vectors ``` r myData ``` ``` [1] 2 3 5 8 ``` ``` r myData[2] ``` ``` [1] 3 ``` ``` r myData[-3] ``` ``` [1] 2 3 8 ``` ``` r myData[c(1,4)] ``` ``` [1] 2 8 ``` ``` r myData[2:4] ``` ``` [1] 3 5 8 ``` ] -- .pull-right[ ### Defining vectors ``` r rep(3,4) ``` ``` [1] 3 3 3 3 ``` ``` r 4:7 ``` ``` [1] 4 5 6 7 ``` ``` r seq(4,7) ``` ``` [1] 4 5 6 7 ``` ``` r seq(0,10,by=2) ``` ``` [1] 0 2 4 6 8 10 ``` ``` r seq(0,3,length.out=7) ``` ``` [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ``` ] --- class: middle, inverse ## 🆘 Getting help! To obtain documentation for a function, simply prefix the function name with a question mark. ``` r ?mean ?rep ``` OR Go to the "Help" tab in RStudio and search. -- ... and if you don't even know what you're looking for? Great resource: [https://rseek.org/](https://rseek.org/) --- ## 🧸 Toy problem - Sum all the integers from 1 to 100. -- 🤔 Perhaps search for how to "sum a vector"? -- ``` r sum(1:100) ``` ``` [1] 5050 ``` -- ... for the contrarians ``` r mean(1:100)*100 ``` ``` [1] 5050 ``` -- - Sum all the even integers between 1 and 100. -- ``` r sum(seq(2, 100, by = 2)) ``` ``` [1] 2550 ``` --- ## Vector calculations When you use the common operations on vectors it will apply the operation element-wise. .pull-left[ ``` r x <- 1:4 x+1 ``` ``` [1] 2 3 4 5 ``` ``` r x*3 ``` ``` [1] 3 6 9 12 ``` ``` r x+(4:1) ``` ``` [1] 5 5 5 5 ``` ] .pull-right[ ``` r x * x ``` ``` [1] 1 4 9 16 ``` ``` r x %*% x ``` ``` [,1] [1,] 30 ``` ] --- class: inverse ## .center[ 🚨🚨 Warning! 🚨🚨 ] Care is required, because R will *silently* 'recycle' a vector when the lengths are different: ``` r x+c(0,10) ``` ``` [1] 1 12 3 14 ``` -- But does at least warn if vector can't recycle completely 😮💨 ``` r x+c(0,10,100) ``` ``` Warning in x + c(0, 10, 100): longer object length is not a multiple of shorter object length ``` ``` [1] 1 12 103 4 ``` --- ## Using Vectors: some essential functions - `sort` sorts a vector. - `rank` provides the rank of each element. - `order` gives the indices of the elements in order. - `unique` returns just the unique values in the vector. - `table` provide counts of the occurrence of each element. - `length` total number of elements in the vector. - `sample` randomly sample from the elements of a vector. - `paste` concatenate a textual representation of vectors together. Also, fairly self descriptive: `mean`, `median`, `sd`, `var`, `min`, `max`, `range`, `quantile`, `cumsum`. --- ## 🤔 Quick thought exercise! How could I simulate throwing a dice 200 times and get counts of the total number of each face? -- ``` r table(sample(1:6, 200, replace = TRUE)) ``` ``` 1 2 3 4 5 6 25 29 37 36 44 29 ``` -- <br><br> **Follow on thought exercise:** could we do a hypothesis test of whether a dice that produced the above numbers is biased? --- ## Code flow control For loops in R rely on vectors ... ``` r for(i in vec) { # Code between { and } runs length(vec) times, with i set to # each element of vec, in order, sequentially } ``` Above executes the code within `{ ... }` for each element of the vector (or list, see later) specified after the `in`, assigning them to the variable name before the `in` -- Following have identical output: .pull-left[ ``` r nums <- 1:4 for(i in nums) { print(i) } ``` ``` [1] 1 [1] 2 [1] 3 [1] 4 ``` ] .pull-right[ ``` r for(myvar in 1:4) { print(myvar) } ``` ``` [1] 1 [1] 2 [1] 3 [1] 4 ``` ] --- class: inverse ## .center[ 🚨🚨 Warning 1! 🚨🚨 ] Modifying the vector you're looping over will have not have any effect on the loop: it is copied before the loop starts! ``` r nums <- 1:4 for(i in nums) { nums <- 10:20 print(i) } ``` ``` [1] 1 [1] 2 [1] 3 [1] 4 ``` ``` r nums ``` ``` [1] 10 11 12 13 14 15 16 17 18 19 20 ``` --- class: inverse ## .center[ 🚨🚨 Warning 2! 🚨🚨 ] Modifying the assigned element does *not* modify the vector you're looping over! ``` r nums <- 1:4 for(i in nums) { i <- 99 } nums ``` ``` [1] 1 2 3 4 ```