Data Science and Statistical Computing

class: center, middle, inverse, title-slide

.title[
# Data Science and Statistical Computing
]
.subtitle[
## Practical Lecture 9<br>Dates & Strings
]
.author[
### Dr Louis Aslett
]
.institute[
### Durham University
]
.date[
### 5 December 2024
]

---

## Next week ...

... is week 10. Last week of term, last week of this course! 😢

- **Wednesday:** New maths material, as usual (importance sampling)

- **Thursday:** 
    - Finishing up importance sampling and programming topics
    - Then, with any remaining time, you decide! Quick revision of assorted topics
    - **Please send requests for what to recap, ideally by end of Monday next week to louis.aslett@durham.ac.uk**

Also note that:

- Assignment 4 due next Monday. Don't fear, slightly shorter ... I know end of term exhaustion is setting in! 😮‍💨

- MEQs are available to complete for this module
    - Please do them! This is still quite a new course
    - Let me know what doesn't work so I can change it ...
    - ... **and** what is good (so I make sure I **don't** change it!)

---

## Dates

R can internally represent date, time or date-time.

We will use the `lubridate` package. It is part of the Tidyverse, but *not* loaded by default.

``` r
install.packages("lubridate")
library("lubridate")
```

```

Attaching package: 'lubridate'
```

```
The following objects are masked from 'package:base':

date, intersect, setdiff, union
```

The website is *very* useful: <https://lubridate.tidyverse.org/>

---

## Constructing dates/date-times

Current date, or date-time

.smaller[

``` r
today()
```

```
[1] "2024-12-04"
```

``` r
now()
```

```
[1] "2024-12-04 10:56:18 GMT"
```
]

.pull-left[
From a string

.smaller[

``` r
ymd("2024-12-2")
```

```
[1] "2024-12-02"
```

``` r
mdy("December 2nd, 2024")
```

```
[1] "2024-12-02"
```

``` r
dmy("2-Dec-2024")
```

```
[1] "2024-12-02"
```
]
]

.pull-right[
From a number

.smaller[

``` r
ymd(20241202)
```

```
[1] "2024-12-02"
```

``` r
dmy(02122024)
```

```
[1] "2024-12-02"
```
]
]

---
background-image: url("i/perfect_date.jpg")
background-size: contain

---

## Constructing dates/date-times

Date-times from strings

.smaller[

``` r
ymd_hms("2024-12-02 12:33:59")
```

```
[1] "2024-12-02 12:33:59 UTC"
```

``` r
mdy_hm("12/2/2024 13:01")
```

```
[1] "2024-12-02 13:01:00 UTC"
```
]

From individual components

.pull-left[
.smaller[

``` r
make_date(2024, 12, 2)
```

```
[1] "2024-12-02"
```
]
]

.pull-right[
.smaller[

``` r
make_date("2024", "12", "2")
```

```
[1] "2024-12-02"
```
]
]

.pull-left[
.smaller[

``` r
make_datetime(2024, 12, 2, 12)
```

```
[1] "2024-12-02 12:00:00 UTC"
```

``` r
make_datetime(2024, 12, 2, 12, 33)
```

```
[1] "2024-12-02 12:33:00 UTC"
```
]
]

.pull-right[
.smaller[

``` r
make_datetime(2024, 12, 2, 12, 33, 59)
```

```
[1] "2024-12-02 12:33:59 UTC"
```
]
]

---

## Time zones

Creation functions support an argument `tz` to specify the time zone, eg

.pull-left[
.smaller[

``` r
now()
```

```
[1] "2024-12-04 10:56:18 GMT"
```
]
]

.pull-right[
.smaller[

``` r
now(tz = "America/New_York")
```

```
[1] "2024-12-04 05:56:18 EST"
```
]
]

To see all available time zones use:

.smaller[

``` r
OlsonNames()
```
]

Changing time zone:

.smaller[

``` r
x <- ymd_hm("2024-12-02 15:10")
force_tz(x, "America/New_York") # forces the zone without converting
```

```
[1] "2024-12-02 15:10:00 EST"
```

``` r
with_tz(x, "America/New_York") # converts time to new zone
```

```
[1] "2024-12-02 10:10:00 EST"
```
]

---

## Converting

Use `as_date()` or `as_datetime()`.

---

## Going in reverse! Extracting from dates/date-times

.smaller[

``` r
datetime <- ymd_hms("2024-12-02 12:33:59")

year(datetime)
```

```
[1] 2024
```

``` r
month(datetime)
```

```
[1] 12
```

``` r
mday(datetime)
```

```
[1] 2
```

``` r
yday(datetime)
```

```
[1] 337
```

``` r
wday(datetime)
```

```
[1] 2
```

``` r
wday(datetime, week_start = 1)
```

```
[1] 1
```
]

---

.smaller[

``` r
datetime <- ymd_hms("2024-12-02 12:33:59")

hour(datetime)
```

```
[1] 12
```

``` r
minute(datetime)
```

```
[1] 33
```

``` r
second(datetime)
```

```
[1] 59
```

``` r
month(datetime, label = TRUE)
```

```
[1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
```

``` r
wday(datetime, label = TRUE)
```

```
[1] Mon
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
```

``` r
wday(datetime, label = TRUE, week_start = 1)
```

```
[1] Mon
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
```
]

---

Rounding down/up (also nearest, see `round_date()`)

.pull-left[
.smaller[

``` r
floor_date(datetime, unit = "minute")
```

```
[1] "2024-12-02 12:33:00 UTC"
```

``` r
floor_date(datetime, unit = "hour")
```

```
[1] "2024-12-02 12:00:00 UTC"
```

``` r
floor_date(datetime, unit = "day")
```

```
[1] "2024-12-02 UTC"
```

``` r
floor_date(datetime, unit = "week")
```

```
[1] "2024-12-01 UTC"
```

``` r
floor_date(datetime, unit = "month")
```

```
[1] "2024-12-01 UTC"
```

``` r
floor_date(datetime, unit = "quarter")
```

```
[1] "2024-10-01 UTC"
```
]
]

.pull-right[
.smaller[

``` r
ceiling_date(datetime, unit = "minute")
```

```
[1] "2024-12-02 12:34:00 UTC"
```

``` r
ceiling_date(datetime, unit = "hour")
```

```
[1] "2024-12-02 13:00:00 UTC"
```

``` r
ceiling_date(datetime, unit = "day")
```

```
[1] "2024-12-03 UTC"
```

``` r
ceiling_date(datetime, unit = "week")
```

```
[1] "2024-12-08 UTC"
```

``` r
ceiling_date(datetime, unit = "month")
```

```
[1] "2025-01-01 UTC"
```

``` r
ceiling_date(datetime, unit = "quarter")
```

```
[1] "2025-01-01 UTC"
```
]
]

.smaller[

``` r
floor_date(datetime, unit = "week", week_start = 1)
```

```
[1] "2024-12-02 UTC"
```
]

---

Finally, updating ...

Either

.smaller[

``` r
datetime <- ymd_hms("2024-12-02 12:33:59")
datetime <- update(datetime, hour = 11, second = 33)
datetime
```

```
[1] "2024-12-02 11:33:33 UTC"
```
]

.smaller[

``` r
datetime <- ymd_hms("2024-12-02 12:33:59")
hour(datetime) <- 11
second(datetime) <- 33
datetime
```

```
[1] "2024-12-02 11:33:33 UTC"
```
]

---

## Durations

How old would Einstein be if alive today?

.smaller[

``` r
einstein <- dmy("14th March 1879")
age <- today() - einstein
age
```

```
Time difference of 53226 days
```
]

So, how old was Einstein 42 months ago? We can do arithmetic using `days`/`hours`/`months`/ etc

.smaller[

``` r
today() - months(42) - einstein
```

```
Time difference of 51947 days
```
]

Convert to a duration for easier viewing

.smaller[

``` r
age <- as.duration(age)
age
```

```
[1] "4598726400s (~145.72 years)"
```
]

---

## Strings

Recall text inside quotes is a *string*:

.smaller[

``` r
x <- "I am a string"
x <- 'I am a string'
```
]

But, care is needed. A backslash `\` can escape quotes or encode special characters (most likely only ever need `\'`, `\"`, `\n` and `\\`):

.smaller[

``` r
y <- "I'm a string" # Ok!
y <- 'I'm a string' # Error!
y <- 'I\'m a string' # Ok!

z <- "As Roosevelt said,
"Believe you can and you're halfway there."" # Error!
z <- "As Roosevelt said,\n\"Believe you can and you're halfway there.\""
cat(z)
```
]

Or in R 4.0.0 or newer, open with `r"(` and close with `)"` and you can enclose anything (except `)"`!), including returns etc.

.smaller[

``` r
z <- r"(As Roosevelt said,
"Believe you can and you're halfway there."
)"
cat(z)
```
]

---

## Working with strings

We will use the `stringr` package, which is included in the Tidyverse!

``` r
library("tidyverse")
```

```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr   1.1.4     ✔ readr   2.1.5
✔ forcats 1.0.0     ✔ stringr 1.5.1
✔ ggplot2 3.5.1     ✔ tibble  3.2.1
✔ purrr   1.0.2     ✔ tidyr   1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

Functions from `stringr` consistently start with the prefix `str_` and enable easily doing the most common string manipulations (Hooray for autocomplete! 😀)

If you can't find what you need in `stringr`, look up the `stringi` package (which it is built on top of) ... it has nearly every imaginable string function you might need.

---

## String Basics

String length

.smaller[

``` r
str_length(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"))
```

```
[1] 38  2 15
```
]

Combining strings

.smaller[

``` r
str_c("Data Science and Statistical Computing", "by", "Dr Louis Aslett")
```

```
[1] "Data Science and Statistical ComputingbyDr Louis Aslett"
```

``` r
str_c("Data Science and Statistical Computing", "by", "Dr Louis Aslett", sep = " ")
```

```
[1] "Data Science and Statistical Computing by Dr Louis Aslett"
```
]

.smaller[

``` r
str_c(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"))
```

```
[1] "Data Science and Statistical Computing"
[2] "by"                                    
[3] "Dr Louis Aslett"                       
```

``` r
str_c(c("Data Science and Statistical Computing", "by", "Dr Louis Aslett"), collapse = " ")
```

```
[1] "Data Science and Statistical Computing by Dr Louis Aslett"
```
]

---

Sub-setting strings: `str_sub` takes `start` and `end` (inclusive, possibly -'ve)

.smaller[

``` r
z <- c("Alice", "Bob", "Connie", "David")
str_sub(z, 1, 4)
```

```
[1] "Alic" "Bob"  "Conn" "Davi"
```

``` r
str_sub(z, -4, -1)
```

```
[1] "lice" "Bob"  "nnie" "avid"
```
]

Assignment form to overwrite

.smaller[

``` r
str_sub(z, 1, 2) <- "Zo"
z
```

```
[1] "Zoice"  "Zob"    "Zonnie" "Zovid" 
```
]

Trimming

.smaller[

``` r
str_trim("  String with  trailing,   middle, and    leading   white space\n\n")
```

```
[1] "String with  trailing,   middle, and    leading   white space"
```

``` r
str_squish("  String with  trailing,   middle, and    leading   white space\n\n")
```

```
[1] "String with trailing, middle, and leading white space"
```
]

---

## Simple *regular expressions*

Regular expressions (*regex* for short) are an incredibly powerful way of finding patterns in strings.

We first examine how to make regexs before deploying them, so hang in there to see the utility soon!

The function `str_view()` lets us see how R will interpret the regex on some example strings we use, but we typically use it just to make the regex and then use the utilities later to achieve what we want.

Exact matching

.smaller[

``` r
x <- c("apple", "banana", "pear")
str_view(x, "an")
```

```
[2] │ b<an><an>a
```
]

---

## Regex: wildcard

"Wildcard" match any single character

.smaller[

``` r
str_view(x, ".a.")
```

```
[2] │ <ban>ana
[3] │ p<ear>
```
]

How to match `.`?? By providing `\.` ... but the `\` must be escaped!

.smaller[

``` r
str_view(c(".bc", "a.c", "be."), "a\\.c")
```

```
[2] │ <a.c>
```
]

Can get messy fast with special characters (so, to match `\` you must provide `\\\\` !!)

---

## Regex: anchoring

.pull-left[
Anchoring the start ...

.smaller[

``` r
str_view(x, "^a")
```

```
[1] │ <a>pple
```
]
]

.pull-right[
... and end of a string

.smaller[

``` r
str_view(x, "a$")
```

```
[2] │ banan<a>
```
]
]

Or both to pin a whole string

.smaller[

``` r
y <- c("apple pie", "apple", "apple cake")
```
]

.pull-left[
.smaller[

``` r
str_view(y, "apple")
```

```
[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake
```
]
]

.pull-right[
.smaller[

``` r
str_view(y, "^apple$")
```

```
[2] │ <apple>
```
]
]

---

## Regex: classes & ranges

Match consecutively from set of characters ... exactly one, the first

.smaller[

``` r
str_view(x, "[pan]")
```

```
[1] │ <a><p><p>le
[2] │ b<a><n><a><n><a>
[3] │ <p>e<a>r
```
]

... one or more (still first *instance*)

.smaller[

``` r
str_view(x, "[pan]+")
```

```
[1] │ <app>le
[2] │ b<anana>
[3] │ <p>e<a>r
```
]

.pull-left[
... exact number

.smaller[

``` r
str_view(x, "[pan]{2}")
```

```
[1] │ <ap>ple
[2] │ b<an><an>a
```
]
]

.pull-right[
... or range

.smaller[

``` r
str_view(x, "[pan]{1,3}")
```

```
[1] │ <app>le
[2] │ b<ana><na>
[3] │ <p>e<a>r
```
]
]

---

.smaller[

``` r
y <- c("There were 122 in total", "Overall about 390 found", "100 but no more")
str_view(y, "[0-9]+")
```

```
[1] │ There were <122> in total
[2] │ Overall about <390> found
[3] │ <100> but no more
```

``` r
str_view(y, "[^A-Za-z ]+")
```

```
[1] │ There were <122> in total
[2] │ Overall about <390> found
[3] │ <100> but no more
```

``` r
str_view(y, "^[0-9]+")
```

```
[3] │ <100> but no more
```

``` r
str_view(y, "[a-z ]+")
```

```
[1] │ T<here were >122< in total>
[2] │ O<verall about >390< found>
[3] │ 100< but no more>
```
]

---

## Regex: repeats

`?` (zero or one); `*` (zero or more); `+` (one or more)

.smaller[

``` r
y <- "1888: the longest year in Roman numerals! MDCCCLXXXVIII"
str_view(y, "CC?")
```

```
[1] │ 1888: the longest year in Roman numerals! MD<CC><C>LXXXVIII
```

``` r
str_view(y, "CC+")
```

```
[1] │ 1888: the longest year in Roman numerals! MD<CCC>LXXXVIII
```

``` r
str_view(y, 'C[LX]+')
```

```
[1] │ 1888: the longest year in Roman numerals! MDCC<CLXXX>VIII
```

``` r
str_view(y, "C{2}")
```

```
[1] │ 1888: the longest year in Roman numerals! MD<CC>CLXXXVIII
```

``` r
str_view(y, "C{2,}")
```

```
[1] │ 1888: the longest year in Roman numerals! MD<CCC>LXXXVIII
```

``` r
str_view(y, "C{2,3}")
```

```
[1] │ 1888: the longest year in Roman numerals! MD<CCC>LXXXVIII
```
]

---

.smaller[

``` r
str_view(c("grey", "gray"), "gr[ea]y")
```

```
[1] │ <grey>
[2] │ <gray>
```

``` r
str_view(c("colour", "color"), "colou?r")
```

```
[1] │ <colour>
[2] │ <color>
```
]

Optional matching

.smaller[

``` r
str_view(c("red", "blue", "green"), "red|green")
```

```
[1] │ <red>
[3] │ <green>
```
]

Finally, grouping ... the `()` parts can be extracted later (see next slide)

.smaller[

``` r
y <- c("There were 122 in total", "Overall about 390 found", "100 but no more")
str_view(y, "^([A-Za-z]+)[^0-9]+([0-9]+)")
```

```
[1] │ <There were 122> in total
[2] │ <Overall about 390> found
```
]

---

# Using regexs

Too many to cover in lecture ... look up docs!

- `str_detect(x, regex)` return `TRUE`/`FALSE` depending on whether regex matches anything
- `str_count()` total number of times the regex matches in the each string
    - use `str_view_all()` to see every match, not just first
    - there are many `_all` versions of functions in `stringr` to capture all matches
- `str_extract()` pull out the matching part of each string
- `str_match()` returns matrix with full match and columns for each group match

.smaller[

``` r
y <- c("There were 122 in total", "Overall about 390 found", "100 but no more")
str_match(y, "^([A-Za-z]+)[^0-9]+([0-9]+)")
```

```
     [,1]                [,2]      [,3] 
[1,] "There were 122"    "There"   "122"
[2,] "Overall about 390" "Overall" "390"
[3,] NA                  NA        NA   
```
]

- `str_replace()` replace the discovered matches
- `str_split()` split the string into two at the match
- `str_locate()` start/end location of matches