APTS: Statistical Machine Learning Computer Lab 1

There has been code provided throughout the course in live code blocks, where you get snippets of code to perform particular tasks by directly leveraging individual packages for particular methods. Therefore, we will not repeat such exercises in the labs and instead we will focus on the full software pipeline which makes running machine learning analyses much easier. Arguably the two top ML pipelines in R (R Core Team, 2021) today are tidymodels (Kuhn and Wickham, 2020) and mlr3 (Lang et al., 2019). The first lab will focus on tidymodels and the second on mlr3.

Data sets

Before beginning with tidymodels, we will ensure you can get the two data sets that will be used for the examples here, which will be suitable for basic binary and multilabel classification models.

The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).

# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")

Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.

# Try some plotting functions to explore the data here

The second dataset we will use which is multiclass is the famous MNIST handwriting data set. I have already put this data into the format we discussed in lectures for coding of images (an image per row with pixels as features). Download it first (this may take some time, it is ~40MB)

# Uncomment and run the following command first if you do not have the fst package
# install.packages("fst")
download.file("https://www.louisaslett.com/Courses/MISCADA/mnist.fst", "mnist.fst")

Then read the data in use the high performance file storage package fst (Klik, 2021):

mnist <- fst::read.fst("mnist.fst")

Have a look at this data. If you want to see an image, the following function will help plot any of the images for you. Read and try to understand the two lines involved by looking at the documentation for each step.

# Define a plotting function to transform the flattened image data into a
# square graphic. See Data Coding appendix in the notes for how images are
# represented as tabular data!
library("tidyverse")

plotimages <- function(i) {
  imgs <- mnist |> 
    slice(i) |> 
    mutate(image.num = i) |>
    pivot_longer(x0.y27:x27.y0,
                 names_to = c("x", "y"),
                 names_pattern = "x([0-9]{1,2}).y([0-9]{1,2})",
                 names_transform = list(x = as.integer, y = as.integer),
                 values_to = "greyscale")
  
  ggplot(imgs, aes(x = x, y = y, fill = greyscale)) +
    geom_tile() +
    scale_fill_gradient(low = "white", high = "black") +
    facet_wrap(~ image.num + response, labeller = label_both)
}

We can then look at the individual images, or groups of them. For example, to look at images 21 through 24, we run:

plotimages(21:24)

Tidymodels universe

The tidymodels universe (see https://www.tidymodels.org/) comprises a suite of packages which are focused on different aspects of the machine learning software pipeline, but which employ a coherent and unified interface to simplify use, automate and abstract away common tasks.

The tidymodels package itself is really a so-called meta package, meaning that it doesn’t contain any important functions itself and is primarily a way to install the core tidyverse packages which are listed as dependencies. First up, install this package which will automatically pull in the others for you:

# Uncomment and run the following command first if you do not have the tidymodels package
# install.packages("tidymodels")

# Load tidymodels, which loads a suite of other packages for us
library("tidymodels")

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

## ✔ broom        1.0.8     ✔ rsample      1.3.1
## ✔ dials        1.4.1     ✔ tune         1.3.0
## ✔ infer        1.0.9     ✔ workflows    1.2.0
## ✔ modeldata    1.5.0     ✔ workflowsets 1.1.1
## ✔ parsnip      1.3.2     ✔ yardstick    1.3.2
## ✔ recipes      1.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

The core tidymodels packages include:

rsample which does the data splitting techniques such as train/validate/test, cross-validation and bootstrap resampling from Chapter 5 in the notes;
parsnip which provides a unified interface to different modelling methods, such as those of Chapters 4 and 6, so that you can try many methods without having to learn the different underlying pacakge interfaces;
recipies which enables you to perform common data pre-processing tasks in a way that makes replication of those steps easy when you load future data to perform predictions;
workflows which facilitates the creation of end-to-end workflows to execute all steps of a machine learning pipeline;
tune which provides a framework for empirical selection of hyperparameters;
dials offering tools to manage tuning hyperparameters;
yardstick which incorporates many different loss functions (such as those in Chapter 3), as well as other metrics (such as those in Chapter 7);
broom for formatting and outputting information about fitted models in a readable way;

We will go through the use of some of these packages as we run though a lab example here, though it is worthwhile looking at the documentation and vignettes of these packages for more detail than we can cover in a lab. Remember, a key strength of much of the R ecosystem is strong documentation and tidymodels keeps up that tradition.

Data splitting

The first step will be to create a training/testing split using functions from rsample (which will have been loaded by the tidymodels package above).

First, it is good to set the random number seed (recall APTS week 1 (Lee et al., 2024)) so we get reproducible results for the purposes of this lab.

Then, because $k$-nearest neighbour doesn’t support missing data natively, we will eliminate any rows with missingness. Note that it may make sense to explore other options such as missing data imputation, but we take the simplest course of action first.

With this basic pre-processing, we are ready to call initial_split() which is the function that returns an object which will contain all the data but internally randomises it to training/testing. We then use the training() and testing() functions to extract each relevant part of the data.

# Set seed so we reproducibly see the same results
set.seed(212)

# Because knn can't handle missings ... maybe try imputation later?
credit_data <- na.omit(credit_data)

# Create a train/test split into 75% / 25% of the data
credit_data_split <- initial_split(credit_data, strata = Status, prop = 3/4)

# We then extract the training and testing portions
train_data <- training(credit_data_split)
test_data <- testing(credit_data_split)

Fitting a model

With the data split into the simplest possible hold-out setup, we can turn to parsnip to fit a variety of models without having to learn each individual package.

Here we’ll fit a logistic regression and $k$-nearest neighbour.

# First, define a logistic regression model
lr_mod <- logistic_reg()

# Second, define a knn model. Since KNN has a hyperparameter we can specify that
# here, and because it supports both classification and regression we can
# set which mode we want. Both these were unnecessary for logistic regression.
knn_mod <- nearest_neighbor(neighbors = 3) |>
  set_mode("classification")

So this is quite different to base R modelling, where you simply call a fitting function right away. Here we have defined a model first and now we proceed to use this model object to fit to the data and save it to a new object ending _fit. Notice the unified approach for both types of model. We obviously fit to the training data.

# Next we fit the model
lr_fit  <- lr_mod  |> fit(Status ~ ., data = train_data)
knn_fit <- knn_mod |> fit(Status ~ ., data = train_data)

Have a look at the model objects individually:

lr_fit

## parsnip model object
## 
## 
## Call:  stats::glm(formula = Status ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##      (Intercept)         Seniority         Homeother         Homeowner  
##        2.469e-01         8.482e-02         2.585e-01         1.119e+00  
##      Homeparents          Homepriv          Homerent              Time  
##        1.058e+00         5.901e-01         4.534e-01        -3.346e-03  
##              Age    Maritalmarried  Maritalseparated     Maritalsingle  
##       -8.346e-03         1.471e+00         9.621e-02         8.330e-01  
##     Maritalwidow        Recordsyes      Jobfreelance         Jobothers  
##        3.435e-01        -1.823e+00        -3.437e-01        -3.099e-01  
##       Jobpartime          Expenses            Income            Assets  
##       -1.569e+00        -1.973e-02         7.536e-03         1.755e-05  
##             Debt            Amount             Price  
##       -1.245e-04        -2.199e-03         1.211e-03  
## 
## Degrees of Freedom: 3027 Total (i.e. Null);  3005 Residual
## Null Deviance:       3432 
## Residual Deviance: 2483  AIC: 2529

knn_fit

## parsnip model object
## 
## 
## Call:
## kknn::train.kknn(formula = Status ~ ., data = data, ks = min_rows(3,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.2790621
## Best kernel: optimal
## Best k: 3

Do you notice something interesting about the knn_fit object?

It mentions kernels! In fact, the kknn package (Schliep and Hechenbichler, 2016) combines both ideas from the “Local Methods” Chapter in the notes, and weights the neighbours using a kernel. Read the documentation of kknn::kknn for more details.

Evaluating performance

We know that we shouldn’t really look at the apparent error as it is upward biased for the true model error, but let’s do so anyway first!

Apparent error

Here, we’re going to construct a prediction matrix for each model. The first column will contain the label predictions, the following two columns the probabilities for each kind of outcome, and the last column will be the truth.

Remember that our loss function is $\mathcal{L} : \mathcal{Y} \times [0,1]^g \to [0,\infty)$, $\mathcal{L}(y, \hat{\vec{p}})$ or $\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to [0,\infty)$, $\mathcal{L}(y, \hat{y})$ so the first column provides $y$, the middle columns provide $\hat{\vec{p}}$, and the final column provides $\hat{y}$.

# APPARENT ERRORS!

lr_apparent <- 
  # First column will be true label
  train_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(lr_fit, train_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(lr_fit, train_data))
head(lr_apparent)

##   Status .pred_bad .pred_good .pred_class
## 1    bad 0.7909789  0.2090211         bad
## 2    bad 0.3671175  0.6328825        good
## 3    bad 0.5463549  0.4536451         bad
## 4    bad 0.5007599  0.4992401         bad
## 5    bad 0.4499215  0.5500785        good
## 6    bad 0.5926929  0.4073071         bad

knn_apparent <- 
  # First column will be true label
  train_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(knn_fit, train_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(knn_fit, train_data))
head(knn_apparent)

##   Status .pred_bad .pred_good .pred_class
## 1    bad 0.9287185 0.07128153         bad
## 2    bad 0.9287185 0.07128153         bad
## 3    bad 1.0000000 0.00000000         bad
## 4    bad 0.6702623 0.32973768         bad
## 5    bad 0.6702623 0.32973768         bad
## 6    bad 0.9287185 0.07128153         bad

tidymodels automatically names the columns .pred_class for $\hat{y}$ and .pred_good/.pred_bad for $\hat{\vec{p}}$, whilst $y$ just retains its original naming Status.

We can now make use of the functions in the yardstick (Kuhn and Vaughan, 2021) package to evaluate some apparent error metrics, making sure to provide either $\hat{y}$ or one of the probabilities as required.

mn_log_loss(lr_apparent, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 mn_log_loss binary         0.410

mn_log_loss(knn_apparent, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 mn_log_loss binary         0.107

accuracy(lr_apparent, truth = Status, .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.819

accuracy(knn_apparent, truth = Status, .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         1.000

conf_mat(lr_apparent, truth = Status, .pred_class)

##           Truth
## Prediction  bad good
##       bad   373  153
##       good  396 2106

conf_mat(knn_apparent, truth = Status, .pred_class)

##           Truth
## Prediction  bad good
##       bad   769    1
##       good    0 2258

roc_auc(lr_apparent, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.846

roc_auc(knn_apparent, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         1.000

The first is the error with cross entropy loss (called log loss by some), then the second uses 0-1 loss. The third is not an error metric but a so-called confusion matrix, where the columns indicate the truth and the rows the predictions. The final is the area under the receiver operator characteristic curve, which we will cover in the final Chapter of the course (this is a metric in the range $[0.5,1]$).

Again, the key thing to note is the unified interface. For example, we didn’t need to know that the glm() function has a different set of choices for type than those tidymodels uses, or indeed that kknn doesn’t even provide a fitted model. This is a key benefit of using abstraction layers like tidymodels.

You can examine the available metrics here and if you want to use a custom loss to compute the error you can see the guide to custom metrics in yardstick.

Test error

We compare now doing a proper test accuracy score. Your turn! Adapt the above code to compute the test error

# TEST ERRORS!
# Adapt the code above to compute the same measures on the test data

## SOLUTION

lr_test <- 
  # First column will be true label
  test_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(lr_fit, test_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(lr_fit, test_data))
head(lr_test)

##   Status .pred_bad .pred_good .pred_class
## 1   good 0.2815948 0.71840521        good
## 2    bad 0.4964453 0.50355473        good
## 3   good 0.2829220 0.71707797        good
## 4   good 0.1736968 0.82630319        good
## 5    bad 0.9467033 0.05329668         bad
## 6   good 0.2269034 0.77309664        good

knn_test <- 
  # First column will be true label
  test_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(knn_fit, test_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(knn_fit, test_data))
head(knn_test)

##   Status  .pred_bad .pred_good .pred_class
## 1   good 0.07128153  0.9287185        good
## 2    bad 0.00000000  1.0000000        good
## 3   good 0.00000000  1.0000000        good
## 4   good 0.00000000  1.0000000        good
## 5    bad 1.00000000  0.0000000         bad
## 6   good 0.00000000  1.0000000        good

mn_log_loss(lr_test, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 mn_log_loss binary         0.445

mn_log_loss(knn_test, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 mn_log_loss binary          3.84

accuracy(lr_test, truth = Status, .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.804

accuracy(knn_test, truth = Status, .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.723

conf_mat(lr_test, truth = Status, .pred_class)

##           Truth
## Prediction bad good
##       bad  113   54
##       good 144  700

conf_mat(knn_test, truth = Status, .pred_class)

##           Truth
## Prediction bad good
##       bad  109  132
##       good 148  622

roc_auc(lr_test, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.809

roc_auc(knn_test, truth = Status, .pred_bad)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.694

Cross validation

We will actually want to do cross validation a lot of the time, perhaps to select the hyperparameters of the model. This works in much the same way as a holdout set, but now there is a list $splits which contains all possible combinations of training and testing sets.

# CROSS VALIDATION!
# We'll now do 3-fold cross validation (perhaps want to do more, we choose 3
# just to make sure the lab code runs quite quickly)
credit_data_cv <- vfold_cv(train_data, v = 3)

split1_train <- training(credit_data_cv$splits[[1]])
split1_test <- testing(credit_data_cv$splits[[1]])

split2_train <- training(credit_data_cv$splits[[2]])
split2_test <- testing(credit_data_cv$splits[[2]])

split3_train <- training(credit_data_cv$splits[[3]])
split3_test <- testing(credit_data_cv$splits[[3]])

NOTE: that above we’re creating cross validation folds of the training data, not the whole data. This is to highlight how you can mix the approaches we learned, in the notes and lectures by, for example, using cross validation within a training split to select hyperparameters, but perhaps evaluating the generalisation error on held out test data.

Examine these to satisfy yourself that, for example, each of the three folds ends up being the test data once, whilst the remaining two are taken as training in each case.

Now, we could proceed to setup a loop to go over each split, compute errors and average them for a bunch of different hyperparameters, but this is a lot of very similar code to have to churn out. Fortunately, the workflow package comes to the rescue here and will wrap up this repetitive work.

# Better than loops is to use workflows in tidymodels
lr_wf <- workflow() |> 
  add_model(lr_mod) |> 
  add_formula(Status ~ .)

lr_cv_fits <- lr_wf |> 
  fit_resamples(credit_data_cv)

Examine the lr_cv_fits object and you will see it has automatically fitted the model specified by our workflow to the training part of each fold of the cross validation. We can then extract some default metrics provided by tidymodels. Note that tidymodels by default shows accuracy, ROC (neither of which is a strictly proper scoring rule), and the Brier score (which is a strictly proper scoring rule).

lr_cv_fits |> collect_metrics()

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.810     3 0.00448 Preprocessor1_Model1
## 2 brier_class binary     0.133     3 0.00157 Preprocessor1_Model1
## 3 roc_auc     binary     0.839     3 0.00666 Preprocessor1_Model1

Fortunately we can override what metrics are computed when fitting to the different folds. By looking at the help file for fit_resamples, decide what to replace ??? with below in order to compute the two error metrics above on this model (ie cross-entropy loss and 0-1 loss).

# Update metrics below to compute with cross-entropy loss and 0-1 loss
lr_cv_fits <- lr_wf |> 
  fit_resamples(credit_data_cv,
                metrics = ???) |>
  collect_metrics()
lr_cv_fits

## SOLUTION

lr_cv_fits <- lr_wf |> 
  fit_resamples(credit_data_cv,
                metrics = metric_set(mn_log_loss,
                                     accuracy)) |>
  collect_metrics()
lr_cv_fits

## # A tibble: 2 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.810     3 0.00448 Preprocessor1_Model1
## 2 mn_log_loss binary     0.418     3 0.00437 Preprocessor1_Model1

Hyper-parameter selection

We are going to take the cross validation further now, but using the tune package to fit lots of models for varying choices of a hyperparameter and then we can select a “best” model. The dials package provides us with convenient functions to define what parameters we want to “tune the dial” for!

Below, we’ll try tuning L1 regularisation in the logistic regression. To fit this we must have the glmnet package, so if you don’t, then install it now:

# Uncomment and run the following command first if you do not have the glmnet package
# install.packages("glmnet")

The dials::penalty() function lets us define a range of values to try (and whether to space them on the log or standard scale) – we’ll leave the defaults initially, $\lambda \in (-10,0)$ on the $log_{10}$ scale. Then, we need to create a new model object which flags any hyperparameter we want to tune by passing the tune() function instead of a value. Look at the help file for logistic_reg and decide what value to replace ??? with to ensure that we perform L1 regularisation. Then, we need to change the ‘engine’ to glmnet since that base glm function does not support regularisation.

# Adjust the mixture parameter below to perform L1 regularisation

lr_grid <- grid_regular(penalty(),
                        levels = 10)

lr_tune_mod <- logistic_reg(penalty = tune(),
                            mixture = ???) |>
  set_engine("glmnet")

# SOLUTION

lr_grid <- grid_regular(penalty(),
                        levels = 10)

lr_tune_mod <- logistic_reg(penalty = tune(),
                            mixture = 1) |>
  set_engine("glmnet")

With that done, we define the workflow in the same way (just with this new model which includes regularisation) and proceed to run tune_grid which now fits not just 1 model per cross validation fold, but rather fits 10 models to every fold for the different penalty choices and computes error metrics for them using the held out fold. Just as with fit_resamples above, we can override this, so use the same errors here for the second ??? as you did above.

# Now, use the metrics found before for tune_grid

lr_tune_wf <- workflow() |> 
  add_model(lr_tune_mod) |> 
  add_formula(Status ~ .)

lr_tune_cv_fits <- lr_tune_wf |> 
  tune_grid(
    resamples = credit_data_cv,
    grid = lr_grid,
    metrics = ???
  )

lr_tune_cv_fits |> collect_metrics()

# SOLUTION

lr_tune_wf <- workflow() |> 
  add_model(lr_tune_mod) |> 
  add_formula(Status ~ .)

lr_tune_cv_fits <- lr_tune_wf |> 
  tune_grid(
    resamples = credit_data_cv,
    grid = lr_grid,
    metrics = metric_set(mn_log_loss,
                         accuracy)
  )

lr_tune_cv_fits |> collect_metrics()

## # A tibble: 20 × 7
##          penalty .metric     .estimator  mean     n std_err .config             
##            <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
##  1 0.0000000001  accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
##  2 0.0000000001  mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
##  3 0.00000000129 accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
##  4 0.00000000129 mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
##  5 0.0000000167  accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
##  6 0.0000000167  mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
##  7 0.000000215   accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
##  8 0.000000215   mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
##  9 0.00000278    accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
## 10 0.00000278    mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
## 11 0.0000359     accuracy    binary     0.811     3 0.00489 Preprocessor1_Model…
## 12 0.0000359     mn_log_loss binary     0.417     3 0.00422 Preprocessor1_Model…
## 13 0.000464      accuracy    binary     0.811     3 0.00534 Preprocessor1_Model…
## 14 0.000464      mn_log_loss binary     0.417     3 0.00391 Preprocessor1_Model…
## 15 0.00599       accuracy    binary     0.808     3 0.00545 Preprocessor1_Model…
## 16 0.00599       mn_log_loss binary     0.421     3 0.00279 Preprocessor1_Model…
## 17 0.0774        accuracy    binary     0.752     3 0.00679 Preprocessor1_Model…
## 18 0.0774        mn_log_loss binary     0.517     3 0.00429 Preprocessor1_Model…
## 19 1             accuracy    binary     0.746     3 0.00526 Preprocessor1_Model…
## 20 1             mn_log_loss binary     0.567     3 0.00560 Preprocessor1_Model…

We can see that the 0-1 and cross-entropy loss errors remain static over quite a large range of the small penalties, meaning that the penalty is too small to result in any regularisation. Decide what range of penalties to “zoom in” on specify this via the penalty function in the grid that is setup on the first line (you may need to play around to find the right choice). In other words, change the following line using the documentation and results above to decide what to place in ???, then rerun the rest of the above.

# Repeat, starting out with a different grid to zoom in on area of relevance
lr_grid <- grid_regular(penalty(???),
                        levels = 10)

# (rerun the previous solution you came up with after changing that line)

# SOLUTION

lr_grid <- grid_regular(penalty(c(-3.7,-2.3)),
                        levels = 10)

lr_tune_mod <- logistic_reg(penalty = tune(),
                            mixture = 1) |>
  set_engine("glmnet")

lr_tune_wf <- workflow() |> 
  add_model(lr_tune_mod) |> 
  add_formula(Status ~ .)

lr_tune_cv_fits <- lr_tune_wf |> 
  tune_grid(
    resamples = credit_data_cv,
    grid = lr_grid,
    metrics = metric_set(mn_log_loss,
                         accuracy)
  )
lr_tune_cv_fits |> collect_metrics()

## # A tibble: 20 × 7
##     penalty .metric     .estimator  mean     n std_err .config              
##       <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.000200 accuracy    binary     0.810     3 0.00473 Preprocessor1_Model01
##  2 0.000200 mn_log_loss binary     0.417     3 0.00411 Preprocessor1_Model01
##  3 0.000285 accuracy    binary     0.811     3 0.00506 Preprocessor1_Model02
##  4 0.000285 mn_log_loss binary     0.417     3 0.00405 Preprocessor1_Model02
##  5 0.000408 accuracy    binary     0.811     3 0.00534 Preprocessor1_Model03
##  6 0.000408 mn_log_loss binary     0.417     3 0.00395 Preprocessor1_Model03
##  7 0.000584 accuracy    binary     0.812     3 0.00513 Preprocessor1_Model04
##  8 0.000584 mn_log_loss binary     0.417     3 0.00384 Preprocessor1_Model04
##  9 0.000836 accuracy    binary     0.814     3 0.00506 Preprocessor1_Model05
## 10 0.000836 mn_log_loss binary     0.417     3 0.00374 Preprocessor1_Model05
## 11 0.00120  accuracy    binary     0.813     3 0.00572 Preprocessor1_Model06
## 12 0.00120  mn_log_loss binary     0.417     3 0.00360 Preprocessor1_Model06
## 13 0.00171  accuracy    binary     0.813     3 0.00657 Preprocessor1_Model07
## 14 0.00171  mn_log_loss binary     0.417     3 0.00342 Preprocessor1_Model07
## 15 0.00245  accuracy    binary     0.810     3 0.00671 Preprocessor1_Model08
## 16 0.00245  mn_log_loss binary     0.418     3 0.00326 Preprocessor1_Model08
## 17 0.00350  accuracy    binary     0.811     3 0.00654 Preprocessor1_Model09
## 18 0.00350  mn_log_loss binary     0.419     3 0.00311 Preprocessor1_Model09
## 19 0.00501  accuracy    binary     0.808     3 0.00492 Preprocessor1_Model10
## 20 0.00501  mn_log_loss binary     0.420     3 0.00290 Preprocessor1_Model10

Is it worth regularising logistic regression in this case? (Hint 1: It’s just as important to know when not to do something fancy as when to do so! Hint 2: notice the standard error column … we know this isn’t quite calibrated to the true generalisation error from lectures, but it is still ok as a rule-of-thumb when deciding if improvements are likely to be spurious or helpful)

You may find it helpful to plot the 0-1 and cross entropy. This will work with the result of the tune_grid function above from both the original or your zoomed in attempt.

# This plot may help to see what's going on!
lr_tune_cv_fits |>
  collect_metrics() |>
  ggplot(aes(penalty, mean)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number())

Nearest neighbour

Try adapting the above code, reading the documentation where necessary, to automatically tune the number of neighbours in the nearest neighbour model.

# Try tuning the k-nearest neighbour to select number of neighbours

# SOLUTION

knn_grid <- grid_regular(neighbors(c(1,50)),
                        levels = 25)

knn_tune_mod <- nearest_neighbor(neighbors = tune()) |>
  set_mode("classification")

knn_tune_wf <- workflow() |> 
  add_model(knn_tune_mod) |> 
  add_formula(Status ~ .)

knn_tune_cv_fits <- knn_tune_wf |> 
  tune_grid(
    resamples = credit_data_cv,
    grid = knn_grid,
    metrics = metric_set(mn_log_loss,
                         accuracy)
  )

knn_tune_cv_fits |> collect_metrics()

## # A tibble: 50 × 7
##    neighbors .metric     .estimator  mean     n std_err .config              
##        <int> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
##  1         1 accuracy    binary     0.725     3 0.0117  Preprocessor1_Model01
##  2         1 mn_log_loss binary     9.93      3 0.422   Preprocessor1_Model01
##  3         3 accuracy    binary     0.725     3 0.0117  Preprocessor1_Model02
##  4         3 mn_log_loss binary     3.41      3 0.181   Preprocessor1_Model02
##  5         5 accuracy    binary     0.758     3 0.0103  Preprocessor1_Model03
##  6         5 mn_log_loss binary     2.09      3 0.182   Preprocessor1_Model03
##  7         7 accuracy    binary     0.764     3 0.00943 Preprocessor1_Model04
##  8         7 mn_log_loss binary     1.60      3 0.156   Preprocessor1_Model04
##  9         9 accuracy    binary     0.768     3 0.00814 Preprocessor1_Model05
## 10         9 mn_log_loss binary     1.28      3 0.228   Preprocessor1_Model05
## # ℹ 40 more rows

knn_tune_cv_fits |>
  collect_metrics() |>
  ggplot(aes(neighbors, mean)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2)

Finalising the model

We can use the show_best and select_best functions to see and pull out the top performing model from the tuning run.

# Examine best model
lr_tune_cv_fits |> show_best(metric = "accuracy")

## # A tibble: 5 × 7
##    penalty .metric  .estimator  mean     n std_err .config              
##      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.000836 accuracy binary     0.814     3 0.00506 Preprocessor1_Model05
## 2 0.00120  accuracy binary     0.813     3 0.00572 Preprocessor1_Model06
## 3 0.00171  accuracy binary     0.813     3 0.00657 Preprocessor1_Model07
## 4 0.000584 accuracy binary     0.812     3 0.00513 Preprocessor1_Model04
## 5 0.000408 accuracy binary     0.811     3 0.00534 Preprocessor1_Model03

lr_tune_cv_fits |> show_best(metric = "mn_log_loss")

## # A tibble: 5 × 7
##    penalty .metric     .estimator  mean     n std_err .config              
##      <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.000584 mn_log_loss binary     0.417     3 0.00384 Preprocessor1_Model04
## 2 0.000836 mn_log_loss binary     0.417     3 0.00374 Preprocessor1_Model05
## 3 0.000408 mn_log_loss binary     0.417     3 0.00395 Preprocessor1_Model03
## 4 0.00120  mn_log_loss binary     0.417     3 0.00360 Preprocessor1_Model06
## 5 0.000285 mn_log_loss binary     0.417     3 0.00405 Preprocessor1_Model02

# Pull out best model
# Really, want to make sure you use a loss *relevant to your real problem* for
# final selection of model!
lr_final_mod <- lr_tune_cv_fits |> select_best(metric = "mn_log_loss")

Since we have out held back test data from much earlier, we can now refit the “best” model to all the data we used during cross validation (using last_fit) and estimate the test error. Remember to bring your choice of metrics forward to here at ???:

# First, finalise the workflow to only the best model
lr_final_wf <- lr_tune_wf |> 
  finalize_workflow(lr_final_mod)

# Then fit it ... note we now use the credit_data_split from way back at the start
# of the lab, so that we use all the training_data (which was split for CV) to
# fit and then test on the test data
lr_final_fit <- 
  lr_final_wf |>
  last_fit(credit_data_split,
           metrics = ???)

lr_final_fit |> collect_metrics()

# SOLUTION

lr_final_wf <- lr_tune_wf |> 
  finalize_workflow(lr_final_mod)

lr_final_fit <- 
  lr_final_wf |>
  last_fit(credit_data_split,
           metrics = metric_set(mn_log_loss,
                         accuracy))

lr_final_fit |> collect_metrics()

## # A tibble: 2 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.808 Preprocessor1_Model1
## 2 mn_log_loss binary         0.443 Preprocessor1_Model1

Try this also for your tuned knn model:

# Repeat for your KNN model!

# SOLUTION

knn_final_mod <- knn_tune_cv_fits |> select_best(metric = "accuracy")

knn_final_wf <- knn_tune_wf |> 
  finalize_workflow(knn_final_mod)

knn_final_fit <- 
  knn_final_wf |>
  last_fit(credit_data_split,
           metrics = metric_set(mn_log_loss,
                         accuracy))

knn_final_fit |> collect_metrics()

## # A tibble: 2 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.775 Preprocessor1_Model1
## 2 mn_log_loss binary         0.594 Preprocessor1_Model1

References

Klik, M. (2021). fst: Lightning Fast Serialization of Data Frames. R package version 0.9.4. URL https://CRAN.R-project.org/package=fst

Kuhn, M., Vaughan, D. (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. URL https://CRAN.R-project.org/package=yardstick

Kuhn, M., Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. URL https://www.tidymodels.org

Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software 4(44), 1903. DOI: 10.21105/joss.01903

Lee, A., Fasiolo, M., Wilkinson, D.J., Lindgren, F., Wood, S.N. (2024). APTS: Statistical Computing Notes. URL https://awllee.github.io/APTS-SC/

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Schliep, K., Hechenbichler, K. (2016). kknn: Weighted k-Nearest Neighbors. R package version 1.3.1. URL https://CRAN.R-project.org/package=kknn