APTS: Statistical Machine Learning Computer Lab 1

There has been code provided throughout the course in live code blocks, where you get snippets of code to perform particular tasks by directly leveraging individual packages for particular methods. Therefore, we will not repeat such exercises in the labs and instead we will focus on the full software pipeline which makes running machine learning analyses much easier. Arguably the two top ML pipelines in R (R Core Team, 2021) today are tidymodels (Kuhn and Wickham, 2020) and mlr3 (Lang et al., 2019). The first lab will focus on tidymodels and the second on mlr3.

Data sets

Before beginning with tidymodels, we will ensure you can get the two data sets that will be used for the examples here, which will be suitable for basic binary and multilabel classification models.

The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).

# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")

Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.

# Try some plotting functions to explore the data here

The second dataset we will use which is multiclass is the famous MNIST handwriting data set. I have already put this data into the format we discussed in lectures for coding of images (an image per row with pixels as features). Download it first (this may take some time, it is ~40MB)

# Uncomment and run the following command first if you do not have the fst package
# install.packages("fst")
download.file("https://www.louisaslett.com/Courses/MISCADA/mnist.fst", "mnist.fst")

Then read the data in use the high performance file storage package fst (Klik, 2021):

mnist <- fst::read.fst("mnist.fst")

Have a look at this data. If you want to see an image, the following function will help plot any of the images for you. Read and try to understand the two lines involved by looking at the documentation for each step.

# Define a plotting function to transform the flattened image data into a
# square graphic. See Data Coding appendix in the notes for how images are
# represented as tabular data!
library("tidyverse")

plotimages <- function(i) {
  imgs <- mnist |> 
    slice(i) |> 
    mutate(image.num = i) |>
    pivot_longer(x0.y27:x27.y0,
                 names_to = c("x", "y"),
                 names_pattern = "x([0-9]{1,2}).y([0-9]{1,2})",
                 names_transform = list(x = as.integer, y = as.integer),
                 values_to = "greyscale")
  
  ggplot(imgs, aes(x = x, y = y, fill = greyscale)) +
    geom_tile() +
    scale_fill_gradient(low = "white", high = "black") +
    facet_wrap(~ image.num + response, labeller = label_both)
}

We can then look at the individual images, or groups of them. For example, to look at images 21 through 24, we run:

plotimages(21:24)

Tidymodels universe

The tidymodels universe (see https://www.tidymodels.org/) comprises a suite of packages which are focused on different aspects of the machine learning software pipeline, but which employ a coherent and unified interface to simplify use, automate and abstract away common tasks.

The tidymodels package itself is really a so-called meta package, meaning that it doesn’t contain any important functions itself and is primarily a way to install the core tidyverse packages which are listed as dependencies. First up, install this package which will automatically pull in the others for you:

# Uncomment and run the following command first if you do not have the tidymodels package
# install.packages("tidymodels")

# Load tidymodels, which loads a suite of other packages for us
library("tidymodels")

The core tidymodels packages include:

rsample which does the data splitting techniques such as train/validate/test, cross-validation and bootstrap resampling from Chapter 5 in the notes;
parsnip which provides a unified interface to different modelling methods, such as those of Chapters 4 and 6, so that you can try many methods without having to learn the different underlying pacakge interfaces;
recipies which enables you to perform common data pre-processing tasks in a way that makes replication of those steps easy when you load future data to perform predictions;
workflows which facilitates the creation of end-to-end workflows to execute all steps of a machine learning pipeline;
tune which provides a framework for empirical selection of hyperparameters;
dials offering tools to manage tuning hyperparameters;
yardstick which incorporates many different loss functions (such as those in Chapter 3), as well as other metrics (such as those in Chapter 7);
broom for formatting and outputting information about fitted models in a readable way;

We will go through the use of some of these packages as we run though a lab example here, though it is worthwhile looking at the documentation and vignettes of these packages for more detail than we can cover in a lab. Remember, a key strength of much of the R ecosystem is strong documentation and tidymodels keeps up that tradition.

Data splitting

The first step will be to create a training/testing split using functions from rsample (which will have been loaded by the tidymodels package above).

First, it is good to set the random number seed (recall APTS week 1 (Lee et al., 2024)) so we get reproducible results for the purposes of this lab.

Then, because $k$-nearest neighbour doesn’t support missing data natively, we will eliminate any rows with missingness. Note that it may make sense to explore other options such as missing data imputation, but we take the simplest course of action first.

With this basic pre-processing, we are ready to call initial_split() which is the function that returns an object which will contain all the data but internally randomises it to training/testing. We then use the training() and testing() functions to extract each relevant part of the data.

# Set seed so we reproducibly see the same results
set.seed(212)

# Because knn can't handle missings ... maybe try imputation later?
credit_data <- na.omit(credit_data)

# Create a train/test split into 75% / 25% of the data
credit_data_split <- initial_split(credit_data, strata = Status, prop = 3/4)

# We then extract the training and testing portions
train_data <- training(credit_data_split)
test_data <- testing(credit_data_split)

Fitting a model

With the data split into the simplest possible hold-out setup, we can turn to parsnip to fit a variety of models without having to learn each individual package.

Here we’ll fit a logistic regression and $k$-nearest neighbour.

# First, define a logistic regression model
lr_mod <- logistic_reg()

# Second, define a knn model. Since KNN has a hyperparameter we can specify that
# here, and because it supports both classification and regression we can
# set which mode we want. Both these were unnecessary for logistic regression.
knn_mod <- nearest_neighbor(neighbors = 3) |>
  set_mode("classification")

So this is quite different to base R modelling, where you simply call a fitting function right away. Here we have defined a model first and now we proceed to use this model object to fit to the data and save it to a new object ending _fit. Notice the unified approach for both types of model. We obviously fit to the training data.

# Next we fit the model
lr_fit  <- lr_mod  |> fit(Status ~ ., data = train_data)
knn_fit <- knn_mod |> fit(Status ~ ., data = train_data)

Have a look at the model objects individually:

lr_fit
knn_fit

Do you notice something interesting about the knn_fit object?

It mentions kernels! In fact, the kknn package (Schliep and Hechenbichler, 2016) combines both ideas from the “Local Methods” Chapter in the notes, and weights the neighbours using a kernel. Read the documentation of kknn::kknn for more details.

Evaluating performance

We know that we shouldn’t really look at the apparent error as it is upward biased for the true model error, but let’s do so anyway first!

Apparent error

Here, we’re going to construct a prediction matrix for each model. The first column will contain the label predictions, the following two columns the probabilities for each kind of outcome, and the last column will be the truth.

Remember that our loss function is $\mathcal{L} : \mathcal{Y} \times [0,1]^g \to [0,\infty)$, $\mathcal{L}(y, \hat{\vec{p}})$ or $\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to [0,\infty)$, $\mathcal{L}(y, \hat{y})$ so the first column provides $y$, the middle columns provide $\hat{\vec{p}}$, and the final column provides $\hat{y}$.

# APPARENT ERRORS!

lr_apparent <- 
  # First column will be true label
  train_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(lr_fit, train_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(lr_fit, train_data))
head(lr_apparent)

knn_apparent <- 
  # First column will be true label
  train_data |> select(Status) |> 
  # Next two columns will be probabilities for the two labels
  bind_cols(predict(knn_fit, train_data, type = "prob")) |> 
  # Last column will be predicted label
  bind_cols(predict(knn_fit, train_data))
head(knn_apparent)

tidymodels automatically names the columns .pred_class for $\hat{y}$ and .pred_good/.pred_bad for $\hat{\vec{p}}$, whilst $y$ just retains its original naming Status.

We can now make use of the functions in the yardstick (Kuhn and Vaughan, 2021) package to evaluate some apparent error metrics, making sure to provide either $\hat{y}$ or one of the probabilities as required.

mn_log_loss(lr_apparent, truth = Status, .pred_bad)
mn_log_loss(knn_apparent, truth = Status, .pred_bad)

accuracy(lr_apparent, truth = Status, .pred_class)
accuracy(knn_apparent, truth = Status, .pred_class)

conf_mat(lr_apparent, truth = Status, .pred_class)
conf_mat(knn_apparent, truth = Status, .pred_class)

roc_auc(lr_apparent, truth = Status, .pred_bad)
roc_auc(knn_apparent, truth = Status, .pred_bad)

The first is the error with cross entropy loss (called log loss by some), then the second uses 0-1 loss. The third is not an error metric but a so-called confusion matrix, where the columns indicate the truth and the rows the predictions. The final is the area under the receiver operator characteristic curve, which we will cover in the final Chapter of the course (this is a metric in the range $[0.5,1]$).

Again, the key thing to note is the unified interface. For example, we didn’t need to know that the glm() function has a different set of choices for type than those tidymodels uses, or indeed that kknn doesn’t even provide a fitted model. This is a key benefit of using abstraction layers like tidymodels.

You can examine the available metrics here and if you want to use a custom loss to compute the error you can see the guide to custom metrics in yardstick.

Test error

We compare now doing a proper test accuracy score. Your turn! Adapt the above code to compute the test error

# TEST ERRORS!
# Adapt the code above to compute the same measures on the test data

Cross validation

We will actually want to do cross validation a lot of the time, perhaps to select the hyperparameters of the model. This works in much the same way as a holdout set, but now there is a list $splits which contains all possible combinations of training and testing sets.

# CROSS VALIDATION!
# We'll now do 3-fold cross validation (perhaps want to do more, we choose 3
# just to make sure the lab code runs quite quickly)
credit_data_cv <- vfold_cv(train_data, v = 3)

split1_train <- training(credit_data_cv$splits[[1]])
split1_test <- testing(credit_data_cv$splits[[1]])

split2_train <- training(credit_data_cv$splits[[2]])
split2_test <- testing(credit_data_cv$splits[[2]])

split3_train <- training(credit_data_cv$splits[[3]])
split3_test <- testing(credit_data_cv$splits[[3]])

NOTE: that above we’re creating cross validation folds of the training data, not the whole data. This is to highlight how you can mix the approaches we learned, in the notes and lectures by, for example, using cross validation within a training split to select hyperparameters, but perhaps evaluating the generalisation error on held out test data.

Examine these to satisfy yourself that, for example, each of the three folds ends up being the test data once, whilst the remaining two are taken as training in each case.

Now, we could proceed to setup a loop to go over each split, compute errors and average them for a bunch of different hyperparameters, but this is a lot of very similar code to have to churn out. Fortunately, the workflow package comes to the rescue here and will wrap up this repetitive work.

# Better than loops is to use workflows in tidymodels
lr_wf <- workflow() |> 
  add_model(lr_mod) |> 
  add_formula(Status ~ .)

lr_cv_fits <- lr_wf |> 
  fit_resamples(credit_data_cv)

Examine the lr_cv_fits object and you will see it has automatically fitted the model specified by our workflow to the training part of each fold of the cross validation. We can then extract some default metrics provided by tidymodels. Note that tidymodels by default shows accuracy, ROC (neither of which is a strictly proper scoring rule), and the Brier score (which is a strictly proper scoring rule).

lr_cv_fits |> collect_metrics()

Fortunately we can override what metrics are computed when fitting to the different folds. By looking at the help file for fit_resamples, decide what to replace ??? with below in order to compute the two error metrics above on this model (ie cross-entropy loss and 0-1 loss).

# Update metrics below to compute with cross-entropy loss and 0-1 loss
lr_cv_fits <- lr_wf |> 
  fit_resamples(credit_data_cv,
                metrics = ???) |>
  collect_metrics()
lr_cv_fits

Hyper-parameter selection

We are going to take the cross validation further now, but using the tune package to fit lots of models for varying choices of a hyperparameter and then we can select a “best” model. The dials package provides us with convenient functions to define what parameters we want to “tune the dial” for!

Below, we’ll try tuning L1 regularisation in the logistic regression. To fit this we must have the glmnet package, so if you don’t, then install it now:

# Uncomment and run the following command first if you do not have the glmnet package
# install.packages("glmnet")

The dials::penalty() function lets us define a range of values to try (and whether to space them on the log or standard scale) – we’ll leave the defaults initially, $\lambda \in (-10,0)$ on the $log_{10}$ scale. Then, we need to create a new model object which flags any hyperparameter we want to tune by passing the tune() function instead of a value. Look at the help file for logistic_reg and decide what value to replace ??? with to ensure that we perform L1 regularisation. Then, we need to change the ‘engine’ to glmnet since that base glm function does not support regularisation.

# Adjust the mixture parameter below to perform L1 regularisation

lr_grid <- grid_regular(penalty(),
                        levels = 10)

lr_tune_mod <- logistic_reg(penalty = tune(),
                            mixture = ???) |>
  set_engine("glmnet")

With that done, we define the workflow in the same way (just with this new model which includes regularisation) and proceed to run tune_grid which now fits not just 1 model per cross validation fold, but rather fits 10 models to every fold for the different penalty choices and computes error metrics for them using the held out fold. Just as with fit_resamples above, we can override this, so use the same errors here for the second ??? as you did above.

# Now, use the metrics found before for tune_grid

lr_tune_wf <- workflow() |> 
  add_model(lr_tune_mod) |> 
  add_formula(Status ~ .)

lr_tune_cv_fits <- lr_tune_wf |> 
  tune_grid(
    resamples = credit_data_cv,
    grid = lr_grid,
    metrics = ???
  )

lr_tune_cv_fits |> collect_metrics()

We can see that the 0-1 and cross-entropy loss errors remain static over quite a large range of the small penalties, meaning that the penalty is too small to result in any regularisation. Decide what range of penalties to “zoom in” on specify this via the penalty function in the grid that is setup on the first line (you may need to play around to find the right choice). In other words, change the following line using the documentation and results above to decide what to place in ???, then rerun the rest of the above.

# Repeat, starting out with a different grid to zoom in on area of relevance
lr_grid <- grid_regular(penalty(???),
                        levels = 10)

# (rerun the previous solution you came up with after changing that line)

Is it worth regularising logistic regression in this case? (Hint 1: It’s just as important to know when not to do something fancy as when to do so! Hint 2: notice the standard error column … we know this isn’t quite calibrated to the true generalisation error from lectures, but it is still ok as a rule-of-thumb when deciding if improvements are likely to be spurious or helpful)

You may find it helpful to plot the 0-1 and cross entropy. This will work with the result of the tune_grid function above from both the original or your zoomed in attempt.

# This plot may help to see what's going on!
lr_tune_cv_fits |>
  collect_metrics() |>
  ggplot(aes(penalty, mean)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number())

Nearest neighbour

Try adapting the above code, reading the documentation where necessary, to automatically tune the number of neighbours in the nearest neighbour model.

# Try tuning the k-nearest neighbour to select number of neighbours

Finalising the model

We can use the show_best and select_best functions to see and pull out the top performing model from the tuning run.

# Examine best model
lr_tune_cv_fits |> show_best(metric = "accuracy")
lr_tune_cv_fits |> show_best(metric = "mn_log_loss")

# Pull out best model
# Really, want to make sure you use a loss *relevant to your real problem* for
# final selection of model!
lr_final_mod <- lr_tune_cv_fits |> select_best(metric = "mn_log_loss")

Since we have out held back test data from much earlier, we can now refit the “best” model to all the data we used during cross validation (using last_fit) and estimate the test error. Remember to bring your choice of metrics forward to here at ???:

# First, finalise the workflow to only the best model
lr_final_wf <- lr_tune_wf |> 
  finalize_workflow(lr_final_mod)

# Then fit it ... note we now use the credit_data_split from way back at the start
# of the lab, so that we use all the training_data (which was split for CV) to
# fit and then test on the test data
lr_final_fit <- 
  lr_final_wf |>
  last_fit(credit_data_split,
           metrics = ???)

lr_final_fit |> collect_metrics()

Try this also for your tuned knn model:

# Repeat for your KNN model!

References

Klik, M. (2021). fst: Lightning Fast Serialization of Data Frames. R package version 0.9.4. URL https://CRAN.R-project.org/package=fst

Kuhn, M., Vaughan, D. (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. URL https://CRAN.R-project.org/package=yardstick

Kuhn, M., Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. URL https://www.tidymodels.org

Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software 4(44), 1903. DOI: 10.21105/joss.01903

Lee, A., Fasiolo, M., Wilkinson, D.J., Lindgren, F., Wood, S.N. (2024). APTS: Statistical Computing Notes. URL https://awllee.github.io/APTS-SC/

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Schliep, K., Hechenbichler, K. (2016). kknn: Weighted k-Nearest Neighbors. R package version 1.3.1. URL https://CRAN.R-project.org/package=kknn