APTS: Statistical Machine Learning Computer Lab 2

There has been code provided throughout the course in live code blocks, where you get snippets of code to perform particular tasks by directly leveraging individual packages for particular methods. Therefore, we will not repeat such exercises in the labs and instead we will focus on the full software pipeline which makes running machine learning analyses much easier. Arguably the two top ML pipelines in R (R Core Team, 2021) today are tidymodels (Kuhn and Wickham, 2020) and mlr3 (Lang et al., 2019). The first lab focused on tidymodels and this second lab now looks at mlr3.

Data sets

Before beginning with mlr3, we will ensure you can get the two data sets that will be used for the examples here, which will be suitable for basic binary and multilabel classification models.

The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).

# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")

Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.

# Try some plotting functions to explore the data here

Due to technical issues, the alternative dataset has been removed, apologies for the reduced variety in the lab. A solution will be sought for future APTS courses.

MLR 3

MLR takes a quite different approach to building pipelines than Tidymodels did. Neither is “right” and you will probably find you prefer one style over the other, which is totally fine and quite a personal choice. In order to make that choice, it’s good to see both, so today it’s MLR’s turn! Note that MLR 3 is a huge ecosystem and we barely scratch the surface today: there is a fantastic online book by Becker et al. (2021) available, which is highly recommended reading if you decide to use MLR 3 for your applications. You also find the three “cheatsheets” useful for at-a-glance reference: MLR3 cheatsheet, MLR3 tuning cheatsheet, and MLR3 pipelines cheatsheet.

mlr3 is another meta-package for building machine learning software pipelines, with the aim of automating a lot of the repetitive tasks we commonly need to do. This both saves time and makes them less error prone, because we’re relying on software stacks that have been extensively tested. Note that MLR is slightly more mature software at this stage having been around for quite a while and undergone a big revamp with version 3, but Tidymodels is rapidly catching up and is backed by the fabulous RStudio/Posit team, so in the medium to longer term both will have similar levels of code maturity and stability.

# Uncomment and run the following command first if you do not have the mlr3 package
# install.packages("mlr3")

# Load mlr3, which loads a suite of other packages for us
library("mlr3")

MLR breaks machine learning problems up into steps. The most basic ones are where you:

define tasks
define learners
train learners on tasks
predict from learners
evaluate performance

Basic modelling run

An MLR task defines the dataset and identifies what the response is (all other variables being taken to be the features).

To define a task for the credit data, we therefore provide the data frame and specify that we are predicting the Status variable.

# Define a task, which is a dataset together with target variable for prediction
# We wrap the data in an na.omit to avoid issues with missingness, see later for
# better options
task_credit <- TaskClassif$new(id = "credit",
                               backend = na.omit(credit_data),
                               target = "Status")
task_credit

Then, we can see what learners MLR has built in.

# This variable shows available learning algorithms
as.data.table(mlr_learners)

This seems rather few! This is because additional packages are used to add features such as new learners. Many staple learners are in the mlr3learners add on package and further probabilistic learners are in mlr3proba. These update mlr_learners with a lot of additional options.

# Load more learners in supporting packages
library("mlr3learners")

as.data.table(mlr_learners)

# ... whilst this just gives the names
mlr_learners

There are a variety of learners now. The start of their name indicates their functionality, including one additional category we did not separate out in our definition on the course:

classif for classification problems
regr for regression problems
dens for density estimation
surv for survival analysis (time to event)

You can get a nicer detailed view in RStudio by wrapping this in the View() command:

# Get more details on learners
View(as.data.table(mlr_learners))

In the View tab, pay particular attention to the feature_types and properties columns. This gives us information about the capabilities of different learners. For example, we know that we have numeric and factor data from the exploration above, so only those learners listing these under feature_types can handle our data natively — for other learners we have some work to do. Likewise, we know that we have missing values, so learners with missings listed under properties could handle our data without using na.omit or any other additional work.

Having defined the task, the next step is to define the learner, let’s say logistic regression here. We pass the lrn function the name from the table to access that learner.

# Define a logistic regression model
learner_lr <- lrn("classif.log_reg")
learner_lr

We now proceed to train this learner on the credit data task.

# Train the model
learner_lr$train(task_credit)

Once the learner is trained, we can use the same object to predict on the same data (ie in-sample training prediction)

# Perform prediction
pred <- learner_lr$predict(task_credit)
pred

Finally, we can for example assess the training/apparent accuracy and confusion matrix by accessing this object. The confusion matrix is special and accessed directly in the prediction object that was returned. For other measures, we pass a measure object to the score function and that measure will be computed on the predictions. Just as we accessed learners via lrn, we access measure objects via msr.

# Evaluate some measures of error
pred$score(msr("classif.acc"))
pred$confusion

… and all the measures supported by MLR3 can be found by querying the mlr_measures object. A detailed reference of what these are as available at (https://mlr3measures.mlr-org.com/reference/index.html)[https://mlr3measures.mlr-org.com/reference/index.html]

# This variable shows available error measures
mlr_measures

What happens if you try to get the apparent error using Brier score loss? By looking at the help file for Learner or otherwise, can you see how to compute this error?

# Try computing the Brier score loss ... what is wrong?
# Look at the help file for Learners and see how to rectify this
?Learner

We can do the same for a random forest using ranger (Wright and Ziegler, 2017), this time with all the code at once for readability (we do not need to recreate the task).

# Uncomment and run the following command first if you do not have the ranger package
# install.packages("ranger")

# Redo everything for a random forest model
learner_rf <- lrn("classif.ranger")

learner_rf$train(task_credit)

pred_rf <- learner_rf$predict(task_credit)

pred_rf$score(msr("classif.acc"))
pred_rf$confusion

Doing it properly!

Above we did the minimal fitting run and just estimated the apparent error to get a feel for the design of MLR 3. We now go into full detail, doing all steps properly.

First, we will redefine the task, this time note two changes: we do not wrap the data in an na.omit call (MLR can handle missingness as part of the pipeline, see later) and we also specifically identify what we mean by a “positive” case (for later calculation of true positive, etc – note that tidymodels uses the order of the factors to determine this).

# Redefinet the task, this time not getting rid of missing data here and 
# specifying what constitutes a positive case
credit_task <- TaskClassif$new(id = "BankCredit",
                               backend = credit_data,
                               target = "Status",
                               positive = "bad")

Next, lets examine the data splitting techniques supported by MLR:

# Let's see what resampling strategies MLR supports

# The final column are the defaults
as.data.table(mlr_resamplings)
# ... whilst this just gives the names
mlr_resamplings

# To see help on any of them, prefix the key name with mlr_resamplings_
?mlr_resamplings_cv

To then use one of these resampling methods, just as we accessed learners via lrn, and measure objects via msr, we access resampling via rsmp.

You can also see from the cross validation help file, that we could access individual training and testing folds via the `$train_set()

# The rsmp function constructs a resampling strategy, taking the name given
# above and allowing any options listed there to be chosen
cv <- rsmp("cv", folds = 3)
# We then instantiate this resampling scheme on the particular task we're
# working on
cv$instantiate(credit_task)

# You can see from the documentation that you could access individual folds
# training and testing data via:
cv$train_set(1)
cv$test_set(1)

Again, as with tidymodels we don’t want to create for loops etc, but will rely instead on the capabilities of the package to automate repetitive tasks.

Lets now create a few different models. Recall from the learners we listed above, we saw two models that support both factors and missings, so we could try these first:

classif.featureless is a so-called baseline classifier … it basically just predicts the most common response all the time ignoring the features! It is often a good idea to include this because if you don’t beat it then there is something very wrong!
classif.rpart does classification trees using the methodology we saw in the lectures.

In both cases we’re going to ask for probabilistic prediction (the default just predicts the label). Also, since we’re now running multiple models it can be helpful to give them an id so that we can more easily identify them.

lrn_baseline <- lrn("classif.featureless", predict_type = "prob", id = "baseline")
lrn_cart <- lrn("classif.rpart", predict_type = "prob", id = "tree")

We can see if these have any options or hyperparameters we can change too:

# Have a look at what options and hyperparameters the model possesses
lrn_baseline$param_set
lrn_cart$param_set

Let’s now fit these learners using cross-validation to determine the accuracy.

# Fit models with cross validation
res_baseline <- resample(credit_task, lrn_baseline, cv, store_models = TRUE)
res_cart <- resample(credit_task, lrn_cart, cv, store_models = TRUE)

# Calculate and aggregate performance values
res_baseline$aggregate()
res_cart$aggregate()

What is classif.ce? It is the mean misclassification error, or 0-1 loss generalisation error. Note that the error in the baseline classifier is only 0.282! So this gives us an idea of what we need to beat (we might have expected 0.5, but this is an imbalanced dataset as is often the case with real data).

We can get many other error measures:

# Remember the error measures we can get ... (or use View() in RStudio for
# nicer format)
as.data.table(mlr_measures)
# ... whilst this just gives the names
mlr_measures

# Again, to see help on any of them, prefix the key name with mlr_measures_
?mlr_measures_classif.ce

We can request the benchmark to show us multiple of these measures:

res_baseline$aggregate(list(msr("classif.ce"),
                            msr("classif.acc"),
                            msr("classif.auc"),
                            msr("classif.fpr"),
                            msr("classif.fnr")))
res_cart$aggregate(list(msr("classif.ce"),
                        msr("classif.acc"),
                        msr("classif.auc"),
                        msr("classif.fpr"),
                        msr("classif.fnr")))

When we want to do multiple models it is actually more convenient to use the benchmark function to run them all on a grid we define (so we can actually run multiple learners on multiple tasks using multiple resampling strategies, as it takes the Cartesian product of all options):

# Use the benchmark functionality to do everything at once, ensuring identical
# settings such as task, folds, etc
res <- benchmark(
  benchmark_grid(
    task        = list(credit_task),
    learners    = list(lrn_baseline,
                       lrn_cart),
    resamplings = list(rsmp("cv", folds = 3))
  ), store_models = TRUE)
res
res$aggregate()

Let’s request the more interesting suite of measures from the benchmark (a lot simpler and neater than doing this call repeatedly for each model ourselves):

res$aggregate(list(msr("classif.ce"),
                   msr("classif.acc"),
                   msr("classif.auc"),
                   msr("classif.fpr"),
                   msr("classif.fnr")))

Advanced look at models within

We can examine in depth the results by getting out the models fitted in each fold:

# Get the trees (2nd model fitted), by asking for second set of resample
# results
trees <- res$resample_result(2)

# Then, let's look at the tree from first CV iteration, for example:
tree1 <- trees$learners[[1]]

# This is a fitted rpart object, so we can look at the model within
tree1_rpart <- tree1$model

# If you look in the rpart package documentation, it tells us how to plot the
# tree that was fitted
plot(tree1_rpart, compress = TRUE, margin = 0.1)
text(tree1_rpart, use.n = TRUE, cex = 0.8)

We can see the other trees too. Change the 3 in double brackets [[]] below to other values from 1 to 3 to see the model from each round of cross validation.

# Looking at other rounds from CV
plot(res$resample_result(2)$learners[[3]]$model, compress = TRUE, margin = 0.1)
text(res$resample_result(2)$learners[[3]]$model, use.n = TRUE, cex = 0.8)

It may be that these trees need to be pruned. To do this, we would need to enable the cross-validation option to rpart in the learner. We can fit this individually and make a selection for the cost penalty (see alpha in lectures), before then setting this value when benchmarking (NOTE: this is not quite optimal but MLR3 doesn’t yet have the option for us to select this within folds … coming soon hopefully).

In particular, note we are now doing nested cross validation which is the correct way to do parameter selection without biasing test error. Change the 3 in double brackets [[]] to other values from 1 to 3 to see cross validation plot from each round.

# Enable nested cross validation
lrn_cart_cv <- lrn("classif.rpart", predict_type = "prob", xval = 10, id = "tree.cv")

res_cart_cv <- resample(credit_task, lrn_cart_cv, cv, store_models = TRUE)
rpart::plotcp(res_cart_cv$learners[[3]]$model)

Now, choose a cost penalty and add this as a model to our benchmark set:

# Try refitting with a chosen complexity parameter for pruning
lrn_cart_cp <- lrn("classif.rpart", predict_type = "prob", cp = 0.016, id = "tree.pruned")

# Then run this in the benchmark with other options
res <- benchmark(benchmark_grid(
  task        = list(credit_task),
  learners    = list(baseline   = lrn_baseline,
                     cart       = lrn_cart,
                     cart_prune = lrn_cart_cp),
  resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)

res$aggregate(list(msr("classif.ce"),
                   msr("classif.acc"),
                   msr("classif.auc"),
                   msr("classif.fpr"),
                   msr("classif.fnr")))

In this case we see a slight improvement to false-positive rate at the cost of higher errors elsewhere. These might be tradeoffs you need to make in the real world.

Dealing with missingness and factors

To handle missing data and factors, we will need to introduce a modelling pipeline. In this pipeline we need to impute missing values and dummy/one-hot code factors.

Pipelines allow us to create a sophisticated workflow without having to manually code how everything ties together. To see what pipeline operations are available:

# Trying out pipelines
library("mlr3pipelines")

# Pipelines available (or use View() in RStudio for a nicer look) ...
as.data.table(mlr_pipeops)
# ... whilst this just gives the names
mlr_pipeops

# Again, to see help on any of them, prefix the key name with mlr_pipeops_
?mlr_pipeops_encode

So we can see the encode pipeline can do one-hot encoding of factors. We’ll do this first. XGBoost which can do gradient boosting doesn’t accept factors (look back at learners table earlier), so we now create a pipeline operation to encode them before passing to the learner. the function po() adds operations and %>>% connects the steps

# Uncomment and run the following command first if you do not have the xgboost package
# install.packages("xgboost")

# Create a pipeline which encodes and then fits an XGBoost model
lrn_xgboost <- lrn("classif.xgboost", predict_type = "prob", id = "gradient.boosting")
pl_xgb <- po("encode") %>>%
  po(lrn_xgboost)

# Now fit as normal ... we can just add it to our benchmark set
res <- benchmark(benchmark_grid(
  task        = list(credit_task),
  learners    = list(lrn_baseline,
                     lrn_cart,
                     lrn_cart_cp,
                     pl_xgb),
  resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)

res$aggregate(list(msr("classif.ce"),
                   msr("classif.acc"),
                   msr("classif.fpr"),
                   msr("classif.fnr")))

Handling missingness is slightly more involved. We provide a pipeline recipie here which is quite robust … read the documentation of each step to understand more.

We then apply this to logistic regression.

# First create a pipeline of just missing fixes we can later use with models
pl_missing <- po("fixfactors") %>>%
  po("removeconstants") %>>%
  po("imputesample", affect_columns = selector_type(c("ordered", "factor"))) %>>%
  po("imputemean")

# Now try with a model that needs no missingness
lrn_log_reg <- lrn("classif.log_reg", predict_type = "prob", id = "logistic.regression")
pl_log_reg <- pl_missing %>>%
  po(lrn_log_reg)

# Now fit as normal ... we can just add it to our benchmark set
res <- benchmark(benchmark_grid(
  task        = list(credit_task),
  learners    = list(lrn_baseline,
                     lrn_cart,
                     lrn_cart_cp,
                     pl_xgb,
                     pl_log_reg),
  resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)

res$aggregate(list(msr("classif.ce"),
                   msr("classif.acc"),
                   msr("classif.fpr"),
                   msr("classif.fnr")))

Advanced: super learning

Rather than having to choose among the models that we fitted above, we could instead fit all of them and have a final “super learner” fitted which automatically selects the best prediction based on the available base learners. We can do this using the pipelines in MLR3 …

We start from scratch to make this more advanced example self contained.

library("mlr3verse")

set.seed(212) # set seed for reproducibility

# Load data
data("credit_data", package = "modeldata")

# Define task
credit_task <- TaskClassif$new(id = "BankCredit",
                               backend = credit_data,
                               target = "Status",
                               positive = "bad")

# Cross validation resampling strategy
cv5 <- rsmp("cv", folds = 5)
cv5$instantiate(credit_task)

# Define a collection of base learners
lrn_baseline <- lrn("classif.featureless", predict_type = "prob", id = "baseline")
lrn_cart     <- lrn("classif.rpart", predict_type = "prob", id = "tree")
lrn_cart_cp  <- lrn("classif.rpart", predict_type = "prob", cp = 0.016, id = "tree.pruned")
lrn_ranger   <- lrn("classif.ranger", predict_type = "prob", id = "random.forest")
lrn_xgboost  <- lrn("classif.xgboost", predict_type = "prob", id = "gradient.descent")
lrn_log_reg  <- lrn("classif.log_reg", predict_type = "prob", id = "logistic.regression")

# Define a super learner
lrnsp_log_reg <- lrn("classif.log_reg", predict_type = "prob", id = "super.learner")

# Missingness imputation pipeline
pl_missing <- po("fixfactors") %>>%
  po("removeconstants") %>>%
  po("imputesample", affect_columns = selector_type(c("ordered", "factor"))) %>>%
  po("imputemean")

# Factors coding pipeline
pl_factor <- po("encode")

# Now define the full pipeline
spr_lrn <- gunion(list(
  # First group of learners requiring no modification to input
  gunion(list(
    po("learner_cv", lrn_baseline),
    po("learner_cv", lrn_cart),
    po("learner_cv", lrn_cart_cp)
  )),
  # Next group of learners requiring special treatment of missingness
  pl_missing %>>%
    gunion(list(
      po("learner_cv", lrn_ranger),
      po("learner_cv", lrn_log_reg),
      po("nop") # This passes through the original features adjusted for
                # missingness to the super learner
    )),
  # Last group needing factor encoding
  pl_factor %>>%
    po("learner_cv", lrn_xgboost)
)) %>>%
  po("featureunion") %>>%
  po(lrnsp_log_reg)

# This plot shows a graph of the learning pipeline
spr_lrn$plot()

# Finally fit the base learners and super learner and evaluate
res_spr <- resample(credit_task, spr_lrn, cv5, store_models = TRUE)
res_spr$aggregate(list(msr("classif.ce"),
                       msr("classif.acc"),
                       msr("classif.fpr"),
                       msr("classif.fnr")))

You will note these are the best results achieved of all the learners (except in false positive), albeit that this is by far the most complicated model

References

Becker, M., Binder, M., Bischl, B., Lang, M., Pfisterer, F., Reich, N.G., Richter, J., Schratz, P., Sonabend, R., Pulatov, D. (2021). mlr3 book. URL https://mlr3book.mlr-org.com/

Kuhn, M., Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. URL https://www.tidymodels.org

Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software 4(44), 1903. DOI: 10.21105/joss.01903

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Wright, M.N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1), 1–17. DOI: 10.18637/jss.v077.i01