There has been code provided throughout the course in live code
blocks, where you get snippets of code to perform particular tasks by
directly leveraging individual packages for particular methods.
Therefore, we will not repeat such exercises in the labs and instead we
will focus on the full software pipeline which makes running machine
learning analyses much easier. Arguably the two top ML pipelines in R
(R Core Team, 2021)
today are tidymodels
(Kuhn and Wickham, 2020) and
mlr3
(Lang et
al., 2019). The first lab focused on tidymodels
and this second lab now looks at mlr3
.
Before beginning with tidymodels
, we will ensure you can
get the two data sets that will be used for the examples here, which
will basic binary and multilabel classification models.
The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).
# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")
Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.
# Try some plotting function to explore the data here
Due to technical issues, the alternative dataset has been removed, apologies for the reduced variety in the lab. A solution will be sought for future APTS courses.
MLR takes a quite different approach to building pipelines than Tidymodels did. Neither is “right” and you will probably find you prefer one style over the other, which is totally fine and quite a personal choice. In order to make that choice, it’s good to see both, so today it’s MLR’s turn! Note that MLR 3 is a huge ecosystem and we barely scratch the surface today: there is a fantastic online book by Becker et al. (2021) available, which is highly recommended reading if you decide to use MLR 3 for your applications.
mlr3
is another meta-package for building machine
learning software pipelines, with the aim of automating a lot of the
repetitive tasks we commonly need to do. This both saves time and makes
them less error prone, because we’re relying on software stacks that has
been extensively tested. Note that MLR is a more mature software at this
stage having been around for quite a while and undergone a big revamp
with version 3, but Tidymodels is rapidly catching up and is backed by
the fabulous RStudio team, so in the medium to longer term both will
have similar levels of code maturity and stability.
# Uncomment and run the following command first if you do not have the mlr3 package
# install.packages("mlr3")
# Load mlr3, which loads a suite of other packages for us
library("mlr3")
MLR breaks machine learning problems up into steps. The most basic ones are where you:
An MLR task defines the dataset and identifies what the response is (all other variables being taken to be the features).
To define a task for the credit data, we therefore provide the data
frame and specify that we are predicting the Status
variable.
# Define a task, which is a dataset together with target variable for prediction
# We wrap the data in an na.omit to avoid issues with missingness, see later for
# better options
task_credit <- TaskClassif$new(id = "credit",
backend = na.omit(credit_data),
target = "Status")
task_credit
Then, we can see what learners MLR has built in.
# This variable shows available learning algorithms
as.data.table(mlr_learners)
This seems rather few! This is because additional packages are used
to add features such as new learners. Many staple learners are in the
mlr3learners
add on package and further probabilistic
learners are in mlr3proba
. These update
mlr_learners
with a lot of additional options.
# Load more learners in supporting packages
library("mlr3learners")
as.data.table(mlr_learners)
# ... whilst this just gives the names
mlr_learners
There are a variety of learners now. The start of their name indicates their functionality, including one additional category we did not separate out in our definition on the course:
classif
for classification problemsregr
for regression problemsdens
for density estimationsurv
for survival analysis (time to event)You can get a nicer detailed view in RStudio by wrapping this in the
View()
command:
# Get more details on learners
View(as.data.table(mlr_learners))
In the View tab, pay particular attention to the
feature_types
and properties
columns. This
gives us information about the capabilities of different learners. For
example, we know that we have numeric
and
factor
data from the exploration above, so only those
learners listing these under feature_types
can handle our
data natively — for other learners we have some work to do. Likewise, we
know that we have missing values, so learners with missings
listed under properties
could handle our data without using
na.omit
or any other additional work.
Having defined the task, the next step is to define the learner,
let’s say logistic regression here. We pass the lrn
function the name from the table to access that learner.
# Define a logistic regression model
learner_lr <- lrn("classif.log_reg")
learner_lr
We now proceed to train this learner on the credit data task.
# Train the model
learner_lr$train(task_credit)
Once the learner is trained, we can use the same object to predict on the same data (ie in-sample training prediction)
# Perform prediction
pred <- learner_lr$predict(task_credit)
pred
Finally, we can for example assess the training/apparent accuracy and
confusion matrix by accessing this object. The confusion matrix is
special and accessed directly in the prediction object that was
returned. For other measures, we pass a measure object to the score
function and that measure will be computed on the predictions. Just as
we accessed learners via lrn
, we access measure objects via
msr
.
# Evaluate some measures of error
pred$score(msr("classif.acc"))
pred$confusion
… and all the measures supported by MLR3 can be found by querying the
mlr_measures
object. A detailed reference of what these are
as available at (https://mlr3measures.mlr-org.com/reference/index.html)[https://mlr3measures.mlr-org.com/reference/index.html]
# This variable shows available error measures
mlr_measures
What happens if you try to get the apparent error using Brier score
loss? By looking at the help file for Learner
or otherwise,
can you see how to compute this error?
# Try computing the Brier score loss ... what is wrong?
# Look at the help file for Learners and see how to rectify this
?Learner
We can do the same for a random forest using ranger
(Wright and Ziegler,
2017), this time with all the code at once for readability
(we do not need to recreate the task).
# Uncomment and run the following command first if you do not have the ranger package
# install.packages("ranger")
# Redo everything for a random forest model
learner_rf <- lrn("classif.ranger")
learner_rf$train(task_credit)
pred_rf <- learner_rf$predict(task_credit)
pred_rf$score(msr("classif.acc"))
pred_rf$confusion
Above we did the minimal fitting run and just estimated the apparent error to get a feel for the design of MLR 3. We now go into full detail, doing all steps properly.
First, we will redefine the task, this time note two changes: we do
not wrap the data in an na.omit
call (MLR can handle
missingness as part of the pipeline, see later) and we also specifically
identify what we mean by a “positive” case (for later calculation of
true positive, etc – note that tidymodels
uses the order of
the factors to determine this).
# Redefinet the task, this time not getting rid of missing data here and
# specifying what constitutes a positive case
credit_task <- TaskClassif$new(id = "BankCredit",
backend = credit_data,
target = "Status",
positive = "bad")
Next, lets examine the data splitting techniques supported by MLR:
# Let's see what resampling strategies MLR supports
# The final column are the defaults
as.data.table(mlr_resamplings)
# ... whilst this just gives the names
mlr_resamplings
# To see help on any of them, prefix the key name with mlr_resamplings_
?mlr_resamplings_cv
To then use one of these resampling methods, just as we accessed
learners via lrn
, and measure objects via msr
,
we access resampling via rsmp
.
You can also see from the cross validation help file, that we could access individual training and testing folds via the `$train_set()
# The rsmp function constructs a resampling strategy, taking the name given
# above and allowing any options listed there to be chosen
cv <- rsmp("cv", folds = 3)
# We then instantiate this resampling scheme on the particular task we're
# working on
cv$instantiate(credit_task)
# You can see from the documentation that you could access individual folds
# training and testing data via:
cv$train_set(1)
cv$test_set(1)
Again, as with tidymodels
we don’t want to create for
loops etc, but will rely instead on the capabilities of the package to
automate repetitive tasks.
Lets now create a few different models. Recall from the learners we listed above, we saw two models that support both factors and missings, so we could try these first:
classif.featureless
is a so-called baseline classifier
… it basically just predicts the most common response all the time
ignoring the features! It is often a good idea to include this because
if you don’t beat it then there is something very wrong!classif.rpart
does classification trees using the
methodology we saw in the lectures.In both cases we’re going to ask for probabilistic prediction (the default just predicts the label)
lrn_baseline <- lrn("classif.featureless", predict_type = "prob")
lrn_cart <- lrn("classif.rpart", predict_type = "prob")
We can see if these have any options or hyperparameters we can change too:
# Have a look at what options and hyperparameters the model possesses
lrn_baseline$param_set
lrn_cart$param_set
Let’s now fit these learners using cross-validation to determine the accuracy.
# Fit models with cross validation
res_baseline <- resample(credit_task, lrn_baseline, cv, store_models = TRUE)
res_cart <- resample(credit_task, lrn_cart, cv, store_models = TRUE)
# Calculate and aggregate performance values
res_baseline$aggregate()
res_cart$aggregate()
What is classif.ce
? It is the mean misclassification
error, or 0-1 loss generalisation error. Note that the error in the
baseline classifier is only 0.282! So this gives us an idea of what we
need to beat (we might have expected 0.5, but this is an imbalanced
dataset as is often the case with real data).
We can get many other error measures:
# Remember the error measures we can get ... (or use View() in RStudio for
# nicer format)
as.data.table(mlr_measures)
# ... whilst this just gives the names
mlr_measures
# Again, to see help on any of them, prefix the key name with mlr_measures_
?mlr_measures_classif.ce
We can request the benchmark to show us multiple of these measures:
res_baseline$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.auc"),
msr("classif.fpr"),
msr("classif.fnr")))
res_cart$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.auc"),
msr("classif.fpr"),
msr("classif.fnr")))
When we want to do multiple models it is actually more convenient to use the benchmark function to run them all on a grid we define (so we can actually run multiple learners on multiple tasks using multiple resampling strategies, as it takes the Cartesian product of all options):
# Use the benchmark functionality to do everything at once, ensuring identical
# settings such as task, folds, etc
res <- benchmark(
benchmark_grid(
task = list(credit_task),
learners = list(lrn_baseline,
lrn_cart),
resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)
res
res$aggregate()
Let’s request the more interesting suite of measures from the benchmark (a lot simpler and neater than doing this call repeatedly for each model ourselves):
res$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.auc"),
msr("classif.fpr"),
msr("classif.fnr")))
We can examine in depth the results by getting out the models fitted in each fold:
# Get the trees (2nd model fitted), by asking for second set of resample
# results
trees <- res$resample_result(2)
# Then, let's look at the tree from first CV iteration, for example:
tree1 <- trees$learners[[1]]
# This is a fitted rpart object, so we can look at the model within
tree1_rpart <- tree1$model
# If you look in the rpart package documentation, it tells us how to plot the
# tree that was fitted
plot(tree1_rpart, compress = TRUE, margin = 0.1)
text(tree1_rpart, use.n = TRUE, cex = 0.8)
We can see the other trees too. Change the 3 in double brackets
[[]]
below to other values from 1 to 3 to see the model
from each round of cross validation.
# Looking at other rounds from CV
plot(res$resample_result(2)$learners[[3]]$model, compress = TRUE, margin = 0.1)
text(res$resample_result(2)$learners[[3]]$model, use.n = TRUE, cex = 0.8)
It may be that these trees need to be pruned. To do this, we would
need to enable the cross-validation option to rpart
in the
learner. We can fit this individually and make a selection for the cost
penalty (see alpha in lectures), before then setting this value when
benchmarking (NOTE: this is not quite optimal but MLR3 doesn’t yet have
the option for us to select this within folds … coming soon
hopefully).
In particular, note we are now doing nested cross validation
which is the correct way to do parameter selection without biasing test
error. Change the 3 in double brackets [[]]
to other values
from 1 to 3 to see cross validation plot from each round.
# Enable nested cross validation
lrn_cart_cv <- lrn("classif.rpart", predict_type = "prob", xval = 10)
res_cart_cv <- resample(credit_task, lrn_cart_cv, cv, store_models = TRUE)
rpart::plotcp(res_cart_cv$learners[[3]]$model)
Now, choose a cost penalty and add this as a model to our benchmark set:
# Try refitting with a chosen complexity parameter for pruning
lrn_cart_cp <- lrn("classif.rpart", predict_type = "prob", cp = 0.016)
# Then run this in the benchmark with other options
res <- benchmark(benchmark_grid(
task = list(credit_task),
learners = list(lrn_baseline,
lrn_cart,
lrn_cart_cp),
resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)
res$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.auc"),
msr("classif.fpr"),
msr("classif.fnr")))
In this case we see a slight improvement to false-positive rate at the cost of higher errors elsewhere. These might be tradeoffs you need to make in the real world.
To handle missing data and factors, we will need to introduce a modelling pipeline. In this pipeline we need to impute missing values and dummy/one-hot code factors.
Pipelines allow us to create a sophisticated workflow without having to manually code how everything ties together. To see what pipeline operations are available:
# Trying out pipelines
library("mlr3pipelines")
# Pipelines available (or use View() in RStudio for a nicer look) ...
as.data.table(mlr_pipeops)
# ... whilst this just gives the names
mlr_pipeops
# Again, to see help on any of them, prefix the key name with mlr_pipeops_
?mlr_pipeops_encode
So we can see the encode pipeline can do one-hot encoding of factors.
We’ll do this first. XGBoost which can do gradient boosting doesn’t
accept factors (look back at learners table earlier), so we now create a
pipeline operation to encode them before passing to the learner. the
function po()
adds operations and %>>%
connects the steps
# Uncomment and run the following command first if you do not have the xgboost package
# install.packages("xgboost")
# Create a pipeline which encodes and then fits an XGBoost model
lrn_xgboost <- lrn("classif.xgboost", predict_type = "prob")
pl_xgb <- po("encode") %>>%
po(lrn_xgboost)
# Now fit as normal ... we can just add it to our benchmark set
res <- benchmark(benchmark_grid(
task = list(credit_task),
learners = list(lrn_baseline,
lrn_cart,
lrn_cart_cp,
pl_xgb),
resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)
res$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.fpr"),
msr("classif.fnr")))
Handling missingness is slightly more involved. We provide a pipeline recipie here which is quite robust … read the documentation of each step to understand more.
We then apply this to logistic regression.
# First create a pipeline of just missing fixes we can later use with models
pl_missing <- po("fixfactors") %>>%
po("removeconstants") %>>%
po("imputesample", affect_columns = selector_type(c("ordered", "factor"))) %>>%
po("imputemean")
# Now try with a model that needs no missingness
lrn_log_reg <- lrn("classif.log_reg", predict_type = "prob")
pl_log_reg <- pl_missing %>>%
po(lrn_log_reg)
# Now fit as normal ... we can just add it to our benchmark set
res <- benchmark(benchmark_grid(
task = list(credit_task),
learners = list(lrn_baseline,
lrn_cart,
lrn_cart_cp,
pl_xgb,
pl_log_reg),
resamplings = list(rsmp("cv", folds = 3))
), store_models = TRUE)
res$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.fpr"),
msr("classif.fnr")))
Rather than having to choose among the models that we fitted above, we could instead fit all of them and have a final “super learner” fitted which automatically selects the best prediction based on the available base learners. We can do this using the pipelines in MLR3 …
We start from scratch to make this more advanced example self contained.
library("mlr3verse")
set.seed(212) # set seed for reproducibility
# Load data
data("credit_data", package = "modeldata")
# Define task
credit_task <- TaskClassif$new(id = "BankCredit",
backend = credit_data,
target = "Status",
positive = "bad")
# Cross validation resampling strategy
cv5 <- rsmp("cv", folds = 5)
cv5$instantiate(credit_task)
# Define a collection of base learners
lrn_baseline <- lrn("classif.featureless", predict_type = "prob")
lrn_cart <- lrn("classif.rpart", predict_type = "prob")
lrn_cart_cp <- lrn("classif.rpart", predict_type = "prob", cp = 0.016, id = "cartcp")
lrn_ranger <- lrn("classif.ranger", predict_type = "prob")
lrn_xgboost <- lrn("classif.xgboost", predict_type = "prob")
lrn_log_reg <- lrn("classif.log_reg", predict_type = "prob")
# Define a super learner
lrnsp_log_reg <- lrn("classif.log_reg", predict_type = "prob", id = "super")
# Missingness imputation pipeline
pl_missing <- po("fixfactors") %>>%
po("removeconstants") %>>%
po("imputesample", affect_columns = selector_type(c("ordered", "factor"))) %>>%
po("imputemean")
# Factors coding pipeline
pl_factor <- po("encode")
# Now define the full pipeline
spr_lrn <- gunion(list(
# First group of learners requiring no modification to input
gunion(list(
po("learner_cv", lrn_baseline),
po("learner_cv", lrn_cart),
po("learner_cv", lrn_cart_cp)
)),
# Next group of learners requiring special treatment of missingness
pl_missing %>>%
gunion(list(
po("learner_cv", lrn_ranger),
po("learner_cv", lrn_log_reg),
po("nop") # This passes through the original features adjusted for
# missingness to the super learner
)),
# Last group needing factor encoding
pl_factor %>>%
po("learner_cv", lrn_xgboost)
)) %>>%
po("featureunion") %>>%
po(lrnsp_log_reg)
# This plot shows a graph of the learning pipeline
spr_lrn$plot()
# Finally fit the base learners and super learner and evaluate
res_spr <- resample(credit_task, spr_lrn, cv5, store_models = TRUE)
res_spr$aggregate(list(msr("classif.ce"),
msr("classif.acc"),
msr("classif.fpr"),
msr("classif.fnr")))
You will note these are the best results achieved of all the learners (except in false positive), albeit that this is by far the most complicated model