There has been code provided throughout the course in live code
blocks, where you get snippets of code to perform particular tasks by
directly leveraging individual packages for particular methods.
Therefore, we will not repeat such exercises in the labs and instead we
will focus on the full software pipeline which makes running machine
learning analyses much easier. Arguably the two top ML pipelines in R
(R Core Team, 2021)
today are tidymodels
(Kuhn and Wickham, 2020) and
mlr3
(Lang et
al., 2019). The first lab will focus on
tidymodels
and the second on mlr3
.
Before beginning with tidymodels
, we will ensure you can
get the two data sets that will be used for the examples here, which
will basic binary and multilabel classification models.
The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).
# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")
Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.
# Try some plotting function to explore the data here
The second dataset we will use which is multiclass is the famous MNIST handwriting data set. I have already put this data into the format we discussed in lectures for coding of images (an image per row with pixels as features). Download it first (this may take some time, it is ~40MB)
# Uncomment and run the following command first if you do not have the fst package
# install.packages("fst")
download.file("https://www.louisaslett.com/Courses/MISCADA/mnist.fst", "mnist.fst")
Then read the data in use the high performance file storage package
fst
(Klik,
2021):
mnist <- fst::read.fst("mnist.fst")
Have a look at this data. If you want to see an image, the following function will help plot any of the images for you. Read and try to understand the two lines involved by looking at the documentation for each step.
# Define a plotting function to transform the flattened image data into a
# square graphic. See Data Coding appendix in the notes for how images are
# represented as tabular data!
library("tidyverse")
plotimages <- function(i) {
imgs <- mnist |>
slice(i) |>
mutate(image.num = i) |>
pivot_longer(x0.y27:x27.y0,
names_to = c("x", "y"),
names_pattern = "x([0-9]{1,2}).y([0-9]{1,2})",
names_transform = list(x = as.integer, y = as.integer),
values_to = "greyscale")
ggplot(imgs, aes(x = x, y = y, fill = greyscale)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "black") +
facet_wrap(~ image.num + response, labeller = label_both)
}
We can then look at the individual images, or groups of them. For example, to look at images 21 through 24, we run:
plotimages(21:24)
The tidymodels
universe comprises a suite of packages
which are focused on different aspects of the machine learning software
pipeline, but which employ a coherent and unified interface to simplify
use, automate and abstract away common tasks.
The tidymodels
package itself is really a so-called
meta package, meaning that it doesn’t contain any important
functions itself and is primarily a way to install the core tidyverse
packages which are listed as dependencies. First up, install this
package which will automatically pull in the others for you:
# Uncomment and run the following command first if you do not have the tidymodels package
# install.packages("tidymodels")
# Load tidymodels, which loads a suite of other packages for us
library("tidymodels")
The core tidymodels packages include:
rsample
which does the data splitting techniques such
as train/validate/test, cross-validation and bootstrap resampling from
Chapter 5 in the notes;parsnip
which provides a unified interface to different
modelling methods, such as those of Chapters 4 and 6, so that you can
try many methods without having to learn the different underlying
pacakge interfaces;recipies
which enables you to perform common data
pre-processing tasks in a way that makes replication of those steps easy
when you load future data to perform predictions;workflows
which facilitates the creation of end-to-end
workflows to execute all steps of a machine learning pipeline;tune
which provides a framework for empirical selection
of hyperparameters;yardstick
which incorporates many different loss
functions (such as those in Chapter 3), as well as other metrics (such
as those in Chapter 7);broom
for formatting and outputting information about
fitted models in a readable way;We will go through the use of some of these packages as we run though a lab example here, though you
The first step will be to create a training/testing split using
functions from rsample
(which will have been loaded by the
tidymodels
package above).
First, it is good to set the random number seed (recall APTS week 1 (Wilkinson and Wood, 2021)) so we get reproducible results for the purposes of this lab.
Then, because k-nearest neighbour doesn’t support missing data natively, we will eliminate any rows with missingness. Note that it may make sense to explore other options such as missing data imputation, but we take the simplest course of action first.
With this basic pre-processing, we are ready to call
initial_split()
which is the function that returns an
object which will contain all the data but internally randomises it to
training/testing. We then use training()
and
testing()
function to extract each relevant part of the
data.
# Set seed so we reproducibly see the same results
set.seed(212)
# Because knn can't handle missings ... maybe try imputation later?
credit_data <- na.omit(credit_data)
# Create a train/test split into 75% / 25% of the data
credit_data_split <- initial_split(credit_data, strata = Status, prop = 3/4)
# We then extract the training and testing portions
train_data <- training(credit_data_split)
test_data <- testing(credit_data_split)
With the data split into the simplest possible hold-out setup, we can
turn to parsnip
to fit a variety of models without having
to learn each individual package.
Here we’ll fit a logistic regression and k-nearest neighbour.
# First, define a logistic regression model
lr_mod <- logistic_reg()
# Now we need to load the kknn package due to a bug waiting a fix
library("kknn") # Needed due to bug! https://github.com/tidymodels/parsnip/issues/264
# Second, define a knn model. Since KNN has a hyperparameter we can specify that
# here, and because it supports both classification and regression we can
# set which mode we want
knn_mod <- nearest_neighbor(neighbors = 3) |>
set_mode("classification")
So this is quite different to base R modelling, where you simply call
a fitting function right away. Here we have defined a model first and
now we proceed to use this model object to fit to the data and save it
to a new object ending _fit
. Notice the unified approach
for both types of model. We obviously fit to the training data.
# Next we fit the model
lr_fit <- lr_mod |> fit(Status ~ ., data = train_data)
knn_fit <- knn_mod |> fit(Status ~ ., data = train_data)
Have a look at the model objects individually:
lr_fit
knn_fit
Do you notice something interesting about the knn_fit
object?
It mentions kernels! In fact, the kknn
package (Schliep and Hechenbichler,
2016) combines both ideas from the “Local Methods” Chapter in
the notes, and weights the neighbours using a kernel. Read the
documentation of kknn::kknn
for more details.
We know that we shouldn’t really look at the apparent error as it is upward biased for the true model error, but let’s do so anyway first!
Here, we’re going to construct a prediction matrix for each model. The first column will contain the label predictions, the following two columns the probabilities for each kind of outcome, and the last column will be the truth.
Remember that our loss function is \mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to [0,\infty), \mathcal{L}(y, \hat{y}) or \mathcal{L} : \mathcal{Y} \times [0,1]^g \to [0,\infty), \mathcal{L}(y, \vec{p}) so the final column provides y, the first column provides \hat{y} and the middle columns provide \hat{\vec{p}}
# APPARENT ERRORS!
lr_apparent <-
# First column will be predicted label
predict(lr_fit, train_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(lr_fit, train_data, type = "prob")) |>
# Last column will be true label
bind_cols(train_data |> select(Status))
head(lr_apparent)
knn_apparent <-
# First column will be predicted label
predict(knn_fit, train_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(knn_fit, train_data, type = "prob")) |>
# Last column will be true label
bind_cols(train_data |> select(Status))
head(knn_apparent)
tidymodels
automatically names the columns
.pred_class
for \hat{y} and
.pred_good
/.pred_bad
for \hat{\vec{p}}, whilst y just retains its original naming
Status
.
We can now make use of the functions in the yardstick
(Kuhn and Vaughan,
2021) package to evaluate some apparent error metrics, making
sure to provide either \hat{y} or
one of the probabilities as required.
mn_log_loss(lr_apparent, truth = Status, .pred_bad)
mn_log_loss(knn_apparent, truth = Status, .pred_bad)
accuracy(lr_apparent, truth = Status, .pred_class)
accuracy(knn_apparent, truth = Status, .pred_class)
conf_mat(lr_apparent, truth = Status, .pred_class)
conf_mat(knn_apparent, truth = Status, .pred_class)
roc_auc(lr_apparent, truth = Status, .pred_bad)
roc_auc(knn_apparent, truth = Status, .pred_bad)
The first is the error with cross entropy loss (called log loss by some), then the second uses 0-1 loss. The third is not an error metric but a so-called confusion matrix, where the columns indicate the truth and the rows the predictions. The final is the area under the receiver operator characteristic curve, which we will cover in the final Chapter of the course (this is a metric in the range [0.5,1]).
Again, the key thing to note is the unified interface. For example,
we didn’t need to know that the glm()
function has a
different set of choices for type
than those
tidymodels
uses, or indeed that kknn
doesn’t
even provide a fitted model. This is a key benefit of using abstraction
layers like tidymodels
.
You can examine the available metrics at and if you want to use a custom loss to compute the error you can see the guide to custom metrics in yardstick
We compare now doing a proper test accuracy score. Your turn! Adapt the above code to compute the test error
# TEST ERRORS!
# Adapt the code above to compute the same measures on the test data
We will actually want to do cross validation a lot of the time,
perhaps to select the hyperparameters of the model. This works in much
the same way as a holdout set, but now there is a list
$splits
which contains all possible combinations of
training and testing sets.
# CROSS VALIDATION!
# We'll now do 3-fold cross validation (perhaps want to do more, we choose 3
# just to make sure the lab code runs quite quickly)
credit_data_cv <- vfold_cv(train_data, v = 3)
split1_train <- training(credit_data_cv$splits[[1]])
split1_test <- testing(credit_data_cv$splits[[1]])
split2_train <- training(credit_data_cv$splits[[2]])
split2_test <- testing(credit_data_cv$splits[[2]])
split3_train <- training(credit_data_cv$splits[[3]])
split3_test <- testing(credit_data_cv$splits[[3]])
NOTE: that above we’re creating cross validation folds of the training data, not the whole data. This is to highlight how you can mix the approaches we learned, in the notes and lectures by, for example, using cross validation within a training split to select hyperparameters, but perhaps evaluating the generalisation error on held out test data.
Examine these to satisfy yourself that, for example, each of the three folds ends up being the test data once, whilst the remaining two are taken as training in each case.
Now, we could proceed to setup a loop to go over each split, compute
errors and average them for a bunch of different hyperparameters, but
this is a lot of very similar code to have to churn out. Fortunately,
the workflow
package comes to the rescue here and will wrap
up this repetitive work.
# Better than loops is to use workflows in tidymodels
lr_wf <- workflow() |>
add_model(lr_mod) |>
add_formula(Status ~ .)
lr_cv_fits <- lr_wf |>
fit_resamples(credit_data_cv)
Examine the lr_cv_fits
object and you will see it has
automatically fitted the model specified by our workflow to the training
part of each fold of the cross validation. We can then extract some
default metrics provided by tidymodels
. Note that
tidymodels
by default shows accuracy, ROC (neither of which
is a strictly proper scoring rule), and the Brier score (which is a
strictly proper scoring rule).
lr_cv_fits |> collect_metrics()
Fortunately we can override what metrics are computed when fitting to
the different folds. By looking at the help file for
fit_resamples
, decide what to replace ???
with
below in order to compute the two error metrics above on this model (ie
cross-entropy loss and 0-1 loss).
# Update metrics below to compute with cross-entropy loss and 0-1 loss
lr_cv_fits <- lr_wf |>
fit_resamples(credit_data_cv,
metrics = ???) |>
collect_metrics()
lr_cv_fits
We are going to take the cross validation further now, but using the
tune
package to fit lots of models for varying choices of a
hyperparameter and then we can select a “best” model. The
dials
package provides us with convenient functions to
define what parameters we want to “tune the dial” for!
Below, we’ll try tuning L1 regularisation in the logistic regression.
To fit this we must have the glmnet
package, so if you
don’t, then install it now:
# Uncomment and run the following command first if you do not have the glmnet package
# install.packages("glmnet")
The dials::penalty()
function lets us define a range of
values to try (and whether to space them on the log or standard scale) –
we’ll leave the defaults initially, \lambda
\in (-10,0) on the log_{10} scale. Then, we need to create a
new model object which flags any hyperparameter we want to tune by
passing the tune()
function instead of a value. Look at the
help file for logistic_reg
and decide what value to replace
???
with to ensure that we perform L1 regularisation. Then,
we need to change the ‘engine’ to glmnet
since that base
glm
function does not support regularisation.
# Adjust the mixture parameter below to perform L1 regularisation
lr_grid <- grid_regular(penalty(),
levels = 10)
lr_tune_mod <- logistic_reg(penalty = tune(),
mixture = ???) |>
set_engine("glmnet")
With that done, we define the workflow in the same way (just with
this new model which includes regularisation) and proceed to run
tune_grid
which now fits not just 1 model per cross
validation fold, but rather fits 10 models to every fold for the
different penalty choices and computes error metrics for them using the
held out fold. Just as with fit_resamples
above, we can
override this, so use the same errors here for the second
???
as you did above.
# Now, use the metrics found before for tune_grid
lr_tune_wf <- workflow() |>
add_model(lr_tune_mod) |>
add_formula(Status ~ .)
lr_tune_cv_fits <- lr_tune_wf |>
tune_grid(
resamples = credit_data_cv,
grid = lr_grid,
metrics = ???
)
lr_tune_cv_fits |> collect_metrics()
We can see that the 0-1 and cross-entropy loss errors remain static
over quite a large range of the small penalties, meaning that the
penalty is too small to result in any regularisation. Decide what range
of penalties to “zoom in” on specify this via the penalty
function in the grid that is setup on the first line (you may need to
play around to find the right choice). In other words, change the
following line using the documentation and results above to decide what
to place in ???
, then rerun the rest of the above.
# Repeat, starting out with a different grid to zoom in on area of relevance
lr_grid <- grid_regular(penalty(???),
levels = 10)
# (rerun the previous solution you came up with after changing that line)
Is it worth regularising logistic regression in this case? (Hint 1: It’s just as important to know when not to do something fancy as when to do so! Hint 2: notice the standard error column … we know this isn’t quite calibrated to the true generalisation error from lectures, but it is still ok as a rule-of-thumb when deciding if improvements are likely to be spurious or helpful)
You may find it helpful to plot the 0-1 and cross entropy. This will
work with the result of the tune_grid
function above from
both the original or your zoomed in attempt.
# This plot may help to see what's going on!
lr_tune_cv_fits |>
collect_metrics() |>
ggplot(aes(penalty, mean)) +
geom_line(linewidth = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number())
Try adapting the above code, reading the documentation where necessary, to automatically tune the number of neighbours in the nearest neighbour model.
# Try tuning the k-nearest neighbour to select number of neighbours
We can use the show_best
and select_best
functions to see and pull out the top performing model from the tuning
run.
# Examine best model
lr_tune_cv_fits |> show_best(metric = "accuracy")
lr_tune_cv_fits |> show_best(metric = "mn_log_loss")
# Pull out best model
# Really, want to make sure you use a loss *relevant to your real problem* for
# final selection of model!
lr_final_mod <- lr_tune_cv_fits |> select_best(metric = "mn_log_loss")
Since we have out held back test data from much earlier, we can now
refit the “best” model to all the data we used during cross validation
(using last_fit
) and estimate the test error. Remember to
bring your choice of metrics forward to here at ???
:
# First, finalise the workflow to only the best model
lr_final_wf <- lr_tune_wf |>
finalize_workflow(lr_final_mod)
# Then fit it ... note we now use the credit_data_split from way back at the start
# of the lab, so that we use all the training_data (which was split for CV) to
# fit and then test on the test data
lr_final_fit <-
lr_final_wf |>
last_fit(credit_data_split,
metrics = ???)
lr_final_fit |> collect_metrics()
Try this also for your tuned knn model:
# Repeat for your KNN model!