There has been code provided throughout the course in live code
blocks, where you get snippets of code to perform particular tasks by
directly leveraging individual packages for particular methods.
Therefore, we will not repeat such exercises in the labs and instead we
will focus on the full software pipeline which makes running machine
learning analyses much easier. Arguably the two top ML pipelines in R
(R Core Team, 2021)
today are tidymodels
(Kuhn and Wickham, 2020) and
mlr3
(Lang et
al., 2019). The first lab will focus on
tidymodels
and the second on mlr3
.
Before beginning with tidymodels
, we will ensure you can
get the two data sets that will be used for the examples here, which
will basic binary and multilabel classification models.
The first is a credit scoring dataset: an important application in banking. The data consist of 4454 observations of 14 variables, each a loan that was given together with the status of whether the loan was “good” (ie repaid) or went “bad” (ie defaulted).
# Uncomment and run the following command first if you do not have the modeldata package
# install.packages("modeldata")
data("credit_data", package = "modeldata")
Use some of the exploration techniques from Lab 1 to explore the dataset before you start modelling.
# Try some plotting function to explore the data here
The second dataset we will use which is multiclass is the famous MNIST handwriting data set. I have already put this data into the format we discussed in lectures for coding of images (an image per row with pixels as features). Download it first (this may take some time, it is ~40MB)
# Uncomment and run the following command first if you do not have the fst package
# install.packages("fst")
download.file("https://www.louisaslett.com/Courses/MISCADA/mnist.fst", "mnist.fst")
Then read the data in use the high performance file storage package
fst
(Klik,
2021):
mnist <- fst::read.fst("mnist.fst")
Have a look at this data. If you want to see an image, the following function will help plot any of the images for you. Read and try to understand the two lines involved by looking at the documentation for each step.
# Define a plotting function to transform the flattened image data into a
# square graphic. See Data Coding appendix in the notes for how images are
# represented as tabular data!
library("tidyverse")
plotimages <- function(i) {
imgs <- mnist |>
slice(i) |>
mutate(image.num = i) |>
pivot_longer(x0.y27:x27.y0,
names_to = c("x", "y"),
names_pattern = "x([0-9]{1,2}).y([0-9]{1,2})",
names_transform = list(x = as.integer, y = as.integer),
values_to = "greyscale")
ggplot(imgs, aes(x = x, y = y, fill = greyscale)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "black") +
facet_wrap(~ image.num + response, labeller = label_both)
}
We can then look at the individual images, or groups of them. For example, to look at images 21 through 24, we run:
plotimages(21:24)
The tidymodels
universe comprises a suite of packages
which are focused on different aspects of the machine learning software
pipeline, but which employ a coherent and unified interface to simplify
use, automate and abstract away common tasks.
The tidymodels
package itself is really a so-called
meta package, meaning that it doesn’t contain any important
functions itself and is primarily a way to install the core tidyverse
packages which are listed as dependencies. First up, install this
package which will automatically pull in the others for you:
# Uncomment and run the following command first if you do not have the tidymodels package
# install.packages("tidymodels")
# Load tidymodels, which loads a suite of other packages for us
library("tidymodels")
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
The core tidymodels packages include:
rsample
which does the data splitting techniques such
as train/validate/test, cross-validation and bootstrap resampling from
Chapter 5 in the notes;parsnip
which provides a unified interface to different
modelling methods, such as those of Chapters 4 and 6, so that you can
try many methods without having to learn the different underlying
pacakge interfaces;recipies
which enables you to perform common data
pre-processing tasks in a way that makes replication of those steps easy
when you load future data to perform predictions;workflows
which facilitates the creation of end-to-end
workflows to execute all steps of a machine learning pipeline;tune
which provides a framework for empirical selection
of hyperparameters;yardstick
which incorporates many different loss
functions (such as those in Chapter 3), as well as other metrics (such
as those in Chapter 7);broom
for formatting and outputting information about
fitted models in a readable way;We will go through the use of some of these packages as we run though a lab example here, though you
The first step will be to create a training/testing split using
functions from rsample
(which will have been loaded by the
tidymodels
package above).
First, it is good to set the random number seed (recall APTS week 1 (Wilkinson and Wood, 2021)) so we get reproducible results for the purposes of this lab.
Then, because \(k\)-nearest neighbour doesn’t support missing data natively, we will eliminate any rows with missingness. Note that it may make sense to explore other options such as missing data imputation, but we take the simplest course of action first.
With this basic pre-processing, we are ready to call
initial_split()
which is the function that returns an
object which will contain all the data but internally randomises it to
training/testing. We then use training()
and
testing()
function to extract each relevant part of the
data.
# Set seed so we reproducibly see the same results
set.seed(212)
# Because knn can't handle missings ... maybe try imputation later?
credit_data <- na.omit(credit_data)
# Create a train/test split into 75% / 25% of the data
credit_data_split <- initial_split(credit_data, strata = Status, prop = 3/4)
# We then extract the training and testing portions
train_data <- training(credit_data_split)
test_data <- testing(credit_data_split)
With the data split into the simplest possible hold-out setup, we can
turn to parsnip
to fit a variety of models without having
to learn each individual package.
Here we’ll fit a logistic regression and \(k\)-nearest neighbour.
# First, define a logistic regression model
lr_mod <- logistic_reg()
# Now we need to load the kknn package due to a bug waiting a fix
library("kknn") # Needed due to bug! https://github.com/tidymodels/parsnip/issues/264
# Second, define a knn model. Since KNN has a hyperparameter we can specify that
# here, and because it supports both classification and regression we can
# set which mode we want
knn_mod <- nearest_neighbor(neighbors = 3) |>
set_mode("classification")
So this is quite different to base R modelling, where you simply call
a fitting function right away. Here we have defined a model first and
now we proceed to use this model object to fit to the data and save it
to a new object ending _fit
. Notice the unified approach
for both types of model. We obviously fit to the training data.
# Next we fit the model
lr_fit <- lr_mod |> fit(Status ~ ., data = train_data)
knn_fit <- knn_mod |> fit(Status ~ ., data = train_data)
Have a look at the model objects individually:
lr_fit
## parsnip model object
##
##
## Call: stats::glm(formula = Status ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Seniority Homeother Homeowner
## 2.469e-01 8.482e-02 2.585e-01 1.119e+00
## Homeparents Homepriv Homerent Time
## 1.058e+00 5.901e-01 4.534e-01 -3.346e-03
## Age Maritalmarried Maritalseparated Maritalsingle
## -8.346e-03 1.471e+00 9.621e-02 8.330e-01
## Maritalwidow Recordsyes Jobfreelance Jobothers
## 3.435e-01 -1.823e+00 -3.437e-01 -3.099e-01
## Jobpartime Expenses Income Assets
## -1.569e+00 -1.973e-02 7.536e-03 1.755e-05
## Debt Amount Price
## -1.245e-04 -2.199e-03 1.211e-03
##
## Degrees of Freedom: 3027 Total (i.e. Null); 3005 Residual
## Null Deviance: 3432
## Residual Deviance: 2483 AIC: 2529
knn_fit
## parsnip model object
##
##
## Call:
## kknn::train.kknn(formula = Status ~ ., data = data, ks = min_rows(3, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.2790621
## Best kernel: optimal
## Best k: 3
Do you notice something interesting about the knn_fit
object?
It mentions kernels! In fact, the kknn
package (Schliep and Hechenbichler,
2016) combines both ideas from the “Local Methods” Chapter in
the notes, and weights the neighbours using a kernel. Read the
documentation of kknn::kknn
for more details.
We know that we shouldn’t really look at the apparent error as it is upward biased for the true model error, but let’s do so anyway first!
Here, we’re going to construct a prediction matrix for each model. The first column will contain the label predictions, the following two columns the probabilities for each kind of outcome, and the last column will be the truth.
Remember that our loss function is \(\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to [0,\infty)\), \(\mathcal{L}(y, \hat{y})\) or \(\mathcal{L} : \mathcal{Y} \times [0,1]^g \to [0,\infty)\), \(\mathcal{L}(y, \vec{p})\) so the final column provides \(y\), the first column provides \(\hat{y}\) and the middle columns provide \(\hat{\vec{p}}\)
# APPARENT ERRORS!
lr_apparent <-
# First column will be predicted label
predict(lr_fit, train_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(lr_fit, train_data, type = "prob")) |>
# Last column will be true label
bind_cols(train_data |> select(Status))
head(lr_apparent)
## # A tibble: 6 × 4
## .pred_class .pred_bad .pred_good Status
## <fct> <dbl> <dbl> <fct>
## 1 bad 0.791 0.209 bad
## 2 good 0.367 0.633 bad
## 3 bad 0.546 0.454 bad
## 4 bad 0.501 0.499 bad
## 5 good 0.450 0.550 bad
## 6 bad 0.593 0.407 bad
knn_apparent <-
# First column will be predicted label
predict(knn_fit, train_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(knn_fit, train_data, type = "prob")) |>
# Last column will be true label
bind_cols(train_data |> select(Status))
head(knn_apparent)
## # A tibble: 6 × 4
## .pred_class .pred_bad .pred_good Status
## <fct> <dbl> <dbl> <fct>
## 1 bad 0.929 0.0713 bad
## 2 bad 0.929 0.0713 bad
## 3 bad 1 0 bad
## 4 bad 0.670 0.330 bad
## 5 bad 0.670 0.330 bad
## 6 bad 0.929 0.0713 bad
tidymodels
automatically names the columns
.pred_class
for \(\hat{y}\) and
.pred_good
/.pred_bad
for \(\hat{\vec{p}}\), whilst \(y\) just retains its original naming
Status
.
We can now make use of the functions in the yardstick
(Kuhn and Vaughan,
2021) package to evaluate some apparent error metrics, making
sure to provide either \(\hat{y}\) or
one of the probabilities as required.
mn_log_loss(lr_apparent, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss binary 0.410
mn_log_loss(knn_apparent, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss binary 0.107
accuracy(lr_apparent, truth = Status, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.819
accuracy(knn_apparent, truth = Status, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1.00
conf_mat(lr_apparent, truth = Status, .pred_class)
## Truth
## Prediction bad good
## bad 373 153
## good 396 2106
conf_mat(knn_apparent, truth = Status, .pred_class)
## Truth
## Prediction bad good
## bad 769 1
## good 0 2258
roc_auc(lr_apparent, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.846
roc_auc(knn_apparent, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 1.00
The first is the error with cross entropy loss (called log loss by some), then the second uses 0-1 loss. The third is not an error metric but a so-called confusion matrix, where the columns indicate the truth and the rows the predictions. The final is the area under the receiver operator characteristic curve, which we will cover in the final Chapter of the course (this is a metric in the range \([0.5,1]\)).
Again, the key thing to note is the unified interface. For example,
we didn’t need to know that the glm()
function has a
different set of choices for type
than those
tidymodels
uses, or indeed that kknn
doesn’t
even provide a fitted model. This is a key benefit of using abstraction
layers like tidymodels
.
You can examine the available metrics at and if you want to use a custom loss to compute the error you can see the guide to custom metrics in yardstick
We compare now doing a proper test accuracy score. Your turn! Adapt the above code to compute the test error
# TEST ERRORS!
# Adapt the code above to compute the same measures on the test data
## SOLUTION
lr_test <-
# First column will be predicted label
predict(lr_fit, test_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(lr_fit, test_data, type = "prob")) |>
# Last column will be true label
bind_cols(test_data |> select(Status))
head(lr_test)
## # A tibble: 6 × 4
## .pred_class .pred_bad .pred_good Status
## <fct> <dbl> <dbl> <fct>
## 1 good 0.282 0.718 good
## 2 good 0.496 0.504 bad
## 3 good 0.283 0.717 good
## 4 good 0.174 0.826 good
## 5 bad 0.947 0.0533 bad
## 6 good 0.227 0.773 good
knn_test <-
# First column will be predicted label
predict(knn_fit, test_data) |>
# Next two columns will be probabilities for the two labels
bind_cols(predict(knn_fit, test_data, type = "prob")) |>
# Last column will be true label
bind_cols(test_data |> select(Status))
head(knn_test)
## # A tibble: 6 × 4
## .pred_class .pred_bad .pred_good Status
## <fct> <dbl> <dbl> <fct>
## 1 good 0.0713 0.929 good
## 2 good 0 1 bad
## 3 good 0 1 good
## 4 good 0 1 good
## 5 bad 1 0 bad
## 6 good 0 1 good
mn_log_loss(lr_test, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss binary 0.445
mn_log_loss(knn_test, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss binary 3.84
accuracy(lr_test, truth = Status, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.804
accuracy(knn_test, truth = Status, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.723
conf_mat(lr_test, truth = Status, .pred_class)
## Truth
## Prediction bad good
## bad 113 54
## good 144 700
conf_mat(knn_test, truth = Status, .pred_class)
## Truth
## Prediction bad good
## bad 109 132
## good 148 622
roc_auc(lr_test, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.809
roc_auc(knn_test, truth = Status, .pred_bad)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.694
We will actually want to do cross validation a lot of the time,
perhaps to select the hyperparameters of the model. This works in much
the same way as a holdout set, but now there is a list
$splits
which contains all possible combinations of
training and testing sets.
# CROSS VALIDATION!
# We'll now do 3-fold cross validation (perhaps want to do more, we choose 3
# just to make sure the lab code runs quite quickly)
credit_data_cv <- vfold_cv(train_data, v = 3)
split1_train <- training(credit_data_cv$splits[[1]])
split1_test <- testing(credit_data_cv$splits[[1]])
split2_train <- training(credit_data_cv$splits[[2]])
split2_test <- testing(credit_data_cv$splits[[2]])
split3_train <- training(credit_data_cv$splits[[3]])
split3_test <- testing(credit_data_cv$splits[[3]])
NOTE: that above we’re creating cross validation folds of the training data, not the whole data. This is to highlight how you can mix the approaches we learned, in the notes and lectures by, for example, using cross validation within a training split to select hyperparameters, but perhaps evaluating the generalisation error on held out test data.
Examine these to satisfy yourself that, for example, each of the three folds ends up being the test data once, whilst the remaining two are taken as training in each case.
Now, we could proceed to setup a loop to go over each split, compute
errors and average them for a bunch of different hyperparameters, but
this is a lot of very similar code to have to churn out. Fortunately,
the workflow
package comes to the rescue here and will wrap
up this repetitive work.
# Better than loops is to use workflows in tidymodels
lr_wf <- workflow() |>
add_model(lr_mod) |>
add_formula(Status ~ .)
lr_cv_fits <- lr_wf |>
fit_resamples(credit_data_cv)
Examine the lr_cv_fits
object and you will see it has
automatically fitted the model specified by our workflow to the training
part of each fold of the cross validation. We can then extract some
default metrics provided by tidymodels
. Note that
tidymodels
by default shows accuracy, ROC (neither of which
is a strictly proper scoring rule), and the Brier score (which is a
strictly proper scoring rule).
lr_cv_fits |> collect_metrics()
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.810 3 0.00448 Preprocessor1_Model1
## 2 brier_class binary 0.133 3 0.00157 Preprocessor1_Model1
## 3 roc_auc binary 0.839 3 0.00666 Preprocessor1_Model1
Fortunately we can override what metrics are computed when fitting to
the different folds. By looking at the help file for
fit_resamples
, decide what to replace ???
with
below in order to compute the two error metrics above on this model (ie
cross-entropy loss and 0-1 loss).
# Update metrics below to compute with cross-entropy loss and 0-1 loss
lr_cv_fits <- lr_wf |>
fit_resamples(credit_data_cv,
metrics = ???) |>
collect_metrics()
lr_cv_fits
## SOLUTION
lr_cv_fits <- lr_wf |>
fit_resamples(credit_data_cv,
metrics = metric_set(mn_log_loss,
accuracy)) |>
collect_metrics()
lr_cv_fits
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.810 3 0.00448 Preprocessor1_Model1
## 2 mn_log_loss binary 0.418 3 0.00437 Preprocessor1_Model1
We are going to take the cross validation further now, but using the
tune
package to fit lots of models for varying choices of a
hyperparameter and then we can select a “best” model. The
dials
package provides us with convenient functions to
define what parameters we want to “tune the dial” for!
Below, we’ll try tuning L1 regularisation in the logistic regression.
To fit this we must have the glmnet
package, so if you
don’t, then install it now:
# Uncomment and run the following command first if you do not have the glmnet package
# install.packages("glmnet")
The dials::penalty()
function lets us define a range of
values to try (and whether to space them on the log or standard scale) –
we’ll leave the defaults initially, \(\lambda
\in (-10,0)\) on the \(log_{10}\) scale. Then, we need to create a
new model object which flags any hyperparameter we want to tune by
passing the tune()
function instead of a value. Look at the
help file for logistic_reg
and decide what value to replace
???
with to ensure that we perform L1 regularisation. Then,
we need to change the ‘engine’ to glmnet
since that base
glm
function does not support regularisation.
# Adjust the mixture parameter below to perform L1 regularisation
lr_grid <- grid_regular(penalty(),
levels = 10)
lr_tune_mod <- logistic_reg(penalty = tune(),
mixture = ???) |>
set_engine("glmnet")
# SOLUTION
lr_grid <- grid_regular(penalty(),
levels = 10)
lr_tune_mod <- logistic_reg(penalty = tune(),
mixture = 1) |>
set_engine("glmnet")
With that done, we define the workflow in the same way (just with
this new model which includes regularisation) and proceed to run
tune_grid
which now fits not just 1 model per cross
validation fold, but rather fits 10 models to every fold for the
different penalty choices and computes error metrics for them using the
held out fold. Just as with fit_resamples
above, we can
override this, so use the same errors here for the second
???
as you did above.
# Now, use the metrics found before for tune_grid
lr_tune_wf <- workflow() |>
add_model(lr_tune_mod) |>
add_formula(Status ~ .)
lr_tune_cv_fits <- lr_tune_wf |>
tune_grid(
resamples = credit_data_cv,
grid = lr_grid,
metrics = ???
)
lr_tune_cv_fits |> collect_metrics()
# SOLUTION
lr_tune_wf <- workflow() |>
add_model(lr_tune_mod) |>
add_formula(Status ~ .)
lr_tune_cv_fits <- lr_tune_wf |>
tune_grid(
resamples = credit_data_cv,
grid = lr_grid,
metrics = metric_set(mn_log_loss,
accuracy)
)
lr_tune_cv_fits |> collect_metrics()
## # A tibble: 20 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 2 0.0000000001 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 3 0.00000000129 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 4 0.00000000129 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 5 0.0000000167 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 6 0.0000000167 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 7 0.000000215 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 8 0.000000215 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 9 0.00000278 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 10 0.00000278 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 11 0.0000359 accuracy binary 0.811 3 0.00489 Preprocessor1_Model…
## 12 0.0000359 mn_log_loss binary 0.417 3 0.00422 Preprocessor1_Model…
## 13 0.000464 accuracy binary 0.811 3 0.00534 Preprocessor1_Model…
## 14 0.000464 mn_log_loss binary 0.417 3 0.00391 Preprocessor1_Model…
## 15 0.00599 accuracy binary 0.808 3 0.00545 Preprocessor1_Model…
## 16 0.00599 mn_log_loss binary 0.421 3 0.00279 Preprocessor1_Model…
## 17 0.0774 accuracy binary 0.752 3 0.00679 Preprocessor1_Model…
## 18 0.0774 mn_log_loss binary 0.517 3 0.00429 Preprocessor1_Model…
## 19 1 accuracy binary 0.746 3 0.00526 Preprocessor1_Model…
## 20 1 mn_log_loss binary 0.567 3 0.00560 Preprocessor1_Model…
We can see that the 0-1 and cross-entropy loss errors remain static
over quite a large range of the small penalties, meaning that the
penalty is too small to result in any regularisation. Decide what range
of penalties to “zoom in” on specify this via the penalty
function in the grid that is setup on the first line (you may need to
play around to find the right choice). In other words, change the
following line using the documentation and results above to decide what
to place in ???
, then rerun the rest of the above.
# Repeat, starting out with a different grid to zoom in on area of relevance
lr_grid <- grid_regular(penalty(???),
levels = 10)
# (rerun the previous solution you came up with after changing that line)
# SOLUTION
lr_grid <- grid_regular(penalty(c(-3.7,-2.3)),
levels = 10)
lr_tune_mod <- logistic_reg(penalty = tune(),
mixture = 1) |>
set_engine("glmnet")
lr_tune_wf <- workflow() |>
add_model(lr_tune_mod) |>
add_formula(Status ~ .)
lr_tune_cv_fits <- lr_tune_wf |>
tune_grid(
resamples = credit_data_cv,
grid = lr_grid,
metrics = metric_set(mn_log_loss,
accuracy)
)
lr_tune_cv_fits |> collect_metrics()
## # A tibble: 20 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000200 accuracy binary 0.810 3 0.00473 Preprocessor1_Model01
## 2 0.000200 mn_log_loss binary 0.417 3 0.00411 Preprocessor1_Model01
## 3 0.000285 accuracy binary 0.811 3 0.00506 Preprocessor1_Model02
## 4 0.000285 mn_log_loss binary 0.417 3 0.00405 Preprocessor1_Model02
## 5 0.000408 accuracy binary 0.811 3 0.00534 Preprocessor1_Model03
## 6 0.000408 mn_log_loss binary 0.417 3 0.00395 Preprocessor1_Model03
## 7 0.000584 accuracy binary 0.812 3 0.00513 Preprocessor1_Model04
## 8 0.000584 mn_log_loss binary 0.417 3 0.00384 Preprocessor1_Model04
## 9 0.000836 accuracy binary 0.814 3 0.00506 Preprocessor1_Model05
## 10 0.000836 mn_log_loss binary 0.417 3 0.00374 Preprocessor1_Model05
## 11 0.00120 accuracy binary 0.813 3 0.00572 Preprocessor1_Model06
## 12 0.00120 mn_log_loss binary 0.417 3 0.00360 Preprocessor1_Model06
## 13 0.00171 accuracy binary 0.813 3 0.00657 Preprocessor1_Model07
## 14 0.00171 mn_log_loss binary 0.417 3 0.00342 Preprocessor1_Model07
## 15 0.00245 accuracy binary 0.810 3 0.00671 Preprocessor1_Model08
## 16 0.00245 mn_log_loss binary 0.418 3 0.00326 Preprocessor1_Model08
## 17 0.00350 accuracy binary 0.811 3 0.00654 Preprocessor1_Model09
## 18 0.00350 mn_log_loss binary 0.419 3 0.00311 Preprocessor1_Model09
## 19 0.00501 accuracy binary 0.808 3 0.00492 Preprocessor1_Model10
## 20 0.00501 mn_log_loss binary 0.420 3 0.00290 Preprocessor1_Model10
Is it worth regularising logistic regression in this case? (Hint 1: It’s just as important to know when not to do something fancy as when to do so! Hint 2: notice the standard error column … we know this isn’t quite calibrated to the true generalisation error from lectures, but it is still ok as a rule-of-thumb when deciding if improvements are likely to be spurious or helpful)
You may find it helpful to plot the 0-1 and cross entropy. This will
work with the result of the tune_grid
function above from
both the original or your zoomed in attempt.
# This plot may help to see what's going on!
lr_tune_cv_fits |>
collect_metrics() |>
ggplot(aes(penalty, mean)) +
geom_line(linewidth = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number())
Try adapting the above code, reading the documentation where necessary, to automatically tune the number of neighbours in the nearest neighbour model.
# Try tuning the k-nearest neighbour to select number of neighbours
# SOLUTION
knn_grid <- grid_regular(neighbors(c(1,50)),
levels = 25)
knn_tune_mod <- nearest_neighbor(neighbors = tune()) |>
set_mode("classification")
knn_tune_wf <- workflow() |>
add_model(knn_tune_mod) |>
add_formula(Status ~ .)
knn_tune_cv_fits <- knn_tune_wf |>
tune_grid(
resamples = credit_data_cv,
grid = knn_grid,
metrics = metric_set(mn_log_loss,
accuracy)
)
knn_tune_cv_fits |> collect_metrics()
## # A tibble: 50 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.725 3 0.0117 Preprocessor1_Model01
## 2 1 mn_log_loss binary 9.93 3 0.422 Preprocessor1_Model01
## 3 3 accuracy binary 0.725 3 0.0117 Preprocessor1_Model02
## 4 3 mn_log_loss binary 3.41 3 0.181 Preprocessor1_Model02
## 5 5 accuracy binary 0.758 3 0.0103 Preprocessor1_Model03
## 6 5 mn_log_loss binary 2.09 3 0.182 Preprocessor1_Model03
## 7 7 accuracy binary 0.764 3 0.00943 Preprocessor1_Model04
## 8 7 mn_log_loss binary 1.60 3 0.156 Preprocessor1_Model04
## 9 9 accuracy binary 0.768 3 0.00814 Preprocessor1_Model05
## 10 9 mn_log_loss binary 1.28 3 0.228 Preprocessor1_Model05
## # ℹ 40 more rows
knn_tune_cv_fits |>
collect_metrics() |>
ggplot(aes(neighbors, mean)) +
geom_line(linewidth = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2)
We can use the show_best
and select_best
functions to see and pull out the top performing model from the tuning
run.
# Examine best model
lr_tune_cv_fits |> show_best(metric = "accuracy")
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000836 accuracy binary 0.814 3 0.00506 Preprocessor1_Model05
## 2 0.00120 accuracy binary 0.813 3 0.00572 Preprocessor1_Model06
## 3 0.00171 accuracy binary 0.813 3 0.00657 Preprocessor1_Model07
## 4 0.000584 accuracy binary 0.812 3 0.00513 Preprocessor1_Model04
## 5 0.000408 accuracy binary 0.811 3 0.00534 Preprocessor1_Model03
lr_tune_cv_fits |> show_best(metric = "mn_log_loss")
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000584 mn_log_loss binary 0.417 3 0.00384 Preprocessor1_Model04
## 2 0.000836 mn_log_loss binary 0.417 3 0.00374 Preprocessor1_Model05
## 3 0.000408 mn_log_loss binary 0.417 3 0.00395 Preprocessor1_Model03
## 4 0.00120 mn_log_loss binary 0.417 3 0.00360 Preprocessor1_Model06
## 5 0.000285 mn_log_loss binary 0.417 3 0.00405 Preprocessor1_Model02
# Pull out best model
# Really, want to make sure you use a loss *relevant to your real problem* for
# final selection of model!
lr_final_mod <- lr_tune_cv_fits |> select_best(metric = "mn_log_loss")
Since we have out held back test data from much earlier, we can now
refit the “best” model to all the data we used during cross validation
(using last_fit
) and estimate the test error. Remember to
bring your choice of metrics forward to here at ???
:
# First, finalise the workflow to only the best model
lr_final_wf <- lr_tune_wf |>
finalize_workflow(lr_final_mod)
# Then fit it ... note we now use the credit_data_split from way back at the start
# of the lab, so that we use all the training_data (which was split for CV) to
# fit and then test on the test data
lr_final_fit <-
lr_final_wf |>
last_fit(credit_data_split,
metrics = ???)
lr_final_fit |> collect_metrics()
# SOLUTION
lr_final_wf <- lr_tune_wf |>
finalize_workflow(lr_final_mod)
lr_final_fit <-
lr_final_wf |>
last_fit(credit_data_split,
metrics = metric_set(mn_log_loss,
accuracy))
lr_final_fit |> collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.808 Preprocessor1_Model1
## 2 mn_log_loss binary 0.443 Preprocessor1_Model1
Try this also for your tuned knn model:
# Repeat for your KNN model!
# SOLUTION
knn_final_mod <- knn_tune_cv_fits |> select_best(metric = "accuracy")
knn_final_wf <- knn_tune_wf |>
finalize_workflow(knn_final_mod)
knn_final_fit <-
knn_final_wf |>
last_fit(credit_data_split,
metrics = metric_set(mn_log_loss,
accuracy))
knn_final_fit |> collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.775 Preprocessor1_Model1
## 2 mn_log_loss binary 0.594 Preprocessor1_Model1