$$ \require{cancel} \newcommand{\given}{ \,|\, } \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\vecg}[1]{\boldsymbol{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\bbone}{\unicode{x1D7D9}} $$

Chapter 2 Introduction

These course notes accompany the taught APTS week. Note that this is an intensive and time constrained course, so these notes are far from exhaustive, but rather provide a rapid introduction to a curated subset of the topic spanning theory, methods, and application, with appropriate signposting to help orientate yourself in the literature.

In particular, the aim of APTS is to “provide courses which will be attractive and relevant to the research preparation and background education of all statistics and probability PhD”. Therefore, although these notes are at an introductory (graduate) level and do not assume prior exposure to machine learning, unlike BSc/MSc level material some details are omitted. You are strongly encouraged to practice following the extensive references provided into the primary research papers to fill in further details of interest or find proofs/derivations.

2.1 Live code

In this course, we consider code to be a first class citizen in your learning: to be able to apply or understand the content it is often helpful to be able to see and play with illustrative code. We will be using the statistician’s favourite tool, R (R Core Team, 2021)! Some code, especially if it is long-running, will simply be displayed inline like the following:

x <- rnorm(10)
mean(x)

However, where a fast running toy example is possible, this will be used and provided in “live code” blocks, where you can make changes and re-run to observe the effect. These appear like this:

x <- rnorm(20)
mean(x)

You can run the code inline by clicking the “Run code!” button without leaving these notes and losing your flow. Also, since the code is remotely executed this will work even on mobile devices like an iPad or iPhone where R is not supported.

Where there is more than one live block on the page, you can link them together into a persistent session, akin to an RStudio Notebook or Jupyter Notebook, by enabling the “Persistent” toggle. Note that like those other notebook environments, the order in which you run blocks matters, so that if you run a second block trying to access a variable x before the first block assigns to it you will encounter an error. Therefore, when needed you can evaluate all preceding blocks with the “Run previous” button which appears when you enable a persistent session. To see this, click the “Persistent” toggle on and try running the next block: note this results in an error, because the first code block has not been run with persistence enabled. Therefore, run the first block before trying this one again, which should result in the value of x printing correctly:

print(x)

It is possible to include plots and other more complex output too. To see this, modify the code above to add hist(x) on line 2, click the “Run code!” button again and either “Zoom” or scroll to see the histogram (note, you don’t need to rerun the first block this time, because your session now has x defined).

The server running these code chunks is a shared and public resource, so please be considerate. You should also assume that anything you run may be publicly visible, so do not run anything confidential!

2.2 Topic background

As Larry Wasserman (2014) puts it, “Statistics is the science of learning from data. Machine Learning (ML) is the science of learning from data. These fields are identical in intent although they differ in their history, conventions, emphasis and culture.”

This course is primarily concerned with supervised machine learning which, despite the modern buzz, has been around for a considerable time in other guises. An early related moniker was “pattern recognition” and indeed, even as early as 1968 one can find review papers packed with methods that are familiar to the modern practitioner of machine learning (Nagy, 1968). It was recognised even then that this was something of a distinct field that draws on many others, including especially statistics.

Around this time there was also cogitation within the statistics community, with some encouraging a broadening of scope from the focused mathematical ideals of the day: in his treatise “The Future of Data Analysis”, John Tukey (1962) advocated passionately for statistics to shift focus. Jerome Friedman (1998) revived the debate about whether statistics should embrace the other blossoming disciplines seeking to learn from data, with pattern recognition being joined by “data mining”, “machine learning” and “artificial intelligence” in a list of related fields sharing similar goals. Contemporaneously, Leo Breiman (2001a) was more forthright, arguing clearly in favour of statistics embracing algorithmic methods and a focus on predictive accuracy which is common in machine learning. Most recently, Brad Efron (2020) wrote a compelling account relating modern machine learning methods to traditional statistical approaches.

However, there are certainly critics:

“Machine learning is statistics minus checking of models and assumptions.”

Brian Ripley, useR! 2004, Vienna

Although this epigrammatic characterisation of machine learning is oft humorously quoted and perhaps a little unfair, it does reflect the fact that a machine learning analysis can exhibit a markedly different feel to other forms of statistical modelling. A machine learner may (perhaps unwisely) retort that many black box algorithmic methods do not have a specific model and make minimal assumptions.

It is interesting to read the above references for some historical context, but it should be uncontroversial to say that today there are sufficiently many statisticians working on machine learning theory, methods, and applications, that statistics certainly has a voice in the machine learning community. Indeed, it has “had a stimulating effect on traditional theory and practice” (Efron, 2020).

Suffice to say, we will therefore avoid attempting to dichotomise statistics and machine learning, instead attempting only a loose characterisation of what we mean by supervised machine learning for the purposes of this course: a collection of modelling approaches predominantly concerned with accurate predictive modelling, frequently employing algorithmic or black-box methodology which does not necessarily have interpretable parametrisations, and where parameter uncertainty is often of secondary concern or neglected. It is a subtle distinction, but it does matter that supervised machine learning is concerned primarily with prediction (whilst statistics is broader and encompasses explanatory modelling, estimation and attribution too (Efron, 2020; Shmueli, 2010))

What therefore of the “Statistical” prefix in the course title? From the perspective of a statistician, an unfortunate feature of some (not all) corners of machine learning practice is a failure to fully evaluate the uncertainty of predictions produced by these methods, a disregard for proper scoring rules, or a contentment to set aside a mathematical treatment of the underpinning theory and methods. As such, the prefix “Statistical” is to highlight the emphasis we will place upon uncertainty quantification in predictive modelling and a desire to develop some mathematical understanding of foundations where it helps.

Work attempting to trace some of the the historical development and coexistence of machine learning and statistics are in Malik (2020) and Jones (2018). The Introduction in Vapnik (1998) argues the historical branching of machine learning theory from its statistics origins happened through four key discoveries in the 1960s-80s.

2.3 Organisation of the course

We start in the next Chapter by formalising the statement of the supervised machine learning problem which will be the focus of this course, and then introduce the learning theory background which provides a mental model for the methodology and application concerns to follow.

We then proceed to examine some modelling approaches or learning algorithms. The unifying course title “Statistical Machine Learning” hopefully hints that we have no desire to engage the debate about whether a particular method is “statistics”, “machine learning” (or the even more poorly defined “artificial intelligence”). However, there will be a focus on local methods, trees, and ensembles, which you are less likely to have encountered if coming from a purely statistical modelling background: this is categorically not to be construed as an endorsement that they are superior techniques (for the record, the author’s predilection is for Bayesian probabilistic models!) The APTS Statistical Modelling course (Ogden et al., 2021) is highly recommended for a background on some fully probabilistic statistical modelling approaches.

Following this we examine concerns around practical application of the methods, including the necessarily empirical evaluation of models, model selection and combination, and associated concerns.

Finally, we devote some time to exploring the popular frameworks in the R (R Core Team, 2021) language for creating a more robust and streamlined analysis pipeline.

References

Breiman, L. (2001a). Statistical modeling: The two cultures. Statistical Science 16(3), 199–231. DOI: 10.1214/ss/1009213726

Efron, B. (2020). Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530), 636–655. DOI: 10.1080/01621459.2020.1762613

Friedman, J.H. (1998). Data mining and statistics: What’s the connection? Computing Science and Statistics 29(1), 3–9.

Jones, M.L. (2018). How we became instrumentalists (again): Data positivism since World War II. Historical Studies in the Natural Sciences 48(5), 673–684. DOI: 10.1525/hsns.2018.48.5.673

Malik, M.M. (2020). A hierarchy of limitations in machine learning. arXiv (2002.05193). URL https://arxiv.org/abs/2002.05193

Nagy, G. (1968). State of the art in pattern recognition. Proceedings of the IEEE 56(5), 836–863. DOI: 10.1109/PROC.1968.6414

Ogden, H.E., Davison, A.C., Forster, J.J., Woods, D.C., Overstall, A.M. (2021). APTS: Statistical Modelling Notes. URL https://warwick.ac.uk/fac/sci/statistics/apts/students/resources/statmod-notes.pdf

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Shmueli, G. (2010). To explain or to predict? Statistical Science 25(3), 289–310. DOI: 10.1214/10-STS330

Tukey, J.W. (1962). The future of data analysis. The Annals of Mathematical Statistics 33(1), 1–67. DOI: 10.1214/aoms/1177704711

Vapnik, V.N. (1998). The Nature of Statistical Learning Theory, 2nd ed. Springer. ISBN: 978-0387987804

Wasserman, L.A. (2014). Rise of the machines, in: Lin, X., Genest, C., Banks, D.L., Molenberghs, G., Scott, D.W., Wang, J.-L. (Eds.), Past, Present, and Future of Statistical Science. Chapman; Hall/CRC, pp. 525–536. DOI: 10.1201/b16720