$$ \require{cancel} \newcommand{\given}{ \,|\, } \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\vecg}[1]{\boldsymbol{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\bbone}{\unicode{x1D7D9}} $$

The Course

Welcome to the Data Science and Statistical Computing module (MATH2687) on the 2nd year undergraduate programme at Durham University.

If you want to get to these notes quickly without having to navigate via Durham Ultra, simply bookmark this page (or remember the shortcut URL https://www.louisaslett.com/dssc). There are common administrative questions that crop up, so I have created a Frequently Asked Questions (FAQ) page in the appendix: please check here first to see if your question is answered. If it is not, then please email me.

These notes will update as the course progresses, so check back each week! At the moment, this is just a welcome page, but additional chapters will appear as we work through the course.

In order to analyse real world problems involving data, we need not only a methodological and theoretical knowledge of statistics, but also to develop strong computational thinking and skills. This course has a lecture component introducing computational statistics, and a slightly larger practical component introducing the modern programming environments and tools used extensively by statisticians, data scientists and machine learners in both academia and industry.

The component on computational statistics introduces sampling methodology and Monte Carlo methods, which underpin many aspects of modern Bayesian statistics and statistical machine learning. The unusual name, “Monte Carlo”, is a reference to the famous Monte Carlo Casino in Monaco: it was chosen by the pioneers of the field because of the need for a code name for the work, which was part of the secret development of the atomic bomb at Los Alamos during the second world war. The course starts with basic principles of finite sampling and the bootstrap, a technique that enables us to loosen the assumptions underlying many statistical methods by deploying computer power. Attention then turns to sampling from probability distributions, a method which enables bypassing the usually intractable integrals encountered in many parts of statistics by appealing to convergence of random sample averages rather than using complex algebra or naive grid based methods (like Simpson’s rule).

The practical component happens via computer labs and will centre around the R programming language. R, together with a host of related software, provides an incredibly powerful toolset for data science. Starting from first principles, the language will be introduced, moving quickly on to learn the modern packages which enable rapid data manipulation, visualisation (including interactive and 3D visualisations), modelling, report building and construction of interactive dashboards to present data in a compelling fashion. Together, the ability to use all these tools will be a complement to your theoretical and methodological statistics courses, invaluable preparation for applied final year projects and provide skills that are used by and sought after in industry.

Course structure

The mathematical component of the course, covering computational statistics, will be delivered via 1 lecture per week and 1 tutorial every other week, together with occasional formative assignments during the term. The data science component of the course will be delivered via 2 practical class hours: one of these is a lecture style; whilst the other is a traditional small group practical with detailed problems to work through and someone available to answer questions.

You will see these notes are split into parts on the left as follows:

  • Part I Computational Statistics: these are the notes that go with the main mathematics lectures.
  • Part II R Programming: these are the slides that relate to the programming lectures.
  • Part III Practical Labs: these are the exercise sheets for the hands-on programming practicals.
  • Part IV Tutorials: these are the tutorial sheets for the mathematical part of the module. Solutions will be posted after all tutorial groups have taken place.

R + RStudio

We will be using the statistician’s favourite tool, R (pronounced like the letter, not an enthusiastic pirate 🏴‍☠️!)

There are three ways you can run R and RStudio which we cover in detail in Appendix A “R and RStudio”. They are: (i) using a Github Codespaces cloud server, with an image created especially for the course; (ii) using the standard AppsAnywhere application on CIS or you own machine; (iii) installing and running the software on your own laptop. Option (i) is recommended for everyone, and if you have a laptop option (iii) is good experience.

Live code

In this course, we consider code to be a first class citizen in your learning! Some code in these notes, especially if it is long-running, will simply be displayed inline like the following:

x <- rnorm(10)
mean(x)

However, where a fast running example is possible, this will be used and provided in “live code” blocks, where you can make changes and re-run to observe the effect. These appear like this:

x <- rnorm(20)
mean(x)

You can run the code inline by clicking the “Run code!” button without leaving these notes and losing your flow. Also, since the code is remotely executed this will work even on mobile devices like an iPad or iPhone where R is not supported.

It is possible to include plots and other more complex output too. To see this, modify the code above to add hist(x) on line 3, click the “Run code!” button again and either “Zoom” or scroll to see the histogram.

The server running these code chunks is a shared and public resource, so please be considerate. You should also assume that anything you run may be publicly visible, so do not run anything confidential!

Note these live code blocks are just for quick examples and are not a substitute for installing R on your device, because there are many features the live code blocks can’t provide that are in RStudio.