Chapter 3 The Bootstrap

⏪ ⏩

Monte Carlo testing helped us enormously by eliminating the need to analytically calculate the sampling distribution of the test statistic under the null hypothesis. We did still need to know the null, which is clearly fine when we are testing, but if we turn attention to standard errors and confidence intervals the picture is less clear. What if I don’t really know the distribution that generated the data and don’t have any particular hypothesis about it, but I simply want to generate uncertainty estimates about some statistic (mean, median, …) of interest?

Can we perhaps use simulation to avoid even having to specify the unknown true population distribution, and avoid having to derive the sampling distribution of the statistic we want to estimate? Yes! We are going to dive straight in and describe a solution immediately and return to the formality later. The technique we cover now is called the Bootstrap.

Definition 3.1 (Bootstrap estimate and standard error) Assume we are given a data set consisting of independent samples \(\vec{x} = (x_1, x_2, \dots, x_n)\) and a statistic \(S(\cdot)\). Then, to construct a bootstrap estimate of the standard error of the statistic \(S(\cdot)\), draw \(B\) new samples of size \(n\) with replacement from \(\vec{x} = (x_1, x_2, \dots, x_n)\), giving \(\vec{x}^{\star 1}, \dots, \vec{x}^{\star B}\) and compute: \[\widehat{\text{Var}}(S(\vec{x})) = \frac{1}{B-1} \sum_{b=1}^B \left(S(\vec{x}^{\star b}) - \bar{S}^\star\right)^2\] where \(\bar{S}^\star = \frac{1}{B} \sum_{b=1}^B S(\vec{x}^{\star b})\).

So, in the above definition, the \(x_i\) could be mouse lifetimes and \(S(\cdot)\) could be the mean (or median or …). We refer to drawing a sample from \(\vec{x} = (x_1, x_2, \dots, x_n)\) of size \(n\) with replacement as resampling the data.

Example 3.1 (mouse survival data (cont.) 🐭) To perform a Bootstrap estimation of the mean (ie \(S(\cdot)\) is the sample mean) and the standard error of that estimate, we proceed to resample from our treatment and control group of mouse observations. So, a resample of \(\vec{x} = (x_1, x_2, x_3, x_4, x_5, x_6, x_7)\) might, for example, be: \[\begin{align*} \vec{x}^{\star1} &= (x^{\star1}_1 = x_5, x^{\star1}_2 = x_7, x^{\star1}_3 = x_5, x^{\star1}_4 = x_2, x^{\star1}_5 = x_1, x^{\star1}_6 = x_7, x^{\star1}_7 = x_4) \\ &= (x_5, x_7, x_5, x_2, x_1, x_7, x_4) \end{align*}\] Meaning that, \[S(\vec{x}^{\star1}) = \frac{x_5 + x_7 + x_5 + x_2 + x_1 + x_7 + x_4}{7}\] Upon repeated resampling in this way, we end up with many resampled datasets of size 7, and can compute the mean (ie here \(\bar{S}^\star\) is the mean of means since \(S(\cdot)\) is the sample mean) and standard error using the above. This leads to the following:

# Mouse data
x <- c(94,197,16,38,99,141,23)

# Number of bootstraps
B <- 1000

# Statistic
S <- mean

# Perform bootstrap
S.star <- rep(0, B)
for(b in 1:B) {
  x.star <- sample(x, replace = TRUE)
  S.star[b] <- S(x.star)
}

# Bootstrap estimate
S(x)

# Standard error of estimate
sd(S.star)

What is remarkable about the Bootstrap is that I can now decide to estimate the median and get the standard error of that statistic by changing line 8 to be:

S <- median

and that’s it … no new algebra, no new asymptotic theory to prove, no new assumptions about the distribution of mouse lifetimes! Try it for yourself and see how the median estimator and it’s standard error differ to that of the mean.

Before we get carried away, we should note that there are plenty of situations in which the Bootstrap can fail. We’ll examine some of these and in some cases modifications can ‘rescue’ it, but it is a surprisingly general method which has an elegant simplicity.

Of course, we should note that we just expended \(1000\times\) more computational effort to calculate this than to calculate the standard error of the mean assuming Normality, but thanks to modern computer power this happened in the blink of an eye. Naturally, for harder problems and bigger datasets the computational cost can become noticeable, but things look promising that we can trade some computer effort to make fewer assumptions and get rid of routine mathematical derivations.

3.1 Light relief aside

Where does the name “bootstrap” come from? Boots often have a small strap at the back which can be pulled to help when putting them on. In the 19th century the idiom “to pull oneself up by one’s bootstraps” was often used to convey the sense of something that is completely infeasible (I can’t levitate by just continuing to pull my bootstraps harder once the boots are on!)

For some reason it’s often attributed to The Adventures of Baron Munchausen, a story about a German nobleman who tells extravagant tall tales of his ludicrous exploits. In the 18th century book (and modern film adaptation), the Baron does however analogously pull himself out of a swamp by his own hair! 🤣

Why this name then? Well, it would seem to be an impossible task to use the very sample of data we are trying to learn about in order to also understand how sure of our answer we are: the logic feels circular. As we will see, this seemingly impossible task is in fact sensibly grounded, but the humorous name has stuck! 🥾

3.2 The non-parametric bootstrap

The procedure described in Definition 3.1 is more fully called the “non-parametric bootstrap”. We have simply described it, but let’s try to understand why this procedure does anything sensible at all.

3.2.1 Empirical distribution functions

Recall from Stats/Prob I the cumulative distribution function (cdf), or just distribution function for short:

Definition 3.2 (Distribution function) For any random variable \(X\), taking values in \(\mathbb{R}\), the distribution function \(F: \mathbb{R} \to [0,1]\) is defined by, \[F(x) := \mathbb{P}(X \le x) \quad \forall\ x \in \mathbb{R}\]

We often subscript the function if we are dealing with multiple random variables and need to distinguish different distribution functions, so \(F_X(x) = F(x)\). From Prob I you also know that the distribution function completely defines the probability distribution of the random variable, so defining a distribution function is enough to fully define the behaviour of a random variable.

Example 3.2 (cdfs) The following shows some plots of example cdfs for some well known distributions:

\[\begin{align*} \text{Black:} & \ \ \text{Standard Normal} &&\implies F(x) = \Phi(x) \\ \color{blue}{\text{Blue:}} & \ \ \color{blue}{\text{Exponential}, \lambda = 1} &&\implies F(x) = 1 - e^{-\lambda x} \\ \color{red}{\text{Red:}} & \ \ \color{red}{\text{Poisson}, \lambda = 1} &&\implies F(x) = e^{-\lambda} \sum_{k=0}^{\lfloor x \rfloor} \frac{\lambda^k}{k!} \\ \color{green}{\text{Green:}} & \ \ \color{green}{\text{Binomial}, n=4, p=0.7} &&\implies F(x) = \sum_{k=0}^{\lfloor x \rfloor} \binom{n}{k} p^k (1-p)^{n-k} \end{align*}\]

However, we can construct cdfs in other ways, including building a so-called empirical cdf, based on some collected data:

Definition 3.3 (Empirical distribution function) Given a sample of data \((x_1, \dots, x_n)\), each sample an iid realisation of some random variable \(X\), we define the empirical cumulative distribution function (ecdf) (or just empirical distribution function) to be: \[\hat{F}(x) := \frac{1}{n} \sum_{i=1}^n \bbone\{x_i \le x\} \quad \forall\ x \in \mathbb{R}\]

Notice that this is indeed a valid cdf:

\(\lim_{x \to -\infty} \hat{F}(x) = 0\) and \(\lim_{x \to +\infty} \hat{F}(x) = 1\)
Monotonicity: For any \(x' < x\), \(\hat{F}(x') \le \hat{F}(x)\)
Right-continuity: For any \(x \in \mathbb{R}, \hat{F}(x) = \hat{F}(x^+)\) where \(\hat{F}(x^+)\) is the limit from the right, \(\hat{F}(x^+) = \lim_{t \downarrow x} \hat{F}(t)\)

Slightly more informally, \[\hat{F}(x) := \frac{\#\{x_i \le x\}}{n}\] where \(\#\) is just counting how often the bracketed expression happens. ie the empirical distribution function is determined by the proportion of observations which are at most equal to \(x\).

Example 3.3 (mouse survival data (cont.) 🐭) The empirical distribution function of the treatment group of mice, \[\vec{x} = (x_1 = 94, x_2 = 197, x_3 = 16, x_4 = 38, x_5 = 99, x_6 = 141, x_7 = 23)\] is:

Notice, that when we say we ‘believe’ data follows a particular distribution and then try to infer the parameters for it, we are effectively imposing a particular smoothed form on the cdf. So, if we say we believe the mouse data are Normal, then we might estimate \(\hat{\mu} = 86.86, \hat{\sigma} = 66.77\) and we can then examine the empirical cdf and Normal cdfs on the same plot:

So when we perform statistical inference to fit a particular distribution to some data, we are effectively trying to estimate the probability distribution which that random variable follows. Commonly we do this by first making an assumption about the distribution family (eg Normal, Exponential, …) and then use maximum likelihood or Bayesian methods to estimate the parameters. The empirical distribution function is a “non-parametric” way to think about this, which does not specifically assume a parametric distributional form for the random variable.

A really important property of the empirical cdf which justifies thinking of it as an alternative way of fitting a distributional form to data is the following, which we state without proof:

Theorem 3.1 (Glivenko-Cantelli) Let \(X_1, \dots, X_n\) be be a random sample taken from some probability distribution with cdf \(F(x)\). Then, \[\sup_{x \in \mathbb{R}} |\hat{F}(x) - F(x)| \to 0 \quad \text{as} \ n \to \infty\] where convergence is in probability.

In other words, the theorem above assures us that as we collect more and more data, the empirical cdf will converge to the true cdf. The following live code lets you see this in action. In it we simulate from a standard Normal, then compute the ecdf, and finally compare the plots for the ecdf and the cdf of a standard Normal.

# Decide the simulation size
n <- 10

# Simulate data from a standard Normal
x <- rnorm(n)

# Compute the ecdf
e <- ecdf(x)

# Plot the ecdf
plot(e)

# Overlay the true cdf
xx <- seq(-4, 4, length.out = 200)
lines(xx, pnorm(xx), lty = 2)

Change \(n\) in line 2 of the code to 100, 1000 and then 10000 and notice that the empirical cdf becomes indistinguishably close to the dashed line (true cdf).

3.2.2 Sampling with replacement and ecdfs

In fact, the procedure of sampling with replacement from some data is equivalent to sampling from the ecdf \(\hat{F}\) of that data.

Lemma 3.1 Given data \(\vec{x} = (x_1, \dots, x_n)\), sampling uniformly at random from the data is equivalent to sampling from the probability distribution defined by the empirical distribution function \(\hat{F}(\cdot)\) constructed using \(\vec{x}\).

Proof (sketch). Simply show the cdf of resampling is the ecdf.

Note drawing a sample with replacement from \(\vec{x} = (x_1, \dots, x_n)\) means that we are drawing a sample from the discrete valued random variable which assigns equal probability mass \(\frac{1}{n}\) at each \(x_i\).

Then, simply apply the result from Prob I that for a probability mass function \(p\), \(F\) is piecewise constant, \[F(x) = \sum_{t:t \le x} p(t)\] Expand sum to trivially observe \(F(x) \equiv \hat{F}(x)\) here.

In other words,

we have seen the ecdf approximates the true cdf
bootstrap resampling is equivalent to sampling \(n\) times from the ecdf
therefore, bootstrap is equivalent to first fitting an ecdf to the data and then sampling from it as though this was a fitted distribution
(… notice the likeness to doing a Monte Carlo sample of the same size as the data, treating \(\hat{F}\) as a “known” distribution)

This is the logic behind bootstrap sampling: we are effectively approximating what we would do if we knew the true distribution, which is to draw samples from it (see Parametric Bootstrap later …)

As noted in the sketch proof of Lemma 3.1, the ecdf defines a new simple discrete random variable whose probability mass function assigns probability \(\frac{1}{n}\) at each value \(x_i\). To be clear: this defines a new random variable, which we will call \(Y\), whose ecdf is defined by this fixed, known set of data \(\{ x_1, \dots, x_n \}\), and with corresponding probability mass function: \[p(y) = \mathbb{P}(Y = y) = \begin{cases} \frac{1}{n} & \text{ if } y \in \{ x_1, \dots, x_n \} \\ 0 & \text{ otherwise} \end{cases}\] Note that if \(x_i = x_j\) for any \(i \ne j\) then obviously the unit of probability is not \(\frac{1}{n}\).

Example 3.4 (expectation and variance of ecdf) Since the ecdf corresponds to such a simple probability mass function with support \(\mathcal{X} = \{ x_1, \dots, x_n \}\), we can actually compute the exact expectation and variance for a single sample from it very easily (recall the definition of expectation and variance for a discrete random variable from Prob I). \[\begin{align*} \mathbb{E}[Y] &= \sum_{y \in \mathcal{X}} y \, p(y) \\ &= \sum_{i=1}^n x_i \, \mathbb{P}(Y=x_i) \\ &= \sum_{i=1}^n x_i \frac{1}{n} \\ &= \bar{x} \end{align*}\] \[\begin{align*} \text{Var}[Y] &= \sum_{y \in \mathcal{X}} (y - \mathbb{E}[Y])^2 p(y) \\ &= \sum_{i=1}^n (x_i - \bar{x})^2 \frac{1}{n} \\ &= \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \frac{n-1}{n} \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \frac{n-1}{n} s_x^2 \end{align*}\] In other words, the expectation is the mean of the data used to construct the ecdf, and the variance is the usual sample variance of the data \((x_1, \dots, x_n)\), apart from the factor \(\frac{n-1}{n}\).

Similarly, if we look at the mean of a sample of \(m\) draws from an ecdf constructed on \(n\) data points, then the expectation and variance of the mean are: \[\mathbb{E}[\bar{Y}] = \bar{x} \quad\text{and}\quad \text{Var}[\bar{Y}] = \frac{n-1}{n} \frac{s_x^2}{m}\] so again the standard results, but with the same factor in front of the variance. This is left as an easy exercise for you to derive.

The above example shows that some computations with the ecdf can be very simple and lead to quite a striking insight into what bootstrap does. So in light of Example 3.4, let us revisit what is actually happening in the mouse example we gave at the start.

Example 3.5 (mouse survival data (cont.) 🐭) We now know that when we bootstrap resampled the mouse data in Example 3.1, we were actually taking samples from the ecdf defined by the 7 treatment group observations. We then initially proceeded to estimate the standard error when the statistic \(S(\cdot)\) was the mean (after that doing the median).

However, we just saw in Example 3.4 that the expectation and variance of the mean when sampling from an ecdf can be trivially computed. Therefore, for a statistic as simple as the mean we can actually compute the bootstrap standard error of our statistic without even doing the simulations at all! This is because if \(S(\cdot)\) is the mean, then \[\widehat{\text{Var}}(S(\vec{x})) \to \frac{n-1}{n} \frac{s^2}{n} \quad\text{as}\quad B \to \infty\] where \(s^2\) is the usual sample variance.

The mouse code from Example 3.1 is repeated here, with one line added which calculates the exact bootstrap standard error at the end.

# Mouse data
x <- c(94,197,16,38,99,141,23)

# Number of bootstraps
B <- 1000

# Statistic
S <- mean

# Perform bootstrap
S.star <- rep(0, B)
for(b in 1:B) {
  x.star <- sample(x, replace = TRUE)
  S.star[b] <- S(x.star)
}

# Bootstrap estimate
S(x)

# Standard error of estimate
sd(S.star)

# Standard error based on ecdf
sqrt(6/7 * var(x)/7)

Rerun this a few times and you should see that the estimated bootstrap standard error indeed hovers around the exact bootstrap standard error of the mean under the corresponding ecdf, and as you increase \(B\) it converges to it.

Of course, the power of the bootstrap is really in the fact it can be used for statistics other than the mean. In those cases, there usually won’t be a nice closed form for the variance of the statistic and so actually performing the resampling and computing the estimator \(\widehat{\text{Var}}(S(\vec{x}))\) will be required (eg for the median).

3.2.3 Special cases

There are many important special cases in which the standard bootstrap procedure needs more care. These include:

Missing data
- eg sometimes not all variables are collected for all observations, leading to so-called missingness
Time-series data
- eg stock prices, where there is a strong serial dependence and the assumption of independence is not valid
Censored data
- eg data on failure times of an engineered system: we may run an experiment and observe some failures but have other units which still work at the end of the experiment, which are called censored observations

You may not have studied the above areas before, so we only mention them as a “forward reference” to be aware of if you encounter them in later studies and think of applying the bootstrap!

One special case that we can study now with your existing knowledge, is how the bootstrap works with finite population sizes.

3.2.3.1 Finite populations

The standard non-parametric bootstrap we have seen so far implicitly assumes a theoretically infinite population size, as is the standard setting in a lot of statistical problems. Of course, in many instances the population is not really infinite, but it generally works ok to assume so if the population size, \(N\), is very large compared to our sample size, \(n\). However, if the sampling fraction, \(f = \frac{n}{N}\), exceeds about 0.1 (rule of thumb) then the effect of the finite population size (and without replacement sampling) cannot be dismissed.

Why? What is different with finite populations?

Recall we use \(\mu\) for the true population mean and \(\sigma^2\) for the true population variance. Then, although \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\) is an unbiased estimate of the population mean \(\mu\), the variance of this estimator, \(\text{Var}(\bar{X})\), is no longer \(\frac{\sigma^2}{n}\) as we are used to in the infinite population case.

Theorem 3.2 (Finite population variance of mean) Given a finite population of size \(N\), the sample mean \(\bar{X}\) based on a sample of size \(n\) drawn uniformly at random without replacement has variance, \[\text{Var}(\bar{X}) = \left( \frac{N-n}{N-1} \right)\frac{\sigma^2}{n}\]

Proof. Let us directly derive this in the finite population case: \[\begin{align*} \text{Var}(\bar{X}) &= \text{Var}\left( \frac{1}{n} \sum_{i=1}^n X_i \right) \\ &= \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^n X_i\right) \\ &= \frac{1}{n^2} \left[ \sum_{i=1}^n \text{Var}\left( X_i \right) + \sum_{i \ne j} \text{Cov}\left( X_i, X_j \right) \right] \\ \end{align*}\] Let us write \(\text{Cov}\left( X_i, X_j \right) = c (\ne 0)\) for brevity. Then, \[\begin{align*} \text{Var}(\bar{X}) &= \frac{1}{n^2} \left[ n \sigma^2 + \sum_{i \ne j} c \right] \\ &= \frac{1}{n^2} \left[ n \sigma^2 + n(n-1) c \right] \\ &= \frac{1}{n} \left[ \sigma^2 + (n-1) c \right] \\ \end{align*}\] How do we determine \(c\)? If we were to sample the whole population (possible since it is finite), then by definition \(\bar{X} = \mu\) and \(\text{Var}(\bar{X}) = 0\) (there is no randomness in the mean when we have sampled everything!) Therefore, \[\begin{align*} \text{Var}(\bar{X}) = 0 &= \frac{1}{N} \left[ \sigma^2 + (N-1) c \right] \\ \implies c &= \frac{-\sigma^2}{N-1} \end{align*}\] Hence, \[\begin{align*} \text{Var}(\bar{X}) &= \frac{1}{n} \left[ \sigma^2 + (n-1) \frac{-\sigma^2}{N-1} \right] \\ &= \frac{\sigma^2}{n} \left( 1 - \frac{n-1}{N-1} \right) \\ &= \frac{\sigma^2}{n} \left( \frac{N-n}{N-1} \right) \end{align*}\] as required.

Some important notes:

\(\frac{N-n}{N-1} < 1\), so in the finite population case the standard error of the sample mean is smaller than in the infinite population case;
If \(N\) is sufficiently large, \(\frac{N-n}{N-1} \approx 1-\frac{n}{N}\), so we can also write \[\text{Var}(\bar{X}) \approx \left( 1-f \right)\frac{\sigma^2}{n}\] which highlights the relevance of the finite population adjustment is linked to the sampling being large relative to the population;
As \(N \to \infty\), \(\text{Var}(\bar{X}) \to \frac{\sigma^2}{n}\) which is the usual standard error of the sample mean for infinite populations, as expected.

So we know the theoretical variance of the finite population mean estimator, but this relies on the unknown population variance \(\sigma^2\). We know from Stats I the sample variance is an unbiased estimator for \(\sigma^2\) in the infinite population setting (ie \(\mathbb{E}[S^2] = \sigma^2\)). However, we will see this is not so in the finite population setting, so care is needed to compute an unbiased estimate for \(\text{Var}(\bar{X})\).

Theorem 3.3 Given a finite population of size \(N\), and a sample of size \(n\) drawn without replacement from it, we have that, \[\left(1 - \frac{n}{N} \right) \frac{S^2}{n}\] is an unbiased estimator of \(\text{Var}(\bar{X})\)

Proof. First, note that, \[\begin{align*} \mathbb{E}[S^2] &= \frac{1}{n-1} \mathbb{E}\left[ \sum_i (X_i - \bar{X})^2 \right] \\ &= \frac{1}{n-1} \mathbb{E}\left[ \sum_i (X_i^2 - 2 X_i \bar{X} - \bar{X}^2) \right] \\ &= \frac{1}{n-1} \mathbb{E}\Bigg[ \sum_i X_i^2 \Bigg] - \mathbb{E}\Bigg[ 2 \bar{X} \underbrace{\sum_i X_i}_{n \bar{X}} \Bigg] + \mathbb{E}\Bigg[ \underbrace{\sum_i \bar{X}^2}_{n\bar{X}^2} \Bigg] \\ &= \frac{1}{n-1} \left( \sum_i \mathbb{E}\left[ X_i^2 \right] - 2n \mathbb{E}\left[ \bar{X}^2 \right] + n \mathbb{E}\left[ \bar{X}^2 \right] \right) \\ &= \frac{n}{n-1} \left( \mathbb{E}\Big[ X_i^2 \Big] - \mathbb{E}\Big[ \bar{X}^2 \Big] \right) \end{align*}\] Therefore, \[\begin{align*} \mathbb{E}\left[ \left( 1 - \frac{n}{N} \right) \frac{S^2}{n} \right] &= \frac{1}{n} \left( 1 - \frac{n}{N} \right) \mathbb{E}\left[ S^2 \right] \\ &= \frac{1}{\cancel{n}} \left( 1 - \frac{n}{N} \right) \frac{\cancel{n}}{n-1} \left( \mathbb{E}\Big[ X_i^2 \Big] - \mathbb{E}\Big[ \bar{X}^2 \Big] \right) \end{align*}\] But, we know \(\text{Var}(X) = \mathbb{E}\Big[X^2\Big] - \mathbb{E}\Big[X\Big]^2\), so \[\begin{align*} &= \frac{1}{n-1} \left( 1 - \frac{n}{N} \right) \left( \left[ \text{Var}\Big(X_i\Big) + \mathbb{E}\Big[X_i\Big]^2 \right] - \left[ \text{Var}\Big(\bar{X}\Big) + \mathbb{E}\Big[\bar{X}\Big]^2 \right] \right) \\ &= \frac{1}{n-1} \left( 1 - \frac{n}{N} \right) \left( \left[ \sigma^2 + \cancel{\mu^2} \right] - \left[ \left( \frac{N-n}{N-1} \right)\frac{\sigma^2}{n} + \cancel{\mu^2} \right] \right) \\ &= \frac{1}{n-1} \left( 1 - \frac{n}{N} \right) \frac{\sigma^2}{n} \left( n - \frac{N-n}{N-1} \right) \\ &= \frac{\sigma^2}{n} \frac{1}{n-1} \frac{N-n}{N} \frac{n(N-1) - (N-n)}{N-1} \\ &= \frac{\sigma^2}{n} \frac{n(N-n)(N-1) - (N-n)^2}{N(N-1)(n-1)} \\ &= \frac{\sigma^2}{n} \frac{N-n}{N-1} \frac{\cancel{n(N-1) - (N-n)}}{\cancel{N(n-1)}} \end{align*}\] Hence, \[\mathbb{E}\left[ \left( 1 - \frac{n}{N} \right) \frac{S^2}{n} \right] = \text{Var}(\bar{X})\] meaning it is an unbiased estimator as required.

So, the problem is laid bare. In the finite population case,

The estimated bootstrap standard error of the mean will converge to \[\widehat{\text{Var}}(S(\vec{x})) \to \frac{n-1}{n} \frac{s^2}{n}\]
The actual standard error of the mean is properly estimated by \[\left( 1 - \frac{n}{N} \right) \frac{s^2}{n}\]

Therefore, unless the sampling fraction \(f = \frac{n}{N}\) is coincidentally such that \(1-f = \frac{n-1}{n} \implies f = n^{-1}\) (which implies \(N = n^2\)), then the bootstrap estimate is incorrect. Specifically, it will be too large.

Solution? One option to fix this is to increase the size of each bootstrap resample by exactly the right amount to shrink the estimated standard error. That is, instead of resamples of size \(n\), take resamples of size \(n' > n\). We know from Example 3.4 this will result in \[\widehat{\text{Var}}(S(\vec{x})) \to \frac{n-1}{n} \frac{s^2}{n'}\]

Thus, if we select \(n' = \frac{n-1}{1-f}\) then \[\widehat{\text{Var}}(S(\vec{x})) \to \frac{n-1}{n} \frac{s^2}{n'} = (1-f) \frac{s^2}{n} = \left( 1 - \frac{n}{N} \right) \frac{s^2}{n}\] as required!

Alas, we have a problem … this is all only for the mean of a finite population. What about some other more interesting choice for \(S(\cdot)\)?! 😱 Following is one generic option (of many) to handle this situation.

Definition 3.4 (Population bootstrap) Assume \(\vec{x} = (x_1, x_2, \dots, x_n)\) is a sample of data drawn from a finite population of size \(N\), and \(S(\cdot)\) is a statistic we wish to use as an estimator. Further assume that \(N/n = k\) is an integer.

To perform the population bootstrap, first construct a pseudo-population of size \(N\) by repeating the samples \(\vec{x}\) \(k\) times: \[\tilde{\vec{x}} = (\underbrace{\underbrace{x_1, \dots, x_n}_{\vec{x}}, \underbrace{x_1, \dots, x_n}_{\vec{x}}, \dots, \underbrace{x_1, \dots, x_n}_{\vec{x}}}_{k \text{ times}})\]

Then, we replicate the finite population sampling scheme by drawing \(B\) new samples of size \(n\) without replacement from \(\tilde{\vec{x}}\), giving \(\vec{x}^{\star 1}, \dots, \vec{x}^{\star B}\) and then proceed as for the standard bootstrap: \[\widehat{\text{Var}}(S(\vec{x})) = \frac{1}{B-1} \sum_{b=1}^B \left(S(\vec{x}^{\star b}) - \bar{S}^\star\right)^2\] where \(\bar{S}^\star = \frac{1}{B} \sum_{b=1}^B S(\vec{x}^{\star b})\).

If \(n\) does not divide \(N\), that is \(N = kn + m\) for \(0 < m < n\), then before each bootstrap sample form pseudo-population \(\tilde{\vec{x}}\) by taking \(k\) copies of \(\vec{x}\) and add an additional subsample of size \(m\), without replacement, from \(\vec{x}\).

Example 3.6 We do a quick example where a random sample of \(n\) students in the class were given a Fitbit to measure their step count the following day. We’re interested in estimating the 0.25 quantile of step counts for the whole class of \(N=220\) students, and an idea of the uncertainty in that estimate: that is, how many steps do at least 75% of students in the class exceed, in the day in question?

We’ll demonstrate doing this for both cases defined above. First, let’s consider the simplest setting where \(n\) divides \(N\) exactly. For example, lets take \(n=10\):

# Data on step counts for 10 students
x <- c(10449, 6569, 8409, 8801, 8025, 12225, 9609, 8216, 9404, 11321)

# Population size
N <- 220

# Number of population bootstrap resamples
B <- 1000

# Statistic
S <- function(x) {
  quantile(x, probs = 0.25)
}

# Create pseudo population
# NOTE: only for the case where sample size divides population exactly
x.tilde <- rep(x, N/length(x))

# Perform population bootstrap
S.star <- rep(0, B)
for(b in 1:B) {
  x.star <- sample(x.tilde,
                   size = length(x),
                   replace = FALSE) # <= NOTE! No replacement now
  S.star[b] <- S(x.star)
}

# Estimator of 0.25 quantile
S(x)

# Standard error of estimate
sd(S.star)

However, as explained at the end of the definition of population bootstrap, we aren’t limited to sample sizes that divide the population size exactly, it’s just a little more complicated as we have to randomly sample the remainder before each bootstrap resample, as follows.

# Data on step counts for 12 students
x <- c(10449, 6569, 8409, 8801, 8025, 12225, 9609, 8216, 9404, 11321, 7443, 13911)

# Population size
N <- 220

# Number of population bootstrap resamples
B <- 1000

# Statistic
S <- function(x) {
  quantile(x, probs = 0.25)
}

# Compute constants k and m for pseudo population size, N = kn + m
# NOTE: this version works for *all* sample sizes
k <- N %/% length(x) # this is integer divison, could also do floor(N/length(x))
m <- N %% length(x) # find remainder

cat(paste0("N = ", N, " and n = ", length(x)))
cat(paste0("Hence, N = ", k, "n + ", m))

# Perform population bootstrap
S.star <- rep(0, B)
for(b in 1:B) {
  # Now we must build pseudo population before each resample
  x.tilde <- c(rep(x, k),
               sample(x, m, replace = FALSE))
  
  x.star <- sample(x.tilde,
                   size = length(x),
                   replace = FALSE) # <= NOTE! No replacement
  S.star[b] <- S(x.star)
}

# Estimator of 0.25 quantile
S(x)

# Standard error of estimate
sd(S.star)

3.3 The parametric bootstrap

If for some reason we have a strong belief (or perhaps some scientific reason to believe) that the data follow a particular distribution (eg Normal, Exponential, …), then we can often achieve more accurate estimates of standard error by deploying the parametric bootstrap, whereby we estimate the parameters of the distribution and then simulate pseudo data sets from it, rather than resampling the original data.

Definition 3.5 (Parametric bootstrap estimate and standard error) Assume we are given a data set consisting of independent samples \(\vec{x} = (x_1, x_2, \dots, x_n)\), which we believe are generated from some distribution parameterised by \(\theta\), \(F(\cdot \given \theta)\). To construct a parametric bootstrap estimate of some statistic \(S(\cdot)\), and an estimate of the standard error, we:

Find the maximum likelihood estimator \(\hat{\theta}\).
Draw \(B\) new samples of size \(n\) from \(\hat{F}(\cdot) = F(\cdot \given \hat{\theta})\), giving \(\vec{x}^{\star 1}, \dots, \vec{x}^{\star B}\) and compute: \[\widehat{\text{Var}}(S(\vec{x})) = \frac{1}{B-1} \sum_{b=1}^B \left(S(\vec{x}^{\star b}) - \bar{S}^\star\right)^2\] where \(\bar{S}^\star = \frac{1}{B} \sum_{b=1}^B S(\vec{x}^{\star b})\).

Aside from this change to the sampling approach, most other things remain essentially unchanged when using the parametric bootstrap.

Notice that this is very similar in feel to Monte Carlo testing! You might ask, as for Monte Carlo testing, how do we generate samples from \(F\)? This will actually be the focus of the second part of the course! For now just imagine this involves running a function in R like rnorm(), rexp(), etc, and we’ll study what kind of techniques R uses internally or how to generate samples for all sorts of distributions not built into R in the second part of the course.

Example 3.7 As a simple example of parametric bootstrap, let us assume we have data \((x_1, \dots, x_n)\) which we believe is drawn from a Geometric distribution, and we wish to estimate the median (and uncertainty in it).

Recall, if \(X \sim \text{Geometric}(\theta)\) then the probability mass function is: \[f_X(x) = \theta (1-\theta)^x, \quad x \in \{0, 1, 2, 3, ...\}, \theta \in [0,1]\]

The first step to do a parametric bootstrap requires the maximum likelihood estimator. Here,

\[\begin{align*} 0 = \frac{\partial \mathcal{L}}{\partial \theta} &= \frac{\partial}{\partial \theta} \left[ \sum_{i=1}^n \log \left( \theta (1-\theta)^{x_i} \right) \right] \\ &= \frac{\partial}{\partial \theta} \left[ n \log\theta + \left( \sum_{i=1}^n x_i \right) \log (1-\theta) \right] \\ \implies 0 &= \frac{n}{\theta} - \frac{\sum_{i=1}^n x_i}{1-\theta} \\ \implies \hat{\theta} &= \frac{n}{n + \sum_{i=1}^n x_i} \end{align*}\]

Armed with this MLE, we can proceed to code up a parametric bootstrap estimator for this setting:

# Data believed to be from a Geometric distribution
x <- c(4, 8, 1, 0, 3, 0, 6, 1, 2)

# Maximum likelihood estimator of Geometric parameter
theta.hat <- length(x)/(length(x)+sum(x))

# Number of bootstraps
B <- 1000

# Statistic
S <- median

# Perform bootstrap
S.star <- rep(0, B)
for(b in 1:B) {
  # NOTE: we now simulate from the Geometric using MLE of parameter,
  #       instead of resampling
  x.star <- rgeom(length(x), theta.hat)
  S.star[b] <- S(x.star)
}

# Bootstrap estimate
S(x)

# Standard error of estimate
sd(S.star)

3.4 How to use bootstrap samples

We have now discussed quite extensively how to generate bootstrap samples of a statistic, why the non-parametric approach makes sense in terms of the ecdf, what we can do instead if we have a parametric family in mind for the distribution, and how to use these samples to get a sense of the uncertainty in terms of the standard error. However, how else can we use these samples?

The bootstrap is a vast area of study (there are whole text books devoted only to this topic), so we just mention a couple of use cases:

Bias estimation: can we get a sense of whether the standard estimate of the statistic of interest is biased?
Confidence intervals: can we approximate the true confidence interval in both non-parametric and parametric settings?

There are many other ways you may encounter in future where bootstrap methods can be used, including approximating a Bayesian analysis with “uninformative” prior, or assessing the predictive accuracy of machine learning models.

3.4.1 Bias estimation

Often when we’re computing a statistic, it’s because we are estimating some parameter with it.

So if we are trying to estimate a parameter \(\theta\) of a true distribution \(F\), using a statistic \(S(\cdot)\), so that \(\hat{\theta} = S(\vec{x})\), then recall (Stats I) that the bias is: \[\text{bias}(\theta,\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta = \mathbb{E}[S(\vec{x})] - \theta\] The bootstrap samples can be used to provide an estimate of this bias.

Aside: why would we ever bother with a biased estimate? Don’t we always want an unbiased estimate of a parameter? Not necessarily! Sometimes unbiased estimators will have quite high variance and in such situations we may prefer an estimator which admits a small non-zero bias if it has much lower variance.

Definition 3.6 (Basic bootstrap bias estimate) Given a data set \(\vec{x} = (x_1, \dots, x_n)\) and a parameter \(\theta\) which is estimated by \(\hat{\theta} = S(\vec{x})\), then the basic bootstrap bias estimate is given by performing bootstrap resampling and computing \[\widehat{\text{bias}}(\theta,\hat{\theta}) = \bar{S}^\star - \hat{\theta} = \left( \frac{1}{B} \sum_{b=1}^B S(\vec{x}^{\star b}) \right) - S(\vec{x})\]

The above estimator is very intuitive: we’re treating \(\hat{\theta}\) as the ‘truth’, \(\theta\), and then the bootstrap resamples allow us to see how our estimator varies when we collect samples, so that \(\bar{S}^\star\) approximates \(\mathbb{E}[\hat{\theta}]\).

This approach is perfectly sound, though it can be improved (we may revisit this later in the course).

Important Note: you may think that we should report \(\bar{S}^\star\) as our estimate, instead of \(\hat{\theta} = S(\vec{x})\), but this would not be correct. If you want to report a bias correction, then we should subtract the bias from the usual estimate, eg:

\[\begin{align*} \hat{\theta} - \widehat{\text{bias}}(\theta,\hat{\theta}) &= \hat{\theta} - (\bar{S}^\star - \hat{\theta}) \\ &= 2 \hat{\theta} - \bar{S}^\star \end{align*}\]

In other words, the bias corrected value is \(2 S(\vec{x}) - \bar{S}^\star\), not \(\bar{S}^\star\) itself! We should report either \(S(\vec{x})\) or \(2 S(\vec{x}) - \bar{S}^\star\) as our estimate. However …

Important Note 2: often \(2 S(\vec{x}) - \bar{S}^\star\) will have higher variance as an estimator, so will not necessarily be preferable to \(S(\vec{x})\) ! It is still useful to get an idea of the level of bias, even if we don’t apply any bias correction.

3.4.2 Confidence intervals

Computation of confidence intervals (CIs) for the statistic we’re interested in can be performed in multiple ways: the Normal CI, the percentile CI, the BC CI, and the BCa CI. These are methods have increasing accuracy, but also increasing complexity both theoretically and in terms of how big \(B\) must be to function well.

3.4.2.1 Normal CIs

As you’ve learned in other Stats courses, statistics we’re computing will end up converging to a Normal distribution as the amount of data we have available increases, even when the distribution of the data used to compute the statistic is very non-Normal.

In this case, the \(100 (1-\alpha)\)% confidence interval can be written in the usual way using the bootstrap estimate of standard error:

\[\hat{\theta} \pm z_{\alpha/2} \sqrt{\widehat{\text{Var}}(S(\vec{x}))}\]

where \(z_{\alpha/2}\) is the \(100(\alpha/2)\)% percentile of the standard Normal distribution (just as you would be used to using when calculating a confidence interval for a Normal mean).

However, we should check for Normality using a quantile plot (Stats I) before depending on this. Particularly in small sample settings it will often be the case that the sampling distribution is still far away from Normality!

Important Note 1: we’re interested in checking the Normality of the bootstrap samples, not the data! The bootstrap samples are what we take to indicate the form of the sampling distribution of the statistic and that is where we rely on Normality when constructing the above interval.

Important Note 2: the confidence interval is centred on \(\hat{\theta} = S(\vec{x})\), not on \(\bar{S}^\star\).

3.4.2.2 Percentile CIs

If \(\hat{F}\) was close to the true distribution, we may hope that the bootstrap samples \(S(\vec{x}^{\star b})\), \(b = 1, \dots, B\), will also then be distributed according to the sampling distribution of the statistic (albeit centred on \(\hat{\theta}\)). In light of this, we can simply sort the bootstrap sampled statistics and identify the quantiles to give the confidence interval. Let \(S_b^\star := S(\vec{x}^{\star b})\), then using the order statistic notation you may have seen in Tutorial 2, let \(S_{(i)}^\star\) denote the \(i\)-th largest value, so that: \[S_{(1)}^\star \le S_{(2)}^\star \le \cdots \le S_{(B)}^\star\] Then, the \(100(1-\alpha)\)% confidence interval using the percentile CI method is given by the interval: \[\left[ S_{\left((\alpha/2) B \right)}^\star, S_{\left((1-\alpha/2) B \right)}^\star \right]\]

That is, we choose for the lower end the \((\alpha/2) B\)-th largest value, and for the upper end we choose the \((1-\alpha/2) B\)-th largest value. You will need to choose \(B\) carefully so that these values are integer.

Important Note 1: we are now in the setting where a large \(B\) is important. Although you can do the Normal CIs above with quite modest size \(B\) (eg sometimes only 50 to 200), it will often require \(B>2000\) to get a reasonable estimate using the Percentile method.

Important Note 2: this method has enabled us to relax the assumption that the distribution of the statistic is Normal, but it is not a universal solution. It will be inaccurate if there is any bias or non-constant standard error.

Example 3.8 (aside on confidence intervals) The full detail of this is is not examinable, but you should have a basic intuitive understanding.

Remember that a confidence interval is the random interval \([\theta_L, \theta_U]\) which contains the true parameter value \(\theta\) with probability \(1-\alpha\). Remember that this corresponds to the most extreme values \(\theta_0\) can take so that a null hypothesis \(H_0: \theta = \theta_0\) would not be rejected.

So, mathematically, how do we find those end points? Well, calling to mind the formal definition of a hypothesis test we gave in the last chapter, the above statement means we are looking to find \(\theta_L\) and \(\theta_U\) which satisfy: \[\mathbb{P}(S(X) > \hat{\theta} \given \theta = \theta_L) = \frac{\alpha}{2} \quad \text{ and } \quad \mathbb{P}(S(X) < \hat{\theta} \given \theta = \theta_U) = \frac{\alpha}{2}\]

If the distribution of \(S(X) \given \theta\) is symmetric, if \(\hat{\theta}\) is not biased, and if the standard error does not change with \(\theta\), then the percentile method works well because finding the confidence interval just involves translating the sampling distribution as \(\theta\) changes: this is a property of the true sampling distribution, not of bootstrap. To demonstrate when it works and when it doesn’t we take the following two hypothetical scenarios (where we show the exact sampling distributions, not the bootstrapped estimates, to highlight that the problem arises from using percentiles of the sampling distribution at \(\hat{\theta}\) – which bootstrap estimates – and not due to bootstrapping itself):

Above, the black curve is the sampling density under \(\hat{\theta}\), with \(\hat{\theta}\) shown by the vertical black line (there is no bias here). If the standard error is also constant for all values of \(\theta\), then the sampling distribution under other values of \(\theta\) simply involves shifting the black curve left or right. Clearly, this means that finding the value \(\theta = \theta_L\) for the green curve such that the area beyond the black vertical line (which is \(\mathbb{P}(S(X) > \hat{\theta} \given \theta = \theta_L)\)) equals 0.025 is precisely the 0.025 quantile of the black curve. Similarly for the red curve and upper tail. Hence, the black dotted lines which are lower and upper 0.025 quantiles and the true lower and upper confidence limits are overplotted and identical.

This is not true if we have bias and non-constant standard error as in the following situation:

Now the sampling distribution behaves very differently for different values of \(\theta\) and a lot more care is required!

The black curve shows the density of the statistic when \(\theta = \hat{\theta}\), the black vertical line shows the value of \(\hat{\theta}\) and the black dotted vertical lines show the lower 0.025 and upper 0.975 quantiles of the black density
- ie the area under the black curve to the left of the first dotted black vertical line, or the right of the second dotted black vertical line, are each 0.025.
The green dashed curve shows the density of the statistic when \(\theta = \theta_L\) is such that \(\mathbb{P}(S(X) > \hat{\theta} \given \theta = \theta_L) = 0.025\) and the green dashed vertical line shows the value of \(\theta_L\)
- ie the area under the green curve to the right of the solid black vertical line is 0.025. \(\theta_L\) is the unique value which results in that area.
The red dashed curve shows the density of the statistic when \(\theta = \theta_U\) is such that \(\mathbb{P}(S(X) < \hat{\theta} \given \theta = \theta_U) = 0.025\) and the red dashed vertical line shows the value of \(\theta_U\)
- ie the area under the red curve to the left of the solid black vertical line is 0.025. \(\theta_U\) is the unique value which results in that area.

The point is: if the quantiles of the sampling distribution when \(\theta = \hat{\theta}\) were sufficient then both black dotted lines would overlay the coloured dashed lines, but because the form of the sampling distribution changes dramatically with changing value of \(\theta\), the estimate is inaccurate. In other words, the percentile CI approximation will be the dotted black vertical lines, when the true exact confidence interval is the dashed coloured vertical lines.

3.4.2.3 BC/BCa CIs

BC (bias corrected) and BCa (bias corrected and accelerated) CIs are two more advanced methods which can make an adjustment for the situation where bias or non-constant standard error arise. Note that this is a non-trivial task, since all we get to see with the bootstrap is approximate samples from the black density in the example above. Therefore, covering BC/BCa would be the topic of a fourth year advanced course: we simply note that if these adjustments are needed, then you will be able to compute them using the boot::boot() function in R. You should, however, understand how to identify when they might be needed per the above discussion.