Chapter 1 Introduction
⏪ ⏩
The canonical problem in statistics is to understand the uncertainty in our estimator (a statistic) of some population quantity (a parameter) based on having observed only some sample of data from the population. We are often interested directly in that uncertainty estimate (think confidence interval or Bayesian posterior distribution), or we may be interested only in whether the population parameter could plausibly be some particular value (think hypothesis test).
It can often be a complicated (or impossible) matter to exactly answer these questions. In this course, we’ll be interested in exploiting computational power to make answering these kinds of questions much easier, and to avoid having to do a lot of the sometimes routine, sometimes complicated, additional mathematical work each time we want to answer a question involving a different set of distributional assumptions.
The way in which we deploy computational power is via simulation … that means, by generating random numbers with the particular distribution we need to solve a problem computationally rather than algebraically. We start the course by diving in with some motivation and applications, basically assuming that we can somehow generate these random numbers, then a larger part of the course will turn to how that simulation can be achieved and understanding the theoretical properties of it.
1.1 Statistics I refresher
We will start the course off gently by casting our mind back to last year (I know, ages ago right!) to remember a couple of the topics we’ll tackle in a new way. This is part refresher, but also we’ll extend by taking the opportunity to slightly formalise hypothesis testing compared to first year and tackle a kind of problem you won’t have seen before, so don’t skip the rest of this Chapter even if you’re feeling confident!
1.1.1 Standard errors
You will recall that one of the key ways you evaluated the uncertainty in the mean in Stats I was to compute the standard error of the sample mean.
Recall that thanks to the Central Limit Theorem, if we have a sample of data \((x_1, \dots, x_n)\) where each \(X_i \sim UK(\mu, \sigma^2)\) (where \(UK\) is some unknown population distribution), then: \[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] as \(n \to \infty\). In words: as the sample size increases, the distribution of the sample mean tends to a Normal distribution with standard deviation \(\frac{\sigma}{\sqrt{n}}\) regardless of what the population distribution may be.
You also saw that if we compute the unbiased estimate of the population standard deviation using: \[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\] then an estimate of the standard deviation of the sample mean is simply \(\frac{s}{\sqrt{n}}\) and is referred to as the standard error of the sample mean.
But (a big but!), if \(n\) is not sufficiently large, then \(s\) is a poor estimator of the population standard deviation and, perhaps even worse, the distribution of the mean may not be Normal. You also saw that if the population distribution is in fact Normal and \(n\) is small then the difference between the sample mean and population mean (scaled by the standard error of the sample mean) will be \(t\)-distributed. That is, \[\frac{\bar{X}-\mu}{s/\sqrt{n}} \sim t_{n-1}\] We call the equation on the left a pivotal quantity, which just means that after transformation the distribution no longer depends on the parameters of the distribution of \(X\) (ie in this case does not depend on \(\mu\) or \(\sigma\)).
Example 1.1 (mouse survival data) Sixteen brave cousins of Jerry the mouse were assigned either to be treated (treatment group), or not treated (control group), with a drug to try to improve their health and athleticism, helping them cheat Tom the cat of his prey.
Clearly, we would want to try to ascertain whether taking this drug truly does improve mouse longevity or not after they move into the cat’s habitat. There is some population distribution for the lifetime of mice (unknown to us) and we are interested in developing a drug which, when taken, results in the mean of the lifetime distribution increasing. We obviously can’t observe the theoretically infinite number of future mice under the condition of taking or not taking the drug, so we must make do with our small sample of 16 of Jerry’s cousins.
The results of the experiment are as follows, with \(x_i\) being the survival time of mouse \(i\) in the treatment group and \(y_i\) the survival time of mouse \(i\) in the the control group, measured in days since moving into the cat’s habitat:
\[\begin{align*} &x_1 = 94, x_2 = 197, x_3 = 16, x_4 = 38, x_5 = 99, x_6 = 141, x_7 = 23 \\ &y_1 = 52, y_2 = 104, y_3 = 146, y_4 = 10, y_5 = 51, y_6 = 30, y_7 = 40, y_8 = 27, y_9 = 46 \end{align*}\]
We then have that \[\bar{x} = \sum_{i=1}^7 \frac{x_i}{7} = 86.86 \quad\text{and}\quad \bar{y} = \sum_{i=1}^9 \frac{y_i}{9} = 56.22\] leading us to an observed difference in samples means of \(\bar{x}-\bar{y} = 30.63\), meaning that at first glance the mice are surviving longer. But, as seasoned statisticians from first year, you know that we can’t trust a point estimate like this! Hopefully your immediate reflex was to ask: but what is the accuracy of these estimates?
Let’s cheerfully hope/assume the data is Normally distributed and compute the standard errors: then \(s_{\bar{x}} = 25.24\) and \(s_{\bar{y}} = 14.14\). Remember, based on Stats I, we would say that the sample mean will be less than \(1\times\) the standard error from the true value about 68% of the time, and less than \(2\times\) the standard error from the true value about 95% of the time. Seeing this, we might start to lose some of the initial enthusiasm that the drug is actually beneficial rather than the difference being possibly just down to sampling variability.
More formally, what is the standard error of the difference? Again, recall from first year, \(\text{Var}(aX+bY) = a^2 \text{Var}(X) + b^2 \text{Var}(Y)\), so with \(a=1, b=-1\) we have \(s_{\bar{x}-\bar{y}} = \sqrt{25.24^2 + 14.14^2} = 28.93\). Therefore, the observed difference is only \(\frac{30.63}{28.93} = 1.05\times\) the standard error away from zero … if we formulate this as a hypothesis test then this is not a significant difference (… subject to all sorts of assumptions).
So, we have methods from first year, why continue studying this problem? Well,
- firstly, it is a small sample size (so we can’t invoke the CLT) and we don’t really know the data is Normal, so possibly none of the above is valid. Indeed, although the rule-of-thumb sample size for invoking the CLT that you were given in first year is ok in reasonably well behaved situations, we’d like to have a method which is less contingent on rules of thumb.
- secondly and perhaps more importantly, it’s all a bit limiting isn’t it? What if I actually want to know whether the median is higher in the treatment group?
- thirdly, maybe I want to study small sample size settings without always having to assume Normality.
Note, asking about the median is actually a very sensible and arguably more interpretable question when we don’t know the distribution of mouse lifetimes, because saying the median has increased is saying that the lifetime for the longest lived 50% of mice has increased, even if the true distribution is very non-Normal (ie highly skewed so that the mean and median are different).
Indeed, the second reason above is arguably the most serious problem, because the Central Limit Theorem is specifically for the mean. Consequently, even in large sample size settings if I am interested in something other than the mean then I can’t just invoke the CLT. Hence, we may need to develop a new set of asymptotic theory each time we want to estimate a different quantity and we don’t know if that new theory will be as universal as the CLT is (maybe we need to make an extra assumption about the true population distribution) … nightmare! 😱
Fear not … the computational statistics methods of this course will enable us to use computer power to dig us out of this hole. 👩💻
1.1.2 Hypothesis tests
After seeing standard errors, you studied hypothesis testing. You’re 2nd year now, so we’ll present the recap in a more general and slightly formalised setting, but what you saw in Stats I is just a simple special case.
We again start with some data \(\vec{x} = (x_1, \dots, x_n)\) and then define a null hypothesis \(H_0\) which identifies the distribution which we believe (or wish to test whether) generated each \(x_i\). For example, in Stats I you looked at hypothesis tests where the null specified a Normal distribution with mean \(H_0: \mu = \mu_0\) versus \(H_1: \mu \ne \mu_0\).
Next, we select some test statistic, \(h(\cdot)\), which is any functional of the data that is extreme when the null is wrong, and not extreme when the null is correct. For example, in Stats I, you used \(\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\) when \(\sigma\) was known, but for more complex settings you will need to make some sensible choice1.
Now, the observed test statistic for our data is simply \(t = h(x_1, \dots, x_n)\). In other words, we simply plug the observed data into the test statistic functional, but we should be conscious that if we had collected a different sample of size \(n\) it would have been slightly different. A hypothesis test therefore compares the observed test statistic to the distribution of test statistic values we would see if the null hypothesis was correct, in order to figure out if the particular value we saw is unusual/extreme.
In other words, we need to know the distribution of the random variable \(T = h(X_1, \dots, X_n)\) ! Then, we need to see how unusual \(t\) would be as a realisation of \(T\).
Looping this back to Stats I, you learned for example that if the distribution of the random variables \(X_i\) which generated your data are Normal with mean \(\mu_0\), variance \(\sigma^2\), then the distribution of the test statistic \(\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\) is standard Normal (mean 0, variance 1). Thus, you were able to compute the observed test statistic for a particular hypothesized \(\mu_0\) and particular data set and see if this would be unusual to observe when the mean is truly \(\mu_0\).
You then learned to express this via a p-value, but you won’t have seen the formal definition … all is now revealed:
Definition 1.1 (p-value) Let us consider data \((x_1, \dots, x_n)\), a null hypothesis \(H_0\) specifying a proposed distribution for the random variable \(X_i\) giving rise to the data, and a test statistic \(h(\cdot)\) measuring departure from the null. Then, if the observed test statistic is \(t = h(x_1, \dots, x_n)\), and the distribution of \(T = h(X_1, \dots, X_n)\) is known, the one sided p-value is given by \[\mathbb{P}(T \ge t \given H_0 \text{ true}) \quad \text{or} \quad \mathbb{P}(T \le t \given H_0 \text{ true})\] In the two sided case (where \(t\) may be extreme either \(+\)’ve or \(-\)’ve) the p-value is \[\mathbb{P}(T \le -|t| \ \cup\ T \ge |t| \given H_0 \text{ true})\] (Aside: Notice that you really only need to worry about centering your test statistic on zero in the two-sided case.)
Example 1.2 (mouse survival data (cont.) 🐭) We have already stated the assumption that the survival times are Normally distributed. Let us now just do a simple test of whether the treatment group could plausibly have a mean survival time of 100 days: \[\begin{align*} H_0: &\ \mu_x = 100 \\ H_1: &\ \mu_x \ne 100 \end{align*}\] Then, taking as test statistic: \[h(x_1, \dots, x_n) = \frac{\bar{x}-100}{s_{x}/\sqrt{n}}\] we get that \(t = \frac{86.86-100}{25.24} = -0.52\).
Now, the distribution of the test statistic \(T = h(X_1, \dots, X_n)\), if \(H_0\) is true, is a \(t\)-distribution with \(7-1=6\) degrees of freedom. In this case, \[\mathbb{P}(T < -|t| \ \cup\ T > |t| \given H_0 \text{ true}) = 0.62\]
Note you didn’t actually compute exact p-values in Stats I because you would need a computer to calculate non-table values of the \(t\)-distribution.
So how did I get this value?
We can extract the probability in the left tail of a \(t\)-distribution with 6 degrees of freedom from R using the pt()
function, then double it to give both upper and lower tail area:
pt(-0.52, 6)*2
In Stats I you would look up in the table a value \(t_\text{crit}\) such that: \[\mathbb{P}(T < -|t_\text{crit}| \ \cup\ T > |t_\text{crit}| \given H_0 \text{ true}) = 0.05\] which you should find is \(t_\text{crit}=2.45\). In fact, you can dispense with the tables now, since R can also give you this value, using the following command (recall, for a 2 sided test at 95%, I need 2.5% in each tail)
qt(0.975, 6)
So why did we slightly formalise hypothesis testing? Did Stats I not cover everything we want?
Well, in the same way that we might want to get different standard error estimates to tell us the accuracy of quantities other than the mean, we might also want to test hypotheses about very different questions than distributional means. The following example is a hypothesis test of a population size, so a very different kind of test. However, with the formalised definition above, it turns out to be quite easy to work through manually …
Example 1.3 (World War II tanks 🪖) In World War II it was critical to gain intelligence on the strength of the German army. In particular, whenever tanks fell into British hands in the field, the serial numbers would be recorded and reported back to the statisticians at HQ.
Let us imagine we have received a report of five serial numbers: \[x_1=61,\ \ x_2=19,\ \ x_3=56,\ \ x_4=24,\ \ x_5=16\] The Allies would like to know if the German army could plausibly have manufactured as many as 350 tanks or if it is actually fewer. How can we test this? Serial numbers are not Normal for sure, and we’re not interested in the mean here!
Well, let the distribution from which serial numbers are drawn be uniform on the set of numbers \(\{1, 2, 3, \dots, N\}\), where \(N\) is the true number of tanks produced (ie the population size). Then \(N\), rather than \(\mu\), is our parameter of interest here. We formulate our hypothesis as: \[\begin{align*} H_0: &\ N = 350 \\ H_1: &\ N < 350 \end{align*}\] Theoretically, there is nothing to stop us using the mean as a test statistic (as \(N\) changes, so the observed mean of serial numbers would change), but it doesn’t really makes sense here when we have a much more natural choice which relates directly to \(N\). Instead, we propose the following test statistic which very clearly relates to \(N\) \[h(x_1, \dots, x_5) = \max\{x_1, \dots, x_5\}\] The observed test statistic is therefore 61 (the max of the observed serial numbers). Then, the p-value is just: \[\begin{align*} \mathbb{P}(T \le 61 \given H_0 \text{ true}) &= \mathbb{P}(\max\{X_1, \dots, X_5\} \le 61 \given N = 350) \\ &= \frac{61}{350}\times\frac{60}{349}\times\dots\times\frac{57}{346} \\ &= 0.00014 \end{align*}\] On the basis of these serial numbers we would then clearly reject the null and conclude that there is strong evidence to suggest that fewer than 350 tanks were manufactured by the German army.
This is not just a classroom exercise but actually played out in a battle of spies 🕵️ versus statisticians 🤓! Interestingly, British spies fed back intelligence estimating the monthly production of tanks in the period 1940–1943 was 1550. The statisticians working with just serial number reports alone estimated 327 per month. Once the war ended and official German war documents became available the true number was found to be 342 per month. Go statistics! 🤓🎉
The purpose of the above example is to demonstrate that the mean of a Normal distribution is a very narrow world in which to do hypothesis testing. Fortunately, the tanks example was pretty easy to work out for ourselves as we went along, because the required probability calculations for the p-value were a kindergarden probability exercise. However, it’s not hard to imagine much more complicated questions for which working it out manually is hard, or genuinely analytically impossible.
Again, fear not … the computational statistics methods of this course will enable us to use computer power to dig us out of this hole. 👩💻
1.2 Summary
In this Chapter, we have refreshed your memory of some Stats I concepts and extended or formalised them slightly in the process. Having studied Stats I well, you will also be aware that what you have learned so far is only applicable under a restrictive set of assumptions. So, by the end of this Chapter, you are hopefully convinced that we should be interested in computing standard errors and hypothesis tests for a far wider variety of problems than the simple setting of means and Normal distributions which you have encountered so far.
However, you did not see the mathematical proofs or derivations of things like the CLT or the sampling distribution of the pivotal quantity being a \(t\)-distribution in Stats I. There is actually a lot of work in deriving and proving such results and the idea of having to do this for every different problem we encounter is not exactly appealing. Instead, we’re going to take a different route and use computational power, together with some elegant mathematics on sampling theory, in order to attack the general problem and not specific cases.
There is a whole part of statistics devoted to identifying what choice of test statistic is best in any situation: we won’t go into this, but essentially some choices are more sensitive to detecting when the null is not true, called a powerful test↩︎