\require{cancel} \newcommand{\given}{ \,|\, } \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\vecg}[1]{\boldsymbol{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\bbone}{\unicode{x1D7D9}}

Chapter 2 Monte Carlo testing


โช โฉ

Notice that when performing a hypothesis test, we specify the distribution that we believe (or want to test) is the one that generated the data we have observed, so this is usually straight-forward to deal with. The test statistic is something we choose and so long as it is sensitive to departures from the null hypothesis we have a lot of freedom here2. So what is it that makes hypothesis testing complicated? Fundamentally, the part that is usually very hard to handle is putting these two things we do know together to find the sampling distribution of the test statistic. Why?

Let us assume we have data (x_1, \dots, x_n) and a data generating distribution F(x \given \theta). Here F is just some cumulative distribution function for an arbitrary probability model with parameter \theta โ€ฆ Iโ€™m not using \mu, \sigma here because I want you to think more broadly than Normal distributions! For example, F(\cdot \given \theta) could be the cdf of an Exponential distribution and \theta could be the rate parameter. We are interested in testing the hypothesis: \begin{align*} H_0: &\ \theta = \theta_0 \\ H_1: &\ \theta > \theta_0 \end{align*} We are going to use the test statistic T=h(X_1, \dots, X_n) to perform this test, where we have an observed test statistic value of t=h(x_1, \dots, x_n) for our data. This means we need to calculate: \begin{align*} &\mathbb{P}\big( T \ge t \given H_0 \text{ true} \big) \\ &= \mathbb{P}\big( h(X_1, \dots, X_n) \ge h(x_1, \dots, x_n) \given X_i \sim F(\cdot \given \theta_0) \big) \end{align*} and it is this probability which is often very hard, or simply impossible, to calculate analytically.

However, as we said above, we know F(x \given \theta) and we know h(\cdot), so we can circumvent doing any algebra at all and rely only on simulation to answer this question.

Definition 2.1 (Monte Carlo testing) Given data (x_1, \dots, x_n), an observed test statistic t=h(x_1, \dots, x_n), a data generating distribution F(x \given \theta) and a hypothesis, \begin{align*} H_0: &\ \theta = \theta_0 \\ H_1: &\ \theta > \theta_0 \end{align*} we conduct a Monte Carlo hypothesis test by performing the following steps.

For j \in \{1, \dots, N\},

  1. Simulate n observations (z_1, \dots, z_n) where Z_i \sim F(\cdot \given \theta_0)
  2. Calculate t_j = h(z_1, \dots, z_n)

Then, estimate the p-value of the test by the empirical average: \mathbb{P}\big( T \ge t \given H_0 \text{ true} \big) \approx \frac{1}{N} \sum_{j=1}^N \bbone\{ t_j \ge t \} where \bbone\{ A \} is the indicator function which is 1 when A is true and 0 when A is false.

Intuitively this is very appealing and arguably makes the whole process of hypothesis testing easier to understand! Basically,

  1. We โ€œpretendโ€ the null is true and simulate lots of data sets (the z_iโ€™s) and calculate the test statistic on each of them.
  2. We then look at how many of the simulated data sets had test statistics as least as extreme as we actually observed โ€ฆ if not many then clearly the data we saw is not very likely to have come from this distribution because despite all our simulations we rarely see such a situation arising.

In particular, the beauty of this is that we didnโ€™t have to calculate any sampling distributions and just used information we know anyway in order to specify the test!

NOTE: there is however one drawback, which is that we canโ€™t perform the above naรฏve simulation test unless we know the true value of the parameters not being tested. In other words, we can do the known \sigma but not the unknown \sigma case of standard Normal/t-testing.

Example 2.1 (candle lifetimes ๐Ÿ•ฏ๏ธ) We compare doing a standard hypothesis test with doing a Monte Carlo hypothesis test to one of the Stats I examples you had on candle lifetimes. There we had n=6 observations: 8.1, \ 8.7, \ 9.2, \ 7.8, \ 8.4, \ 9.4 with a known variance \sigma^2=0.4. The test was \begin{align*} H_0: &\ \mu = 9.2 \\ H_1: &\ \mu \ne 9.2 \end{align*} We have that \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} = \frac{8.6-9.2}{\sqrt{0.4}/\sqrt{6}} = -2.32 and you saw in the course that \text{p-value} = 2 \mathbb{P}(Z \ge 2.32) = 0.0204

We now perform the test via simulation instead. Note, we donโ€™t need to scale by the standard error now (which you do purely for mathematical reasons to create a pivotal quantity in a standard test), since weโ€™re just interested in whether each simulation is more or less extreme compared to the null:

๐Ÿš€ louisaslett.com/surfeR  โ€ข  SHA
# Specify test statistic and null value
x.bar <- 8.6
n <- 6
mu0 <- 9.2
# Simulate N sets of data of size n, assuming the null is true
N <- 50000
t <- rep(0, N)
for(j in 1:N) {
z <- rnorm(n, mu0, sqrt(0.4))
t[j] <- abs(mean(z)-mu0)
}
# Calculate empirical p-value
sum(t > abs(x.bar-mu0)) / N
ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”ื”
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You should get a value very close to 0.0204.

Notice that in the example above if you run it a few times it varies a bit: this is because weโ€™re using random simulations so of course the result has some randomness in it. The more simulations we do, the less the result varies and the closer to the truth weโ€™ll get. For example, here is a graph showing the running estimate as we increase the number of simulations, with the dashed line showing the true value which it hones in on.

Later in the course weโ€™ll study how to evaluate the error we might be making after a certain number of simulations. Of course, if we know the error we might be making then we will be able to decide when we have done enough simulation. For example, if we were doing a test at the 5% level, we might (see later) find that weโ€™ve actually done enough simulations to decide to reject after just a few thousand simulations, but where the true p-value is closer to the significance level we may need to do many more simulations to be absolutely certain which side of rejection/non-rejection we lie.


  1. Again, we are slightly simplifying here. Some choices of test statistic are better than others in the sense that they are more sensitive to changing when the null is wrong, and hence more powerful.โ†ฉ๏ธŽ