$$ \require{cancel} \newcommand{\given}{ \,|\, } \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\vecg}[1]{\boldsymbol{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\bbone}{\unicode{x1D7D9}} $$

Tutorial 1, Week 2

Download as PDF

Q1

Let my data be: \[x_1 = 1.49, x_2 = 1.51, x_3 = 1.48, x_4 = 1.54, x_5 = 1.50\] \[x_6 = 1.46, x_7 = 1.44, x_8 = 1.56, x_9 = 1.45, x_{10} = 1.47\] Recall, the Bootstrap takes resamples of the data (ie samples with replacement from the original data to create a psuedo data set of the same size). What is the probability that a Bootstrap resample of the above data has:

  1. exactly three samples equal to \(1.45\)?
  2. at most two samples \(\le 1.48\)?
  3. exactly two samples \(\le 1.48\) and all other samples \(>1.52\)?

Q2

Reminder: A one-sided Monte Carlo hypothesis test simulates \(N\) โ€˜pseudoโ€™ data sets from the null hypothesis of the same size as the original data, and calculates the test statistic for each one, \(\big\{ t_i : i\in\{1,\dots,N\}\big\}\). We then take the empirical average of how many simulated test statistics are more extreme than the observed test statistic, \(t_\text{obs}\), as our estimate of the p-value: \[p \approx \hat{p} = \frac{1}{N} \sum_{i=1}^N \bbone\{ t_i \ge t_\text{obs} \}\]

Question: You have collected the following data of size \(n=3\):

2, 0, 1

and are interested in testing the hypothesis that it is Poisson(\(\lambda\)) distributed, with: \[\begin{align*} H_0: &\ \lambda = 1 \\ H_1: &\ \lambda > 1 \end{align*}\]

You recall that the expectation of a Poisson random variable is \(\lambda\), so your friend suggests using the sum of the observations as the test statistic, \(T = h(X_1, X_2, X_3) = \sum_{i=1}^3 X_i\).

  1. Is your friendโ€™s suggestion ok as a test statistic and why (or why not)?
  2. What is the observed test statistic, \(t_\text{obs}\), here?

You now proceed to simulate some data from a Poisson distribution with \(\lambda=1\) using the R programming language and get the following returned:

2, 1, 0       1, 1, 2
0, 2, 0       0, 2, 0
2, 1, 1       0, 3, 1
3, 1, 0       0, 0, 1
1, 4, 1       1, 1, 1
  1. What is \(N\) here?
  2. What is the test statistic, \(t_i\), for each Monte Carlo simulated data set?
  3. What would you estimate the p-value is based on this (admittedly tiny) Monte Carlo simulation?
  4. At a significance level \(\alpha=0.1\), do you reject, or not reject, the null hypothesis?

Q3

Let \(p\) denote the true (exact) p-value (ie if we know the exact distribution of the test statistic),

\[p = \mathbb{P}(T \ge t_\text{obs} \given H_0 \text{ true})\]

  1. What is the exact p-value of the test in the previous question?

    Hint: You may use, without proof, that: \[X_i \sim \text{Pois}(\lambda) \quad\implies\quad \sum_{i=1}^n X_i \sim \text{Pois}(n \lambda)\]

  2. What is the probability that a single Monte Carlo simulated test statistic will be greater than or equal to \(t_\text{obs}\)?

  3. For \(N\) Monte Carlo simulated test statistics, what is the distribution of the number that exceed \(t_\text{obs}\)?

A problem for Monte Carlo testing is that the estimated p-value is random. The resampling risk is defined to be the probability that the Monte Carlo simulated p-value and the true p-value are on different sides of the significance threshold, \(\alpha\), because this is the situation when the Monte Carlo test will be wrong.

\[\text{resampling risk} = \begin{cases} \mathbb{P}(\hat{p} > \alpha) & \text{ if } p \le \alpha \\ \mathbb{P}(\hat{p} \le \alpha) & \text{ if } p > \alpha \end{cases}\]

  1. What is the resampling risk of the Monte Carlo simulated hypothesis test in the last question?

Q4

In the lecture we saw the mouse data with lifetimes in days for the treatment group \((x_1, \dots, x_7)\) and control group \((y_1, \dots, y_9)\)

  1. For each group, what effect would there be on (i) the sample mean, and (ii) the standard error of the sample mean, if all the lifetimes were expressed in weeks?

  2. How many standard errors from zero would the difference \(\bar{x}-\bar{y}\) be now?

  3. If we have a new dataset where each observation in the original data is repeated \(N\) times (ie, we get the value \(x_1\) repeated \(N\) times, as well as the value \(x_2\) repeated \(N\) times, etc), what would the effect be on the standard error of the sample mean? (is this roughly a factor of \(\frac{1}{\sqrt{N}}\)? Weโ€™ll see this factor cropping up a lot later in the course!)

    Hint: First show the mean is unchanged, then write down the standard error of the new mean and put it in terms of the standard error of the original mean.