No Title

$next$ $up$ $previous$

STAT 450

Lecture 32

Goals for today:

Prove the Neyman Pearson Lemma
Apply Lemma to derive Uniformly most powerful tests

Uniformly Most Powerful Tests

Definition: In the general problem of testing $\Theta_0$ against $\Theta_1$ the level of a test function $\phi$ is

$\begin{displaymath}\alpha = \sup_{\theta\in\Theta_0}E_\theta(\phi(X)) \end{displaymath}$

The power function is

$\begin{displaymath}\pi(\theta) = E_\theta(\phi(X)) \end{displaymath}$

A test $\phi^*$ is a Uniformly Most Powerful level $\alpha_0$ test if

1.

$\phi^*$ has level $\alpha \le \alpha_o$

2.

If $\phi$ has level $\alpha \le \alpha_0$ then for every $\theta\in \Theta_1$ we have

$\begin{displaymath}E_\theta(\phi(X)) \le E_\theta(\phi^*(X)) \end{displaymath}$

Application of the NP lemma: In the $N(\mu,1)$ model consider $\Theta_1=\{\mu>0\}$ and $\Theta_0=\{0\}$ or $\Theta_0=\{\mu \le 0\}$ . The UMP level $\alpha_0$ test of $H_0: \mu\in\Theta_0$ against $H_1:\mu\in\Theta_1$ is

$\begin{displaymath}\phi(X_1,\ldots,X_n) = 1(n^{1/2}\bar{X} > z_{\alpha_0}) \end{displaymath}$

Proof: I showed there is a function $g(\sqrt{n}\bar{X})$ which is increasing and for which

$\begin{displaymath}\frac{f_{\mu_1}(X_1,\ldots,X_n)}{f_{\mu_0}(X_1,\ldots,X_n)} = g(\sqrt{n}\bar{X}) \end{displaymath}$

The rejection region is then

$\begin{displaymath}\{(x_1,\ldots,x_n): g(\sqrt{n}\bar{X}) > \lambda\} \end{displaymath}$

where we choose $\lambda$ so that

$\begin{displaymath}P_{\mu_0}(g(\sqrt{n}\bar{X}) > \lambda) \end{displaymath}$

If you choose $\lambda = g(z_\alpha)$ you see that
$\begin{align*}P_{\mu_0}(g(\sqrt{n}\bar{X}) > \lambda) & = P(g(\sqrt{n}\bar{X}) > g(z_\alpha) \\ & = P(\sqrt{n}\bar{X} > z_\alpha) \\ & = \alpha \end{align*}$
This rejection region does not depend on $\mu_1$ so it is the best for all $\mu-1>0$ . That is: the usual test is UMP.

Definition: The family $f_\theta;\theta\in \Theta\subset R$ has monotone likelihood ratio with respect to a statistic T(X)if for each $\theta_1>\theta_0$ the likelihood ratio $f_{\theta_1}(X) / f_{\theta_0}(X)$ is a monotone increasing function of T(X).

Theorem: For a monotone likelihood ratio family the Uniformly Most Powerful level $\alpha$ test of $\theta \le \theta_0$ (or of $\theta=\theta_0$ ) against the alternative $\theta>\theta_0$ is

$\begin{displaymath}\phi(x) =\left\{\begin{array}{ll} 1 & T(x) > t_\alpha \\ \gamma & T(X)=t_\alpha \\ 0 & T(x) < t_\alpha \end{array}\right. \end{displaymath}$

where $P_0(T(X) > t_\alpha)+\gamma P_0(T(X) = t_\alpha) = \alpha_0$ .

A typical family where this will work is a one parameter exponential family. In almost any other problem the method doesn't work and there is no uniformly most powerful test. For instance to test $\mu=\mu_0$ against the two sided alternative $\mu\neq\mu_0$ there is no UMP level $\alpha$ test. If there were its power at $\mu>\mu_0$ would have to be as high as that of the one sided level $\alpha$ test and so its rejection region would have to be the same as that test, rejecting for large positive values of $\bar{X} -\mu_0$ . But it also has to have power as good as the one sided test for the alternative $\mu < \mu_0$ and so would have to reject for large negative values of $\bar{X} -\mu_0$ . This would make its level too large.

Two sided tests

The favourite test is the usual 2 sided test which rejects for large values of $\vert\bar{X} -\mu_0\vert$ with the critical value chosen appropriately. This test maximizes the power subject to two constraints: first, that the level be $\alpha$ and second that the test have power which is minimized at $\mu=\mu_0$ . This second condition is really that the power on the alternative be larger than it is on the null.

Definition: A test $\phi$ of $\Theta_0$ against $\Theta_1$ is unbiased level $\alpha$ if it has level $\alpha$ and, for every $\theta\in \Theta_1$ we have

$\begin{displaymath}\pi(\theta) \ge \alpha \, . \end{displaymath}$

When testing a point null hypothesis like $\mu=\mu_0$ this requires that the power function be minimized at $\mu_0$ which will mean that if $\pi$ is differentiable then

$\begin{displaymath}\pi^\prime(\mu_0) =0 \end{displaymath}$

In the $N(\mu,1)$ problem there is a version of the Neyman Pearson lemma which proves that the Uniformly Most Powerful Unbiased test rejects for large values of $\vert\bar{X}\vert$ .

A test $\phi^*$ is a Uniformly Most Powerful Unbiased level $\alpha_0$ test if

1.

$\phi^*$ has level $\alpha \le \alpha_0$ .

2.

$\phi^*$ is unbiased.

3.

If $\phi$ has level $\alpha \le \alpha_0$ then for every $\theta\in \Theta_1$ we have

$\begin{displaymath}E_\theta(\phi(X)) \le E_\theta(\phi^*(X)) \end{displaymath}$

Conclusion: The two sided z test which rejects if

$\begin{displaymath}\vert Z\vert > z_{\alpha/2} \end{displaymath}$

where

$\begin{displaymath}Z=n^{1/2}(\bar{X} -\mu_0) \end{displaymath}$

is the uniformly most powerful unbiased test of $\mu=\mu_0$ against the two sided alternative $\mu\neq\mu_0$ .

Nuisance Parameters

What good can be said about the t-test? It's UMPU.

Suppose $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$ and that we want to test $\mu=\mu_0$ or $\mu \le \mu_0$ against $\mu>\mu_0$ . Notice that the parameter space is two dimensional and that the boundary between the null and alternatives is

$\begin{displaymath}\{(\mu,\sigma); \mu=\mu_0,\sigma>0\} \end{displaymath}$

If a test has $\pi(\mu,\sigma) \le \alpha$ for all $\mu \le \mu_0$ and $\pi(\mu,\sigma) \ge \alpha$ for all $\mu>\mu_0$ then we must have $\pi(\mu_0,\sigma) =\alpha$ for all $\sigma$ because the power function of any test must be continuous.

It is possible to use these facts and ideas of sufficiency and completeness to prove that the t test is UMPU for the one and two sided problems.

Optimal tests

A good test has $\pi(\theta)$ large on the alternative and small on the null.
For one sided one parameter families with MLR a UMP test exists.
For two sided or multiparameter families the best to be hoped for is UMP Unbiased or Invariant or Similar.
Good tests are found as follows:

1.
Use the NP lemma to determine a good rejection region for a simple alternative.

2.
Try to express that region in terms of a statistic whose definition does not depend on the specific alternative.

3.
If this fails impose an additional criterion such as unbiasedness. Then mimic the NP lemma and again try to simplify the rejection region.

Likelihood Ratio tests

For general composite hypotheses optimality theory is not usually successful in producing an optimal test. instead we look for heuristics to guide our choices. The simplest approach is to consider the likelihood ratio

$\begin{displaymath}\frac{f_{\theta_1}(X)}{f_{\theta_0}(X)} \end{displaymath}$

and choose values of $\theta_1 \in \Theta_1$ and $\theta_0 \in \Theta_0$ which are reasonable estimates of $\theta$ assuming respectively the alternative or null hypothesis is true. The simplest method is to make each $\theta_i$ a maximum likelihood estimate, but maximized only over $\Theta_i$ .

Example 1: In the $N(\mu,1)$ problem suppose we want to test $\mu \le 0$ against $\mu>0$ . (Remember there is a UMP test.) The log likelihood function is

$\begin{displaymath}-n(\bar{X}-\mu)^2/2 \end{displaymath}$

If $\bar{X} >0$ then this function has its global maximum in $\Theta_1$ at $\bar{X}$ . Thus $\hat\mu_1$ which maximizes $\ell(\mu)$ subject to $\mu>0$ is $\bar{X}$ if $\bar{X} >0$ . When $\bar{X} \le 0$ the maximum of $\ell(\mu)$ over $\mu>0$ is on the boundary, at $\hat\mu_1=0$ . (Technically this is in the null but in this case $\ell(0)$ is the supremum of the values $\ell(\mu)$ for $\mu>0$ . Similarly, the estimate $\hat\mu_0$ will be $\bar{X}$ if $\bar{X} \le 0$ and 0 if $\bar{X} >0$ . It follows that

$\begin{displaymath}\frac{f_{\theta_1}(X)}{f_{\theta_0}(X)}= \exp\{\ell(\hat\mu_1) - \ell(\hat\mu_0)\} \end{displaymath}$

which simplifies to

$\begin{displaymath}\exp\{n\bar{X}\vert\bar{X}\vert/2\} \end{displaymath}$

This is a monotone increasing function of $\bar{X}$ so the rejection region will be of the form $\bar{X} > K$ . To get the level right the test will have to reject if $n^{1/2} \bar{X} > z_\alpha$ . Notice that the log likelihood ratio statistic

$\begin{displaymath}\lambda \equiv 2\log(\frac{f_{\hat\mu_1}(X)}{f_{\hat\mu_0}(X)} = n\bar{X}\vert\bar{X}\vert \end{displaymath}$

as a simpler statistic.

Example 2: In the $N(\mu,1)$ problem suppose we make the null $\mu=0$ . Then the value of $\hat\mu_0$ is simply 0 while the maximum of the log-likelihood over the alternative $\mu \neq 0$ occurs at $\bar{X}$ . This gives

$\begin{displaymath}\lambda = n\bar{X}^2 \end{displaymath}$

which has a $\chi^2_1$ distribution. This test leads to the rejection region $\lambda > (z_{\alpha/2})^2$ which is the usual UMPU test.

Example 3: For the $N(\mu,\sigma^2)$ problem testing $\mu=0$ against $\mu \neq 0$ we must find two estimates of $\mu,\sigma^2$ . The maximum of the likelihood over the alternative occurs at the global mle $\bar{X}, \hat\sigma^2$ . We find

$\begin{displaymath}\ell(\hat\mu,\hat\sigma^2) = -n/2 - n \log(\hat\sigma) \end{displaymath}$

We also need to maximize $\ell$ over the null hypothesis. Recall

$\begin{displaymath}\ell(\mu,\sigma) = -\frac{1}{2\sigma^2} \sum (X_i-\mu)^2 -n\log(\sigma) \end{displaymath}$

On the null hypothesis we have $\mu=0$ and so we must find $\hat\sigma_0$ by maximizing

$\begin{displaymath}\ell(0,\sigma) = -\frac{1}{2\sigma^2} \sum X_i^2 -n\log(\sigma) \end{displaymath}$

This leads to

$\begin{displaymath}\hat\sigma_0^2 = \sum X_i^2/n \end{displaymath}$

and

$\begin{displaymath}\ell(0,\hat\sigma_0) = -n/2 -n\log(\hat\sigma_0) \end{displaymath}$

This gives

$\begin{displaymath}\lambda =-n\log(\hat\sigma^2/\hat\sigma_0^2) \end{displaymath}$

Since

$\begin{displaymath}\frac{\hat\sigma^2}{\hat\sigma_0^2} = \frac{ \sum (X_i-\bar{X})^2}{ \sum (X_i-\bar{X})^2 + n\bar{X}^2} \end{displaymath}$

we can write

$\begin{displaymath}\lambda = n \log(1+t^2/(n-1)) \end{displaymath}$

where

$\begin{displaymath}t = \frac{n^{1/2} \bar{X}}{s} \end{displaymath}$

is the usual t statistic. The likelihood ratio test thus rejects for large values of |t| which gives the usual test.

Notice that if n is large we have

$\begin{displaymath}\lambda \approx n[1+t^2/(n-1) +O(n^{-2})] \approx t^2 \, . \end{displaymath}$

Since the t statistic is approximately standard normal if n is large we see that

$\begin{displaymath}\lambda = 2[\ell(\hat\theta_1) - \ell(\hat\theta_0)] \end{displaymath}$

has nearly a $\chi^2_1$ distribution.

This is a general phenomenon when the null hypothesis being tested is of the form $\phi=0$ . Here is the general theory. Suppose that the vector of p+q parameters $\theta$ can be partitioned into $\theta=(\phi,\gamma)$ with $\phi$ a vector of p parameters and $\gamma$ a vector of q parameters. To test $\phi=\phi_0$ we find two mles of $\theta$ . First the global mle $\hat\theta = (\hat\phi,\hat\gamma)$ maximizes the likelihood over $\Theta_1=\{\theta:\phi\neq\phi_0\}$ (because typically the probability that $\hat\phi$ is exactly $\phi_0$ is 0).

Now we maximize the likelihood over the null hypothesis, that is we find $\hat\theta_0 = (\phi_0,\hat\gamma_0)$ to maximize

$\begin{displaymath}\ell(\phi_0,\gamma) \end{displaymath}$

The log-likelihood ratio statistic is

$\begin{displaymath}2[\ell(\hat\theta)-\ell(\hat\theta_0)] \end{displaymath}$

Now suppose that the true value of $\theta$ is $\phi_0,\gamma_0$ (so that the null hypothesis is true). The score function is a vector of length p+q and can be partitioned as $U=(U_\phi,U_\gamma)$ . The Fisher information matrix can be partitioned as

$\begin{displaymath}\left[\begin{array}{cc} I_{\phi\phi} & I_{\phi\gamma} \\ I_{\phi\gamma} & I_{\gamma\gamma} \end{array}\right] \, . \end{displaymath}$

According to our large sample theory for the mle we have

$\begin{displaymath}\hat\theta \approx \theta + I^{-1} U \end{displaymath}$

and

$\begin{displaymath}\hat\gamma_0 \approx \gamma_0 + I_{\gamma\gamma}^{-1} U_\gamma \end{displaymath}$

If you carry out a two term Taylor expansion of both $\ell(\hat\theta)$ and $\ell(\hat\theta_0)$ around $\theta_0$ you get

$\begin{displaymath}\ell(\hat\theta) \approx \ell(\theta_0) + U^t I^{-1}U + \frac{1}{2} U^tI^{-1} V(\theta) I^{-1} U \end{displaymath}$

where V is the second derivative matrix of $\ell$ . Remember that $V \approx -I$ and you get

$\begin{displaymath}2[\ell(\hat\theta) - \ell(\theta_0)] \approx U^t I^{-1}U \, . \end{displaymath}$

A similar expansion for $\hat\theta_0$ gives

$\begin{displaymath}2[\ell(\hat\theta_0) -\ell(\theta_0)] \approx U_\gamma^t I_{\gamma\gamma}^{-1} U_\gamma \, . \end{displaymath}$

If you subtract these you find that

$\begin{displaymath}2[\ell(\hat\theta)-\ell(\hat\theta_0)] \end{displaymath}$

can be written in the approximate form

U^tMU

for a suitable matrix M. It is now possible to use the general theory of the distribution of X^t M X where X is $MVN(0,\Sigma)$ to demonstrate that

Theorem: The log-likelihood ratio statistic

$\begin{displaymath}\lambda = 2[\ell(\hat\theta) - \ell(\hat\theta_0)] \end{displaymath}$

has, under the null hypothesis, approximately a $\chi_p^2$ distribution.

Aside:

Theorem: Suppose that $X\sim MVN(0,\Sigma)$ with $\Sigma$ non-singular and Mis a symmetric matrix. If $\Sigma M \Sigma M \Sigma = \Sigma M \Sigma$ then X^t M X has a $\chi^2$ distribution with degrees of freedom $\nu=trace(M\Sigma)$ .

Proof: We have X=AZ where $AA^t = \Sigma$ and Z is standard multivariate normal. So X^t M X = Z^t A^t M A Z. Let Q=A^t M A. Since $AA^t = \Sigma$ the condition in the theorem is actually

A QQA^t = AQA^t

Since $\Sigma$ is non-singular so is A. Multiply by A^-1on the left and (A^t)^-1 on the right to discover QQ=Q.

The matrix Q is symmetric and so can be written in the form $P\Lambda P^t$ where $\Lambda$ is a diagonal matrix containing the eigenvalues of Q and P is an orthogonal matrix whose columns are the corresponding orthonormal eigenvectors. It follows that we can rewrite

$\begin{displaymath}Z^t Q Z = (P^t Z)^t \Lambda (PZ) \end{displaymath}$

The variable W = P^t Z is multivariate normal with mean 0 and variance covariance matrix P^t P = I; that is, W is standard multivariate normal. Now

$\begin{displaymath}W^t \Lambda W =\sum \lambda_i W_i^2 \end{displaymath}$

We have established that the general distribution of any quadratic form X^t M X is a linear combination of $\chi^2$ variables. Now go back to the condition QQ=Q. If $\lambda$ is an eigenvalue of Q and $v\neq 0$ is a corresponding eigenvector then $QQv = Q(\lambda v) = \lambda Qv = \lambda^2 v$ but also $QQv =Qv = \lambda v$ . Thus $\lambda(1-\lambda ) v=0$ . It follows that either $\lambda=0$ or $\lambda=1$ . This means that the weights in the linear combination are all 1 or 0 and that X^t M X has a $\chi^2$ distribution with degrees of freedom, $\nu$ , equal to the number of $\lambda_i$ which are equal to 1. This is the same as the sum of the $\lambda_i$ so

$\begin{displaymath}\nu = trace(\Lambda) \end{displaymath}$

But
$\begin{align*}trace(M\Sigma)& = trace(MAA^t) \\ &= trace(A^t M A) \\ & = trace... ... (P\Lambda P^t) \\ & = trace(\Lambda P^t P) \\ & = trace(\Lambda) \end{align*}$

In the application $\Sigma$ is ${\cal I}$ the Fisher information and $M={\cal I}^{-1} - J$ where

$\begin{displaymath}J= \left[\begin{array}{cc} 0 & 0 \\ 0 & I_{\gamma\gamma}^{-1} \end{array}\right] \end{displaymath}$

It is easy to check that $M\Sigma$ becomes

$\begin{displaymath}\left[\begin{array}{cc} I & 0 \\ 0 & 0 \end{array}\right] \end{displaymath}$

where I is a $p\times p$ identity matrix. It follows that $M\Sigma M\Sigma= M\Sigma$ and that $trace(M\Sigma) = p)$ .

$next$ $up$ $previous$

Richard Lockhart
1999-11-27