next up previous


STAT 450

Lecture 28

Goals for today:

Last time:

Exponential Families, Lehmann-Scheffé Theorem

If $X_1,\ldots,X_n$ are iid with density

\begin{displaymath}f(x_j,\theta) = h(x_j) \exp\{\sum_1^p a_i(\theta)S_i(x_j)+c(\theta)\}
\end{displaymath}

then the joint density of the data is

\begin{displaymath}\prod_j h(x_j) \exp\{\sum_1^p a_i(\theta) \sum_j S_i(x_j)+nc(\theta)\}
\end{displaymath}

If the range of the function $(a_1(\theta),\ldots,a_p(\theta))$ (as $\theta$ varies over $\Theta$ contains a (hyper-) rectangle in Rp then the statistic

\begin{displaymath}(\sum_j S_1(X_j), \ldots, \sum_j S_p(X_j))
\end{displaymath}

is complete and sufficient.

The Lehmann-Scheffé Theorem

Theorem: If S is a complete sufficient statistic for some model and h(S) is an unbiased estimate of some parameter $\phi(\theta)$ then h(S) is the UMVUE of $\phi(\theta)$.

Example: In the $N(\mu,\sigma^2)$ example $\bar{X},s$ is complete and sufficient so the UMVUEs of $\mu, \sigma^2,\sigma^4$ are $\bar{X}$, s2 and s4/2.

Criticism of Unbiasedness

1.
The UMVUE can be inadmissible for squared error loss meaning that there is a (biased, of course) estimate whose MSE is smaller for every parameter value. An example is the UMVUE of $\phi=p(1-p)$which is $\hat\phi =n\hat{p}(1-\hat{p})/(n-1)$. The MSE of

\begin{displaymath}\tilde{\phi} = \min(\hat\phi,1/4)
\end{displaymath}

is smaller than that of $\hat\phi$.

2.
There are examples where unbiased estimation is impossible. The log odds in a Binomial model is $\phi=\log(p/(1-p))$. Since the expectation of any function of the data is a polynomial function of p and since $\phi$ is not a polynomial function of p there is no unbiased estimate of $\phi$

3.
The UMVUE of $\sigma$ is not the square root of the UMVUE of $\sigma^2$. This method of estimation does not have the parameterization equivariance that maximum likelihood does.

4.
Unbiasedness is irrelevant (unless you plan to average together many estimators). The property is an average over possible values of the estimate in which positive errors are allowed to cancel negative errors. An exception to this criticism is that if you plan to average a number of estimators to get a single estimator then it is a problem if all the estimators have the same bias. In assignment 5 you have the one way layout example in which the mle of the residual variance averages together many biased estimates and so is very badly biased. That assignment shows that the solution is not really to insist on unbiasedness but to consider an alternative to averaging for putting the individual estimates together.

Minimal Sufficiency

In any model the statistic $S(X)\equiv X$ is sufficient. In any iid model the vector of order statistics $X_{(1)}, \ldots, X_{(n)}$ is sufficient. In the $N(\mu,1)$ model then we have three possible sufficient statistics:

1.
$S_1 = (X_1,\ldots,X_n)$.

2.
$S_2 = (X_{(1)}, \ldots, X_{(n)})$.

3.
$S_3 = \bar{X}$.

Notice that I can calculate S3 from the values of S1 or S2but not vice versa and that I can calculate S2 from S1 but not vice versa. It turns out that $\bar{X}$ is a minimal sufficient statistic meaning that it is a function of any other sufficient statistic. (You can't collapse the data set any more without losing information about $\mu$.)

To recognize minimal sufficient statistics you look at the likelihood function:

Fact: If you fix some particular $\theta^*$ then the log likelihood ratio function

\begin{displaymath}\ell(\theta)-\ell(\theta^*)
\end{displaymath}

is minimal sufficient. WARNING: the function is the statistic.

The subtraction of $\ell(\theta^*)$ gets rid of those irrelevant constants in the log-likelihood. For instance in the $N(\mu,1)$ example we have

\begin{displaymath}\ell(\mu) = -n\log(2\pi)/2 - \sum X_i^2/2 + \mu\sum X_i -n\mu^2/2
\end{displaymath}

This depends on $\sum X_i^2$ which is not needed for the sufficient statistic. Take $\mu^*=0$ and get

\begin{displaymath}\ell(\mu) -\ell(\mu^*) = \mu\sum X_i -n\mu^2/2
\end{displaymath}

This function of $\mu$ is minimal sufficient. Notice that from $\sum X_i$ you can compute this minimal sufficient statistic and vice versa. Thus $\sum X_i$ is also minimal sufficient.

FACT: A complete sufficient statistic is also minimal sufficient.

1.
The UMVUE can be inadmissible for squared error loss, meaning that there is a (biased, of course) estimate whose MSE is smaller for every parameter value. An example is the UMVUE of $\phi=p(1-p)$which is $\hat\phi =n\hat{p}(1-\hat{p})/(n-1)$. The MSE of

\begin{displaymath}\tilde{\phi} = \min(\hat\phi,1/4)
\end{displaymath}

is smaller than that of $\hat\phi$. Another example is provided by estimation of $\sigma^2$ in the $N(\mu,\sigma^2)$ problem; see the homework.

2.
There are examples where unbiased estimation is impossible. The log odds in a Binomial model is $\phi=\log(p/(1-p))$. Since the expectation of any function of the data is a polynomial function of p and since $\phi$ is not a polynomial function of p there is no unbiased estimate of $\phi$

3.
The UMVUE of $\sigma$ is not the square root of the UMVUE of $\sigma^2$. This method of estimation does not have the parameterization equivariance that maximum likelihood does.

4.
Unbiasedness is irrelevant (unless you plan to average together many estimators). The property is an average over possible values of the estimate in which positive errors are allowed to cancel negative errors. An exception to this criticism is that if you plan to average a number of estimators to get a single estimator then it is a problem if all the estimators have the same bias. In assignment 5 you have the one way layout example in which the mle of the residual variance averages together many biased estimates and so is very badly biased. That assignment shows that the solution is not really to insist on unbiasedness but to consider an alternative to averaging for putting the individual estimates together.

Minimal Sufficiency

In any model the statistic $S(X)\equiv X$ is sufficient. In any iid model the vector of order statistics $X_{(1)}, \ldots, X_{(n)}$ is sufficient. In the $N(\mu,1)$ model then we have three possible sufficient statistics:

1.
$S_1 = (X_1,\ldots,X_n)$.

2.
$S_2 = (X_{(1)}, \ldots, X_{(n)})$.

3.
$S_3 = \bar{X}$.

Notice that I can calculate S3 from the values of S1 or S2but not vice versa and that I can calculate S2 from S1 but not vice versa. It turns out that $\bar{X}$ is a minimal sufficient statistic meaning that it is a function of any other sufficient statistic. (You can't collapse the data set any more without losing information about $\mu$.)

To recognize minimal sufficient statistics you look at the likelihood function:

Fact: If you fix some particular $\theta^*$ then the log likelihood ratio function

\begin{displaymath}\ell(\theta)-\ell(\theta^*)
\end{displaymath}

is minimal sufficient. WARNING: the function is the statistic.

The subtraction of $\ell(\theta^*)$ gets rid of those irrelevant constants in the log-likelihood. For instance in the $N(\mu,1)$ example we have

\begin{displaymath}\ell(\mu) = -n\log(2\pi)/2 - \sum X_i^2/2 + \mu\sum X_i -n\mu^2/2
\end{displaymath}

This depends on $\sum X_i^2$ which is not needed for the sufficient statistic. Take $\mu^*=0$ and get

\begin{displaymath}\ell(\mu) -\ell(\mu^*) = \mu\sum X_i -n\mu^2/2
\end{displaymath}

This function of $\mu$ is minimal sufficient. Notice that from $\sum X_i$ you can compute this minimal sufficient statistic and vice versa. Thus $\sum X_i$ is also minimal sufficient.

FACT: A complete sufficient statistic is also minimal sufficient.

Hypothesis Testing

Hypothesis testing is a statistical problem where you must choose, on the basis of data X, between two alternatives. We formalize this as the problem of choosing between two hypotheses: $H_o: \theta\in \Theta_0$ or $H_1: \theta\in\Theta_1$ where $\Theta_0$ and $\Theta_1$ are a partition of the model $P_\theta; \theta\in \Theta$. That is $\Theta_0 \cup \Theta_1 =
\Theta$ and $\Theta_0 \cap\Theta_1=\empty$.

A rule for making the required choice can be described in two ways:

1.
In terms of the set

\begin{displaymath}C=\{X: \mbox{we choose $\Theta_1$ if we observe $X$}\}
\end{displaymath}

called the rejection or critical region of the test.

2.
In terms of a function $\phi(x)$ which is equal to 1 for those x for which we choose $\Theta_1$ and 0 for those x for which we choose $\Theta_0$.

For technical reasons which will come up soon I prefer to use the second description. However, each $\phi$ corresponds to a unique rejection region $R_\phi=\{x:\phi(x)=1\}$.

The Neyman Pearson approach to hypothesis testing which we consider first treats the two hypotheses asymmetrically. The hypothesis Ho is referred to as the null hypothesis (because traditionally it has been the hypothesis that some treatment has no effect).

Definition: The power function of a test $\phi$(or the corresponding critical region $R_\phi$) is

\begin{displaymath}\pi(\theta) = P_\theta(X\in R_\phi) = E_\theta(\phi(X))
\end{displaymath}

We are interested here in optimality theory, that is, the problem of finding the best $\phi$. A good $\phi$ will evidently have $\pi(\theta)$ small for $\theta\in\Theta_0$and large for $theta\in\Theta_1$. There is generally a trade off which can be made in many ways, however.

Simple versus Simple testing

Finding a best test is easiest when the hypotheses are very precise.

Definition: A hypothesis Hi is simple if $\Theta_i$ contains only a single value $\theta_i$.

The simple versus simple testing problem arises when we test $\theta=\theta_0$ against $\theta=\theta_1$ so that $\Theta$has only two points in it. This problem is of importance as a technical tool, not because it is a realistic situation.

Suppose that the model specifies that if $\theta=\theta_0$ then the density of X is f0(x) and if $\theta=\theta_1$ then the density of X is f1(x). How should we choose $\phi$? To answer the question we begin by studying the problem of minimizing the total error probability.

We define a Type I error as the error made when $\theta=\theta_0$ but we choose H1, that is, $X\in R_\phi$. The other kind of error, when $\theta=\theta_1$ but we choose H0 is called a Type II error. We define the level of a simple versus simple test to be

\begin{displaymath}\alpha = P_{\theta_0}(\mbox{We make a Type I error})
\end{displaymath}

or

\begin{displaymath}\alpha = P_{\theta_0}(X\in R_\phi) = E_{\theta_0}(\phi(X))
\end{displaymath}

The other error probability is denoted $\beta$ and defined as

\begin{displaymath}\beta= P_{\theta_1}(X\not\in R_\phi) = E_{\theta_1}(1-\phi(X))
\end{displaymath}

Suppose we want to minimize $\alpha+\beta$, the total error probability. We want to minimize

\begin{displaymath}E_{\theta_0}(\phi(X))+E_{\theta_1}(1-\phi(X))
=
\int[ \phi(x) f_0(x) +(1-\phi(x))f_1(x)] dx
\end{displaymath}

The problem is to choose, for each x, either the value 0 or the value 1, in such a way as to minimize the integral. But for each x the quantity

\begin{displaymath}\phi(x) f_0(x) +(1-\phi(x))f_1(x)
\end{displaymath}

can be chosen either to be f0(x) or f1(X). To make it small we take $\phi(x) = 1$ if f1(x)> f0(x) and $\phi(x) = 0$ if f1(x) < f0(x). It makes no difference what we do for those x for which f1(x)=f0(x). Notice that we can divide both sides of these inequalities to rephrase the condition in terms of the likelihood ration f1(x)/f0(x).

Theorem: For each fixed $\lambda$ the quantity $\lambda\beta+\alpha$ is minimized by any $\phi$ which has

\begin{displaymath}\phi(x) =\left\{\begin{array}{ll}
1 & \frac{f_1(x)}{f_0(x)} >...
...bda
\\
0 & \frac{f_1(x)}{f_0(x)} < \lambda
\end{array}\right.
\end{displaymath}


next up previous



Richard Lockhart
1999-11-20