No Title

$next$ $up$ $previous$

STAT 450

Lecture 20

Today's notes

Reading for Today's Lecture:

Goals of Today's Lecture:

Define the Fisher information matrix.
Illustrate standard MLE theory on $N(\mu,\sigma^2)$ example.

Today's notes

So far:

The Fisher information matrix is

$\begin{displaymath}{\cal I}_n(\theta)=-E_{\theta}(U^\prime(\theta)) = \text{Var}(U(\theta)). \end{displaymath}$

Theorem: In iid sampling

$\begin{displaymath}\sqrt{n}(\hat\theta - \theta) \Rightarrow N(0,{\cal I}_1^{-1}(\theta)) \end{displaymath}$

Examples, then uses

D: Uniform $[0,\theta]$ We have $X_1,\ldots,X_n$ iid with density $f(x,\theta) = \frac{1}{\theta}1(0 \le x \le \theta)$ . We find
$\begin{align*}% latex2html id marker 33 L(\theta) & = \frac{1}{\theta^n} 1(\thet... ...ymptotically normal} \\ \bullet & \text{Family is {\bf irregular}} \end{align*}$
This family has the feature that the support of the density, namely $\{x:f(x;\theta) > 0 \}$ depends on $\theta$ . In such families it is common for the standard mle theory to fail.

Uses and extensions

Confidence Intervals: We can base confidence intervals on one of several forms. For this section I will assume that $\theta$ is a scalar (one dimensional) parameter and use ${}^\prime$ to denote a derivative with respect to the parameter. There are 3 standard versions of the normal approximation:
$\begin{align*}\sqrt{I_n(\theta)}(\hat\theta - \theta) \Rightarrow N(0,1) \\ \sq... ...sqrt{-U^\prime(\hat\theta)}(\hat\theta - \theta) \Rightarrow N(0,1) \end{align*}$
Each of these quantities may be used to derive confidence intervals for $\theta$ by finding the collection of all $\theta$ for which the quantity is smaller than some critical point.

The second and third quantities are of the form

$\begin{displaymath}\frac{\hat\theta -\theta}{\hat\sigma_{\hat\theta}} \end{displaymath}$

If such a quantity is standard normal then

$\begin{displaymath}P(\left\vert\frac{\hat\theta -\theta}{\hat\sigma_{\hat\theta}}\right\vert \le z_{\alpha/2}) \approx 1-\alpha \end{displaymath}$

$\begin{displaymath}P(\hat\theta - \hat\sigma_{\hat\theta} z_{\alpha/2} \le \thet... ...theta + \hat\sigma_{\hat\theta} z_{\alpha/2}) \approx 1-\alpha \end{displaymath}$

This means that

$\begin{displaymath}\hat\theta \pm \hat\sigma_{\hat\theta} z_{\alpha/2} \end{displaymath}$

is an approximate level $1-\alpha$ confidence interval for $\theta$ .

The first quantity above can also be used to derive a confidence interval but you must do more work, usually, to solve the inequality

$\begin{displaymath}\sqrt{n I_1(\theta)}\left\vert\hat\theta - \theta\right\vert\le z_{\alpha/2} \end{displaymath}$

Here are some examples:

Exponential distribution: With $X_1,\ldots,X_n$ iid with density $f(x,\lambda) = \lambda e^{-\lambda x}1(x>0)$ we have

$\begin{displaymath}\sqrt{n}(\hat\lambda - \lambda)/\hat\lambda \Rightarrow N(0,1) \end{displaymath}$

In this formula

$\begin{displaymath}\hat\sigma_{\hat\lambda} \equiv \hat\lambda/\sqrt{n} \end{displaymath}$

is an estimated standard error. We get the confidence interval for $\lambda$

$\begin{displaymath}\hat\lambda \pm \frac{\hat\lambda}{\sqrt{n}} z_{\alpha/2}= \frac{1}{\bar{X}} \pm \frac{1}{\sqrt{n}\bar{X}} z_{\alpha/2} \end{displaymath}$

We could also use the fact that

$\begin{displaymath}\sqrt{n}(\hat\lambda - \lambda)/\lambda \Rightarrow N(0,1) \end{displaymath}$

This shows that

$\begin{displaymath}P( \left\vert\sqrt{n}(\hat\lambda - \lambda)/\lambda\right\vert \le z_{\alpha/2} ) \approx 1-\alpha \end{displaymath}$

To get a confidence interval we take

$\begin{displaymath}CI = \{\lambda: \left\vert\sqrt{n}(\hat\lambda - \lambda)/\lambda\right\vert \le z_{\alpha/2} \} \, . \end{displaymath}$

Now I will show you how to simplify this expression to get an interval. A given $\lambda$ will be in CI if and only if

$\begin{displaymath}-z_{\alpha/2} \le \sqrt{n}(\hat\lambda/\lambda -1)\le z_{\alpha/2} \end{displaymath}$

Multiply through by $1/\sqrt{n}$ and add 1 to see that these inequalities are equivalent to

$\begin{displaymath}1-z_{\alpha/2}/\sqrt{n} \le \hat\lambda/\lambda \le 1 + z_{\alpha/2}/\sqrt{n} \end{displaymath}$

Now invert the fraction (remembering it changes the direction of inequalities) to get

$\begin{displaymath}\frac{1}{1+z_{\alpha/2}/\sqrt{n} } \le \frac{\lambda}{\hat\lambda} \le \frac{1}{1 - z_{\alpha/2}/\sqrt{n} } \end{displaymath}$

finally multiply through by $\hat\lambda$ to get the interval

$\begin{displaymath}\frac{\hat\lambda}{1+z_{\alpha/2}/\sqrt{n} } \le \lambda \le \frac{\hat\lambda}{1 - z_{\alpha/2}/\sqrt{n} } \end{displaymath}$

Cauchy example: In this example I use the observed information (namely $U^\prime(\hat\theta)$ ): In the Cauchy example we found

$\begin{displaymath}-U(\theta) = 2 \sum \frac{1-(X_i-\theta)^2}{[1+(X_i-\theta)^2]^2} \end{displaymath}$

If $\hat\theta$ is the MLE then

$\begin{displaymath}-U(\hat\theta) = 2 \sum \frac{1-(X_i-\hat\theta)^2}{[1+(X_i-\hat\theta)^2]^2} \end{displaymath}$

can be used to give the approximation

$\begin{displaymath}\sqrt{-U(\hat\theta)}(\hat\theta-\theta) \approx N(0,1) \end{displaymath}$

leading to confidence intervals

$\begin{displaymath}\hat\theta \pm \frac{z_{\alpha/2}}{\sqrt{-U(\hat\theta)}} \end{displaymath}$

which are quite different from those using $I(\hat\theta) =I(\theta)=n/2$ :

$\begin{displaymath}\hat\theta \pm \frac{z_{\alpha/2}}{\sqrt{n/2}} \end{displaymath}$

Hypothesis testing

All these normal approximations can be used to give tests of either the one sided hypotheses $H_o: \theta=\theta_o$ or $H_o: \theta \le \theta_0$ against $H_1: \theta > \theta_o$ or the two sided hypotheses $H_o: \theta=\theta_o$ against $H_1: \theta \neq \theta_o$ . All you do is stick in $\theta_0$ for $\theta$ and then get P-values from the normal distribution. In the exponential example for instance you use either

$\begin{displaymath}Z=\sqrt{n}(\hat\lambda-\lambda_o)/\lambda_o \end{displaymath}$

$\begin{displaymath}Z= \sqrt{n}(\hat\lambda-\lambda_o)/\hat\lambda \end{displaymath}$

as a test statistic and look up areas under normal curves to get P-values.

In the Cauchy case you could use

$\begin{displaymath}Z=\sqrt{n/2}(\hat\theta-\theta_o) \end{displaymath}$

$\begin{displaymath}Z=\sqrt{-U^\prime(\hat\theta)}(\hat\theta-\theta_o) \end{displaymath}$

Other methods of estimation: Method of Moments

Basic strategy: set sample moments equal to population moments and solve for the parameters.

Definition: The $r^{\mbox{th}}$ sample moment (about the origin) is

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n X_i^r \end{displaymath}$

The $r^{\mbox{th}}$ population moment is

$\begin{displaymath}{\rm E}(X^r) \end{displaymath}$

Central moments are

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^r \end{displaymath}$

and

$\begin{displaymath}{\rm E}\left[(X-\mu)^r\right] \, . \end{displaymath}$

If we have p parameters we can estimate the parameters $\theta_1,\ldots,\theta_p$ by solving the system of p equations:

$\begin{displaymath}\mu_1 = \bar{X} \end{displaymath}$

$\begin{displaymath}\mu_2^\prime = \overline{X^2} \end{displaymath}$

and so on to

$\begin{displaymath}\mu_p^\prime = \overline{X^p} \end{displaymath}$

You need to remember that the population moments $\mu_k^\prime$ will be formulas involving the parameters.

Gamma Example

The Gamma( $\alpha,\beta$ ) density is

$\begin{displaymath}f(x;\alpha,\beta) = \frac{1}{\beta\Gamma(\alpha)}\left(\frac{... ...ta}\right)^{\alpha-1} \exp\left[-\frac{x}{\beta}\right] 1(x>0) \end{displaymath}$

and has

$\begin{displaymath}\mu_1 = \alpha\beta \end{displaymath}$

and

$\begin{displaymath}\mu_2^\prime = \alpha\beta^2\end{displaymath}$

This gives the equations
$\begin{align*}\alpha\beta & = \overline{X} \\ \alpha\beta^2 & = \overline{X^2} \end{align*}$
Divide the second by the first to find the method of moments estimate of $\beta$ is

$\begin{displaymath}\tilde\beta = \overline{X^2}/\overline{X} \end{displaymath}$

Then from the first equation get

$\begin{displaymath}\tilde\alpha = \overline{X}/\tilde\beta= (\overline{X})^2/\overline{X^2} \end{displaymath}$

The equations are much easier to solve than the likelihood equations which involve the function

$\begin{displaymath}\psi(\alpha) = \frac{d}{d\alpha} \log(\Gamma(\alpha)) \end{displaymath}$

called the digamma function. The score function in this problem has components

$\begin{displaymath}U_\beta = \frac{\sum X_i}{\beta^2} - n \alpha / \beta \end{displaymath}$

and

$\begin{displaymath}U_\alpha = -n\psi(\alpha) +\sum\log(X_i) -n\log(\beta) \end{displaymath}$

You can solve for $\beta$ in terms of $\alpha$ to leave you trying to find a root of the equation

$\begin{displaymath}-n\psi(\alpha) +\sum\log(X_i) -n \log(\sum X_i/(n\alpha)) = 0 \end{displaymath}$

To use Newton Raphson on this you begin with the preliminary estimate $\hat\alpha_1 = \tilde\alpha$ and then compute iteratively

$\begin{displaymath}\hat\alpha_{k+1} = \frac{\overline{\log(X)} - \psi(\hat\alpha... ...verline{ X}/\hat\alpha_k)}{1/\alpha-\psi^\prime(\hat\alpha_k)} \end{displaymath}$

until the sequence converges. Computation of $\psi^\prime$ , the trigamma function, requires special software. Web sites like netlib and statlib are good sources for this sort of thing.

Optimality theory for point estimates

Why bother doing the Newton Raphson steps? Why not just use the method of moments estimates? The answer is that the method of moments estimates are not usually as close to the right answer as the mles.

Rough principle: A good estimate $\hat\theta$ of $\theta$ is usually close to $\theta_0$ if $\theta_0$ is the true value of $\theta$ . Closer estimates, more often, are better estimates.

This principle must be quantified if we are to ``prove'' that the mle is a good estimate. In the Neyman Pearson spirit we measure average closeness.

Definition: The Mean Squared Error (MSE) of an estimator $\hat\theta$ is the function

$\begin{displaymath}MSE(\theta) = E_\theta[(\hat\theta-\theta)^2] \end{displaymath}$

Standard identity:

$\begin{displaymath}MSE = {\rm Var}_\theta(\hat\theta) + Bias_{\hat\theta}^2(\theta) \end{displaymath}$

where the bias is defined as

$\begin{displaymath}Bias_{\hat\theta}(\theta) = E_\theta(\hat\theta) - \theta \, . \end{displaymath}$

Primitive example: I take a coin from my pocket and toss it 6 times. I get HTHTTT. The MLE of the probability of heads is

$\begin{displaymath}\hat{p} = X/n \end{displaymath}$

where X is the number of heads. In this case I get $\hat{p} =\frac{1}{3}$ .

An alternative estimate is $\tilde p = \frac{1}{2}$ . That is, $\tilde p$ ignores the data and guesses the coin is fair. The MSEs of these two estimators are

$\begin{displaymath}MSE_{\text{MLE}} = \frac{p(1-p)}{6} \end{displaymath}$

and

MSE_0.5 = (p-0.5)²

If p is between 0.311 and 0.689 then the second MSE is smaller than the first. For this reason I would recommend use of $\tilde p$ for sample sizes this small.

Now suppose I did the same experiment with a thumbtack. The tack can land point up (U) or tipped over (O). If I get UOUOOO how should I estimate p the probability of U? The mathematics is identical to the above but it seems clear that there is less reason to think $\tilde p$ is better than $\hat p$ since there is less reason to believe $0.311 \le p \le 0.689$ than with a coin.

Unbiased Estimation

The problem above illustrates a general phenomenon. An estimator can be good for some values of $\theta$ and bad for others. When comparing $\hat\theta$ and $\tilde\theta$ , two estimators of $\theta$ we will say that $\hat\theta$ is better than $\tilde\theta$ if it has uniformly smaller MSE:

$\begin{displaymath}MSE_{\hat\theta}(\theta) \le MSE_{\tilde\theta}(\theta) \end{displaymath}$

for all $\theta$ . Normally we also require that the inequality be strict for at least one $\theta$ .

The definition raises the question of the existence of a best estimate - one which is better than every other estimator. There is no such estimate. Suppose $\hat\theta$ were such a best estimate. Fix a $\theta^*$ in $\Theta$ and let $\tilde p\equiv \theta^*$ . Then the MSE of $\tilde p$ is 0 when $\theta=\theta^*$ . Since $\hat\theta$ is better than $\tilde p$ we must have

$\begin{displaymath}MSE_{\hat\theta}(\theta^*) = 0 \end{displaymath}$

so that $\hat\theta=\theta^*$ with probability equal to 1. This makes $\hat\theta=\tilde\theta$ . If there are actually two different possible values of $\theta$ this gives a contradiction; so no such $\hat\theta$ exists.

$next$ $up$ $previous$

Richard Lockhart
1999-10-26