No Title

$next$ $up$ $previous$

Postscript version of these notes

STAT 450

Lecture 17

Reading for Today's Lecture:

Goals of Today's Lecture:

Start large sample theory for the likelihood function

Today's notes

Large Sample Theory

Our goal is to study the behaviour of $\hat\theta$ and to develop approximate distribution theory for $n^{1/2}(\hat\theta-\theta)$ . Here is a summary of the conclusions of the theory:

1.: The log likelihood is probably bigger at $\theta_0$ , the true value of $\theta$ than at any single $\theta_1 \neq \theta_0$ .
2.: The log-likelihood is approximately quadratic in a small region around $\theta_0$ ;
3.: This will prove that there is probably a local maximum of $\ell$ near $\theta_0$ and that this local maximum is unique.
4.: The value of the score function at $\theta_0$ is approximately normal.
5.: The second derivative of the log likelihood is nearly constant for $\theta$ fairly near $\theta_0$ ;
6.: The likelihood equations can be approximated by Taylor expansion.
7.: The error in the MLE $\hat\theta$ is approximately a multiple of the score function at $\theta_0$ .
8.: The MLE is approximately normal.
9.: The approximate standard deviation of the MLE can be estimated.

We now study the approximate behaviour of $\hat\theta$ by studying the function U. Notice first that Uis a sum of independent random variables.

Theorem: If $Y_1,Y_2,\ldots$ are iid with mean $\mu$ then

$\begin{displaymath}\frac{\sum Y_i}{n} \to \mu \end{displaymath}$

This is called the law of large numbers. The strong law says

$\begin{displaymath}P(\lim \frac{\sum Y_i}{n}=\mu)=1 \end{displaymath}$

and the weak law that

$\begin{displaymath}\lim P(\vert \frac{\sum Y_i}{n}-\mu\vert>\epsilon) = 0 \end{displaymath}$

For iid Y_i the stronger conclusion holds but for our heuristics we will ignore the differences between these notions.

Now suppose that $\theta_0$ is the true value of $\theta$ . Then

$\begin{displaymath}U(\theta)/n \to \mu(\theta) \end{displaymath}$

where
$\begin{align*}\mu(\theta) =& E_{\theta_0}\left[ \frac{\partial \log f}{\partia... ...t \frac{\partial \log f}{\partial\theta}(x,\theta) f(x,\theta_0) dx \end{align*}$

Consider as an example the case of $N(\mu,1)$ data where

$\begin{displaymath}U(\mu)/n = \sum(X_i -\mu)/n = \bar{X} -\mu \end{displaymath}$

If the true mean is $\mu_0$ then $\bar{X} \to \mu_0$ and

$\begin{displaymath}U(\mu)/n \to \mu_0-\mu \end{displaymath}$

If we think of a $\mu < \mu_0$ we see that the derivative of $\ell(\mu)$ is likely to be positive so that $\ell$ increases as we increase $\mu$ . For $\mu$ more than $\mu_0$ the derivative is probably negative and so $\ell$ tends to be decreasing for $\mu > 0$ . It follows that $\ell$ is likely to be maximized close to $\mu_0$ .

Now we repeat these ideas for a more general case. We study the random variable $\log[f(X_i,\theta)/f(X_i,\theta_0)]$ . You know the inequality

$\begin{displaymath}E(X)^2 \le E(X^2) \end{displaymath}$

(because the difference is $Var(X) \ge 0$ . This inequality has the following generalization, called Jensen's inequality. If g is a convex function (non-negative second derivative, roughly) then

$\begin{displaymath}g(E(x)) \le E(g(X)) \end{displaymath}$

The inequality above has g(x)=x². We use $g(x) = -\log ( x )$ which is convex because $g^{\prime\prime}(x) = x^{-2} > 0$ . We get

$\begin{displaymath}-\log(E_{\theta_0}[f(X_i,\theta)/f(X_i,\theta_0)] \le E_{\theta_0}[-\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \end{displaymath}$

But
$\begin{align*}E_{\theta_0}[f(X_i,\theta)/f(X_i,\theta_0)] & = \int \frac{f(x,\... ...(x,\theta_0)}f(x,\theta_0) dx \\ & = \int f(x,\theta) dx \\ & = 1 \end{align*}$
We can reassemble the inequality and this calculation to get

$\begin{displaymath}E_{\theta_0}[\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \le 0 \end{displaymath}$

It is possible to prove that the inequality is strict unless the $\theta$ and $\theta_0$ densities are actually the same. Let $\mu(\theta) < 0$ be this expected value. Then for each $\theta$ we find
$\begin{align*}n^{-1}[\ell(\theta) - \ell(\theta_0)] & = n^{-1}\sum \log[f(X_i,\theta)/f(X_i,\theta_0)] \\ &\to \mu(\theta) \end{align*}$
This proves that the likelihood is probably higher at $\theta_0$ than at any other single $\theta$ . This idea can often be stretched to prove that the mle is consistent.

Definition A sequence $\hat\theta_n$ of estimators of $\theta$ is consistent if $\hat\theta_n$ converges weakly (or strongly) to $\theta$ .

Proto theorem: In regular problems the mle $\hat\theta$ is consistent.

Now let us study the shape of the log likelihood near the true value of $\hat\theta$ under the assumption that $\hat\theta$ is a root of the likelihood equations close to $\theta_0$ . We use Taylor expansion to write, for a 1 dimensional parameter $\theta$ ,
$\begin{align*}U(\hat\theta) & = 0 \\ & = U(\theta_0) + U^\prime(\theta_0)(\hat... ...eta_0) + U^{\prime\prime}(\tilde\theta) (\hat\theta-\theta_0)^2/2 \end{align*}$
for some $\tilde\theta$ between $\theta_0$ and $\hat\theta$ . (This form of the remainder in Taylor's theorem is not valid for multivariate $\theta$ .) The derivatives of U are each sums of n terms and so should be both proportional to n in size. The second derivative is multiplied by the square of the small number $\hat\theta-\theta_0$ so should be negligible compared to the first derivative term. If we ignore the second derivative term we get

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

Now let's look at the terms U and $U^\prime$ .

In the normal case

$\begin{displaymath}U(\theta_0) = \sum (X_i-\mu_0) \end{displaymath}$

has a normal distribution with mean 0 and variance n (SD $\sqrt{n}$ ). The derivative is simply

$\begin{displaymath}U^\prime(\mu) = -n \end{displaymath}$

and the next derivative $U^{\prime\prime}$ is 0. We will analyze the general case by noticing that both U and $U^\prime$ are sums of iid random variables. Let

$\begin{displaymath}U_i = \frac{\partial \log f}{\partial\theta} (X_i,\theta_0) \end{displaymath}$

and

$\begin{displaymath}V_i = -\frac{\partial^2 \log f}{\partial\theta^2} (X_i,\theta) \end{displaymath}$

In general, $U(\theta_0)=\sum U_i$ has mean 0 and approximately a normal distribution. Here is how we check that:
$\begin{align*}E_{\theta_0}(U(\theta_0)) & = n E_{\theta_0} (U_1) \\ & = n \int ... ...heta=\theta_0} \\ &= n\frac{\partial}{\partial\theta} 1 \\ & = 0\ \end{align*}$

Notice that I have interchanged the order of differentiation and integration at one point. This step is usually justified by applying the dominated convergence theorem to the definition of the derivative. The same tactic can be applied by differentiating the identity which we just proved

$\begin{displaymath}\int\frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) dx =0 \end{displaymath}$

Taking the derivative of both sides with respect to $\theta$ and pulling the derivative under the integral sign again gives

$\begin{displaymath}\int \frac{\partial}{\partial\theta} \left[ \frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) \right] dx =0 \end{displaymath}$

Do the derivative and get
$\begin{align*}-\int\frac{\partial^2\log(f)}{\partial\theta^2} f(x,\theta) dx & =... ...{\partial\log f}{\partial\theta}(x,\theta) \right]^2 f(x,\theta) dx \end{align*}$

Definition: The Fisher Information is

$\begin{displaymath}I(\theta)=-E_{\theta}(U^\prime(\theta))=nE_{\theta_0}(V_1) \end{displaymath}$

We refer to ${\cal I}(\theta_0) = E_{\theta_0}(V_1)$ as the information in 1 observation.

The idea is that I is a measure of how curved the log likelihood tends to be at the true value of $\theta$ . Big curvature means precise estimates. Our identity above is

$\begin{displaymath}I(\theta) = Var_\theta(U(\theta))=n{\cal I}(\theta) \end{displaymath}$

Now we return to our Taylor expansion approximation

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

and study the two appearances of U.

We have shown that $U=\sum U_i$ is a sum of iid mean 0 random variables. The central limit theorem thus proves that

$\begin{displaymath}n^{-1/2} U(\theta) \Rightarrow N(0,\sigma^2) \end{displaymath}$

where $\sigma^2 = Var(U_i)=E(V_i)={\cal I}(\theta)$ .

Next observe that

$\begin{displaymath}-U^\prime(\theta) = \sum V_i \end{displaymath}$

where again

$\begin{displaymath}V_i = -\frac{\partial U_i}{\partial\theta} \end{displaymath}$

The law of large numbers can be applied to show

$\begin{displaymath}-U^\prime(\theta_0)/n \to E_{\theta_0}[ V_1] = {\cal I}(\theta_0) \end{displaymath}$

Now manipulate our Taylor expansion as follows

$\begin{displaymath}n^{1/2} (\hat\theta - \theta_0) \approx \left[\frac{\sum V_i}{n}\right]^{-1} \frac{\sum U_i}{\sqrt{n}} \end{displaymath}$

Apply Slutsky's Theorem to conclude that the right hand side of this converges in distribution to $N(0,\sigma^2/{\cal I}(\theta)^2)$ which simplifies, because of the identities, to $N(0,1/{\cal I}(\theta)$ .

Summary

In regular families:

Under strong regularity conditions Jensen's inequality can be used to demonstrate that $\hat\theta$ which maximizes $\ell$ globally is consistent and that this $\hat\theta$ is a root of the likelihood equations.
It is generally easier to study $\ell$ only close to $\theta_0$ . For instance define A to be the event that $\ell$ is concave on the set of $\theta$ such that $\vert\theta-\theta_0\vert <\delta$ and the likelihood equations have a unique root in that set. Under weaker conditions than the previous case we can prove that there is a $\delta > 0$ such that

$\begin{displaymath}P(A) \to 1 \end{displaymath}$

In that case we can prove that the root $\hat\theta$ of the likelihood equations mentioned in the definition of A is consistent.
Sometimes we can only get an even weaker conclusion. Define B to be the event that $\ell(\theta)$ is concave for $n^{1/2}\vert\theta-\theta_0\vert < L$ and there is a unique root of $\ell$ over this range. Again this root is consistent but there might be other consistent roots of the likelihood equations.
Under any of these scenarios there is a consistent root of the likelihood equations which is definitely the closest to the true value $\theta_0$ . This root $\hat\theta$ has the property

$\begin{displaymath}\sqrt{n}(\hat\theta-\theta_0) \Rightarrow N(0,1/{\cal I}(\theta)) \, . \end{displaymath}$

We usually simply say that the mle is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information. This assertion is actually valid for vector valued $\theta$ where now I is a matrix with ijth entry

$\begin{displaymath}I_ij = - E\left(\frac{\partial^2\ell}{\partial\theta_i\partial\theta_j}\right) \end{displaymath}$

$next$ $up$ $previous$

Richard Lockhart
1999-10-18