No Title

$next$ $up$ $previous$

Postscript version of this file

STAT 801 Lecture 12

Goals of Today's Lecture:

Develop large sample theory for the mle.
Define and interpret Fisher information.
Extend the ideas to estimating equations.

Today's notes

If $X_1,\ldots,X_n$ are independent then the log likelihood is of the form

$\begin{displaymath}\ell(\theta )= \sum \log(f(X_i,\theta)) \end{displaymath}$

The score function is

$\begin{displaymath}U(\theta) = \sum \frac{\partial \log f}{\partial\theta}(X_i,\theta) \end{displaymath}$

The mle $\hat\theta$ maximizes $\ell$ . If the maximum occurs at the interior of the parameter space and the log likelihood continuously differentiable then $\hat\theta$ solves the likelihood equations

$\begin{displaymath}U(\theta) = 0 \end{displaymath}$

Examples

N $(\mu,\sigma^2$ )

There is a unique root of the likelihood equations. It is a global maximum.

[Remark: Suppose we had called $\tau=\sigma^2$ the parameter. The score function would still have two components with the first component being the same as before but now the second component is

$\begin{displaymath}\frac{\partial}{\partial\tau} \ell = \frac{\sum(X_i-\mu)^2}{2\tau^2} -\frac{n}{2\tau} \end{displaymath}$

Setting the new likelihood equations equal to 0 still gives

$\begin{displaymath}\hat\tau = \hat\sigma^2 \end{displaymath}$

This is a general invariance (or equivariance) principal. If $\phi=g(\theta)$ is some reparametrization of a model (a one to one relabelling of the parameter values) then $\hat\phi= g(\hat\theta)$ . We will see that this does not apply to other estimators.]

Cauchy: location $\theta$

There is at least 1 root of the likelihood equations but often several more. One of the roots is a global maximum, others, if they exist may be local minima or maxima.

Binomial( $n,\theta$ )

If X=0 or X=n there is no root of the likelihood equations; in this case the likelihood is monotone. For other values of X there is a unique root, a global maximum. The global maximum is at $\hat\theta = X/n$ even if X=0 or n.

The 2 parameter exponential

The density is

$\begin{displaymath}f(x;\alpha,\beta) = \frac{1}{\beta} e^{-(x-\alpha)/\beta} 1(x>\alpha) \end{displaymath}$

The resulting log-likelihood is $-\infty$ for $\alpha > \min\{X_1,\ldots,X_mn\}$ and otherwise is

$\begin{displaymath}\ell(\alpha,\beta) = -n\log(\beta) -\sum(X_i-\alpha)/\beta \end{displaymath}$

As a function of $\alpha$ this is increasing till $\alpha$ reaches

$\begin{displaymath}\hat\alpha = X_{(1)} = \min\{X_1,\ldots,X_mn\} \end{displaymath}$

which gives the mle of $\alpha$ . Now plug in this value $\hat\alpha$ for $\alpha$ and get the so-called profile likelihood for $\beta$ :

$\begin{displaymath}\ell_{\mbox{profile}}(\beta) = -n\log(\beta) -\sum(X_i-X_{(1)})/\beta \end{displaymath}$

Take the $\beta$ derivative and set it equal to 0 to get

$\begin{displaymath}\hat\beta =\sum(X_i-X_{(1)})/n \end{displaymath}$

Notice that the mle $\hat\theta=(\hat\alpha,\hat\beta)$ does not solve the likelihood equations; we had to look at the edge of the possible parameter space. The parameter $\alpha$ is called a support or truncation parameter. ML methods behave oddly in problems with such parameters.

Three parameter Weibull

The density in question is

$\begin{displaymath}f(x;\alpha,\beta,\gamma) = \frac{1}{\beta} \left(\frac{x-\a... ...ght)^{\gamma-1} \exp[-\{(x-\alpha)/\beta\}^\gamma]1(x>\alpha) \end{displaymath}$

There are 3 derivatives to take to solve the likelihood equations. Setting the $\beta$ derivative equal to 0 gives the equation

$\begin{displaymath}\hat\beta(\alpha,\gamma) = \left[\sum (X_i-\alpha)^\gamma/n\right]^{1/\gamma} \end{displaymath}$

where we use the notation $\hat\beta(\alpha,\gamma)$ to indicate that the mle of $\beta$ could be found by finding the mles of the other two parameters and then plugging in to the formula above. It is not possible to find explicitly the remaining two parameters; numerical methods are needed. However, you can see that putting $\gamma < 1$ and letting $\alpha \to X_{(1)}$ will make the log likelihood go to $\infty$ . The mle is not uniquely defined, then, since any $\gamma < 1$ and any $\beta$ will do.

If the true value of $\gamma$ is more than 1 then the probability that there is a root of the likelihood equations is high; in this case there must be two more roots: a local maximum and a saddle point! For a true value of $\gamma>0$ the theory we detail below applies to the local maximum and not to the global maximum of the likelihood equations.

Large Sample Theory

We now study the approximate behaviour of $\hat\theta$ by studying the function U. Notice first that U is a sum of independent random variables.

Theorem: If $Y_1,Y_2,\ldots$ are iid with mean $\mu$ then

$\begin{displaymath}\frac{\sum Y_i}{n} \to \mu \end{displaymath}$

This is called the law of large numbers. The strong law says

$\begin{displaymath}P(\lim \frac{\sum Y_i}{n}=\mu)=1 \end{displaymath}$

and the weak law that

$\begin{displaymath}\lim P(\vert \frac{\sum Y_i}{n}-\mu\vert>\epsilon) = 0 \end{displaymath}$

For iid Y_i the stronger conclusion holds but for our heuristics we will ignore the differences between these notions.

Now suppose that $\theta_0$ is the true value of $\theta$ . Then

$\begin{displaymath}U(\theta)/n \to \mu(\theta) \end{displaymath}$

where
$\begin{align*}\mu(\theta) =& E_{\theta_0}\left[ \frac{\partial \log f}{\partia... ...t \frac{\partial \log f}{\partial\theta}(x,\theta) f(x,\theta_0) dx \end{align*}$

Consider as an example the case of $N(\mu,1)$ data where

$\begin{displaymath}U(\mu)/n = \sum(X_i -\mu)/n = \bar{X} -\mu \end{displaymath}$

If the true mean is $\mu_0$ then $\bar{X} \to \mu_0$ and

$\begin{displaymath}U(\mu)/n \to \mu_0-\mu \end{displaymath}$

If we think of a $\mu < \mu_0$ we see that the derivative of $\ell(\mu)$ is likely to be positive so that $\ell$ increases as we increase $\mu$ . For $\mu$ more than $\mu_0$ the derivative is probably negative and so $\ell$ tends to be decreasing for $\mu > 0$ . It follows that $\ell$ is likely to be maximized close to $\mu_0$ .

Now we repeat these ideas for a more general case. We study the random variable $\log[f(X_i,\theta)/f(X_i,\theta_0)]$ . You know the inequality

$\begin{displaymath}E(X)^2 \le E(X^2) \end{displaymath}$

(because the difference is $Var(X) \ge 0$ . This inequality has the following generalization, called Jensen's inequality. If g is a convex function (non-negative second derivative, roughly) then

$\begin{displaymath}g(E(x)) \le E(g(X)) \end{displaymath}$

The inequality above has g(x)=x². We use $g(x) = -\log ( x )$ which is convex because $g^{\prime\prime}(x) = x^{-2} > 0$ . We get

$\begin{displaymath}-\log(E_{\theta_0}[f(X_i,\theta)/f(X_i,\theta_0)] \le E_{\theta_0}[-\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \end{displaymath}$

But
$\begin{align*}E_{\theta_0}[f(X_i,\theta)/f(X_i,\theta_0)] & = \int \frac{f(x,\... ...(x,\theta_0)}f(x,\theta_0) dx \\ & = \int f(x,\theta) dx \\ & = 1 \end{align*}$
We can reassemble the inequality and this calculation to get

$\begin{displaymath}E_{\theta_0}[\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \le 0 \end{displaymath}$

It is possible to prove that the inequality is strict unless the $\theta$ and $\theta_0$ densities are actually the same. Let $\mu(\theta) < 0$ be this expected value. Then for each $\theta$ we find
$\begin{align*}n^{-1}[\ell(\theta) - \ell(\theta_0)] & = n^{-1}\sum \log[f(X_i,\theta)/f(X_i,\theta_0)] \\ &\to \mu(\theta) \end{align*}$
This proves that the likelihood is probably higher at $\theta_0$ than at any other single $\theta$ . This idea can often be stretched to prove that the mle is consistent.

Definition A sequence $\hat\theta_n$ of estimators of $\theta$ is consistent if $\hat\theta_n$ converges weakly (or strongly) to $\theta$ .

Proto theorem: In regular problems the mle $\hat\theta$ is consistent.

Now let us study the shape of the log likelihood near the true value of $\hat\theta$ under the assumption that $\hat\theta$ is a root of the likelihood equations close to $\theta_0$ . We use Taylor expansion to write, for a 1 dimensional parameter $\theta$ ,
$\begin{align*}U(\hat\theta) & = 0 \\ & = U(\theta_0) + U^\prime(\theta_0)(\hat... ...eta_0) + U^{\prime\prime}(\tilde\theta) (\hat\theta-\theta_0)^2/2 \end{align*}$
for some $\tilde\theta$ between $\theta_0$ and $\hat\theta$ . (This form of the remainder in Taylor's theorem is not valid for multivariate $\theta$ .) The derivatives of U are each sums of n terms and so should be both proportional to n in size. The second derivative is multiplied by the square of the small number $\hat\theta-\theta_0$ so should be negligible compared to the first derivative term. If we ignore the second derivative term we get

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

Now let's look at the terms U and $U^\prime$ .

In the normal case

$\begin{displaymath}U(\theta_0) = \sum (X_i-\mu_0) \end{displaymath}$

has a normal distribution with mean 0 and variance n (SD $\sqrt{n}$ ). The derivative is simply

$\begin{displaymath}U^\prime(\mu) = -n \end{displaymath}$

and the next derivative $U^{\prime\prime}$ is 0. We will analyze the general case by noticing that both U and $U^\prime$ are sums of iid random variables. Let

$\begin{displaymath}U_i = \frac{\partial \log f}{\partial\theta} (X_i,\theta_0) \end{displaymath}$

and

$\begin{displaymath}V_i = -\frac{\partial^2 \log f}{\partial\theta^2} (X_i,\theta) \end{displaymath}$

In general, $U(\theta_0)=\sum U_i$ has mean 0 and approximately a normal distribution. Here is how we check that:
$\begin{align*}E_{\theta_0}(U(\theta_0)) & = n E_{\theta_0} (U_1) \\ & = n \int ... ...heta=\theta_0} \\ &= n\frac{\partial}{\partial\theta} 1 \\ & = 0\ \end{align*}$

Notice that I have interchanged the order of differentiation and integration at one point. This step is usually justified by applying the dominated convergence theorem to the definition of the derivative. The same tactic can be applied by differentiating the identity which we just proved

$\begin{displaymath}\int\frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) dx =0 \end{displaymath}$

Taking the derivative of both sides with respect to $\theta$ and pulling the derivative under the integral sign again gives

$\begin{displaymath}\int \frac{\partial}{\partial\theta} \left[ \frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) \right] dx =0 \end{displaymath}$

Do the derivative and get
$\begin{align*}-\int\frac{\partial^2\log(f)}{\partial\theta^2} f(x,\theta) dx & =... ...{\partial\log f}{\partial\theta}(x,\theta) \right]^2 f(x,\theta) dx \end{align*}$

Definition: The Fisher Information is

$\begin{displaymath}I(\theta)=-E_{\theta}(U^\prime(\theta))=nE_{\theta_0}(V_1) \end{displaymath}$

We refer to ${\cal I}(\theta_0) = E_{\theta_0}(V_1)$ as the information in 1 observation.

The idea is that I is a measure of how curved the log likelihood tends to be at the true value of $\theta$ . Big curvature means precise estimates. Our identity above is

$\begin{displaymath}I(\theta) = Var_\theta(U(\theta))=n{\cal I}(\theta) \end{displaymath}$

Now we return to our Taylor expansion approximation

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

and study the two appearances of U.

We have shown that $U=\sum U_i$ is a sum of iid mean 0 random variables. The central limit theorem thus proves that

$\begin{displaymath}n^{-1/2} U(\theta) \Rightarrow N(0,\sigma^2) \end{displaymath}$

where $\sigma^2 = Var(U_i)=E(V_i)={\cal I}(\theta)$ .

Next observe that

$\begin{displaymath}-U^\prime(\theta) = \sum V_i \end{displaymath}$

where again

$\begin{displaymath}V_i = -\frac{\partial U_i}{\partial\theta} \end{displaymath}$

The law of large numbers can be applied to show

$\begin{displaymath}-U^\prime(\theta_0)/n \to E_{\theta_0}[ V_1] = {\cal I}(\theta_0) \end{displaymath}$

Now manipulate our Taylor expansion as follows

$\begin{displaymath}n^{1/2} (\hat\theta - \theta_0) \approx \left[\frac{\sum V_i}{n}\right]^{-1} \frac{\sum U_i}{\sqrt{n}} \end{displaymath}$

Apply Slutsky's Theorem to conclude that the right hand side of this converges in distribution to $N(0,\sigma^2/{\cal I}(\theta)^2)$ which simplifies, because of the identities, to $N(0,1/{\cal I}(\theta)$ .

Summary

In regular families:

Under strong regularity conditions Jensen's inequality can be used to demonstrate that $\hat\theta$ which maximizes $\ell$ globally is consistent and that this $\hat\theta$ is a root of the likelihood equations.
It is generally easier to study $\ell$ only close to $\theta_0$ . For instance define A to be the event that $\ell$ is concave on the set of $\theta$ such that $\vert\theta-\theta_0\vert <\delta$ and the likelihood equations have a unique root in that set. Under weaker conditions than the previous case we can prove that there is a $\delta > 0$ such that

$\begin{displaymath}P(A) \to 1 \end{displaymath}$

In that case we can prove that the root $\hat\theta$ of the likelihood equations mentioned in the definition of A is consistent.
Sometimes we can only get an even weaker conclusion. Define B to be the event that $\ell(\theta)$ is concave for $n^{1/2}\vert\theta-\theta_0\vert < L$ and there is a unique root of $\ell$ over this range. Again this root is consistent but there might be other consistent roots of the likelihood equations.
Under any of these scenarios there is a consistent root of the likelihood equations which is definitely the closest to the true value $\theta_0$ . This root $\hat\theta$ has the property

$\begin{displaymath}\sqrt{n}(\hat\theta-\theta_0) \Rightarrow N(0,1/{\cal I}(\theta)) \, . \end{displaymath}$

We usually simply say that the mle is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information. This assertion is actually valid for vector valued $\theta$ where now I is a matrix with ijth entry

$\begin{displaymath}I_ij = - E\left(\frac{\partial^2\ell}{\partial\theta_i\partial\theta_j}\right) \end{displaymath}$

Estimating Equations

The same ideas arise in almost any model where estimates are derived by solving some equation. As an example I sketch large sample theory for Generalized Linear Models.

Suppose that for $i=1,\ldots, n$ we have observations of the numbers of cancer cases Y_i in some group of people characterized by values x_i of some covariates. You are supposed to think of x_i as containing variables like age, or a dummy for sex or average income or $\ldots$ A parametric regression model for the Y_i might postulate that Y_i has a Poisson distribution with mean $\mu_i$ where the mean $\mu_i$ depends somehow on the covariate values. Typically we might assume that $g(\mu_i) = \beta0+ x_i \beta$ where g is a so-called link function, often for this case $g(\mu) = \log(\mu)$ and $x_i\beta$ is a matrix product with x_i written as a row vector and $\mu$ a column vector. This is supposed to function as a ``linear regression model with Poisson errors''. I will do as a special case $\log(\mu_i) = \beta x_i$ where x_i is a scalar.

The log likelihood is simply

$\begin{displaymath}\ell(\beta) = \sum(Y_i \log(\mu_i) - \mu_i) \end{displaymath}$

ignoring irrelevant factorials. The score function is, since $\log(\mu_i) = \beta x_i$ ,

$\begin{displaymath}U(\beta) = \sum (Y_i x_i - x_i \mu_i) = \sum x_i(Y_i-\mu_i) \end{displaymath}$

(Notice again that the score has mean 0 when you plug in the true parameter value.) The key observation, however, is that it is not necessary to believe that Y_i has a Poisson distribution to make solving the equation U=0 sensible. Suppose only that $\log(E(Y_i)) = x_i\beta$ . Then we have assumed that

$\begin{displaymath}E_\beta(U(\beta)) = 0 \end{displaymath}$

This was the key condition in proving that there was a root of the likelihood equations which was consistent and here it is what is needed, roughly, to prove that the equation $U(\beta)=0$ has a consistent root $\hat\beta$ . Ignoring higher order terms in a Taylor expansion will give

$\begin{displaymath}V(\beta)(\hat\beta-\beta) \approx U(\beta) \end{displaymath}$

where $V=-U^\prime$ . In the mle case we had identities relating the expectation of V to the variance of U. In general here we have

$\begin{displaymath}Var(U) = \sum x_i^2 Var(Y_i) \end{displaymath}$

If Y_i is Poisson with mean $\mu_i$ (and so $Var(Y_i)=\mu_i$ ) this is

$\begin{displaymath}Var(U) = \sum x_i^2\mu_i \end{displaymath}$

Moreover we have

$\begin{displaymath}V_i = x_i^2 \mu_i \end{displaymath}$

and so

$\begin{displaymath}V(\beta) = \sum x_i^2 \mu_i \end{displaymath}$

The central limit theorem (the Lyapunov kind) will show that $U(\beta)$ has an approximate normal distribution with variance $\sigma_U^2 = \sum x_i^2 Var(Y_i)$ and so

$\begin{displaymath}\hat\beta-\beta \approx N(0,\sigma_U^2/(\sum x_i^2\mu_i)^2) \end{displaymath}$

If $Var(Y_i)=\mu_i$ , as it is for the Poisson case, the asymptotic variance simplifies to $1/\sum x_i^2\mu_i$ .

Notice that other estimating equations are possible. People suggest alternatives very often. If w_i is any set of deterministic weights (even possibly depending on $\mu_i$ then we could define

$\begin{displaymath}U(\beta) = \sum w_i (Y_i-\mu_i) \end{displaymath}$

and still conclude that U=0 probably has a consistent root which has an asymptotic normal distribution. This idea is being used all over the place these days: see, for example Zeger and Liang's Generalized estimating equations (GEE) which the econometricians call Generalized Method of Moments.

Problems with maximum likelihood

1.: In problems with many parameters the approximations don't work very well and maximum likelihood estimators can be far from the right answer. See your homework for the Neyman Scott example where the mle is not consistent.
2.: When there are multiple roots of the likelihood equation you must choose the right root. To do so you might start with a different consistent estimator and then apply some iterative scheme like Newton Raphson to the likelihood equations to find the mle. It turns out not many steps of NR are generally required if the starting point is a reasonable estimate.

Finding (good) preliminary Point Estimates

Method of Moments

Basic strategy: set sample moments equal to population moments and solve for the parameters.

Definition: The $r^{\mbox{th}}$ sample moment (about the origin) is

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n X_i^r \end{displaymath}$

The $r^{\mbox{th}}$ population moment is

$\begin{displaymath}{\rm E}(X^r) \end{displaymath}$

(Central moments are

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^r \end{displaymath}$

and

$\begin{displaymath}{\rm E}\left[(X-\mu)^r\right] \, . \end{displaymath}$

$next$ $up$ $previous$

Richard Lockhart
1998-10-29