Postscript version of this file
STAT 801 Lecture 12
Goals of Today's Lecture:
- Develop large sample theory for the mle.
- Define and interpret Fisher information.
- Extend the ideas to estimating equations.
Today's notes
If
are independent then the log likelihood is of
the form
The score function is
The mle
maximizes
.
If the maximum occurs
at the interior of the parameter space and the log likelihood continuously
differentiable then
solves the likelihood equations
Examples
N
)
There is a unique root of the likelihood equations. It is a global maximum.
[Remark: Suppose we had called
the parameter.
The score function would still have two components with the
first component being the same as before but now the
second component is
Setting the new likelihood equations equal to 0 still gives
This is a general invariance (or equivariance) principal.
If
is some reparametrization of a model
(a one to one relabelling of the parameter values) then
.
We will see that this does not
apply to other estimators.]
Cauchy: location
There is at least 1 root of the likelihood equations but
often several more. One of the roots is a global maximum,
others, if they exist may be local minima or maxima.
Binomial(
)
If X=0 or X=n there is no root of the likelihood equations;
in this case the likelihood is monotone. For other values of
X there is a unique root, a global maximum. The global
maximum is at
even if X=0 or n.
The 2 parameter exponential
The density is
The resulting log-likelihood is
for
and
otherwise is
As a function of
this is increasing till
reaches
which gives the mle of
.
Now plug in this value
for
and get the so-called profile likelihood for
:
Take the
derivative and set it equal to 0 to get
Notice that the mle
does not solve the likelihood equations; we had to
look at the edge of the possible parameter space.
The parameter
is called a support or
truncation parameter. ML methods behave oddly in
problems with such parameters.
Three parameter Weibull
The density in question is
There are 3 derivatives to take to solve the likelihood equations.
Setting the
derivative equal to 0 gives the equation
where we use the notation
to indicate
that the mle of
could be found by finding the mles of the
other two parameters and then plugging in to the formula above.
It is not possible to find explicitly the remaining two parameters;
numerical methods are needed. However, you can see that putting
and letting
will make the log
likelihood go to
.
The mle is not uniquely
defined, then, since any
and any
will do.
If the true value of
is more than 1 then the probability that
there is a root of the likelihood equations is high; in this case there
must be two more roots: a local maximum and a saddle point! For a
true value of
the theory we detail below applies to the
local maximum and not to the global maximum of the likelihood equations.
Large Sample Theory
We now study the approximate behaviour of
by studying
the function U. Notice first that U
is a sum of independent random variables.
Theorem: If
are iid with mean
then
This is called the law of large numbers. The strong law says
and the weak law that
For iid Yi the stronger conclusion holds but for our
heuristics we will ignore the differences between these notions.
Now suppose that
is the true value of
.
Then
where
Consider as an example the case of
data where
If the true mean is
then
and
If we think of a
we see that the derivative of
is likely to be positive so that
increases
as we increase
.
For
more than
the derivative
is probably negative and so
tends to be decreasing for
.
It follows that
is likely to be maximized close to
.
Now we repeat these ideas for a more general case. We study
the random variable
.
You know the inequality
(because the difference is
.
This inequality has
the following generalization, called Jensen's inequality. If g
is a convex function (non-negative second derivative, roughly) then
The inequality above has g(x)=x2. We use
which is convex because
.
We get
But
We can reassemble the inequality and this calculation to get
It is possible to prove that the inequality is strict unless
the
and
densities are actually the same.
Let
be this expected value. Then
for each
we find
This proves that the likelihood is probably higher at
than at any other single
.
This idea can often be
stretched to prove that the mle is consistent.
Definition A sequence
of estimators of
is consistent if
converges weakly
(or strongly) to
.
Proto theorem: In regular problems the mle
is consistent.
Now let us study the shape of the log likelihood near the true
value of
under the assumption that
is a
root of the likelihood equations close to
.
We use Taylor
expansion to write, for a 1 dimensional parameter
,
for some
between
and
.
(This form of the remainder in Taylor's theorem is not valid
for multivariate
.) The derivatives of U are each
sums of n terms and so should be both proportional
to n in size. The second derivative is multiplied by the
square of the small number
so should be
negligible compared to the first derivative term.
If we ignore the second derivative term we get
Now let's look at the terms U and
.
In the normal case
has a normal distribution with mean 0 and variance n
(SD
). The derivative is simply
and the next derivative
is 0. We will analyze
the general case by noticing that both U and
are
sums of iid random variables. Let
and
In general,
has mean 0
and approximately a normal distribution. Here is how we check that:
Notice that I have interchanged the order of differentiation
and integration at one point. This step is usually justified
by applying the dominated convergence theorem to the definition
of the derivative. The same tactic can be applied by differentiating
the identity which we just proved
Taking the derivative of both sides with respect to
and pulling the derivative under the integral sign again gives
Do the derivative and get
Definition: The Fisher Information is
We refer to
as
the information in 1 observation.
The idea is that I is a measure of how curved the log
likelihood tends to be at the true value of
.
Big curvature means precise estimates. Our identity above
is
Now we return to our Taylor expansion approximation
and study the two appearances of U.
We have shown that
is a sum of iid mean 0 random variables.
The central limit theorem thus proves that
where
.
Next observe that
where again
The law of large numbers can be applied to show
Now manipulate our Taylor expansion as follows
Apply Slutsky's Theorem to conclude that the right hand
side of this converges in distribution to
which simplifies, because of the
identities, to
.
Summary
In regular families:
We usually simply say that the mle is consistent and asymptotically
normal with an asymptotic variance which is the inverse of the Fisher
information. This assertion is actually valid for vector valued
where now I is a matrix with ijth entry
Estimating Equations
The same ideas arise in almost any model where estimates are derived
by solving some equation. As an example I sketch large sample theory
for Generalized Linear Models.
Suppose that for
we have observations of the numbers
of cancer cases Yi in some group of people characterized by values
xi of some covariates. You are supposed to think
of xi as containing variables like age, or a dummy for sex or
average income or
A parametric regression model for the Yi might postulate that
Yi has a Poisson distribution with mean
where the mean
depends somehow on the covariate
values. Typically we might assume that
where
g is a so-called link function,
often for this case
and
is a matrix product with xi written as a row vector and
a column vector. This is supposed to function as a
``linear regression model with Poisson errors''.
I will do as a special case
where xi is a scalar.
The log likelihood is simply
ignoring irrelevant factorials. The score function is, since
,
(Notice again that the score has mean 0 when you plug in
the true parameter value.) The key observation, however,
is that it is not necessary to believe that Yi has
a Poisson distribution to make solving the equation U=0
sensible. Suppose only that
.
Then we have assumed that
This was the key condition in proving that there was a
root of the likelihood equations which was consistent and
here it is what is needed, roughly, to prove
that the equation
has a consistent root
.
Ignoring higher order terms
in a Taylor expansion will give
where
.
In the mle case we had identities relating
the expectation of V to the variance of U. In general here we have
If Yi is Poisson with mean
(and so
)
this is
Moreover we have
and so
The central limit theorem (the Lyapunov kind) will show that
has an approximate normal distribution with variance
and so
If
,
as it is for the Poisson case,
the asymptotic variance simplifies
to
.
Notice that other estimating equations are possible. People
suggest alternatives very often. If wi is any set of
deterministic weights (even possibly depending on
then we could define
and still conclude that U=0 probably has a consistent root
which has an asymptotic normal distribution. This idea is being
used all over the place these days: see, for example Zeger and Liang's
Generalized estimating equations (GEE) which the econometricians call
Generalized Method of Moments.
Problems with maximum likelihood
- 1.
- In problems with many parameters the approximations don't
work very well and maximum likelihood estimators can be far from
the right answer. See your homework for the Neyman Scott example
where the mle is not consistent.
- 2.
- When there are multiple roots of the likelihood equation you must
choose the right root. To do so you might start with a different
consistent estimator and then apply some iterative scheme like Newton
Raphson to the likelihood equations to find the mle. It turns out not many
steps of NR are generally required if the starting point is a reasonable
estimate.
Finding (good) preliminary Point Estimates
Method of Moments
Basic strategy: set sample moments equal to population moments and
solve for the parameters.
Definition: The
sample moment (about the origin)
is
The
population moment is
(Central moments are
and
Richard Lockhart
1998-10-29