Reading for Today's Lecture:
Goals of Today's Lecture:
Today's notes
Definition: A model is a family
of possible distributions for some random variable X. (Our
data set is X, so X will generally be a big vector or matrix or even
more complicated object.)
We will assume throughout this course that the true distribution P of Xis in fact some
for some
.
We call
the true value of the parameter. Notice that this assumption
will be wrong; we hope it is not wrong in an important way. If we are very
worried that it is wrong we enlarge our model putting in more distributions
and making
bigger.
Our goal is to observe the value of X and then guess
or some
property of
.
We will consider the following classic mathematical
versions of this:
There are several schools of statistical thinking with different views on how these problems should be done. The main schools of thought may be summarized roughly as follows:
For the next several weeks we do only the Neyman Pearson approach, though we use that approach to evaluate the quality of likelihood methods.
Suppose you toss a coin 6 times and get Heads twice. If p is the
probability of getting H then the probability of getting 2 heads is
Definition: A model is a family
of possible distributions for some random variable X.
Typically the model is described by specifying
the set of possible densities of X.
Definition: The likelihood function is the function L whose domain
is
and whose values are given by
The key point is to think about how the density depends on
not
about how it depends on X. Notice that X, the observed value of the
data, has been plugged into the formula for the density. Notice also
that the coin tossing example is like this but with f being the discrete
density. We use the likelihood in most of our inference problems:
Maximum Likelihood Estimation
To find an MLE we maximize L. This is a typical function maximization problem which we approach by setting the gradient of L equal to 0 and then checking to see that the root is a maximum, not a minimum or saddle point.
We begin by examining some likelihood plots in examples:
Cauchy Data
We have a sample
from the Cauchy
density
Here are some plots of this function for 6 samples of size 5.
To maximize this likelihood we would have to differentiate L and
set the result equal to 0. Notice that L is a product of n terms
and the derivative will then be
Definition: The Log Likelihood function is
For the Cauchy problem we have
You can see that the likelihood will tend to 0 as
so that the maximum of
will occur at a root of
,
the derivative of
with respect to
.
Definition: The Score Function is the gradient of
The MLE
usually solves the Likelihood Equations
In our Cauchy example we find
Here are some plots of the score functions for n=5 for our
Cauchy data sets. Each score is plotted beneath a plot of the
corresponding
.
If X has a Binomial
distribution then
The function L is 0 at
and at
unless X=0or X=n so for
the MLE must be found by setting
U=0 and getting
The Normal Distribution
Now we have
iid
.
There are
two parameters
.
We find
Notice that U is a function with two components because
has
two components.
Setting the likelihood equal to 0 and solving gives
Here is a contour plot of the normal log likelihood for two data sets with n=10 and n=100.
We now turn to theory to explain the features of these plots, at least approximately in large samples.