No Title

$next$ $up$ $previous$

Postscript version of this file

STAT 870 Lecture 6

Consistency of MLE

Suppose $X_1,X_2,\ldots$ are iid with density $f(x,\theta_o)$ where

$\begin{displaymath}f(\cdot,\theta);\theta\in\Theta \subset {\Bbb R} \end{displaymath}$

is a family of densities.

Conditions under which MLE of $\theta$ is a.s. consistent?

Goal: find conditions under which

$\begin{displaymath}P(\hat\theta_n \to \theta_o)=1 \end{displaymath}$

where $\hat\theta_n$ is the mle.

General technical problems:

What is the precise definition of $\hat\theta_n$ ?
Having settled on some definition is the resulting object a random variable?

Example: Cauchy $(\theta)$ density is

$\begin{displaymath}f(x,\theta) = \frac{1}{\pi\left\{1+(x-\theta)^2\right\}} \end{displaymath}$

For a sample $X_1,\ldots,X_n$ the likelihood is

$\begin{displaymath}\frac{1}{\pi^n \prod_1^n\{1+(X_i-\theta)^2\}} \end{displaymath}$

We ``define'' $\hat\theta_n$ to be the value of $\theta$ which maximizes this function of $\theta$ .

This is supposed to define $\hat\theta$ as a function of $X_1,\ldots,X_n$ .

Underlying supposition: for each $x_1,\ldots,x_n$ $\exists !$ $\hat\theta(x_1,\ldots,x_n)$ which maximizes the likelihood. If this were so we would have a definition of a function from ${\Bbb R}^n$ to ${\Bbb R}$ . Useful tool: log-likelihood:

$\begin{displaymath}\ell(\theta\vert x_1,\ldots,x_n) = -n\log(\pi) - \sum_1^n \log(1+(x_i-\theta)^2) \end{displaymath}$

Problems:

1.: Is there, for every $(x_1,\ldots,x_n)$ a $\theta$ which maximizes $\ell$ ?
2.: If so is the $\theta$ unique?
3.: If so is $\hat\theta_n(x_1,\ldots,x_n)$ a Borel function of $x_1,\ldots,x_n$ ?

Question 1: For the Cauchy density there is always a maximizer. Fix $(x_1,\ldots,x_n)$ . As $\theta\to\pm\infty$ it is easy to check that

$\begin{displaymath}\ell(\theta\vert x_1,\ldots,x_n) \to -\infty \end{displaymath}$

There is then a M such that $\vert\theta\vert>M$ implies
$\begin{align*}\sup\{\ell(\theta\vert &x_1,\ldots,x_n) ; \vert\theta\vert >M\} \\... ...e \sup\{\ell(\theta\vert x_1,\ldots,x_n) ; \vert\theta\vert \le M\} \end{align*}$
Now the function

$\begin{displaymath}\theta\mapsto\ell(\theta\vert x_1,\ldots,x_n) \end{displaymath}$

is continuous so that it assumes its maximum over [-M,M]. This shows the existence of at least one maximizing $\theta$ for any set of x values.

Question 2: n=2, x₁=x=-x₂:

$\begin{displaymath}\ell(\theta\vert x,-x) \end{displaymath}$

is an even function of x. Derivative

$\begin{displaymath}\ell^\prime = \frac{2(\theta-x)}{1+(x-\theta)^2} +\frac{2(\theta+x)}{1+(x+\theta)^2} \end{displaymath}$

At $\theta=0$ this is 0 so $\theta=0$ is critical point of $\ell$ . 2nd derivative may at 0 is

$\begin{displaymath}\ell^{\prime\prime}(0) = \frac{4(x^2-1)}{(1+x^2)^2} \end{displaymath}$

If |x| <1 this is negative so that 0 is a local maximum but if |x| > 1 it is a local minimum. In this case, since $\ell$ is even there must be two maxima n either side of 0. Note that putting the two terms in $\ell^\prime$ on a common denominator will give a numerator which is a multiple of

$\begin{displaymath}\theta(\theta^2-(x^2-1)) \end{displaymath}$

Notice there are exactly three roots if x²>1.

Summary: defining $\hat\theta$ to be the maximizer of $\ell$ does not actually define a function.

Alternative strategies:

1: You might pick one of the maximizing $\theta$ values in an unequivocal way:

$\begin{displaymath}\hat\theta = \inf\{\theta: \ell(\theta) = \sup\ell\} \end{displaymath}$

(The set of such $\theta$ is not empty and bounded so there is such a $\hat\theta$ and that $\hat\theta$ is finite. By continuity of $\ell$

$\begin{displaymath}\ell(\hat\theta)=\sup\ell \end{displaymath}$

2: You might try defining $\hat\theta$ to be a suitably chosen critical point of $\ell$ .

3: You might try to prove that
$\begin{multline*}P({\rm card}(\{\theta: \ell(\theta\vert X_1,\ldots,X_n) \\ = \sup_\phi\ell(\phi\vert X_1,\ldots,X_n)\})=1)=1 \end{multline*}$
In other words it might be true that the set of $\theta$ where $\ell$ achieves its maximum is almost surely a singleton when the xs are actually a data set.

I am going to follow method 2 since this is the one which works most generally.

For $x=(x_1,\ldots,x_n)$ we define the order statistics

$\begin{displaymath}x_{(1)} \le \cdots\le x_{(n)} \end{displaymath}$

be the entries in x sorted into non-decreasing order. If n=2m-1 is odd set

$\begin{displaymath}g_n(x_1,\ldots,x_n) = x_{(m)} \end{displaymath}$

If n=2m set

$\begin{displaymath}g_n(x_1,\ldots,x_n) = (x_{(m)}+x_{(m+1)})/2. \end{displaymath}$

Now define

$\begin{displaymath}\tilde\theta_n = g(X_1,\ldots,X_n) \end{displaymath}$

Lemma 1 If $X_1,\ldots,X_n$ are iid from a distribution F with the properties:

1.

F(0)=1/2.

2.

For each $\epsilon>0$

$\begin{displaymath}F(-\epsilon) < 1/2 < F(\epsilon) \end{displaymath}$

Then $\tilde\theta_n$ converges almost surely to 0.

Remark: Part of the theorem is that

$\begin{displaymath}A\equiv \{\omega: \tilde\theta_n\to 0\} \end{displaymath}$

is an event. Proof:

A is an event if each $\tilde\theta_n$ is a random variable.
Since the X_i are random variables, we need only check that g_n is Borel.
This reduces to the assertion that

$\begin{displaymath}(x_1,\ldots,x_n) \mapsto x_{(k)} \end{displaymath}$

is Borel for each k and n.
But
$\begin{multline*}\{(x_1,\ldots,x_n): x_{(k)} < t\} \\ = \{(x_1,\ldots,x_n): \sum_1^n 1(x_i < t) \ge k\} \end{multline*}$
Sum of Borel functions is Borel so prove map

$\begin{displaymath}(x_1,\ldots,x_n) \mapsto 1(x_i < t) \end{displaymath}$

is Borel for each t and i.
This is equivalent to

$\begin{displaymath}\{(x_1,\ldots,x_n): x_i < t\} \end{displaymath}$

is Borel.
This set is open so Borel.

Now to prove the lemma we begin by formalizing an argument we have used several times.

Lemma 2 Suppose Y_n is a sequence of random variables. Then $Y_n \to 0$ almost surely is equivalent to $P(C_\epsilon)=1$ for each $\epsilon>0$ where

$\begin{displaymath}C_\epsilon=\bigcup_{N=1}^\infty \bigcap_{n=N}^\infty \{\vert Y_n\vert\le \epsilon\} \end{displaymath}$

Fix $\epsilon>0$ . For each x rvs $Y_1,Y_2,\ldots$ defined by $Y_k=1(X_k \le x)-F(x)$ are iid with mean 0. According to SLLN there is a null set N_x such that for all $\omega\not\in N_x$ we have

$\begin{displaymath}\frac{1}{n} \sum_1^n Y_k \to 0 \, . \end{displaymath}$

Let $N=N_{\epsilon}\cup N_{-\epsilon}$ . Then N is a null set. If $\omega\not\in N$ then

$\begin{displaymath}\frac{1}{n}\sum_1^n 1(X_k \le \epsilon) \to F(\epsilon) > 1/2 \end{displaymath}$

and

$\begin{displaymath}\frac{1}{n}\sum_1^n 1(X_k \le -\epsilon) \to F(-\epsilon) < 1/2 \end{displaymath}$

For any such $\omega$ there is an M such that for all $n\ge M$ the number of X_i exceeding $\epsilon$ is less than n/2 and the number of X_i less than $-\epsilon$ is less than n/2. Thus for such $\omega$ , and $n\ge M$

$\begin{displaymath}-\epsilon \le \tilde\theta_n \le \epsilon \end{displaymath}$

In other words the set $C_\epsilon^c \subset N$ so $P(C_\epsilon)=1$ . $\bullet$

$next$ $up$ $previous$

Richard Lockhart
2000-10-03