No Title

STAT 870 Lecture 7

Consistency of MLE

Cauchy problem: for vector $(x_1,\ldots,x_n)$ we define $h_n(x_1,\ldots,x_n)$ to be that root of

$\begin{displaymath}\ell^\prime(\theta\vert x_1,\ldots,x_n) = \sum\frac{2(x_i-\theta)}{1+(x_i-\theta)^2} =0 \end{displaymath}$

which is closest to $g_n(x_1,\ldots,x_n)$ (here g_n is the Borel function used in defining the median above). If $g_n(x_1,\ldots,x_n)$ is exactly midway between two roots which are tied for closest we define h_nto be the root smaller than g_n.

It is possible to prove that this defines a Borel function from ${\Bbb R}^n$ to ${\Bbb R}$ . Now define

$\begin{displaymath}\hat\theta_n = h_n(X_1,\ldots,X_n) \end{displaymath}$

I claim that if $X_1,X_2,\ldots$ are iid Cauchy(0) then

$\begin{displaymath} \hat\theta_n \to 0 \end{displaymath}$

(1)

almost surely.

To prove this fix $\epsilon > 0$ we prove that $P(C_\epsilon)=1$ where

$\begin{displaymath}C_\epsilon= \bigcup_{N=1}^\infty \bigcap_{n=N}^\infty \{\vert\hat\theta_n\vert \le \epsilon\} \end{displaymath}$

Notation:

Suppress dependence of $\ell$ on the data.
Use superscript $\prime$ to denote differentiation wrt $\theta$ .
Components of $\ell$ etc:
$\begin{align*}L_i(\theta) & = -\log(f(X_i,\theta))+\log(f(X_i,\theta_o)) \\ U_i... ...rime}(\theta) \\ W_i(\theta) & = L_i^{\prime\prime\prime}(\theta) \end{align*}$

Find event $D_\epsilon$ inside $C_\epsilon$ with $P(D_\epsilon)=1$ . To define this new event note that if

1.: $\ell^\prime$ has a unique root over $[-3\epsilon,3\epsilon]$ and
2.: that root is actually in $[-\epsilon,\epsilon]$ and
3.: $\vert\tilde\theta_n\vert \le \epsilon$

then the root of $\ell^\prime$ closest to $\tilde\theta_n$ is actually the root in points 1 and 2 and so

$\begin{displaymath}\vert\hat\theta_n\vert \le \epsilon \end{displaymath}$

Define $D^{(1)}_\epsilon$ to be the event that there is an N such that for all $n \ge N$ and all $\vert\theta\vert\le 2\epsilon$ we have

$\begin{displaymath}\ell^{\prime\prime}(\theta\vert X_1,\ldots,X_n) < 0 \end{displaymath}$

Define $D^{(2)}_\epsilon$ to be the event that there is an N such that for all $n \ge N$

$\begin{displaymath}\ell^{\prime\prime}(\epsilon\vert X_1,\ldots,X_n) < 0 \end{displaymath}$

and

$\begin{displaymath}\ell^{\prime\prime}(-\epsilon\vert X_1,\ldots,X_n) > 0 \end{displaymath}$

Finally define $D^{(3)}_\epsilon$ to be the event that there is an N such that for all $n \ge N$

$\begin{displaymath}\vert\tilde\theta_n\vert \le \epsilon \end{displaymath}$

Already shown: $P(D^{(3)}_\epsilon)=1$ .

Next show $P( D^{(2)}_\epsilon)=1$ . Note that

$\begin{displaymath}U_k(\epsilon) = \frac{2(X_k-\epsilon)}{1+(X_k-\epsilon)^2} \end{displaymath}$

and

$\begin{displaymath}U_k(-\epsilon) = \frac{2(X_k+\epsilon)}{1+(X_k+\epsilon)^2} \end{displaymath}$

Then

$\begin{displaymath}\frac{1}{n}\ell^{\prime\prime}(\epsilon\vert X_1,\ldots,X_n) =\overline{U_n(\epsilon)} \end{displaymath}$

and

$\begin{displaymath}\frac{1}{n}\ell^{\prime\prime}(-\epsilon\vert X_1,\ldots,X_n) =\overline{U_n(-\epsilon)} \end{displaymath}$

Thus $D^{(2)}_\epsilon$ is the event that there is an N such that for all $n \ge N$

$\begin{displaymath}\overline{U_n(\epsilon)} < 0 \quad \text{and} \quad \overline{U_n(-\epsilon)} > 0 \end{displaymath}$

$\overline{U_n(-\epsilon)}$ , $\overline{U_n(\epsilon)}$ averages of iid variates. Apply SLLN and show
$\begin{align*}{\rm E}(U_k(\epsilon)) & < 0 \\ {\rm E}(U_k(-\epsilon)) & > 0 \, . \end{align*}$
In fact
$\begin{align*}{\rm E}(U_k(\epsilon)) & = \frac{-2\pi\epsilon}{\epsilon^2+4} < 0 ... ...m E}(U_k(-\epsilon)) & = \frac{2\pi\epsilon}{\epsilon^2+4} > 0 \, . \end{align*}$

Defect: argument not easy to generalize; uses exact computation of moment.

More general tactic: use Jensen's inequality.

If $\ell$ is smaller at $-\epsilon$ and at $\epsilon$ than it is at 0 then there must be a critical point in $[-\epsilon,\epsilon]$ , that is, a root of $\ell^\prime$ .

Define

$\begin{displaymath}L_i(\theta) = \log(1+X_i^2)-\log(1+(X_i-\theta)^2) \, . \end{displaymath}$

Note that $\ell(\theta)-\ell(0) = \sum L_i(\theta)$ .

$D^{(4)}_\epsilon$ is the event $\exists N$ such that $\forall n \ge N$

$\begin{displaymath}\{ \sum L_i(\epsilon) < 0, \sum L_i(-\epsilon) <0\} \, . \end{displaymath}$

Define

$\begin{displaymath}\mu(\epsilon) = {\rm E}(L_i(\epsilon)) \end{displaymath}$

SLLN shows $P(D^{(5)}_\epsilon)=1$ where $D^{(5)}_\epsilon$ is the event

$\begin{displaymath}\{ \sum L_i(\epsilon) / n \to \mu(\epsilon), \sum L_i(-\epsilon)/n \to \mu(-\epsilon)\} \end{displaymath}$

Claim: for all $\epsilon\neq 0$ $\mu(\epsilon)< 0$ .

If so then

$\begin{displaymath}D^{(5)}_\epsilon \subset D^{(4)}_\epsilon \end{displaymath}$

and so $P(D^{(4)}_\epsilon )=1$ .

To prove the claim we apply Jensen's inequality:

Proposition 1 (Jensen) Suppose Y is a random variable and $\phi$ is a function which is convex on an interval (a,b) with P(a<Y<b)=1Assume ${\rm E}(\vert Y\vert)< \infty$ . Then

$\begin{displaymath}\phi({\rm E}(Y)) \le {\rm E}(phi(Y)) \end{displaymath}$

If $\phi$ is strictly convex then the inequality is strict unless ${\rm Var}(Y)=0$ .

Jargon: $\phi$ is convex if for each x,y and $\lambda\in (0,1)$

$\begin{displaymath}\phi(\lambda x +(1-\lambda)y) \le \lambda \phi(x) +(1-\lambda)\phi(y) \end{displaymath}$

We call $\phi$ strictly convex if the inequality is strict.

If $\phi$ is twice differentiable and $\phi^{\prime\prime} \ge 0$ then $\phi$ is convex; a strict inequality shows $\phi$ is strictly convex.

Apply Jensen's inequality with $\phi(x) = - \log(x)$ to Y=g(X)/f(X) where X has density f and g is some other density. Then

$\begin{displaymath}{\rm E}\{-\log(Y)\} > -\log\{{\rm E}(Y)\} \end{displaymath}$

But the latter is

$\begin{displaymath}\log\left\{\int \frac{g(x)}{f(x)} f(x) dx\right\} =\log(1) = 0 \end{displaymath}$

Technically: interval (a,b) is $(0,\infty)$ . The assumption

$\begin{displaymath}P(0 < Y < \infty)=1 \end{displaymath}$

deserves some discussion. If f(x)=0 for some places where g(x) is not 0 then

$\begin{displaymath}{\rm E}\left\{\frac{g(X)}{f(X)}\right\} = \int g(x)1(f(X)>0) dx \end{displaymath}$

which might be less than 1. This just makes the inequality stronger, however.

The other technical detail is that g(x) might be 0 some places where f(x) is not 0. This might mean P(Y=0) > 0. On the event Y=0we will agree to take $-\log(Y)=\infty$ and conclude

$\begin{displaymath}{\rm E}\{-log(Y)\}=\infty \end{displaymath}$

In any case we find

$\begin{displaymath}{\rm E}\{-log(Y)\} > 0 \end{displaymath}$

$\begin{displaymath}{\rm E}[\log\{g(X)\} - \log\{f(X)\}] < 0 \, . \end{displaymath}$

Applied to our Cauchy problem we have shown $\mu(\theta) < 0$ for all $\theta\neq 0$ . Hence $P(D^{(4)}_\epsilon )=1$ .

Finally we consider $D^{(1)}_\epsilon$ . Up to now we have been able to make do with an arbitrary $\epsilon$ . In this case, however, the result holds only for small $\epsilon > 0$ . First consider

$\begin{displaymath}\frac{1}{n} \ell^{\prime\prime}(0\vert X_1,\ldots,X_n) \end{displaymath}$

According to the strong law of large numbers this converges almost surely to

$\begin{displaymath}{\rm E}\left\{\frac{2(1-X^2)}{(1+X^2)^2}\right\} =-\frac{1}{2} < 0 \end{displaymath}$

Now you can check that

$\begin{displaymath}\left\vert \frac{1}{n} \ell^{\prime\prime\prime}(\theta\vert X_1,\ldots,X_n) \right\vert < 4 \, . \end{displaymath}$

(In fact each term in $\ell^{\prime\prime\prime}$ may be shown to be bounded by $3/2+\sqrt{2}$ .) As a result

$\begin{displaymath}\frac{1}{n}\left\vert \ell^{\prime\prime}(\theta\vert X_1,\ld... ...rime}(0\vert X_1,\ldots,X_n) \right\vert \le 4\vert\theta\vert \end{displaymath}$

Pick $\epsilon > 0$ so that $4\epsilon < \pi/2$ . If B is the event

$\begin{displaymath}\frac{1}{n} \ell^{\prime\prime}(0\vert X_1,\ldots,X_n)\to -\frac{1}{2} \end{displaymath}$

and $\omega$ is in B then there is an N such that for $n \ge N$ we have

$\begin{displaymath}\frac{1}{n} \ell^{\prime\prime}(\theta\vert X_1,\ldots,X_n) < 0 \end{displaymath}$

for all $\vert\theta\vert < \epsilon$ .

This proves that for all $3\epsilon < \pi/8$

$\begin{displaymath}P(D^{(1)}(\epsilon))=1 \, . \end{displaymath}$

We have now shown that for $\epsilon < \pi/24$

$\begin{displaymath}P(D^{(1)}(\epsilon)\cap D^{(2)}_\epsilon\cap D^{(3)}_\epsilon)=1 \end{displaymath}$

For $\omega$ in this event we have that there is an N such that

$\begin{displaymath}\vert\hat\theta_n\vert \le \epsilon \end{displaymath}$

for all $n \ge N$ . This establishes the result.

General Case

Consider parametric family

$\begin{displaymath}\{f(x\vert\theta); a<\theta < b\} \end{displaymath}$

Let $\theta_o$ be true value of $\theta$

Let $A_\epsilon$ be the event: $\exists N$ such that $\forall n \ge N$ $\ell$ has a local maximum on the interval $[\theta_o-\epsilon,\theta_o+\epsilon]$ .

We have proved quite generally that

$\begin{displaymath}P(A_\epsilon) = 1 \end{displaymath}$

(2)

Add assumption

$\begin{displaymath}\tag{{\bf A}} \ell\text{ has a continuous derivative} \end{displaymath}$

(A)

Let $B_\epsilon$ be the event that there is an N such that for all $n \ge N$ there is a critical point of $\ell$ in $(\theta_o-\epsilon,\theta_o+\epsilon)$ which is a local maximum of $\ell$ we have proved

$\begin{displaymath}P(B_\epsilon) = 1 \end{displaymath}$

(3)

The event $B = \cap_\epsilon B_\epsilon$ then has probability 1. On this event there is a sequence of roots of the likelihood equations which is consistent.

Remaining problem: prove, under general conditions, that there is probably only one root near $\theta_o$ .

Consider event that $\ell^\prime$ is monotone on $[\theta_o-\epsilon,\theta_o+\epsilon]$ . Previous proof based on showing that next derivative was negative at $\theta_o$ and did not change much over a small enough interval.

Behaviour at $\theta_o$ is essentially the behaviour of

$\begin{displaymath}\frac{1}{n}\sum V_i(\theta_o) \end{displaymath}$

which converges almost surely to

$\begin{displaymath}{\rm E}(L_1^{\prime\prime}(\theta_o)) \end{displaymath}$

I claim this is negative for regular families.

Begin with

$\begin{displaymath}1 = \int f(x,\theta) dx \end{displaymath}$

Differentiating with respect to $\theta$ gives
$\begin{align*}0 & =\frac{d}{d\theta} \int f(x,\theta) dx \\ & = \lim_{\epsilon\to 0} \int \frac{f(x,\theta+\epsilon)-f(x,\theta)}{\epsilon} dx \end{align*}$

In order to take the limit inside the integral sign we must prove that for any sequence $\epsilon_n\to 0$
$\begin{multline*}\lim \int \frac{f(x,\theta+\epsilon_n)-f(x,\theta)}{\epsilon_n} dx \\ = \int \frac{\partial}{\partial\theta}f(x,\theta) dx \end{multline*}$
Normally: apply dominated convergence theorem. If f is continuously differentiable wrt $\theta$ then difference quotient is exactly

$\begin{displaymath}\frac{\partial}{\partial\theta}f(x,\theta^*_n) \end{displaymath}$

where $\theta^*_n$ depends on both n and x.

One tactic: compute

$\begin{displaymath}M(x,\epsilon) =\sup\{\vert\frac{\partial}{\partial\theta}f(x,\theta)\vert; \vert\theta-\theta_o\vert \le \epsilon\} \end{displaymath}$

and show that

$\begin{displaymath}\int M(x,\epsilon) dx < \infty \end{displaymath}$

to apply dominated convergence.

Assuming dominated convergence theorem applies:
$\begin{align*}0 & = \int \frac{\partial}{\partial\theta}f(x,\theta) dx \\ & = \... ...heta)}{\partial\theta}f(x,\theta) dx \\ & = {\rm E}(U_k(\theta_o)) \end{align*}$
Differentiating again and again passing limits through integrals gives

$\begin{displaymath}{\rm E}(U_k^2(\theta_o)) = -{\rm E}(V_k(\theta_o)) \end{displaymath}$

This shows that

$\begin{displaymath}\frac{1}{n} \ell^{\prime\prime}(\theta_o)\to {\rm E}(V_k(\theta_o)) < 0 \end{displaymath}$

almost surely.

Next we consider

$\begin{displaymath}\frac{1}{n} \ell^{\prime\prime}(\theta) - \ell^{\prime\prime}(\theta_o) \end{displaymath}$

For a three times continuously differentiable $\ell$ there is a $\theta_n^*$ (which is random but between $\theta_o$ and $\theta$ ) such that this difference is

$\begin{displaymath}\frac{\theta-\theta_o}{n} \sum W_i(\theta_n^*) \end{displaymath}$

Define

$\begin{displaymath}M_i(\epsilon) = \sup\{\vert W_i(\theta)\vert: \vert\theta-\theta_o\vert\le \epsilon\} \end{displaymath}$

The M_i are iid. If for some $\epsilon > 0$ the M_i are integrable then SLLN shows

$\begin{displaymath}\limsup \left\vert\frac{1}{n} \left\{\ell^{\prime\prime}(\the... ...heta_o)\right\}\right\vert \le \epsilon {\rm E}(M_1(\epsilon)) \end{displaymath}$

almost surely. RHS of inequality can be made arbitrarily small by choosing $\epsilon$ small enough.

Pick $\epsilon$ so small that bound strictly smaller than

$\begin{displaymath}I(\theta_o) \equiv - {\rm E}(V_k(\theta_o)) \, . \end{displaymath}$

Then

$\begin{displaymath}P( E_\epsilon) = 1 \end{displaymath}$

where $E_\epsilon$ is event $\exists N$ st $\forall n \ge N$ we have $\ell^\prime$ is monotone decreasing on $[\theta_o-\epsilon,\theta_o+\epsilon]$ .

$next$ $up$ $previous$

Richard Lockhart
2000-10-03