web

$next$ $up$ $previous$

Postscript version of this page

STAT 801: Mathematical Statistics

Large Sample Theory

Study approximate behaviour of $\hat\theta$ by studying the function .

Notice is sum of independent random variables.

Theorem: If $Y_1,Y_2,\ldots$ are iid with mean $\mu$ then

$\displaystyle \frac{\sum Y_i}{n} \to \mu$

Called law of large numbers. Strong law

$\displaystyle P(\lim \frac{\sum Y_i}{n}=\mu)=1$

and the weak law that

$\displaystyle \lim P(\vert \frac{\sum Y_i}{n}-\mu\vert>\epsilon) = 0$

For iid

the stronger conclusion holds; for our heuristics ignore differences between these notions.

Now suppose $\theta_0$ is true value of $\theta$ . Then

$\displaystyle U(\theta)/n \to \mu(\theta)$

where

$\displaystyle \mu(\theta) =$	$\displaystyle E_{\theta_0}\left[ \frac{\partial\log f}{\partial\theta}(X_i,\theta)\right]$
$\displaystyle =$	$\displaystyle \int \frac{\partial \log f}{\partial\theta}(x,\theta) f(x,\theta_0) dx$

Example: $N(\mu,1)$ data:

$\displaystyle U(\mu)/n = \sum(X_i -\mu)/n = \bar{X} -\mu$

If the true mean is $\mu_0$ then $\bar{X} \to \mu_0$ and

$\displaystyle U(\mu)/n \to \mu_0-\mu$

Consider $\mu < \mu_0$ : derivative of $\ell(\mu)$ is likely to be positive so that $\ell$ increases as $\mu$ increases.

For $\mu > \mu_0$ : derivative is probably negative and so $\ell$ tends to be decreasing for $\mu > 0$ .

Hence: $\ell$ is likely to be maximized close to $\mu_0$ .

Repeat ideas for more general case. Study rv

$\displaystyle \log[f(X_i,\theta)/f(X_i,\theta_0)].$

You know the inequality

$\displaystyle E(X)^2 \le E(X^2)$

(difference is ${\rm Var}(X) \ge 0$ .)

Generalization: Jensen's inequality: for a convex function ( $g^{\prime\prime} \ge 0$ roughly) then

$\displaystyle g(E(X)) \le E(g(X))$

Inequality above has . Use $g(x) = -\log ( x )$ : convex because $g^{\prime\prime}(x) = x^{-2} > 0$ . We get

$\displaystyle -\log(E_{\theta_0}[f(X_i,\theta)/f(X_i,\theta_0)] \le E_{\theta_0}[-\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \, .$

But

$\displaystyle E_{\theta_0}\left[\frac{f(X_i,\theta)}{f(X_i,\theta_0)}\right]$	$\displaystyle = \int \frac{f(x,\theta)}{f(x,\theta_0)}f(x,\theta_0) dx$
	$\displaystyle = \int f(x,\theta) dx$
	$\displaystyle = 1$

We can reassemble the inequality and this calculation to get

$\displaystyle E_{\theta_0}[\log\{f(X_i,\theta)/f(X_i,\theta_0)\}] \le 0$

Fact: inequality is strict unless the $\theta$ and $\theta_0$ densities are actually the same.

Let $\mu(\theta) < 0$ be this expected value.

Then for each $\theta$ we find

$\begin{multline*} \frac{\ell(\theta) - \ell(\theta_0)}{n} \\ = \frac{\sum \log[f(X_i,\theta)/f(X_i,\theta_0)] }{n} \\ \to \mu(\theta) \end{multline*}$

This proves likelihood probably higher at $\theta_0$ than at any other single $\theta$ .

Idea can often be stretched to prove that the mle is consistent.

Definition A sequence $\hat\theta_n$ of estimators of $\theta$ is consistent if $\hat\theta_n$ converges weakly (or strongly) to $\theta$ .

Proto theorem: In regular problems the mle $\hat\theta$ is consistent.

More precise statements of possible conclusions. Use notation

$\displaystyle N(\epsilon) = \{\theta:\vert\theta-\theta_0\vert \le \epsilon\} \, .$

Suppose:

$\hat\theta_n$ is global maximizer of $\ell$ .

$\hat\theta_{n,\delta}$ maximizes $\ell$ over $N(\delta) =\{\vert\theta-\theta_0\vert \le \delta\}$ .

$\displaystyle A_\epsilon= \{ \vert\hat\theta_n-\theta_0\vert\le \epsilon\}$

$\displaystyle B_{\delta,\epsilon} = \{\vert \hat\theta_{n,\delta}-\theta_0\vert \le \epsilon\}$

$\displaystyle C_{L} = \{\exists ! \theta\in N( L/n^{1/2}): U(\theta)=0, U^\prime(\theta)<0\}$

Theorem:

Under conditions I $P(A_\epsilon)\to 1$ for each $\epsilon>0$ .
Under conditions II there is a $\delta > 0$ such that for all $\epsilon>0$ we have $P(B_{\delta,\epsilon}) \to 1$ .
Under conditions III for all $\delta > 0$ there is an so large and an so large that for all $n \ge n_0$ , $P(C_{L}) > 1-\delta$ .
Under conditions III there is a sequence tending to $\infty$ so slowly that $P(C_{L_n})\to 1$ .

Point: conditions get weaker as conclusions get weaker. Many possible conditions in literature. See book by Zacks for some precise conditions.

Asymptotic Normality

Study shape of log likelihood near the true value of $\theta$ .

Assume $\hat\theta$ is a root of the likelihood equations close to $\theta_0$ .

Taylor expansion (1 dimensional parameter $\theta$ ):

$\displaystyle U(\hat\theta) =$	0
$\displaystyle =$	$\displaystyle U(\theta_0) + U^\prime(\theta_0)(\hat\theta - \theta_0)$
	$\displaystyle + U^{\prime\prime}(\tilde\theta) (\hat\theta-\theta_0)^2/2$

for some $\tilde\theta$ between $\theta_0$ and $\hat\theta$ .

WARNING: This form of the remainder in Taylor's theorem is not valid for multivariate $\theta$ . Derivatives of are sums of terms.

So each derivative should be proportional to in size.

Second derivative is multiplied by the square of the small number $\hat\theta-\theta_0$ so should be negligible compared to the first derivative term.

Ignoring second derivative term get

$\displaystyle -U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0)$

Now look at terms

and $U^\prime$ .

Normal case:

$\displaystyle U(\theta_0) = \sum (X_i-\mu_0)$

has a normal distribution with mean 0 and variance

(SD $\sqrt{n}$ ).

Derivative is

$\displaystyle U^\prime(\mu) = -n \, .$

Next derivative $U^{\prime\prime}$ is 0.

Notice: both and $U^\prime$ are sums of iid random variables.

Let

$\displaystyle U_i = \frac{\partial \log f}{\partial\theta} (X_i,\theta_0)$

and

$\displaystyle V_i = -\frac{\partial^2 \log f}{\partial\theta^2} (X_i,\theta)$

In general, $U(\theta_0)=\sum U_i$ has mean 0 and approximately a normal distribution.

Here is how we check that:

$\displaystyle E_{\theta_0}(U(\theta_0))$	$\displaystyle = n E_{\theta_0} (U_1)$
	$\displaystyle = n \int \frac{\partial \log(f(x,\theta_0))}{\partial\theta} f(x,\theta_0) dx$
	$\displaystyle = n \int \frac{\partial f(x,\theta_0)/\partial\theta)}{f(x,\theta_0)} f(x,\theta_0) dx$
	$\displaystyle = n \int \frac{\partial f}{\partial\theta}(x,\theta_0) dx$
	$\displaystyle = \left.n\frac{\partial}{\partial\theta} \int f(x,\theta) dx \right\vert _{\theta=\theta_0}$
	$\displaystyle = n\frac{\partial}{\partial\theta} 1$
	$\displaystyle = 0\$

Notice: interchanged order of differentiation and integration at one point.

This step is usually justified by applying the dominated convergence theorem to the definition of the derivative.

Differentiate identity just proved:

$\displaystyle \int\frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) dx =0$

Take derivative of both sides wrt $\theta$ ; pull derivative under integral sign:

$\displaystyle \int \frac{\partial}{\partial\theta} \left[ \frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) \right] dx =0$

Do the derivative and get

$\begin{multline*} -\int\frac{\partial^2\log(f)}{\partial\theta^2} f(x,\theta) dx... ...artial\log f}{\partial\theta}(x,\theta) \right]^2 f(x,\theta) dx \end{multline*}$

Definition: The Fisher Information is

$\displaystyle I(\theta)=-E_{\theta}(U^\prime(\theta))=nE_{\theta_0}(V_1)$

We refer to ${\cal I}(\theta_0) = E_{\theta_0}(V_1)$ as the information in 1 observation.

The idea is that is a measure of how curved the log likelihood tends to be at the true value of $\theta$ . Big curvature means precise estimates. Our identity above is

$\displaystyle I(\theta) = Var_\theta(U(\theta))=n{\cal I}(\theta)$

Now we return to our Taylor expansion approximation

$\displaystyle -U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0)$

and study the two appearances of

We have shown that $U=\sum U_i$ is a sum of iid mean 0 random variables. The central limit theorem thus proves that

$\displaystyle n^{-1/2} U(\theta_0) \Rightarrow N(0,\sigma^2)$

where $\sigma^2 = {\rm Var}(U_i)=E(V_i)={\cal I}(\theta)$ .

Next observe that

$\displaystyle -U^\prime(\theta) = \sum V_i$

where again

$\displaystyle V_i = -\frac{\partial U_i}{\partial\theta}$

The law of large numbers can be applied to show

$\displaystyle -U^\prime(\theta_0)/n \to E_{\theta_0}[ V_1] = {\cal I}(\theta_0)$

Now manipulate our Taylor expansion as follows

$\displaystyle n^{1/2} (\hat\theta - \theta_0) \approx \left[\frac{\sum V_i}{n}\right]^{-1} \frac{\sum U_i}{\sqrt{n}}$

Apply Slutsky's Theorem to conclude that the right hand side of this converges in distribution to $N(0,\sigma^2/{\cal I}(\theta)^2)$ which simplifies, because of the identities, to $N\{0,1/{\cal I}(\theta)\}$ .

Summary

In regular families: assuming $\hat\theta=\hat\theta_n$ is a consistent root of $U(\theta) = 0$ .

$n^{-1/2} U(\theta_0) \Rightarrow MVN(0,{\cal I})$ where

$\displaystyle {\cal I}_{ij} = {\rm E}_{\theta_0}\left\{V_{1,ij}(\theta_0)\right\}$
and

$\displaystyle V_{k,ij}(\theta) = - \frac{\partial^2 \log f(X_k,\theta)}{\partial \theta_i\partial\theta_j}$
If ${\bf V}_k(\theta)$ is the matrix $[V_{k,ij}]$ then

$\displaystyle \frac{\sum_{k=1}^n {\bf V}_k(\theta_0)}{n} \to {\cal I}$
If ${\bf V}(\theta) =\sum_k {\bf V}_k(\theta)$ then

$\displaystyle \{{\bf V}(\theta_0)/n\}n^{1/2} (\hat\theta-\theta_0) - n^{-1/2} U(\theta_0) \to 0$
in probability as $n\to\infty$ .

Also

$\displaystyle \{{\bf V}(\hat\theta)/n\}n^{1/2} (\hat\theta-\theta_0) - n^{-1/2} U(\theta_0)\to 0$
in probability as $n\to\infty$ .
$n^{1/2} (\hat\theta-\theta_0) -\{{\cal I}(\theta_0)\}^{-1} U(\theta_0)\to 0$ in probability as $n\to\infty$ .
$n^{1/2}(\hat\theta-\theta_0) \Rightarrow MVN(0,{\cal I}^{-1})$ .
In general (not just iid cases)

$\displaystyle \sqrt{I(\theta_0)}(\hat\theta - \theta_0)$ $\displaystyle \Rightarrow N(0,1)$

$\displaystyle \sqrt{I(\hat\theta)}(\hat\theta - \theta_0)$ $\displaystyle \Rightarrow N(0,1)$

$\displaystyle \sqrt{V(\theta_0)}(\hat\theta - \theta_0)$ $\displaystyle \Rightarrow N(0,1)$

$\displaystyle \sqrt{V(\hat\theta)}(\hat\theta - \theta_0)$ $\displaystyle \Rightarrow N(0,1)$

where $V=-\ell^{\prime\prime}$ is the so-called observed information, the negative second derivative of the log-likelihood.

Note: If the square roots are replaced by matrix square roots we can let $\theta$ be vector valued and get as the limit law.

Why all these different forms? Use limit laws to test hypotheses and compute confidence intervals. Test $H_o:\theta=\theta_0$ using one of the 4 quantities as test statistic. Find confidence intervals using quantities as pivots. E.g.: second and fourth limits lead to confidence intervals

$\displaystyle \hat\theta \pm z_{\alpha/2}/\sqrt{I(\hat\theta)}$

and

$\displaystyle \hat\theta \pm z_{\alpha/2}/\sqrt{V(\hat\theta)}$

respectively. The other two are more complicated. For iid $N(0,\sigma^2)$ data we have

$\displaystyle V(\sigma) = \frac{3\sum X_i^2}{\sigma^4}-\frac{n}{\sigma^2}$

and

$\displaystyle I(\sigma) = \frac{2n}{\sigma^2}$

The first line above then justifies confidence intervals for $\sigma$ computed by finding all those $\sigma$ for which

$\displaystyle \left\vert\frac{\sqrt{2n}(\hat\sigma-\sigma)}{\sigma}\right\vert \le z_{\alpha/2}$

Similar interval can be derived from 3rd expression, though this is much more complicated.

Usual summary: mle is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information.

Problems with maximum likelihood

Many parameters lead to poor approximations. MLEs can be far from right answer. See homework for Neyman Scott example where MLE is not consistent.
Multiple roots of the likelihood equations: you must choose the right root. Start with different, consistent, estimator; apply iterative scheme like Newton Raphson to likelihood equations to find MLE. Not many steps of NR generally required if starting point is a reasonable estimate.

Finding (good) preliminary Point Estimates

Method of Moments

Basic strategy: set sample moments equal to population moments and solve for the parameters.

Definition: The $r^{\mbox{th}}$ sample moment (about the origin) is

$\displaystyle \frac{1}{n}\sum_{i=1}^n X_i^r$

The $r^{\mbox{th}}$ population moment is

$\displaystyle {\rm E}(X^r)$

( Central moments are

$\displaystyle \frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^r$

and

$\displaystyle {\rm E}\left[(X-\mu)^r\right] \, .$

If we have parameters we can estimate the parameters $\theta_1,\ldots,\theta_p$ by solving the system of equations:

$\displaystyle \mu_1 = \bar{X}$

$\displaystyle \mu_2^\prime = \overline{X^2}$

and so on to

$\displaystyle \mu_p^\prime = \overline{X^p}$

You need to remember that the population moments $\mu_k^\prime$ will be formulas involving the parameters.

Gamma Example

The Gamma( $\alpha,\beta$ ) density is

$\displaystyle f(x;\alpha,\beta) = \frac{1}{\beta\Gamma(\alpha)}\left(\frac{x}{\beta}\right)^{\alpha-1} \exp\left[-\frac{x}{\beta}\right] 1(x>0)$

and has

$\displaystyle \mu_1 = \alpha\beta$

and

$\displaystyle \mu_2^\prime = \alpha(\alpha+1)\beta^2$

This gives the equations

$\displaystyle \alpha\beta$	$\displaystyle = \overline{X}$
$\displaystyle \alpha(\alpha+1)\beta^2$	$\displaystyle = \overline{X^2}$

$\displaystyle \alpha\beta$	$\displaystyle = \overline{X}$
$\displaystyle \alpha\beta^2$	$\displaystyle = \overline{X^2} - \overline{X}^2$

Divide the second equation by the first to find the method of moments estimate of $\beta$ is

$\displaystyle \tilde\beta = (\overline{X^2} - \overline{X}^2)/\overline{X} \, .$

Then from the first equation get

$\displaystyle \tilde\alpha = \overline{X}/\tilde\beta= (\overline{X})^2/(\overline{X^2} - \overline{X}^2) \, .$

The method of moments equations are much easier to solve than the likelihood equations which involve the function

$\displaystyle \psi(\alpha) = \frac{d}{d\alpha} \log(\Gamma(\alpha))$

called the digamma function.

Score function has components

$\displaystyle U_\beta = \frac{\sum X_i}{\beta^2} - n \alpha / \beta$

and

$\displaystyle U_\alpha = -n\psi(\alpha) +\sum\log(X_i) -n\log(\beta) \, .$

You can solve for $\beta$ in terms of $\alpha$ to leave you trying to find a root of the equation

$\displaystyle -n\psi(\alpha) +\sum\log(X_i) -n \log(\sum X_i/(n\alpha)) = 0$

To use Newton Raphson on this you begin with the preliminary estimate $\hat\alpha_1 = \tilde\alpha$ and then compute iteratively

$\displaystyle \hat\alpha_{k+1} = \frac{\overline{\log(X)} - \psi(\hat\alpha_k) - \log(\overline{ X})/\hat\alpha_k}{1/\alpha-\psi^\prime(\hat\alpha_k)}$

until the sequence converges. Computation of $\psi^\prime$ , the trigamma function, requires special software. Web sites like netlib and statlib are good sources for this sort of thing.

Estimating Equations

Same large sample ideas arise whenever estimates derived by solving some equation.

Example: large sample theory for Generalized Linear Models.

Suppose is number of cancer cases in some group of people characterized by values of some covariates.

Think of as containing variables like age, or a dummy for sex or average income or $\ldots$ .

Possible parametric regression model: has a Poisson distribution with mean $\mu_i$ where the mean $\mu_i$ depends somehow on .

Typically assume $g(\mu_i) = \beta_0+ x_i \beta$ ; is link function.

Often $g(\mu) = \log(\mu)$ and $x_i\beta$ is a matrix product: row vector, $\mu$ column vector.

``Linear regression model with Poisson errors''.

Special case $\log(\mu_i) = \beta x_i$ where is a scalar.

The log likelihood is simply

$\displaystyle \ell(\beta) = \sum(Y_i \log(\mu_i) - \mu_i)$

ignoring irrelevant factorials. The score function is, since $\log(\mu_i) = \beta x_i$ ,

$\displaystyle U(\beta) = \sum (Y_i x_i - x_i \mu_i) = \sum x_i(Y_i-\mu_i)$

(Notice again that the score has mean 0 when you plug in the true parameter value.) The key observation, however, is that it is not necessary to believe that

has a Poisson distribution to make solving the equation

sensible. Suppose only that $\log(E(Y_i)) = x_i\beta$ . Then we have assumed that

$\displaystyle E_\beta(U(\beta)) = 0$

This was the key condition in proving that there was a root of the likelihood equations which was consistent and here it is what is needed, roughly, to prove that the equation $U(\beta)=0$ has a consistent root $\hat\beta$ . Ignoring higher order terms in a Taylor expansion will give

$\displaystyle V(\beta)(\hat\beta-\beta) \approx U(\beta)$

where $V=-U^\prime$ . In the mle case we had identities relating the expectation of

to the variance of

. In general here we have

$\displaystyle {\rm Var}(U) = \sum x_i^2 {\rm Var}(Y_i) \, .$

is Poisson with mean $\mu_i$ (and so ${\rm Var}(Y_i)=\mu_i$ ) this is

$\displaystyle {\rm Var}(U) = \sum x_i^2\mu_i \, .$

Moreover we have

$\displaystyle V_i = x_i^2 \mu_i$

and so

$\displaystyle V(\beta) = \sum x_i^2 \mu_i \, .$

The central limit theorem (the Lyapunov kind) will show that $U(\beta)$ has an approximate normal distribution with variance $\sigma_U^2 = \sum x_i^2 {\rm Var}(Y_i)$ and so

$\displaystyle \hat\beta-\beta \approx N(0,\sigma_U^2/(\sum x_i^2\mu_i)^2)$

If ${\rm Var}(Y_i)=\mu_i$ , as it is for the Poisson case, the asymptotic variance simplifies to $1/\sum x_i^2\mu_i$ .

Other estimating equations are possible, popular. If is any set of deterministic weights (possibly depending on $\mu_i$ ) then could define

$\displaystyle U(\beta) = \sum w_i (Y_i-\mu_i)$

and still conclude that

probably has a consistent root which has an asymptotic normal distribution.

Idea widely used:

Example: Generalized Estimating Equations, Zeger and Liang.

Abbreviation: GEE.

Called by econometricians Generalized Method of Moments.

An estimating equation is unbiased if

$\displaystyle E_\theta(U(\theta)) = 0$

Theorem: Suppose $\hat\theta$ is a consistent root of the unbiased estimating equation

$\displaystyle U(\theta)=0.$

Let $V=-U^\prime$ . Suppose there is a sequence of constants $B(\theta)$ such that

$\displaystyle V(\theta)/B(\theta) \to 1$

and let

$\displaystyle A(\theta) = Var_\theta(U(\theta))$

and

$\displaystyle C(\theta) = B(\theta)A^{-1}(\theta)B(\theta).$

Then

$\displaystyle \sqrt{C(\theta_0)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$
$\displaystyle \sqrt{C(\hat\theta)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$

Other ways to estimate

and

lead to same conclusions. There are multivariate extensions using matrix square roots.

$next$ $up$ $previous$

Richard Lockhart
2001-03-05

$\displaystyle \sqrt{I(\theta_0)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$
$\displaystyle \sqrt{I(\hat\theta)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$
$\displaystyle \sqrt{V(\theta_0)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$
$\displaystyle \sqrt{V(\hat\theta)}(\hat\theta - \theta_0)$	$\displaystyle \Rightarrow N(0,1)$