next up previous


Postscript version of this page

STAT 801: Mathematical Statistics

Convergence in Distribution

Undergraduate version of central limit theorem: if $ X_1,\ldots,X_n$ are iid from a population with mean $ \mu$ and standard deviation $ \sigma$ then $ n^{1/2}(\bar{X}-\mu)/\sigma$ has approximately a normal distribution.

Also Binomial$ (n,p)$ random variable has approximately a $ N(np,np(1-p))$ distribution.

Precise meaning of statements like ``$ X$ and $ Y$ have approximately the same distribution''?

Desired meaning: $ X$ and $ Y$ have nearly the same cdf.

But care needed.

Q1) If $ n$ is a large number is the $ N(0,1/n)$ distribution close to the distribution of $ X\equiv 0$?

Q2) Is $ N(0,1/n)$ close to the $ N(1/n,1/n)$ distribution?

Q3) Is $ N(0,1/n)$ close to $ N(1/\sqrt{n},1/n)$ distribution?

Q4) If $ X_n\equiv 2^{-n}$ is the distribution of $ X_n$ close to that of $ X\equiv 0$?

Answers depend on how close close needs to be so it's a matter of definition.

In practice the usual sort of approximation we want to make is to say that some random variable $ X$, say, has nearly some continuous distribution, like $ N(0,1)$.

So: want to know probabilities like $ P(X>x)$ are nearly $ P(N(0,1) > x)$.

Real difficulty: case of discrete random variables or infinite dimensions: not done in this course.

Mathematicians' meaning of close:

Either they can provide an upper bound on the distance between the two things or they are talking about taking a limit.

In this course we take limits.

Definition: A sequence of random variables $ X_n$ converges in distribution to a random variable $ X$ if

$\displaystyle E(g(X_n)) \to E(g(X))
$

for every bounded continuous function $ g$.

Theorem 1   The following are equivalent:
  1. $ X_n$ converges in distribution to $ X$.
  2. $ P(X_n \le x) \to P(X \le x)$ for each $ x$ such that $ P(X=x)=0$
  3. The limit of the characteristic functions of $ X_n$ is the characteristic function of $ X$:

    $\displaystyle E(e^{itX_n}) \to E(e^{itX})
$

    for every real $ t$.
These are all implied by

$\displaystyle M_{X_n}(t) \to M_X(t) < \infty
$

for all $ \vert t\vert \le \epsilon$ for some positive $ \epsilon$.

Now let's go back to the questions I asked:

\psfig {file=convergence1.ps,height=6in,width=6in}


\psfig {file=convergence3.ps,height=6in,width=6in}


\psfig {file=convergence5.ps,height=6in,width=6in}


Summary: to derive approximate distributions:

Show sequence of rvs $ X_n$ converges to some $ X$.

The limit distribution (i.e. dstbon of $ X$) should be non-trivial, like say $ N(0,1)$.

Don't say: $ X_n$ is approximately $ N(1/n,1/n)$.

Do say: $ n^{1/2}X_n$ converges to $ N(0,1)$ in distribution.

The Central Limit Theorem

If $ X_1, X_2, \cdots$ are iid with mean 0 and variance 1 then $ n^{1/2}\bar{X}$ converges in distribution to $ N(0,1)$. That is,

$\displaystyle P(n^{1/2}\bar{X} \le x ) \to \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-y^2/2} dy
\, .
$

Proof: As before

$\displaystyle E(e^{itn^{1/2}\bar{X}}) \to e^{-t^2/2}
$

This is the characteristic function of a $ N(0,1)$ random variable so we are done by our theorem.

Edgeworth expansions

In fact if $ \gamma=E(X^3)$ then

$\displaystyle \phi(t) \approx 1 -t^2/2 -i\gamma t^3/6 + \cdots
$

keeping one more term. Then

$\displaystyle \log(\phi(t)) =\log(1+u)
$

where

$\displaystyle u=-t^2/2 -i \gamma t^3/6 + \cdots
$

Use $ \log(1+u) = u-u^2/2 + \cdots$ to get

\begin{multline*}
\log(\phi(t)) \approx
\\ [-t^2/2 -i\gamma t^3/6 +\cdots]
\\
-[\cdots]^2/2 +\cdots
\end{multline*}

which rearranged is

$\displaystyle \log(\phi(t)) \approx -t^2/2 -i\gamma t^3/6 + \cdots
$

Now apply this calculation to

$\displaystyle \log(\phi_T(t)) \approx -t^2/2 -i E(T^3) t^3/6 + \cdots
$

Remember $ E(T^3) = \gamma/\sqrt{n}$ and exponentiate to get

$\displaystyle \phi_T(t) \approx e^{-t^2/2} \exp\{-i\gamma t^3/(6\sqrt{n}) + \cdots\}
$

You can do a Taylor expansion of the second exponential around 0 because of the square root of $ n$ and get

$\displaystyle \phi_T(t) \approx e^{-t^2/2} (1-i\gamma t^3/(6\sqrt{n}))
$

neglecting higher order terms. This approximation to the characteristic function of $ T$ can be inverted to get an Edgeworth approximation to the density (or distribution) of $ T$ which looks like

$\displaystyle f_T(x) \approx \frac{1}{\sqrt{2\pi}} e^{-x^2/2} [1-\gamma
(x^3-3x)/(6\sqrt{n}) + \cdots]
$

Remarks:

  1. The error using the central limit theorem to approximate a density or a probability is proportional to $ n^{-1/2}$

  2. This is improved to $ n^{-1}$ for symmetric densities for which $ \gamma=0$.

  3. These expansions are asymptotic. This means that the series indicated by $ \cdots$ usually does not converge. When $ n=25$ it may help to take the second term but get worse if you include the third or fourth or more.

  4. You can integrate the expansion above for the density to get an approximation for the cdf.

Multivariate convergence in distribution

Definition: $ X_n\in R^p$ converges in distribution to $ X\in R^p$ if

$\displaystyle E(g(X_n)) \to E(g(X))
$

for each bounded continuous real valued function $ g$ on $ R^p$.

This is equivalent to either of

Cramér Wold Device: $ a^tX_n$ converges in distribution to $ a^t X$ for each $ a \in R^p$

or

Convergence of characteristic functions:

$\displaystyle E(e^{ia^tX_n}) \to E(e^{ia^tX})
$

for each $ a \in R^p$.

Extensions of the CLT


  1. $ Y_1,Y_2,\cdots$ iid in $ R^p$, mean $ \mu$, variance covariance $ \Sigma$ then $ n^{1/2}(\bar{Y}-\mu) $ converges in distribution to $ MVN(0,\Sigma)$.


  2. Lyapunov CLT: for each $ n$ $ X_{n1},\ldots,X_{nn}$ independent rvs with

    $\displaystyle E(X_{ni})$ $\displaystyle =0$    
    $\displaystyle Var(\sum_i X_{ni})$ $\displaystyle = 1$    
    $\displaystyle \sum E(\vert X_{ni}\vert^3) \to 0$    

    then $ \sum_i X_{ni}$ converges to $ N(0,1)$.


  3. Lindeberg CLT: 1st two conds of Lyapunov and

    $\displaystyle \sum E(X_{ni}^2 1(\vert X_{ni}\vert > \epsilon)) \to 0
$

    each $ \epsilon > 0$. Then $ \sum_i X_{ni}$ converges in distribution to $ N(0,1)$. (Lyapunov's condition implies Lindeberg's.)


  4. Non-independent rvs: $ m$-dependent CLT, martingale CLT, CLT for mixing processes.


  5. Not sums: Slutsky's theorem, $ \delta$ method.

Slutsky's Theorem: If $ X_n$ converges in distribution to $ X$ and $ Y_n$ converges in distribution (or in probability) to $ c$, a constant, then $ X_n+Y_n$ converges in distribution to $ X+c$. More generally, if $ f(x,y)$ is continuous then $ f(X_n,Y_n) \Rightarrow f(X,c)$.

Warning: the hypothesis that the limit of $ Y_n$ be constant is essential.

Definition: We say $ Y_n$ converges to $ Y$ in probability if

$\displaystyle P(\vert Y_n-Y\vert > \epsilon) \to 0
$

for each $ \epsilon > 0$.

The fact is that for $ Y$ constant convergence in distribution and in probability are the same. In general convergence in probability implies convergence in distribution. Both of these are weaker than almost sure convergence:

Definition: We say $ Y_n$ converges to $ Y$ almost surely if

$\displaystyle P(\{\omega\in \Omega: \lim_{n \to \infty} Y_n(\omega) = Y(\omega) \}) = 1
\, .
$

The delta method: Suppose:

Then $ a_n(f(Y_n)-f(y))$ converges in distribution to $ f^\prime(y) X$.

If $ X_n\in R^p$ and $ f: R^p\mapsto R^q$ then $ f^\prime$ is $ q\times p$ matrix of first derivatives of components of $ f$.

Example: Suppose $ X_1,\ldots,X_n$ are a sample from a population with mean $ \mu$, variance $ \sigma^2$, and third and fourth central moments $ \mu_3$ and $ \mu_4$. Then

$\displaystyle n^{1/2}(s^2-\sigma^2) \Rightarrow N(0,\mu_4-\sigma^4)
$

where $ \Rightarrow $ is notation for convergence in distribution. For simplicity I define $ s^2 = \overline{X^2} -{\bar{X}}^2$.

Take $ Y_n =(\overline{X^2},\bar{X})$. Then $ Y_n$ converges to $ y=(\mu^2+\sigma^2,\mu)$. Take $ a_n = n^{1/2}$. Then

$\displaystyle n^{1/2}(Y_n-y)
$

converges in distribution to $ MVN(0,\Sigma)$ with

$\displaystyle \Sigma = \left[\begin{array}{cc} \mu_4-\sigma^4 & \mu_3 -\mu(\mu^2+\sigma^2)\\
\mu_3-\mu(\mu^2+\sigma^2) & \sigma^2 \end{array} \right]
$

Define $ f(x_1,x_2) = x_1-x_2^2$. Then $ s^2 = f(Y_n)$. The gradient of $ f$ has components $ (1,-2x_2)$. This leads to

\begin{multline*}
n^{1/2}(s^2-\sigma^2) \approx
\\
n^{1/2}[1, -2\mu]
\left[\b...
...ne{X^2} -
(\mu^2 + \sigma^2)
\\
\bar{X} -\mu
\end{array}\right]
\end{multline*}

which converges in distribution to $ (1,-2\mu) Y$. This rv is $ N(0,a^t \Sigma a)=N(0, \mu_4-\sigma^2)$ where $ a=(1,-2\mu)^t$.

Remark: In this sort of problem it is best to learn to recognize that the sample variance is unaffected by subtracting $ \mu$ from each $ X$. Thus there is no loss in assuming $ \mu=0$ which simplifies $ \Sigma$ and $ a$.

Special case: if the observations are $ N(\mu,\sigma^2)$ then $ \mu_3 =0$ and $ \mu_4=3\sigma^4$. Our calculation has

$\displaystyle n^{1/2} (s^2-\sigma^2) \Rightarrow N(0,2\sigma^4)
$

You can divide through by $ \sigma^2$ and get

$\displaystyle n^{1/2}(\frac{s^2}{\sigma^2}-1) \Rightarrow N(0,2)
$

In fact $ (n-1)s^2/\sigma^2$ has a $ \chi_{n-1}^2$ distribution and so the usual central limit theorem shows that

$\displaystyle (n-1)^{-1/2} [(n-1)s^2/\sigma^2 - (n-1)] \Rightarrow N(0,2)
$

(using mean of $ \chi^2_1$ is 1 and variance is 2). Factoring out $ n-1$ gives the assertion that

$\displaystyle (n-1)^{1/2}(s^2/\sigma^2-1) \Rightarrow N(0,2)
$

which is our $ \delta$ method calculation except for using $ n-1$ instead of $ n$. This difference is unimportant as can be checked using Slutsky's theorem.

next up previous



Richard Lockhart
2001-01-26