web

STAT 801: Mathematical Statistics

Monte Carlo

Given random variables $X_1,\ldots,X_n$ whose joint density (or distribution) is specified and a statistic $T(X_1,\ldots,X_n)$ whose distribution you want to know. To compute something like :

Generate $X_1,\ldots,X_n$ from the density .
Compute $T_1 = T(X_1,\ldots,X_n)$ .
Repeat times getting $T_1, \ldots,T_N$ .
Estimate as $\hat p =M/N$ where is number of repetitions where .
Estimate accuracy of $\hat p$ using $\sqrt{\hat p ( 1 - \hat p)/N}$ .

Notice accuracy inversely proportional to $\sqrt{N}$ . There are a number of tricks to make the method more accurate (but they only change the constant of proportionality - the SE is still inversely proportional to the square root of the sample size).

Generating the Sample

Transformation

Most computer languages have a facility for generating pseudo uniform random numbers, that is, variables which have (approximately of course) a Uniform distribution. Other distributions are generated by transformation:

Exponential: $X=-\log U$ has an exponential distribution:

$\displaystyle P(X > x)$	$\displaystyle = P(-\log(U) > x)$
	$\displaystyle = P(U \le e^{-x})=e^{-x}$

Random uniforms generated on the computer sometimes have only 6 or 7 digits or so of detail. This can make the tail of your distribution grainy. If

were actually a multiple of $10^{-6}$ for instance then the largest possible value of

is $6\log(10)$ . This problem can be ameliorated by the following algorithm:

Generate a Uniform[0,1] variable.
Pick a small $\epsilon$ like $10^{-3}$ say. If $U>\epsilon$ take $Y=-\log(U)$ .
If $U \le \epsilon$ remember that the conditional distribution of given is exponential. You use this by generating a new $U^\prime$ and computing $Y^\prime=-\log(U^\prime)$ . Then take $Y=Y^\prime -\log(\epsilon)$ . The resulting has an exponential distribution. You should check this by computing .

General technique: inverse probability integral transform.

If is to have cdf :

Generate $U\sim Uniform[0,1]$ .

Take $X=F^{-1}(U)$ :

$\displaystyle P(Y \le y)$	$\displaystyle = P(F^{-1}(U) \le y)$
	$\displaystyle = P(U \le F(y))= F(y)$

Example: exponential. $F(x)= 1-e^{-x}$ and $F^{-1}(u) = -\log(1-u)$ .

Compare to previous method. (Use instead of .)

Normal: $F=\Phi$ (common notation for standard normal cdf).

No closed form for $F^{-1}$ .

One way: use numerical algorithm to compute $F^{-1}$ .

Alternative: Box Müller

Generate two independent Uniform[0,1] variables.

Define

$\displaystyle Y_1 = \sqrt{-2\log(U_1)} \cos(2\pi U_2)$

and

$\displaystyle Y_2 = \sqrt{-2\log(U_1)} \sin(2\pi U_2)\, .$

Exercise: (use change of variables) and are independent variables.

Acceptance Rejection

Suppose: can't calculate $F^{-1}$ but know .

Find density and constant such that

$f(x) \le cg(x)$ for each and
$G^{-1}$ is computable or can generate observations $W_1, W_2, \ldots$ independently from .

Algorithm:

Generate .
Compute $p=f(W_1)/(cg(W_1)) \le 1$ .
Generate uniform[0,1] random variable independent of all s.
Let if $U_1 \le p$ .
Otherwise get new ; repeat until you find $U_i \le f(W_i)/(cg(W_i))$ .
Make be last generated.

This

has density

Markov Chain Monte Carlo

Recently popular tactic, particularly for generating multivariate observations.

Theorem Suppose $W_1, W_2, \ldots$ is an (ergodic) Markov chain with stationary transitions and the stationary initial distribution of has density . Then starting the chain with any initial distribution

$\displaystyle \frac{1}{n} \sum_{i=1}^n g(W_i) \to \int g(x) f(x) dx \, .$

Estimate things like $\int_A f(x) dx$ by computing the fraction of the which land in .

Many versions of this technique including Gibbs Sampling and Metropolis-Hastings algorithm.

Technique invented in 1950s: Metropolis et al.

One of the authors was Edward Teller ``father of the hydrogen bomb''.

Importance Sampling

If you want to compute

$\displaystyle \theta \equiv E(T(X)) = \int T(x) f(x) dx$

you can generate observations from a different density

and then compute

$\displaystyle \hat\theta = n^{-1} \sum T(X_i) f(X_i)/g(X_i)$

Then

$\displaystyle E(\hat\theta)$	$\displaystyle = n^{-1} \sum E\left\{T(X_i) f(X_i)/g(X_i)\right\}$
	$\displaystyle = \int \{T(x) f(x)/g(x)\} g(x) dx$
	$\displaystyle = \int T(x) f(x) dx$
	$\displaystyle = \theta$

Variance reduction

Consider the problem of estimating the distribution of the sample mean for a Cauchy random variable. The Cauchy density is

$\displaystyle f(x) = \frac{1}{\pi(1+x^2)}$

We generate $U_1, \ldots, U_n$ uniforms and then define $X_i = \tan^{-1}(\pi(U_i-1/2))$ . Then we compute $T=\bar{X}$ . Now to estimate

we would use

$\displaystyle \hat p = \sum_{i=1}^N 1(T_i > t) /N$

after generating

samples of size

. This estimate is unbiased and has standard error $\sqrt{p(1-p)/N}$ .

We can improve this estimate by remembering that also has Cauchy distribution. Take . Remember that has the same distribution as . Then we try (for )

$\displaystyle \tilde p = [\sum_{i=1}^N 1(T_i > t) + \sum_{i=1}^N 1(S_i > t) ]/(2N)$

which is the average of two estimates like $\hat p$ . The variance of $\tilde p$ is

$\begin{multline*} (4N)^{-1} Var(1(T_i > t)+1(S_i > t)) \\ = (4N)^{-1} Var( 1(\vert T\vert > t)) \end{multline*}$

which is

$\displaystyle \frac{2p(1-2p)}{4N} = \frac{p(1-2p)}{2N}$

Notice that the variance has an extra 2 in the denominator and that the numerator is also smaller - particularly for

near 1/2. So this method of variance reduction has resulted in a need for only half the sample size to get the same accuracy.

Regression estimates

Suppose we want to compute

$\displaystyle \theta = E(\vert Z\vert)$

where

is standard normal. We generate

iid

variables $Z_1,\ldots,Z_N$ and compute $\hat\theta = \sum\vert Z_i\vert/N$ . But we know that

and can see easily that $\hat\theta$ is positively correlated with $\sum Z_i^2 / N$ . So we consider using

$\displaystyle \tilde\theta = \hat\theta -c(\sum Z_i^2/N-1)$

Notice that $E(\tilde\theta) = \theta$ and

$\begin{multline*} Var(\tilde\theta) = \\ Var(\hat\theta) -2c {\rm Cov}(\hat\theta, \sum Z_i^2/n) \\ +c^2 Var(\sum Z_i^2/N) \end{multline*}$

The value of

which minimizes this is

$\displaystyle c=\frac{{\rm Cov}(\hat\theta, \sum Z_i^2/n)}{Var(\sum Z_i^2/N)}$

and this value can be estimated by regressing the $\vert Z_i\vert$ on the

$next$ $up$ $previous$

Richard Lockhart
2001-01-29