Postscript version of this file

STAT 801 Lecture 9

Goals of Today's Lecture:

• State Slutsky's Theorem.

• Introduce the method.

• Discuss Monte Carlo techniques

Central Limit Theorems

1.
If are iid in Rp with mean and then converges in distribution to .

2.
If for each n we have a set of independent mean 0 random variables and E(Xni) =0 and and

then converges in distribution, as , to standard normal. This is the Lyapunov central limit theorem.

3.
As in the Lyapunov CLT but replace the third moment condition with

for each then again converges in distribution to N(0,1). This is the Lindeberg central limit theorem. (Lyapunov's condition implies Lindeberg's.)

4.
There are extensions to random variables which are not independent. Examples include the m-dependent central limit theorem, the martingale central limit theorem, the central limit theorem for mixing processes.

5.
Many important random variables are not sums of independent random variables. We handle these with Slutsky's theorem and the method.

Slutsky's Theorem: If Xn converges in distribution to Xand Yn converges in distribution (or in probability) to c, a constant, then Xn+Yn converges in distribution to X+c. More generally, if f(x,y) is continuous then .

Warning: the hypothesis that the limit of Yn be constant is essential.

Definition: We say Yn converges to Y in probability if

for each .

The fact is that for Y constant convergence in distribution and in probability are the same. In general convergence in probability implies convergence in distribution. Both of these are weaker than almost sure convergence:

Definition: We say Yn converges to Y almost surely if

The delta method: Suppose a sequence Yn of random variables converges to some y a constant and that if we define Xn = an(Yn-y) then Xn converges in distribution to some random variable X. Suppose that f is a differentiable function on the range of Yn. Then an(f(Yn)-f(y)) converges in distribution to . If Xn is in Rp and f maps Rp to Rq then is the matrix of first derivatives of components of f.

Example: Suppose are a sample from a population with mean , variance , and third and fourth central moments and . Then

where is notation for convergence in distribution. For simplicity I define .

We take Yn to be the vector with components . Then Yn converges to . Take an = n1/2. Then

n1/2(Yn-y)

converges in distribution to with

Define f(x1,x2) = x1-x22. Then s2 = f(Yn). The gradient of f has components (1,-2x2). This leads to

which converges in distribution to which is where . This boils down to .

Remark: In this sort of problem it is best to learn to recognize that the sample variance is unaffected by subtracting from each X. Thus there is no loss in assuming which simplifies and a.

Special case: if the observations are then and . Our calculation has

You can divide through by and get

In fact has a distribution and so the usual central limit theorem shows that

(using mean of is 1 and variance is 2). Factoring out n-1 gives the assertion that

which is our method calculation except for using n-1 instead of n. This difference is unimportant as can be checked using Slutsky's theorem.

Monte Carlo

The last method of distribution theory that I will review is that of Monte Carlo simulation. Suppose you have some random variables whose joint distribution is specified and a statistic whose distribution you want to know. To compute something like P(T > t) for some specific value of t we appeal to the limiting relative frequency interpretation of probability: P(T>t) is the limit of the proportion of trials in a long sequence of trials in which Toccurs. We use a (pseudo) random number generator to generate a sample and then calculate the statistic getting T1. Then we generate a new sample (independently of our first, say) and calculate T2. We repeat this a large number of times say N and just count up how many of the Tk are larger than t. If there are M such Tkwe estimate that P(T>t) =M/N.

The quantity M has a Binomial( N,p=P(T>t)) distribution. The standard error of M/N is then p(1-p)/N which is estimated by M(N-M)/N3. This permits us to guess the accuracy of our study.

Notice that the standard deviation of M/N is so that to improve the accuracy by a factor of 2 requires 4 times as many samples. This makes Monte Carlo a relatively time consuming method of calculation. There are a number of tricks to make the method more accurate (though they only change the constant of proportionality - the SE is still inversely proportional to the square root of the sample size).

Generating the Sample

Most computer languages have a facility for generating pseudo uniform random numbers, that is, variables U which have (approximately of course) a Uniform[0,1] distribution. Other distributions are generated by transformation:

Exponential: has an exponential distribution:

Random uniforms generated on the computer sometimes have only 6 or 7 digits or so of detail. This can make the tail of your distribution grainy. If U were actually a multiple of 10-6 for instance then the largest possible value of X is . This problem can be ameliorated by the following algorithm:

• Generate U a Uniform[0,1] variable.

• Pick a small like 10-3 say. If take .

• If remember that the conditional distribution of Y-y given Y>y is exponential. You use this by generating a new and computing . Then take . The resulting Y has an exponential distribution. You should check this by computing P(Y > y).

Normal: In general if F is a continuous cdf and U is Uniform[0,1] then Y=F-1(U) has cdf F because

This is almost the technique we used above for the exponential distribution. For the normal distribution ( is a common notation for the standard normal cdf) there is no closed form for F-1. You could use a numerical algorithm to compute F-1 or you could use the following Box Müller trick. Generate U1,U2 two independent Uniform[0,1] variables. Define and . Then you can check using the change of variables formula that Y1 and Y2 are independent N(0,1) variables.

Acceptance Rejection

If you can't easily calculate F-1 but you know f you can try the acceptance rejection method. Find a density g and a constant c such that for each x and G-1 is computable or you otherwise know how to generate observations independently from g. Generate W1. Compute . Generate a uniform[0,1] random variable U1 independent of all the Ws and let Y=W1 if . Otherwise get a new W and a new U and repeat until you find a . You make Y be the last W you generated. This Y has density f.

Markov Monte Carlo

In the last 10 years the following tactic has become popular, particularly for generating multivariate observations. If is an (ergodic) Markov chain with stationary transitions and the stationary initial distribution of W has density f then you can get random variables which have the marginal density f by starting off the Markov chain and letting it run for a long time. The marginal distributions of the Wi converge to f. So you can estimate things like by computing the fraction of the Wi which land in A.

There are now many versions of this technique including Gibbs Sampling and the Metropolis-Hastings algorithm. (The technique was invented in the 1950s by physicists: Metropolis et al. One of the authors of the paper was Edward Teller father of the hydrogen bomb''.)

Importance Sampling

If you want to compute

you can generate observations from a different density g and then compute

Then

Variance reduction

Consider the problem of estimating the distribution of the sample mean for a Cauchy random variable. The Cauchy density is

We generate uniforms and then define . Then we compute . Now to estimate p=P(T > t) we would use

after generating N samples of size n. This estimate is unbiased and has standard error .

We can improve this estimate by remembering that -Xi also has Cauchy distribution. Take Si=-Ti. Remember that Si has the same distribution as Ti. Then we try (for t>0)

which is the average of two estimates like . The variance of is

(4N)-1 Var(1(Ti > t)+1(Si > t)) = (4N)-1 Var( 1(|T| > t))

which is

Notice that the variance has an extra 2 in the denominator and that the numerator is also smaller - particularly for p near 1/2. So this method of variance reduction has resulted in a need for only half the sample size to get the same accuracy.

Regression estimates

Suppose we want to compute

where Z is standard normal. We generate N iid N(0,1) variables and compute . But we know that E(Zi2) = 1 and can see easily that is positively correlated with . So we consider using

Notice that and

The value of c which minimizes this is

and this value can be estimated by regressing the |Zi| on the Zi2!

Statistical Inference NEXT

Richard Lockhart
2000-02-02