web

STAT 801: Mathematical Statistics

Unbiased Estimation

The problem above illustrates a general phenomenon. An estimator can be good for some values of $\theta$ and bad for others. When comparing $\hat\theta$ and $\tilde\theta$ , two estimators of $\theta$ we will say that $\hat\theta$ is better than $\tilde\theta$ if it has uniformly smaller MSE:

$\displaystyle MSE_{\hat\theta}(\theta) \le MSE_{\tilde\theta}(\theta)$

for all $\theta$ . Normally we also require that the inequality be strict for at least one $\theta$ .

The definition raises the question of the existence of a best estimate - one which is better than every other estimator. There is no such estimate. Suppose $\hat\theta$ were such a best estimate. Fix a $\theta^*$ in $\Theta$ and let $\tilde p\equiv \theta^*$ . Then the MSE of $\tilde p$ is 0 when $\theta=\theta^*$ . Since $\hat\theta$ is better than $\tilde p$ we must have

$\displaystyle MSE_{\hat\theta}(\theta^*) = 0$

so that $\hat\theta=\theta^*$ with probability equal to 1. This makes $\hat\theta=\tilde\theta$ . If there are actually two different possible values of $\theta$ this gives a contradiction; so no such $\hat\theta$ exists.

Principle of Unbiasedness: A good estimate is unbiased, that is,

$\displaystyle E_\theta(\hat\theta) \equiv 0 \, .$

WARNING: In my view the Principle of Unbiasedness is a load of hog wash.

For an unbiased estimate the MSE is just the variance.

Definition: An estimator $\hat\phi$ of a parameter $\phi=\phi(\theta)$ is Uniformly Minimum Variance Unbiased (UMVU) if, whenever $\tilde\phi$ is an unbiased estimate of $\phi$ we have

$\displaystyle {\rm Var}_\theta(\hat\phi) \le {\rm Var}_\theta(\tilde\phi)$

We call $\hat\phi$ the UMVUE. (`E' is for Estimator.)

The point of having $\phi(\theta)$ is to study problems like estimating $\mu$ when you have two parameters like $\mu$ and $\sigma$ for example.

Cramér Rao Inequality

If $\phi(\theta)=\theta$ we can derive some information from the identity

$\displaystyle E_\theta(T) \equiv\theta$

When we worked with the score function we derived some information from the identity

$\displaystyle \int f(x,\theta) dx \equiv 1$

by differentiation and we do the same here. If

is some function of the data

which is unbiased for $\theta$ then

$\displaystyle E_\theta(T) = \int T(x) f(x,\theta) dx \equiv \theta$

Differentiate both sides to get

$\displaystyle 1$	$\displaystyle = \frac{d}{d\theta} \int T(x) f(x,\theta) dx$
	$\displaystyle = \int T(x) \frac{\partial}{\partial\theta} f(x,\theta) dx$
	$\displaystyle = \int T(x) \frac{\partial}{\partial\theta} \log(f(x,\theta)) f(x,\theta) dx$
	$\displaystyle = E_\theta( T(X) U(\theta))$

where

is the score function. Since score has mean 0

$\displaystyle {\rm Cov}_\theta(T(X),U(\theta)) = 1$

Remember correlations between -1 and 1 or

$\displaystyle 1=\vert{\rm Cov}_\theta(T(X),U(\theta))\vert \le \sqrt{{\rm Var}_\theta(T) {\rm Var}_\theta(U(\theta))}$

Squaring gives the inequality

$\displaystyle {\rm Var}_\theta(T) \ge \frac{1}{I(\theta)}$

which is called the Cramér Rao Lower Bound. The inequality is strict unless the correlation is 1 which would require that

$\displaystyle U(\theta) = A(\theta) T(X) + B(\theta)$

for non-random constants

and

(may depend on $\theta$ .) This would prove that

$\displaystyle \ell(\theta) = A^*(\theta) T(X) + B^*(\theta) + C(X)$

for some further constants

and

and finally

$\displaystyle f(x,\theta) = h(x) e^{A*(\theta)T(x)+B^*(\theta)}$

for

Summary of Implications

You can recognize a UMVUE sometimes. If ${\rm Var}_\theta(T(X)) \equiv 1/I(\theta)$ then is the UMVUE. In the $N(\mu,1)$ example the Fisher information is and ${\rm Var}(\overline{X}) = 1/n$ so that $\overline{X}$ is the UMVUE of $\mu$ .
In an asymptotic sense the MLE is nearly optimal: it is nearly unbiased and (approximate) variance nearly $1/I(\theta)$ .
Good estimates are highly correlated with the score.
Densities of exponential form (called exponential family) given above are somehow special.
Usually inequality is strict -- strict unless score is affine function of a statistic and (or for constant ) is unbiased for $\theta$ .

What can we do to find UMVUEs when the CRLB is a strict inequality?

Example: Suppose has a Binomial() distribution. The score function is

$\displaystyle U(p) = \frac{1}{p(1-p)} X - \frac{n}{1-p}$

CRLB will be strict unless

for some

. If we are trying to estimate

then choosing $c=n^{-1}$ does give an unbiased estimate $\hat p = X/n$ and

achieves the CRLB so it is UMVU.

Different tactic: Suppose is some unbiased function of . Then we have

$\displaystyle E_p(T(X)-X/n) \equiv 0$

because $\hat p = X/n$ is also unbiased. If

then

$\displaystyle E_p(h(X)) = \sum_{k=0}^n h(k) \dbinom{n}{k} p^k (1-p)^{n-k} \equiv 0$

LHS of $\equiv$ sign is polynomial function of

as is the right. Thus if the left hand side is expanded out the coefficient of each power

is 0. The constant term occurs only in the term

and its coefficient is

$\displaystyle h(0) \dbinom{n}{0}= h(0)$

Thus

. Now

occurs only in the term

with coefficient

. Since the terms with

are 0 the quantity

occurs only in the term with

with coefficient

$\displaystyle n(n-1)h(2)/2$

. We can continue in this way to see that in fact

for each

and so the only unbiased function of

A Binomial random variable is a sum of iid Bernoulli rvs. If $Y_1,\ldots,Y_n$ iid Bernoulli() then $X=\sum Y_i$ is Binomial(). Could we do better by than $\hat p = X/n$ by trying $T(Y_1,\ldots,Y_n)$ for some other function ?

Try . There are 4 possible values for . If then

$\displaystyle E_p(h(Y_1,Y_2)) \equiv 0$

and we have

$\displaystyle E_p( h(Y_1,Y_2))$	$\displaystyle =$	$\displaystyle h(0,0)(1-p)^2$
		$\displaystyle + [h(1,0)+h(0,1)]p(1-p)$
		$\displaystyle + h(1,1) p^2 \, .$

This can be rewritten in the form

$\displaystyle \sum_{k=0}^n w(k) \dbinom{n}{k} p^k(1-p)^{n-k}$

where

$\displaystyle w(0)$	$\displaystyle =h(0,0)$
$\displaystyle 2w(1)$	$\displaystyle =h(1,0)+h(0,1)$
$\displaystyle w(2)$	$\displaystyle = h(1,1)\, .$

So, as before

. This argument can be used to prove that for any unbiased estimate $T(Y_1,\ldots,Y_n)$ we have that the average value of $T(y_1,\ldots,y_n)$ over vectors $y_1,\ldots,y_n$ which have exactly

1s and

0s is

. Now let's look at the variance of

$\displaystyle {\rm Var}$	$\displaystyle (T)$
	$\displaystyle = E_p( [T(Y_1,\ldots,Y_n) - p]^2)$
	$\displaystyle = E_p( [T(Y_1,\ldots,Y_n) -X/n+X/n-p]^2)$
	$\displaystyle = E_p( [T(Y_1,\ldots,Y_n) -X/n]^2) +$
	$\displaystyle \quad 2E_p( [T(Y_1,\ldots,Y_n) -X/n][X/n-p])$
	$\displaystyle + E_p([X/n-p]^2)$

Claim cross product term is 0 which will prove variance of

is variance of

plus a non-negative quantity (which will be positive unless $T(Y_1,\ldots,Y_n) \equiv X/n$ ). Compute the cross product term by writing

$\displaystyle E_p( [T(Y_1,\ldots,Y_n)-X/n][X/n-p]) =\sum_{y_1,\ldots,y_n} [T(y_1,\ldots,y_n)-\sum y_i/n][\sum y_i/n -p] \times p^{\sum y_i} (1-p)^{n-\sum y_i}$

Sum over those $y_1,\ldots,y_n$ whose sum is an integer

; then sum over

$\displaystyle E_p( [T(Y_1,\ldots,Y_n)-X/n]$	$\displaystyle [X/n-p])$
	$\displaystyle = \sum_{x=0}^n \sum_{\sum y_i=x} [T(y_1,\ldots,y_n)-\sum y_i/n] [\sum y_i/n -p] p^{\sum y_i} (1-p)^{n-\sum y_i}$
	$\displaystyle = \sum_{x=0}^n \left[ \sum_{\sum y_i=x} [T(y_1,\ldots,y_n)-x/n]\right][x/n -p] \times p^{x} (1-p)^{n-x}$

We have already shown that the sum in is 0!

This long, algebraically involved, method of proving that $\hat p = X/n$ is the UMVUE of is one special case of a general tactic.

To get more insight rewrite

$\displaystyle E_p\{T(Y_1,\ldots,Y_n)\}$	$\displaystyle = \sum_{x=0}^n \sum_{\sum y_i = x} T(y_1,\ldots,y_n) \times P(Y_1=y_1,\ldots,Y_n=y_n)$
	$\displaystyle = \sum_{x=0}^n \sum_{\sum y_i = x} T(y_1,\ldots,y_n) \times P(Y_1=y_1,\ldots,Y_n=y_n\vert X=x) P(X=x)$
	$\displaystyle = \sum_{x=0}^n \frac{\sum_{\sum y_i = x} T(y_1,\ldots,y_n)}{ \dbinom{n}{x}} \dbinom{n}{x} p^x(1-p)^{n-x}$

Notice large fraction in formula is average value of over values of when $\sum y_i$ is held fixed at . Notice that the weights in this average do not depend on . Notice that this average is actually

$\displaystyle E\{T(Y_1,\ldots,Y_n)\vert X=x\} = \sum_{y_1,\ldots,y_n} T(y_1,\ldots,y_n) \times P(Y_1=y_1,\ldots,Y_n=y_n\vert X=x)$

Notice conditional probabilities do not depend on

. In a sequence of Binomial trials if I tell you that 5 of 17 were heads and the rest tails the actual trial numbers of the 5 Heads are chosen at random from the 17 possibilities; all of the 17 choose 5 possibilities have the same chance and this chance does not depend on

Notice: with data $Y_1,\ldots,Y_n$ log likelihood is

$\displaystyle \ell(p) = \sum Y_i \log(p) - (n-\sum Y_i) \log(1-p)$

and

$\displaystyle U(p) = \frac{1}{p(1-p)} X - \frac{n}{1-p}$

as before. Again CRLB is strict except for multiples of

. Since only unbiased multiple of

is $\hat p = X/n$ UMVUE of

is $\hat p$ .

Sufficiency

In the binomial situation the conditional distribution of the data $Y_1,\ldots,Y_n$ given is the same for all values of $\theta$ ; we say this conditional distribution is free of $\theta$ .

Defn: Statistic is sufficient for the model $\{ P_\theta;\theta \in \Theta\}$ if conditional distribution of data given is free of $\theta$ .

Intuition: Data tell us about $\theta$ if different values of $\theta$ give different distributions to . If two different values of $\theta$ correspond to same density or cdf for we cannot distinguish these two values of $\theta$ by examining . Extension of this notion: if two values of $\theta$ give same conditional distribution of given then observing in addition to doesn't improve our ability to distinguish the two values.

Mathematically Precise version of this intuition: Suppose is sufficient statistic and is any estimate or confidence interval or ... If you only know value of then:

Generate an observation (via some sort of Monte Carlo program) from the conditional distribution of given .
Use instead of . Then has the same performance characteristics as because the distribution of is the same as that of .

You can carry out the first step only if the statistic is sufficient; otherwise you need to know the true value of $\theta$ to generate .

Example 1: $Y_1,\ldots,Y_n$ iid Bernoulli(). Given $\sum Y_i = y$ the indexes of the successes have the same chance of being any one of the $\dbinom{n}{y}$ possible subsets of $\{1,\ldots,n\}$ . Chance does not depend on so $T(Y_1,\ldots,Y_n) = \sum Y_i$ is sufficient statistic.

Example 2: $X_1,\ldots,X_n$ iid $N(\mu,1)$ . Joint distribution of $X_1,\ldots,X_n,\overline{X}$ is MVN. All entries of mean vector are $\mu$ . Variance covariance matrix partitioned as

$\displaystyle \left[\begin{array}{cc} I_{n \times n} & {\bf 1}_n /n \\ {\bf 1}_n^t /n & 1/n \end{array}\right]$

where ${\bf 1}_n$ is column vector of

1s and $I_{n \times n}$ is $n \times n$ identity matrix.

Compute conditional means and variances of given $\overline{X}$ ; use fact that conditional law is MVN. Conclude conditional law of data given $\overline{X} = x$ is MVN. Mean vector has all entries . Variance-covariance matrix is $I_{n\times n} - {\bf 1}_n{\bf 1}_n^t /n$ . No dependence on $\mu$ so $\overline{X}$ is sufficient.

WARNING: Whether or not statistic is sufficient depends on density function and on $\Theta$ .

Theorem: [Rao-Blackwell] Suppose is a sufficient statistic for model $\{P_\theta,\theta\in\Theta\}$ . If is an estimate of $\phi(\theta)$ then:

$E(T\vert S)$ is a statistic.
$E(T\vert S)$ has the same bias as ; if is unbiased so is $E(T\vert S)$ .
${\rm Var}_\theta(E(T\vert S)) \le {\rm Var}_\theta(T)$ and the inequality is strict unless is a function of .
MSE of $E(T\vert S)$ is no more than MSE of .

Proof: Review conditional distributions: abstract definition of conditional expectation is:

Defn: $E(Y\vert X)$ is any function of such that

$\displaystyle E\left[R(X)E(Y\vert X)\right] = E\left[R(X) Y\right]$

for any function

. $E(Y\vert X=x)$ is a function

such that

$\displaystyle g(X) = E(Y\vert X)$

Fact: If has joint density $f_{X,Y}(x,y)$ and conditional density $f(y\vert x)$ then

$\displaystyle g(x) = \int y f(y\vert x) dy$

satisfies these definitions.

Proof:

$\displaystyle E(R(X)g(X))$	$\displaystyle = \int R(x) g(x)f_X(x) dx$
	$\displaystyle = \int\int R(x) y f_X(x) f(y\vert x) dy dx$
	$\displaystyle = \int\int R(x)y f_{X,Y}(x,y) dy dx$
	$\displaystyle = E(R(X)Y)$

Think of $E(Y\vert X)$ as average holding fixed. Behaves like ordinary expected value but functions of only are like constants:

$\displaystyle E(\sum A_i(X) Y_i \vert X) = \sum A_i(X)E(Y_i\vert X)$

Example: $Y_1,\ldots,Y_n$ iid Bernoulli(). Then $X=\sum Y_i$ is Binomial(). Summary of conclusions:

Log likelihood function of only not of $Y_1,\ldots,Y_n$ .
Only function of which is unbiased estimate of is $\hat p = X/n$ .
If $T(Y_1,\ldots,Y_n)$ is unbiased for then average value of $T(y_1,\ldots,y_n)$ over $y_1,\ldots,y_n$ for which $\sum y_i = x$ is .
Distribution of given $\sum Y_i = x$ does not depend on .
If $T(Y_1,\ldots,Y_n)$ is unbiased for then

$\displaystyle {\rm Var}(T) = {\rm Var}(\hat p) + E[(T-\hat p)^2]$
$\hat p$ is the UMVUE of .

This proof that $\hat p = X/n$ is UMVUE of is special case of general tactic.

Proof of the Rao Blackwell Theorem

Step 1: The definition of sufficiency is that the conditional distribution of given does not depend on $\theta$ . This means that $E(T(X)\vert S)$ does not depend on $\theta$ .

Step 2: This step hinges on the following identity (called Adam's law by Jerzy Neyman - he used to say it comes before all the others)

$\displaystyle E[E(Y\vert X)] =E(Y)$

which is just the definition of $E(Y\vert X)$ with $R(X) \equiv 1$ .

From this we deduce that

$\displaystyle E_\theta[E(T\vert S)] = E_\theta(T)$

so that $E(T\vert S)$ and

have the same bias. If

is unbiased then

$\displaystyle E_\theta[E(T\vert S)] = E_\theta(T) = \phi(\theta)$

so that $E(T\vert S)$ is unbiased for $\phi$ .

Step 3: This relies on the following very useful decomposition. (In regression courses we say that the total sum of squares is the sum of the regression sum of squares plus the residual sum of squares.)

$\displaystyle {\rm Var(Y)} = {\rm Var}(E(Y\vert X)) + E[{\rm Var}(Y\vert X)]$

The conditional variance means

$\displaystyle {\rm Var}(Y\vert X) = E[ (Y-E(Y\vert X))^2\vert X]$

Square out right hand side:

$\displaystyle {\rm Var}(E(Y\vert X))$	$\displaystyle =$	$\displaystyle E[(E(Y\vert X)-E[E(Y\vert X)])^2]$
	$\displaystyle =$	$\displaystyle E[(E(Y\vert X)-E(Y))^2]$

and

$\displaystyle E[{\rm Var}(Y\vert X)] = E[(Y-E(Y\vert X)^2]$

Adding these together gives

$\displaystyle E\left[Y^2 -2YE[Y\vert X]+2(E[Y\vert X])^2 -2E(Y)E[Y\vert X] + E^2(Y)\right]$

Simplify remembering $E(Y\vert X)$ is function of

-- constant when holding

fixed. So

$\displaystyle E[Y\vert X]E[Y\vert X] = E[YE(Y\vert X)\vert X]\;$

taking expectations gives

$\displaystyle E[(E[Y\vert X])^2]$	$\displaystyle =$	$\displaystyle E[E[YE(Y\vert X)\vert X]]$
	$\displaystyle =$	$\displaystyle E[YE(Y\vert X)]$

So 3rd term above cancels with 2nd term.

Fourth term simplifies

$\displaystyle E[E(Y)E[Y\vert X]] = E(Y) E[E[Y\vert X]] =E^2(Y)$

so that

$\displaystyle {\rm Var}(E(Y\vert X)) + E[{\rm Var}(Y\vert X)] = E[Y^2] - E^2(Y)$

Apply to Rao Blackwell theorem to get

$\displaystyle {\rm Var}_\theta(T) = {\rm Var}_\theta(E(T\vert S)) + E[(T-E(T\vert S))^2]$

Second term $\ge 0$ so variance of $E(T\vert S)$ is no more than that of

; will be strictly less unless $T=E(T\vert S)$ . This would mean that

is already a function of

. Adding the squares of the biases of

(or of $E(T\vert S)$ ) gives the inequality for MSE.

Examples:

In the binomial problem is an unbiased estimate of . We improve this by computing

$\displaystyle E(Y_1(1-Y_2)\vert X)$

We do this in two steps. First compute

$\displaystyle E(Y_1(1-Y_2)\vert X=x)$

Notice that the random variable

is either 1 or 0 so its expected value is just the probability it is equal to 1:

$\displaystyle {E(Y_1(1-Y_2)\vert X=x)}$
	$\displaystyle =$	$\displaystyle P(Y_1(1-Y_2) =1 \vert X=x)$
	$\displaystyle =$	$\displaystyle P(Y_1 =1, Y_2 = 0\vert Y_1+Y_2+\cdots+Y_n=x)$
	$\displaystyle =$	$\displaystyle \frac{P(Y_1 =1, Y_2 = 0,Y_1+\cdots+Y_n=x)}{P(Y_1+Y_2+\cdots+Y_n=x)}$
	$\displaystyle =$	$\displaystyle \frac{P(Y_1 =1, Y_2 = 0,Y_3 +\cdots + Y_n = x-1)}{ \dbinom{n}{x} p^x (1-p)^{n-x}}$
	$\displaystyle =$	$\displaystyle \frac{p(1-p) \dbinom{n-2}{x-1} p^{x-1} (1-p)^{(n-1)-(x-1)}}{ \dbinom{n}{x} p^x (1-p)^{n-x}}$
	$\displaystyle =$	$\displaystyle \frac{\dbinom{n-2}{x-1}}{\dbinom{n}{x}}$
	$\displaystyle =$	$\displaystyle \frac{x(n-x)}{n(n-1)}$

This is simply $n\hat p(1-\hat p)/(n-1)$ (can be bigger than

, the maximum value of

Example: If $X_1,\ldots,X_n$ are iid $N(\mu,1)$ then $\bar{X}$ is sufficient and is an unbiased estimate of $\mu$ . Now

$\displaystyle E(X_1\vert\bar{X})$	$\displaystyle = E[X_1-\bar{X}+\bar{X}\vert\bar{X}]$
	$\displaystyle = E[X_1-\bar{X}\vert\bar{X}] + \bar{X}$
	$\displaystyle = \bar{X}$

which is the UMVUE.

Finding Sufficient statistics

Binomial: log likelihood $\ell(\theta)$ (part depending on $\theta$ ) is function of alone, not of $Y_1,\ldots,Y_n$ as well.

Normal example: $\ell(\mu)$ is, ignoring terms not containing $\mu$ ,

$\displaystyle \ell(\mu) = \mu \sum X_i - n\mu^2/2 = n\mu\bar{X} -n\mu^2/2 \, .$

Examples of the Factorization Criterion:

Theorem: If the model for data has density $f(x,\theta)$ then the statistic is sufficient if and only if the density can be factored as

$\displaystyle f(x,\theta) = g(S(x),\theta)h(x)$

Proof: Find statistic such that is a one to one function of the pair . Apply change of variables to the joint density of and . If the density factors then

$\displaystyle f_{S,T}(s,t) =g(s,\theta) h(x(s,t)) J(s,t)$

where

is the jacobian, so conditional density of

given

does not depend on $\theta$ . Thus the conditional distribution of

given

does not depend on $\theta$ and finally the conditional distribution of

given

does not depend on $\theta$ .

Conversely if is sufficient then the $f_{T\vert S}$ has no $\theta$ in it so joint density of is

$\displaystyle f_S(s,\theta) f_{T\vert S} (t\vert s)$

Apply change of variables formula to get

$\displaystyle f_X(x) = f_S(S(x),\theta) f_{T\vert S} (t(x)\vert S(x)) J(x)$

where

is the Jacobian. This factors.

Example: If $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$ then the joint density is

$\begin{multline*} (2\pi)^{-n/2} \sigma^{-n} \times \\ \exp\{-\sum X_i^2/(2\sigma^2) +\mu\sum X_i/\sigma^2 -n\mu^2/(2\sigma^2)\} \end{multline*}$

which is evidently a function of

$\displaystyle \sum X_i^2, \sum X_i$

This pair is a sufficient statistic. You can write this pair as a bijective function of $\bar{X}, \sum (X_i-\bar{X})^2$ so that this pair is also sufficient.

Example: If $Y_1,\ldots,Y_n$ are iid Bernoulli then

$\displaystyle f(y_1,\ldots,y_p;p)$	$\displaystyle = \prod p^{y_i}(1-p)^{1-y_i}$
	$\displaystyle = p^{\sum y_i} (1-p)^{n-\sum y_i}$

Define $g(x,p) = p^x(1-p)^{n-x}$ and $h\equiv 1$ to see that $X=\sum Y_i$ is sufficient by the factorization criterion.

Minimal Sufficiency

In any model $S(X)\equiv X$ is sufficient. (Apply the factorization criterion.) In any iid model the vector $X_{(1)}, \ldots, X_{(n)}$ of order statistics is sufficient. (Apply the factorization criterion.) In $N(\mu,1)$ model we have 3 sufficient statistics:

$S_1 = (X_1,\ldots,X_n)$ .
$S_2 = (X_{(1)}, \ldots, X_{(n)})$ .
$S_3 = \bar{X}$ .

Notice that I can calculate from the values of or but not vice versa and that I can calculate from but not vice versa. It turns out that $\bar{X}$ is a minimal sufficient statistic meaning that it is a function of any other sufficient statistic. (You can't collapse the data set any more without losing information about $\mu$ .)

Recognize minimal sufficient statistics from $\ell$ :

Fact: If you fix some particular $\theta^*$ then the log likelihood ratio function

$\displaystyle \ell(\theta)-\ell(\theta^*)$

is minimal sufficient. WARNING: the function is the statistic.

Subtraction of $\ell(\theta^*)$ gets rid of irrelevant constants in $\ell$ . In $N(\mu,1)$ example:

$\displaystyle \ell(\mu) = -n\log(2\pi)/2 - \sum X_i^2/2 + \mu\sum X_i -n\mu^2/2$

depends on $\sum X_i^2$ , not needed for sufficient statistic. Take $\mu^*=0$ and get

$\displaystyle \ell(\mu) -\ell(\mu^*) = \mu\sum X_i -n\mu^2/2$

This function of $\mu$ is minimal sufficient. Notice: from $\sum X_i$ can compute this minimal sufficient statistic and vice versa. Thus $\sum X_i$ is also minimal sufficient.

Completeness

In Binomial example only one function of is unbiased. Rao Blackwell shows UMVUE, if it exists, will be a function of any sufficient statistic. Can there be more than one such function? Yes in general but no for some models like the binomial.

Definition: A statistic is complete for a model $P_\theta;\theta\in\Theta$ if

$\displaystyle E_\theta(h(T)) = 0$

for all $\theta$ implies

We have already seen that is complete in the Binomial model. In the $N(\mu,1)$ model suppose

$\displaystyle E_\mu(h(\bar{X})) \equiv 0$

Since $\bar{X}$ has a $N(\mu,1/n)$ distribution we find that

$\displaystyle E(h(\bar{X})) = \frac{\sqrt{n}e^{-n\mu^2/2}}{\sqrt{2\pi}} \int_{-\infty}^\infty h(x) e^{-nx^2/2} e^{n\mu x} dx$

so that

$\displaystyle \int_{-\infty}^\infty h(x) e^{-nx^2/2} e^{n\mu x} dx \equiv 0$

This is the so called Laplace transform of the function $h(x)e^{-nx^2/2}$ . It is a theorem that a Laplace transform is 0 if and only if the function is 0 ( because you can invert the transform). Hence $h\equiv 0$ .

How to Prove Completeness

There is only one general tactic. Suppose has density

$\displaystyle f(x,\theta) = h(x) \exp\{\sum_1^p a_i(\theta)S_i(x)+c(\theta)\}$

If the range of the function $(a_1(\theta),\ldots,a_p(\theta))$ as $\theta$ varies over $\Theta$ contains a (hyper-) rectangle in

then the statistic

$\displaystyle ( S_1(X), \ldots, S_p(X))$

is complete and sufficient.

You prove the sufficiency by the factorization criterion and the completeness using the properties of Laplace transforms and the fact that the joint density of $S_1,\ldots,S_p$

$\displaystyle g(s_1,\ldots,s_p;\theta) = h^*(s) \exp\{\sum a_k(\theta)s_k+c^*(\theta)\}$

Example: $N(\mu,\sigma^2)$ model density has form

$\displaystyle \frac{ \exp\left\{ \left(-\frac{1}{2\sigma^2}\right)x^2 + \left(\... ...sigma^2} \right)x -\frac{\mu^2}{2\sigma^2} -\log \sigma \right\}}{\sqrt{2\pi}}$

which is an exponential family with

$\displaystyle h(x) = \frac{1}{\sqrt{2\pi}}$

$\displaystyle a_1(\theta) = -\frac{1}{2\sigma^2}$

$\displaystyle S_1(x) = x^2$

$\displaystyle a_2(\theta) = \frac{\mu}{\sigma^2}$

$\displaystyle S_2(x) = x$

and

$\displaystyle c(\theta) = -\frac{\mu^2}{2\sigma^2}-\log \sigma \, .$

It follows that

$\displaystyle (\sum X_i^2, \sum X_i)$

is a complete sufficient statistic.

Remark: The statistic $(s^2, \bar{X})$ is a one to one function of $(\sum X_i^2, \sum X_i)$ so it must be complete and sufficient, too. Any function of the latter statistic can be rewritten as a function of the former and vice versa.

FACT: A complete sufficient statistic is also minimal sufficient.

The Lehmann-Scheffé Theorem

Theorem: If is a complete sufficient statistic for some model and is an unbiased estimate of some parameter $\phi(\theta)$ then is the UMVUE of $\phi(\theta)$ .

Proof: Suppose is another unbiased estimate of $\phi$ . According to Rao-Blackwell, is improved by $E(T\vert S)$ so if is not UMVUE then there must exist another function which is unbiased and whose variance is smaller than that of for some value of $\theta$ . But

$\displaystyle E_\theta(h^*(S)-h(S)) \equiv 0$

so, in fact

Example: In the $N(\mu,\sigma^2)$ example the random variable $(n-1)s^2/\sigma^2$ has a $\chi^2_{n-1}$ distribution. It follows that

$\displaystyle E\left[\frac{\sqrt{n-1}s}{\sigma}\right] = \frac{\int_0^\infty x^{1/2} \left(\frac{x}{2}\right)^{(n-1)/2-1} e^{-x/2} dx}{{2\Gamma((n-1)/2)}}$

Make the substitution

and get

$\displaystyle E(s) = \frac{\sigma}{\sqrt{n-1}}\frac{\sqrt{2}}{\Gamma((n-1)/2)} \int_0^\infty y^{n/2-1} e^{-y} dy$

Hence

$\displaystyle E(s) = \sigma\frac{\sqrt{2}\Gamma(n/2)}{\sqrt{n-1}\Gamma((n-1)/2)}$

The UMVUE of $\sigma$ is then

$\displaystyle s\frac{\sqrt{n-1}\Gamma((n-1)/2)}{\sqrt{2}\Gamma(n/2)}$

by the Lehmann-Scheffé theorem.

Criticism of Unbiasedness

UMVUE can be inadmissible for squared error loss meaning there is a (biased, of course) estimate whose MSE is smaller for every parameter value. An example is the UMVUE of $\phi=p(1-p)$ which is $\hat\phi =n\hat{p}(1-\hat{p})/(n-1)$ . The MSE of

$\displaystyle \tilde{\phi} = \min(\hat\phi,1/4)$
is smaller than that of $\hat\phi$ .
Unbiased estimation may be impossible.
Binomial log odds is $\phi=\log(p/(1-p))$ . Since the expectation of any function of the data is a polynomial function of and since $\phi$ is not a polynomial function of there is no unbiased estimate of $\phi$
The UMVUE of $\sigma$ is not the square root of the UMVUE of $\sigma^2$ . This method of estimation does not have the parameterization equivariance that maximum likelihood does.
Unbiasedness is irrelevant (unless you average together many estimators). The property is an average over possible values of the estimate in which positive errors are allowed to cancel negative errors. An exception to this criticism is that if you plan to average a number of estimators to get a single estimator then it is a problem if all the estimators have the same bias. In assignment 5 you have the one way layout example in which the mle of the residual variance averages together many biased estimates and so is very badly biased. That assignment shows that the solution is not really to insist on unbiasedness but to consider an alternative to averaging for putting the individual estimates together.

$next$ $up$ $previous$

Richard Lockhart
2001-03-05