web

$next$ $up$ $previous$

Postscript version of this page

STAT 801: Mathematical Statistics

Bayesian estimation

Focus on problem of estimation of 1 dimensional parameter.

Mean Squared Error corresponds to using

$\displaystyle L(d,\theta) = (d-\theta)^2 \, .$

Risk function of procedure (estimator) $\hat\theta$ is

$\displaystyle R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ]$

Now consider prior with density $\pi(\theta)$ .

Bayes risk of $\hat\theta$ is

$\displaystyle r_\pi$	$\displaystyle = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta$
	$\displaystyle = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta$

Choose $\hat\theta$ to minimize $r_\pi$ ?

Recognize that $f(x;\theta)\pi(\theta)$ is really a joint density

$\displaystyle \int\int f(x;\theta)\pi(\theta)dx d\theta= 1$

For this joint density: conditional density of

given $\theta$ is just the model $f(x;\theta)$ .

Justifies notation $f(x\vert\theta)$ .

Compute $r_\pi$ a different way by factoring the joint density a different way:

$\displaystyle f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x)$

where now

is the marginal density of

and $\pi(\theta\vert x)$ denotes the conditional density of $\theta$ given

Call $\pi(\theta\vert x)$ the posterior density.

Found via Bayes theorem (which is why this is Bayesian statistics):

$\displaystyle \pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi) d\phi}$

With this notation we can write

$\displaystyle r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2 \pi(\theta\vert x) d\theta \right] f(x) dx$

Can choose $\hat\theta(x)$ separately for each

to minimize the quantity in square brackets (as in the NP lemma).

Quantity in square brackets is a quadratic function of $\hat\theta(x)$ ; minimized by

$\displaystyle \hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta$

which is

$\displaystyle E(\theta\vert X)$

and is called the posterior expected mean of $\theta$ .

Example: estimating normal mean $\mu$ .

Imagine, for example that $\mu$ is the true speed of sound.

I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess.

I might summarize my opinion by saying that I think $\mu$ has a normal distribution with mean $\nu=$ 330 and standard deviation $\tau=10$ .

That is, I take a prior density $\pi$ for $\mu$ to be $N(\nu,\tau^2)$ .

Before I make any measurements best guess of $\mu$ minimizes

$\displaystyle \int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}} \exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu$

This quantity is minimized by the prior mean of $\mu$ , namely,

$\displaystyle \hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, .$

Now collect 25 measurements of the speed of sound.

Assume: relationship between the measurements and $\mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $\sigma=15$ which I assume that we know.

So model is: given $\mu$ , $X_1,\ldots,X_n$ iid $N(\mu,\sigma^2)$ .

The joint density of the data and $\mu$ is then

$\displaystyle (2\pi)^{-n/1} \sigma^{-n} \exp\{-\sum (X_i-\mu)^2/(2\sigma^2)\} \times (2\pi)^{-1/2} \tau^{-1} \exp\{-(\mu-\nu)^2/\tau^2\}.$

Thus $(X_1,\ldots,X_n,\mu)\sim MVN$ . Conditional distribution of $\theta$ given $X_1,\ldots,X_n$ is normal.

Use standard MVN formulas to calculate conditional means and variances.

Alternatively: exponent in joint density has form

$\displaystyle -\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right]$

plus terms not involving $\mu$ where

$\displaystyle \frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)$

and

$\displaystyle \frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2}$

So: conditional of $\mu$ given data is $N(\psi,\gamma^2)$ .

In other words the posterior mean of $\mu$ is

$\displaystyle \frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{ \frac{n}{\sigma^2} + \frac{1}{\tau^2} }$

which is a weighted average of the prior mean $\nu$ and the sample mean $\bar{X}$ .

Notice: weight on data is large when is large or $\sigma$ is small (precise measurements) and small when $\tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior.

Example: $N(\mu,\sigma^2)$ ; consider prior density

$\displaystyle \pi(\mu) \equiv 1.$

This ``density'' integrates to $\infty$ ; using Bayes' theorem to compute the posterior would give

$\displaystyle \pi(\mu\vert X) = \frac{ (2\pi)^{-n/2} \sigma^{-n} \exp\{-\sum (... ...}}{ \int (2\pi)^{-n/2} \sigma^{-n} \exp\{-\sum (X_i-\xi)^2/(2\sigma^2)\} d\xi}$

It is easy to see that this cancels to the limit of the case previously done when $\tau \to \infty$ giving a $N(\bar{X},\sigma^2/n)$ density.

I.e., Bayes estimate of $\mu$ for this improper prior is $\bar{X}$ .

Admissibility: Bayes procedures corresponding to proper priors are admissible. It follows that for each $w\in (0,1)$ and each real $\nu$ the estimate

$\displaystyle w\bar{X} +(1-w)\nu$

is admissible. That this is also true for

, that is, that $\bar{X}$ is admissible is much harder to prove.

Minimax estimation: The risk function of $\bar{X}$ is simply $\sigma^2/n$ . That is, the risk function is constant since it does not depend on $\mu$ . Were $\bar{X}$ Bayes for a proper prior this would prove that $\bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Given , has a Binomial distribution.

Give a Beta $(\alpha,\beta)$ prior density

$\displaystyle \pi(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}$

The joint ``density'' of

and

$\displaystyle \dbinom{n}{X} p^X(1-p)^{n-X} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}\, ;$

posterior density of

given

is of the form

$\displaystyle cp^{X+\alpha-1}(1-p)^{n-X+\beta-1}$

for a suitable normalizing constant

This is Beta $(X+\alpha,n-X+\beta)$ density. Mean of Beta $(\alpha,\beta)$ distribution is $\alpha/(\alpha+\beta)$ .

So Bayes estimate of is

$\displaystyle \frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w) \frac{\alpha}{\alpha+\beta}$

where $\hat{p} =X/n$ is the usual mle.

Notice: again weighted average of prior mean and mle.

Notice: prior is proper for $\alpha>0$ and $\beta>0$ .

To get take $\alpha=\beta=0$ ; use improper prior

$\displaystyle \frac{1}{p(1-p)}$

Again: each $w\hat{p} + (1-w) p_o$ is admissible for $w\in (0,1)$ .

Again: it is true that $\hat{p}$ is admissible but our theorem is not adequate to prove this fact.

The risk function of $w\hat{p} +(1-w) p_0$ is

$\displaystyle R(p) = E[(w\hat{p} +(1-w) p_0-p)^2]$

which is

$\displaystyle w^2 {\rm Var}(\hat{p}) + (wp+(1-w)p-p)^2 = w^2 p(1-p)/n +(1-w)^2(p-p_0)^2.$

Risk function constant if coefficients of

and

in risk are 0.

Coefficient of is

$\displaystyle -w^2/n +(1-w)^2$

so $w=n^{1/2}/(1+n^{1/2})$ .

Coefficient of is then

$\displaystyle w^2/n -2p_0(1-w)^2$

which vanishes if

Working backwards: to get these values for and require $\alpha=\beta$ . Moreover

$\displaystyle w^2/(1-w)^2 = n$

gives

$\displaystyle n/(\alpha+\beta) =\sqrt{n}$

or $\alpha =\beta = \sqrt{n}/2$ . Minimax estimate of

$\displaystyle \frac{\sqrt{n}}{1+\sqrt{n}} \hat{p} + \frac{1}{1+\sqrt{n}} \frac{1}{2}$

Example: $X_1,\ldots,X_n$ iid $MVN(\mu,\Sigma)$ with $\Sigma$ known.

Take improper prior for $\mu$ which is constant.

Posterior of $\mu$ given is then $MVN(\bar{X},\Sigma/n)$ .

Multivariate estimation: common to extend the notion of squared error loss by defining

$\displaystyle L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 = (\hat\theta-\theta)^t (\hat\theta-\theta) \, .$

For this loss risk is sum of MSEs of individual components.

Bayes estimate is again posterior mean. Thus $\bar{X}$ is Bayes for an improper prior in this problem.

It turns out that $\bar{X}$ is minimax; its risk function is the constant $trace(\Sigma)/n$ .

If the dimension of $\theta$ is 1 or 2 then $\bar{X}$ is also admissible but if $p \ge 3$ then it is inadmissible.

Fact first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $\mu$ .

So-called James Stein estimator is essentially never used.

$next$ $up$ $previous$

Richard Lockhart
2001-04-06