next up previous


Postscript version of this page

STAT 801: Mathematical Statistics

Bayesian estimation

Focus on problem of estimation of 1 dimensional parameter.

Mean Squared Error corresponds to using

$\displaystyle L(d,\theta) = (d-\theta)^2 \, .
$

Risk function of procedure (estimator) $ \hat\theta$ is

$\displaystyle R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ]
$

Now consider prior with density $ \pi(\theta)$.

Bayes risk of $ \hat\theta$ is

$\displaystyle r_\pi$ $\displaystyle = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta$    
  $\displaystyle = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta$    

Choose $ \hat\theta$ to minimize $ r_\pi$?

Recognize that $ f(x;\theta)\pi(\theta)$ is really a joint density

$\displaystyle \int\int f(x;\theta)\pi(\theta)dx d\theta= 1
$

For this joint density: conditional density of $ X$ given $ \theta$ is just the model $ f(x;\theta)$.

Justifies notation $ f(x\vert\theta)$.

Compute $ r_\pi$ a different way by factoring the joint density a different way:

$\displaystyle f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x)
$

where now $ f(x)$ is the marginal density of $ x$ and $ \pi(\theta\vert x)$ denotes the conditional density of $ \theta$ given $ X$.

Call $ \pi(\theta\vert x)$ the posterior density.

Found via Bayes theorem (which is why this is Bayesian statistics):

$\displaystyle \pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi)
d\phi}
$

With this notation we can write

$\displaystyle r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2
\pi(\theta\vert x) d\theta \right] f(x) dx
$

Can choose $ \hat\theta(x)$ separately for each $ x$ to minimize the quantity in square brackets (as in the NP lemma).

Quantity in square brackets is a quadratic function of $ \hat\theta(x)$; minimized by

$\displaystyle \hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta
$

which is

$\displaystyle E(\theta\vert X)
$

and is called the posterior expected mean of $ \theta$.

Example: estimating normal mean $ \mu$.

Imagine, for example that $ \mu$ is the true speed of sound.

I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess.

I might summarize my opinion by saying that I think $ \mu$ has a normal distribution with mean $ \nu=$330 and standard deviation $ \tau=10$.

That is, I take a prior density $ \pi$ for $ \mu$ to be $ N(\nu,\tau^2)$.

Before I make any measurements best guess of $ \mu$ minimizes

$\displaystyle \int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}}
\exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu
$

This quantity is minimized by the prior mean of $ \mu$, namely,

$\displaystyle \hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, .
$

Now collect 25 measurements of the speed of sound.

Assume: relationship between the measurements and $ \mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $ \sigma=15$ which I assume that we know.

So model is: given $ \mu$, $ X_1,\ldots,X_n$ iid $ N(\mu,\sigma^2)$.

The joint density of the data and $ \mu$ is then

$\displaystyle (2\pi)^{-n/1} \sigma^{-n} \exp\{-\sum (X_i-\mu)^2/(2\sigma^2)\}
\times
(2\pi)^{-1/2} \tau^{-1} \exp\{-(\mu-\nu)^2/\tau^2\}.
$

Thus $ (X_1,\ldots,X_n,\mu)\sim
MVN$. Conditional distribution of $ \theta$ given $ X_1,\ldots,X_n$ is normal.

Use standard MVN formulas to calculate conditional means and variances.

Alternatively: exponent in joint density has form

$\displaystyle -\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right]
$

plus terms not involving $ \mu$ where

$\displaystyle \frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)
$

and

$\displaystyle \frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2}
$

So: conditional of $ \mu$ given data is $ N(\psi,\gamma^2)$.

In other words the posterior mean of $ \mu$ is

$\displaystyle \frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{
\frac{n}{\sigma^2} + \frac{1}{\tau^2} }
$

which is a weighted average of the prior mean $ \nu$ and the sample mean $ \bar{X}$.

Notice: weight on data is large when $ n$ is large or $ \sigma$ is small (precise measurements) and small when $ \tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior.

Example: $ N(\mu,\sigma^2)$; consider prior density

$\displaystyle \pi(\mu) \equiv 1. $

This ``density'' integrates to $ \infty$; using Bayes' theorem to compute the posterior would give

$\displaystyle \pi(\mu\vert X) =
\frac{ (2\pi)^{-n/2} \sigma^{-n} \exp\{-\sum
(...
...}}{ \int (2\pi)^{-n/2} \sigma^{-n} \exp\{-\sum
(X_i-\xi)^2/(2\sigma^2)\} d\xi}
$

It is easy to see that this cancels to the limit of the case previously done when $ \tau \to \infty$ giving a $ N(\bar{X},\sigma^2/n)$ density.

I.e., Bayes estimate of $ \mu$ for this improper prior is $ \bar{X}$.

Admissibility: Bayes procedures corresponding to proper priors are admissible. It follows that for each $ w\in (0,1)$ and each real $ \nu$ the estimate

$\displaystyle w\bar{X} +(1-w)\nu
$

is admissible. That this is also true for $ w=1$, that is, that $ \bar{X}$ is admissible is much harder to prove.

Minimax estimation: The risk function of $ \bar{X}$ is simply $ \sigma^2/n$. That is, the risk function is constant since it does not depend on $ \mu$. Were $ \bar{X}$ Bayes for a proper prior this would prove that $ \bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Given $ p$, $ X$ has a Binomial$ (n,p)$ distribution.

Give $ p$ a Beta $ (\alpha,\beta)$ prior density

$\displaystyle \pi(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
p^{\alpha-1} (1-p)^{\beta-1}
$

The joint ``density'' of $ X$ and $ p$ is

$\displaystyle \dbinom{n}{X} p^X(1-p)^{n-X}
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
p^{\alpha-1} (1-p)^{\beta-1}\, ;
$

posterior density of $ p$ given $ X$ is of the form

$\displaystyle cp^{X+\alpha-1}(1-p)^{n-X+\beta-1}
$

for a suitable normalizing constant $ c$.

This is Beta $ (X+\alpha,n-X+\beta)$ density. Mean of Beta $ (\alpha,\beta)$ distribution is $ \alpha/(\alpha+\beta)$.

So Bayes estimate of $ p$ is

$\displaystyle \frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w)
\frac{\alpha}{\alpha+\beta}
$

where $ \hat{p} =X/n$ is the usual mle.

Notice: again weighted average of prior mean and mle.

Notice: prior is proper for $ \alpha>0$ and $ \beta>0$.

To get $ w=1$ take $ \alpha=\beta=0$; use improper prior

$\displaystyle \frac{1}{p(1-p)}
$

Again: each $ w\hat{p} + (1-w) p_o$ is admissible for $ w\in (0,1)$.

Again: it is true that $ \hat{p}$ is admissible but our theorem is not adequate to prove this fact.

The risk function of $ w\hat{p} +(1-w) p_0$ is

$\displaystyle R(p) = E[(w\hat{p} +(1-w) p_0-p)^2]
$

which is

$\displaystyle w^2 {\rm Var}(\hat{p}) + (wp+(1-w)p-p)^2
=
w^2 p(1-p)/n +(1-w)^2(p-p_0)^2.
$

Risk function constant if coefficients of $ p^2$ and $ p$ in risk are 0.

Coefficient of $ p^2$ is

$\displaystyle -w^2/n +(1-w)^2
$

so $ w=n^{1/2}/(1+n^{1/2})$.

Coefficient of $ p$ is then

$\displaystyle w^2/n -2p_0(1-w)^2
$

which vanishes if $ 2p_0=1$ or $ p_0=1/2$.

Working backwards: to get these values for $ w$ and $ p_0$ require $ \alpha=\beta$. Moreover

$\displaystyle w^2/(1-w)^2 = n
$

gives

$\displaystyle n/(\alpha+\beta) =\sqrt{n}
$

or $ \alpha =\beta = \sqrt{n}/2$. Minimax estimate of $ p$ is

$\displaystyle \frac{\sqrt{n}}{1+\sqrt{n}} \hat{p} + \frac{1}{1+\sqrt{n}} \frac{1}{2}
$

Example: $ X_1,\ldots,X_n$ iid $ MVN(\mu,\Sigma)$ with $ \Sigma$ known.

Take improper prior for $ \mu$ which is constant.

Posterior of $ \mu$ given $ X$ is then $ MVN(\bar{X},\Sigma/n)$.

Multivariate estimation: common to extend the notion of squared error loss by defining

$\displaystyle L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 =
(\hat\theta-\theta)^t (\hat\theta-\theta) \, .
$

For this loss risk is sum of MSEs of individual components.

Bayes estimate is again posterior mean. Thus $ \bar{X}$ is Bayes for an improper prior in this problem.

It turns out that $ \bar{X}$ is minimax; its risk function is the constant $ trace(\Sigma)/n$.

If the dimension $ p$ of $ \theta$ is 1 or 2 then $ \bar{X}$ is also admissible but if $ p \ge 3$ then it is inadmissible.

Fact first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $ \mu$.

So-called James Stein estimator is essentially never used.

next up previous



Richard Lockhart
2001-04-06