next up previous


Postscript version of these notes

STAT 801 Lecture 25

Reading for Today's Lecture:

Goals of Today's Lecture:

Bayesian estimation

Now let's focus on the problem of estimation of a 1 dimensional parameter. Mean Squared Error corresponds to using

\begin{displaymath}L(d,\theta) = (d-\theta)^2 \, .
\end{displaymath}

The risk function of a procedure (estimator) $\hat\theta$ is

\begin{displaymath}R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ]
\end{displaymath}

Now consider using a prior with density $\pi(\theta)$. The Bayes risk of $\hat\theta$ is
\begin{align*}r_\pi & = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta
\\
& = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta
\end{align*}

How should we choose $\hat\theta$ to minimize $r_\pi$? The solution lies in recognizing that $f(x;\theta)\pi(\theta)$ is really a joint density

\begin{displaymath}\int\int f(x;\theta)\pi(\theta)dx d\theta= 1
\end{displaymath}

For this joint density the conditional density of X given $\theta$is just the model $f(x;\theta)$. From now on I write the model as $f(x\vert\theta)$ to emphasize this fact. We can now compute $r_\pi$ a different way by factoring the joint density a different way:

\begin{displaymath}f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x)
\end{displaymath}

where now f(x) is the marginal density of x and $\pi(\theta\vert x)$ denotes the conditional density of $\theta$ given X. We call $\pi(\theta\vert x)$ the posterior density. It is found via Bayes theorem (which is why this is Bayesian statistics):

\begin{displaymath}\pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi)
d\phi}
\end{displaymath}

With this notation we can write

\begin{displaymath}r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2
\pi(\theta\vert x) d\theta \right] f(x) dx
\end{displaymath}

Now we can choose $\hat\theta(x)$ separately for each x to minimize the quantity in square brackets (as in the NP lemma). The quantity in square brackets is a quadratic function of $\hat\theta(x)$ and can be seen to be minimized by

\begin{displaymath}\hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta
\end{displaymath}

which is

\begin{displaymath}E(\theta\vert X)
\end{displaymath}

and is called the posterior expected mean of $\theta$.

Example: Consider first the problem of estimating a normal mean $\mu$. Imagine, for example that $\mu$ is the true speed of sound. I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess. I might summarize my opinion by saying that I think $\mu$ has a normal distribution with mean $\nu=$330 and standard deviation $\tau=10$. That is, I take a prior density $\pi$ for $\mu$ to be $N(\nu,\tau^2)$. Before I make any measurements my best guess of $\mu$ minimizes

\begin{displaymath}\int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}}
\exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu
\end{displaymath}

This quantity is minimized by the prior mean of $\mu$, namely,

\begin{displaymath}\hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, .
\end{displaymath}

Now suppose we collect 25 measurements of the speed of sound. I will assume that the relationship between the measurements and $\mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $\sigma=15$ which I assume that we know. Thus the model is that conditional on $\mu$ $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$. The joint density of the data and $\mu$ is then
\begin{multline*}\frac{\exp\{-\sum (X_i-\mu)^2/(2\sigma^2)\}}{
(2\pi)^{n/2} \sig...
...
\times
\frac{\exp\{-(\mu-\nu)^2/\tau^2\}}{
(2\pi)^{1/2} \tau}
\end{multline*}
Thus $(X_1,\ldots,X_n,\mu)\sim
MVN$. Conditional distribution of $\theta$ given $X_1,\ldots,X_n$ is normal.

Use standard MVN formulas to calculate conditional means and variances.

Alternatively the exponent in the joint density is of the form

\begin{displaymath}-\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right]
\end{displaymath}

plus terms not involving $\mu$ where

\begin{displaymath}\frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)
\end{displaymath}

and

\begin{displaymath}\frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2}
\end{displaymath}

This means that the conditional density of $\mu$ given the data is $N(\psi,\gamma^2)$. In other words the posterior mean of $\mu$ is

\begin{displaymath}\frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{
\frac{n}{\sigma^2} + \frac{1}{\tau^2} }
\end{displaymath}

which is a weighted average of the prior mean $\nu$ and the sample mean $\bar{X}$. Notice that the weight on the data is large when n is large or $\sigma$ is small (precise measurements) and small when $\tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior. For instance in the $N(\mu,\sigma^2)$ example consider the prior density $\pi(\theta) \equiv 1$. This ``density'' integrates to $\infty$ but using Bayes' theorem to compute the posterior would give
\begin{multline*}\pi(\theta\vert X) =
\\
\frac{ (2\pi)^{-n/2} \sigma^{-n} \exp...
...)^{-n/2} \sigma^{-n} \exp\{-\sum
(X_i-\nu)^2/(2\sigma^2)\} d\nu}
\end{multline*}
It is easy to see that this cancels to the limit of the case previously done when $\tau \to \infty$ giving a $N(\bar{X},\sigma^2/n)$ density. That is, the Bayes estimate of $\mu$ for this improper prior is $\bar{X}$.

Admissibility: Bayes procedures corresponding to proper priors are admissible. It follows that for each $w\in (0,1)$ and each real $\nu$ the estimate

\begin{displaymath}w\bar{X} +(1-w)\nu
\end{displaymath}

is admissible. That this is also true for w=1, that is, that $\bar{X}$is admissible is much harder to prove.

Minimax estimation: The risk function of $\bar{X}$is simply $\sigma^2/n$. That is, the risk function is constant since it does not depend on $\mu$. Were $\bar{X}$ Bayes for a proper prior this would prove that $\bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Given p X has a Binomial(n,p) distribution. Give p a Beta $(\alpha,\beta)$ prior density

\begin{displaymath}\pi(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
p^{\alpha-1} (1-p)^{\beta-1}
\end{displaymath}

The joint ``density'' of X and p is

\begin{displaymath}\dbinom{n}{X} p^X(1-p)^{n-X}
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
p^{\alpha-1} (1-p)^{\beta-1}\, ;
\end{displaymath}

posterior density of p given X is of the form

\begin{displaymath}cp^{X+\alpha-1}(1-p)^{n-X+\beta-1}
\end{displaymath}

for a suitable normalizing constant c. This is Beta $(X+\alpha,n-X+\beta)$ density. Mean of Beta $(\alpha,\beta)$distribution is $\alpha/(\alpha+\beta)$. So Bayes estimate of p is

\begin{displaymath}\frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w)
\frac{\alpha}{\alpha+\beta}
\end{displaymath}

where $\hat{p} =X/n$ is usual mle. Notice: weighted average of prior mean and mle. Notice prior is proper for $\alpha>0$ and $\beta>0$. To get w=1 take $\alpha=\beta=0$ and use improper prior

\begin{displaymath}\frac{1}{p(1-p)}
\end{displaymath}

Again we learn that each $w\hat{p} + (1-w) p_o$ is admissible for $w\in (0,1)$. Again $\hat{p}$ is admissible but our theorem is not adequate to prove this fact.

The risk function of $w\hat{p} +(1-w) p_0$ is

\begin{displaymath}R(p) = E[(w\hat{p} +(1-w) p_0-p)^2]
\end{displaymath}

which is
\begin{multline*}w^2 Var(\hat{p}) + (wp+(1-w)p-p)^2
\\
=
\\
w^2 p(1-p)/n +(1-w)^2(p-p_0)^2
\end{multline*}
Risk function constant if coefficients of p2 and p in risk are 0. Coefficient of p2 is

-w2/n +(1-w)2

so w=n1/2/(1+n1/2). Coefficient of p is then

w2/n -2p0(1-w)2

which vanishes if 2p0=1 or p0=1/2. Working backwards we find that to get these values for w and p0 we require $\alpha=\beta$. Moreover

w2/(1-w)2 = n

gives

\begin{displaymath}n/(\alpha+\beta) =\sqrt{n}
\end{displaymath}

or $\alpha =\beta = \sqrt{n}/2$. Minimax estimate of p is

\begin{displaymath}\frac{\sqrt{n}}{1+\sqrt{n}} \hat{p} + \frac{1}{1+\sqrt{n}} \frac{1}{2}
\end{displaymath}

Example: Now suppose that $X_1,\ldots,X_n$ are iid $MVN(\mu,\Sigma)$with $\Sigma$ known. Consider as the improper prior for $\mu$ which is constant. The posterior density of $\mu$ given X is then $MVN(\bar{X},\Sigma/n)$.

For multivariate estimation it is common to extend the notion of squared error loss by defining

\begin{displaymath}L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 =
(\hat\theta-\theta)^t (\hat\theta-\theta) \, .
\end{displaymath}

For this loss function the risk is the sum of the MSEs of the individual components and the Bayes estimate is the posterior mean again. Thus $\bar{X}$ is Bayes for an improper prior in this problem. It turns out that $\bar{X}$ is minimax; its risk function is the constant $trace(\Sigma)/n$. If the dimension p of $\theta$ is 1 or 2 then $\bar{X}$ is also admissible but if $p \ge 3$ then it is inadmissible. This fact was first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $\mu$. The ``improved'' estimator, called the James Stein estimator is essentially never used.

Hypothesis Testing and Decision Theory

Decision analysis of hypothesis testing takes $D=\{0,1\}$ and

\begin{displaymath}L(d,\theta)=1(\mbox{make an error})
\end{displaymath}

or more generally $L(0,\theta) = \ell_1 1(\theta\in\Theta_1)$ and $L(1,\theta) = \ell_2 1(\theta\in\Theta_0)$ for two positive constants $\ell_1$ and $\ell_2$. We make the decision space convex by allowing a decision to be a probability measure on D. Any such measure can be specified by $\delta=P(\mbox{reject})$ so ${\cal D} =[0,1]$. The loss function of $\delta\in[0,1]$ is

\begin{displaymath}L(\delta,\theta) = (1-\delta)\ell_1 1(\theta\in\Theta_1) + \delta \ell_0
1(\theta\in\Theta_0) \, .
\end{displaymath}

Simple hypotheses: Prior is $\pi_0>0$and $\pi_1>0$ with $\pi_0+\pi_1=1 $.

Procedure: map from sample space to $\cal D$ - a test function.

Risk function of procedure $\phi(X)$ is a pair of numbers:

\begin{displaymath}R_\phi(\theta_0) = E_0(L(\delta,\theta_0))
\end{displaymath}

and

\begin{displaymath}R_\phi(\theta_1) = E_1(L(\delta,\theta_1))
\end{displaymath}

We find

\begin{displaymath}R_\phi(\theta_0) = \ell_0 E_0(\phi(X)) =\ell_0\alpha
\end{displaymath}

and

\begin{displaymath}R_\phi(\theta_1) = \ell_1 E_1(1-\phi(X)) = \ell_1 \beta
\end{displaymath}

The Bayes risk of $\phi$ is

\begin{displaymath}\pi_0\ell_0 \alpha+ \pi_1\ell_1\beta
\end{displaymath}

We saw in the hypothesis testing section that this is minimized by

\begin{displaymath}\phi(X) = 1(f_1(X)/f_0(X) > \pi_1\ell_1/(\pi_0\ell_0))
\end{displaymath}

which is a likelihood ratio test. These tests are Bayes and admissible. The risk is constant if $\beta\ell_1 = \alpha\ell_0$; you can use this to find the minimax test in this context.


next up previous



Richard Lockhart
2000-03-21