next up previous


Postscript version of these notes

STAT 801 Lecture 24

Reading for Today's Lecture:

Goals of Today's Lecture:

Today's notes

Decision Theory and Bayesian Methods

Statistical Decision Theory

Statistical problems have another ingredient, the data. We observe X a random variable taking values in say $\cal X$. We may make our decision d depend on X. A decision rule is a function $\delta(X)$ from $\cal X$ to D. We will want $L(\delta(X),\theta)$ to be small for all $\theta$. Since X is random we quantify this by averaging over X and compare procedures $\delta$ in terms of the risk function

\begin{displaymath}R_\delta(\theta) = E_\theta(L(\delta(X),\theta))
\end{displaymath}

To compare two procedures we must compare two functions of $\theta$ and pick ``the smaller one''. But typically the two functions will cross each other and there won't be a unique `smaller one'.

Example: In estimation theory to estimate a real parameter $\theta$ we used $D=\Theta$,

\begin{displaymath}L(d,\theta) = (d-\theta)^2
\end{displaymath}

and find that the risk of an estimator $\hat\theta(X)$ is

\begin{displaymath}R_{\hat\theta}(\theta) = E[(\hat\theta-\theta)^2]
\end{displaymath}

which is just the Mean Squared Error of $\hat\theta$. We have already seen that there is no unique best estimator in the sense of MSE. How do we compare risk functions in general? We extend minimax and bayes to risks rather than just losses.

Definition: A decision rule $\delta$ is inadmissible if there is a rule $\delta^*$ such that

\begin{displaymath}R_{\delta^*}(\theta) \le R_\delta(\theta)
\end{displaymath}

for all $\theta$ and there is at least one value of $\theta$ where the inequality is strict. A rule which is not inadmissible is called admissible.

Bayesian estimation

Now let's focus on the problem of estimation of a 1 dimensional parameter. Mean Squared Error corresponds to using

\begin{displaymath}L(d,\theta) = (d-\theta)^2 \, .
\end{displaymath}

The risk function of a procedure (estimator) $\hat\theta$ is

\begin{displaymath}R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ]
\end{displaymath}

Now consider using a prior with density $\pi(\theta)$. The Bayes risk of $\hat\theta$ is
\begin{align*}r_\pi & = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta
\\
& = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta
\end{align*}

How should we choose $\hat\theta$ to minimize $r_\pi$? The solution lies in recognizing that $f(x;\theta)\pi(\theta)$ is really a joint density


For this joint density the conditional density of X given $\theta$ is just the model $f(x;\theta)$. From now on I write the model as $f(x\vert\theta)$ to emphasize this fact. We can now compute $r_\pi$ a different way by factoring the joint density a different way:

\begin{displaymath}f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x)
\end{displaymath}

where now f(x) is the marginal density of x and $\pi(\theta\vert x)$ denotes the conditional density of $\theta$ given X. We call $\pi(\theta\vert x)$ the posterior density. It is found via Bayes theorem (which is why this is Bayesian statistics):

\begin{displaymath}\pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi)
d\phi}
\end{displaymath}

With this notation we can write

\begin{displaymath}r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2
\pi(\theta\vert x) d\theta \right] f(x) dx
\end{displaymath}

Now we can choose $\hat\theta(x)$ separately for each x to minimize the quantity in square brackets (as in the NP lemma). The quantity in square brackets is a quadratic function of $\hat\theta(x)$ and can be seen to be minimized by

\begin{displaymath}\hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta
\end{displaymath}

which is

\begin{displaymath}E(\theta\vert X)
\end{displaymath}

and is called the posterior expected mean of $\theta$.

Example: Consider first the problem of estimating a normal mean $\mu$. Imagine, for example that $\mu$ is the true speed of sound. I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess. I might summarize my opinion by saying that I think $\mu$ has a normal distribution with mean $\nu=$330 and standard deviation $\tau=10$. That is, I take a prior density $\pi$ for $\mu$ to be . Before I make any measurements my best guess of $\mu$ minimizes

\begin{displaymath}\int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}}
\exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu
\end{displaymath}

This quantity is minimized by the prior mean of $\mu$, namely,

\begin{displaymath}\hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, .
\end{displaymath}

Now suppose we collect 25 measurements of the speed of sound. I will assume that the relationship between the measurements and $\mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $\sigma=15$ which I assume that we know. Thus the model is that conditional on $\mu$ $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$. The joint density of the data and $\mu$ is then


This is a multivariate normal joint density for $(X_1,\ldots,X_n,\mu)$ so the conditional distribution of $\theta$ given $X_1,\ldots,X_n$ is normal. Standard multivariate normal formulas can be used to calculate the conditional means and variances. Alternatively the exponent in the joint density is of the form

\begin{displaymath}-\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right]
\end{displaymath}

plus terms not involving $\mu$ where

\begin{displaymath}\frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)
\end{displaymath}

and

\begin{displaymath}\frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2}
\end{displaymath}

This means that the conditional density of $\mu$ given the data is $N(\psi,\gamma^2)$. In other words the posterior mean of $\mu$ is

\begin{displaymath}\frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{
\frac{n}{\sigma^2} + \frac{1}{\tau^2} }
\end{displaymath}

which is a weighted average of the prior mean $\nu$ and the sample mean $\bar{X}$. Notice that the weight on the data is large when n is large or $\sigma$ is small (precise measurements) and small when $\tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior. For instance in the $N(\mu,\sigma^2)$ example consider the prior density $\pi(\theta) \equiv 1$. This ``density'' integrates to $\infty$ but using Bayes' theorem to compute the posterior would give

\begin{displaymath}\pi(\theta\vert X) = \frac{ (2\pi)^{-n/2} \sigma^{-n} \exp\{-...
...{-n/2} \sigma^{-n} \exp\{-\sum
(X_i-\nu)^2/(2\sigma^2)\} d\nu}
\end{displaymath}

It is easy to see that this cancels to the limit of the case previously done when $\tau \to \infty$ giving a $N(\bar{X},\sigma^2/n)$ density. That is, the Bayes estimate of $\mu$ for this improper prior is $\bar{X}$.

Admissibility: ayes procedures for proper priors are admissible. It follows that for each $w\in (0,1)$ and each real $\nu$ the estimate

\begin{displaymath}w\bar{X} +(1-w)\nu
\end{displaymath}

is admissible. That this is also true for w=1, that is, that $\bar{X}$ is admissible is much harder to prove.

Minimax estimation: The risk function of $\bar{X}$ is simply $\sigma^2/n$. That is, the risk function is constant since it does not depend on $\mu$. Were $\bar{X}$ Bayes for a proper prior this would prove that $\bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Suppose that given p X has a Binomial(n,p) distribution. We will give p a Beta $(\alpha,\beta)$ prior density


The joint ``density'' of X and p is


so that the posterior density of p given X is of the form

\begin{displaymath}cp^{X+\alpha-1}(1-p)^{n-X+\beta-1}
\end{displaymath}

for a suitable normalizing constant c. But this is a Beta $(X+\alpha,n-X+\beta)$ density. The mean of a Beta $(\alpha,\beta)$ distribution is $\alpha/(\alpha+\beta)$. Thus the Bayes estimate of p is

\begin{displaymath}\frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w)
\frac{\alpha}{\alpha+\beta}
\end{displaymath}

where $\hat{p} =X/n$ is the usual mle. Notice that this is again a weighted average of the prior mean and the mle. Notice also that the prior is proper for $\alpha>0$ and $\beta>0$. To get w=1 we take $\alpha=\beta=0$ and use the improper prior

\begin{displaymath}\frac{1}{p(1-p)}
\end{displaymath}

Again we learn that each $w\hat{p} + (1-w) p_o$ is admissible for $w\in (0,1)$. Again it is true that $\hat{p}$ is admissible but that our theorem is not adequate to prove this fact.

The risk function of $w\hat{p} +(1-w) p_0$ is

\begin{displaymath}R(p) = E[(w\hat{p} +(1-w) p_0-p)^2]
\end{displaymath}

which is

\begin{displaymath}w^2 Var(\hat{p}) + (wp+(1-w)p-p)^2 = w^2 p(1-p)/n +(1-w)^2(p-p_0)^2
\end{displaymath}

This risk function will be constant if the coefficients of both p2 and of p in the risk are 0. The coefficient of p is

-w2/n +(1-w)2

so w=n1/2/(1+n1/2). The coefficient of p is then

w2/n -2p0(1-w)2

which will vanish if 2p0=1 or p0=1/2. Working backwards we find that to get these values for w and p0 we require $\alpha=\beta$. Moreover the equation

w2/(1-w)2 = n

gives


or $\alpha =\beta = \sqrt{n}/2$. The minimax estimate of p is


Example: Now suppose that $X_1,\ldots,X_n$ are iid $MVN(\mu,\Sigma)$ with $\Sigma$ known. Consider as the improper prior for $\mu$ which is constant. The posterior density of $\mu$ given X is then $MVN(\bar{X},\Sigma/n)$.

For multivariate estimation it is common to extend the notion of squared error loss by defining

\begin{displaymath}L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 =
(\hat\theta-\theta)^t (\hat\theta-\theta) \, .
\end{displaymath}

For this loss function the risk is the sum of the MSEs of the individual components and the Bayes estimate is the posterior mean again. Thus $\bar{X}$ is Bayes for an improper prior in this problem. It turns out that $\bar{X}$ is minimax; its risk function is the constant $trace(\Sigma)/n$. If the dimension p of $\theta$ is 1 or 2 then $\bar{X}$ is also admissible but if $p \ge 3$ then it is inadmissible. This fact was first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $\mu$. The ``improved'' estimator, called the James Stein estimator is essentially never used.


next up previous



Richard Lockhart
1998-11-30