No Title

$next$ $up$ $previous$

Postscript version of these notes

STAT 801 Lecture 24

Reading for Today's Lecture:

Goals of Today's Lecture:

Introduce Decision Theory.
Discuss priors, posteriors, posterior expected risk.
Discuss admissible, minimax and Bayes procedures.

Today's notes

Decision Theory and Bayesian Methods

Decision space is the set of possible actions I might take. We assume that it is convex, typically by expanding a basic decision space D to the space of all probability distributions on D.
Parameter space $\Theta$ of possible ``states of nature''.
Loss function $L=L(d,\theta)$ which is the loss I incur if I do d and $\theta$ is the true state of nature.
We call $\delta^*$ minimax if
A prior is a probability distribution $\pi$ on $\Theta$ ,.
The Bayes risk of a decision $\delta$ for a prior $\pi$ is

if the prior has a density. For finite parameter spaces $\Theta$ the integral is a sum.
A decision $\delta^*$ is Bayes for a prior $\pi$ if

$\begin{displaymath}r_\pi(\delta^*) \le r_\pi(\delta) \end{displaymath}$

for any decision $\delta$ .
When the parameter space is infinite we sometimes have to consider prior ``densities'' which are not really densities because they have integral equal to $\infty$ . A positive function on $\Theta$ is called a proper prior if it has a finite integral; in this case we divide through by that integral to get a density. A positive function on $\Theta$ whose integral is infinite is an improper prior density.
A decision $\delta$ is inadmissible if there is a decision $\delta^*$ such that

$\begin{displaymath}L(\delta^*,\theta) \le L(\delta,\theta) \end{displaymath}$

for all $\theta$ and there is at least one value of $\theta$ where the inequality is strict. A decision which is not inadmissible is called admissible.
Every admissible procedure is Bayes, perhaps only for an improper prior. (Proof uses the Separating Hyperplane Theorem in Functional Analysis.)
Every Bayes procedure with finite Bayes risk (for a prior which has a density which is positive for all $\theta$ ) is admissible. (Proof: If $\delta$ is Bayes for $\pi$ but not admissible there is a $\delta^*$ such that

$\begin{displaymath}L(\delta^*,\theta) \le L(\delta,\theta) \end{displaymath}$

Multiply by the prior density and integrate to deduce

$\begin{displaymath}r_\pi(\delta^*) \le r_\pi(\delta) \end{displaymath}$

If there is a $\theta$ for which the inequality involving L is strict and if the density of $\pi$ is positive at that $\theta$ then the inequality for $r_\pi$ is strict which would contradict the hypothesis that $\delta$ is Bayes for $\pi$ . Notice that this theorem actually requires the extra hypothesis about the positive density and that the risk functions of $\delta$ and $\delta^*$ be continuous.)
A minimax procedure is admissible. (Actually there can be several minimax procedures and the claim is that at least one of them is admissible. When the parameter space is infinite it might happen that set of possible risk functions is not closed; if not then we have to replace the notion of admissible by some notion of nearly admissible.)
The minimax procedure has constant risk. Actually the admissible minimax procedure is Bayes for some $\pi$ and its risk is constant on the set of $\theta$ for which the prior density is positive.

Statistical Decision Theory

Statistical problems have another ingredient, the data. We observe X a random variable taking values in say $\cal X$ . We may make our decision d depend on X. A decision rule is a function $\delta(X)$ from $\cal X$ to D. We will want $L(\delta(X),\theta)$ to be small for all $\theta$ . Since X is random we quantify this by averaging over X and compare procedures $\delta$ in terms of the risk function

$\begin{displaymath}R_\delta(\theta) = E_\theta(L(\delta(X),\theta)) \end{displaymath}$

To compare two procedures we must compare two functions of $\theta$ and pick ``the smaller one''. But typically the two functions will cross each other and there won't be a unique `smaller one'.

Example: In estimation theory to estimate a real parameter $\theta$ we used $D=\Theta$ ,

$\begin{displaymath}L(d,\theta) = (d-\theta)^2 \end{displaymath}$

and find that the risk of an estimator $\hat\theta(X)$ is

$\begin{displaymath}R_{\hat\theta}(\theta) = E[(\hat\theta-\theta)^2] \end{displaymath}$

which is just the Mean Squared Error of $\hat\theta$ . We have already seen that there is no unique best estimator in the sense of MSE. How do we compare risk functions in general? We extend minimax and bayes to risks rather than just losses.

Minimax methods choose $\delta$ to minimize the worst case risk:

$\begin{displaymath}\sup\{R_\delta(\theta);\theta\in\Theta)\}.\end{displaymath}$

We call $\delta^*$ minimax if

$\begin{displaymath}\sup_\theta R_{\delta^*}(\theta) = \inf_\delta \sup_\theta R_\delta(\theta) \end{displaymath}$

Usually the suprema and infima are achieved and we write $\max$ for $\sup$ and $\min$ for $\inf$ . This is the source of ``minimax''.
Bayes methods choose $\delta$ to minimize an average

$\begin{displaymath}r_\pi(\delta) = \int R_\delta(\theta) \pi(\theta) d\theta \end{displaymath}$

for a suitable density $\pi$ . We call $\pi$ a prior density and r the Bayes risk of $\delta$ for the prior $\pi$ .

Definition: A decision rule $\delta$ is inadmissible if there is a rule $\delta^*$ such that

$\begin{displaymath}R_{\delta^*}(\theta) \le R_\delta(\theta) \end{displaymath}$

for all $\theta$ and there is at least one value of $\theta$ where the inequality is strict. A rule which is not inadmissible is called admissible.

Bayesian estimation

Now let's focus on the problem of estimation of a 1 dimensional parameter. Mean Squared Error corresponds to using

$\begin{displaymath}L(d,\theta) = (d-\theta)^2 \, . \end{displaymath}$

The risk function of a procedure (estimator) $\hat\theta$ is

$\begin{displaymath}R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ] \end{displaymath}$

Now consider using a prior with density $\pi(\theta)$ . The Bayes risk of $\hat\theta$ is
$\begin{align*}r_\pi & = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta \\ & = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta \end{align*}$

How should we choose $\hat\theta$ to minimize $r_\pi$ ? The solution lies in recognizing that $f(x;\theta)\pi(\theta)$ is really a joint density

For this joint density the conditional density of X given $\theta$ is just the model $f(x;\theta)$ . From now on I write the model as $f(x\vert\theta)$ to emphasize this fact. We can now compute $r_\pi$ a different way by factoring the joint density a different way:

$\begin{displaymath}f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x) \end{displaymath}$

where now f(x) is the marginal density of x and $\pi(\theta\vert x)$ denotes the conditional density of $\theta$ given X. We call $\pi(\theta\vert x)$ the posterior density. It is found via Bayes theorem (which is why this is Bayesian statistics):

$\begin{displaymath}\pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi) d\phi} \end{displaymath}$

With this notation we can write

$\begin{displaymath}r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2 \pi(\theta\vert x) d\theta \right] f(x) dx \end{displaymath}$

Now we can choose $\hat\theta(x)$ separately for each x to minimize the quantity in square brackets (as in the NP lemma). The quantity in square brackets is a quadratic function of $\hat\theta(x)$ and can be seen to be minimized by

$\begin{displaymath}\hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta \end{displaymath}$

which is

$\begin{displaymath}E(\theta\vert X) \end{displaymath}$

and is called the posterior expected mean of $\theta$ .

Example: Consider first the problem of estimating a normal mean $\mu$ . Imagine, for example that $\mu$ is the true speed of sound. I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess. I might summarize my opinion by saying that I think $\mu$ has a normal distribution with mean $\nu=$ 330 and standard deviation $\tau=10$ . That is, I take a prior density $\pi$ for $\mu$ to be . Before I make any measurements my best guess of $\mu$ minimizes

$\begin{displaymath}\int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}} \exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu \end{displaymath}$

This quantity is minimized by the prior mean of $\mu$ , namely,

$\begin{displaymath}\hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, . \end{displaymath}$

Now suppose we collect 25 measurements of the speed of sound. I will assume that the relationship between the measurements and $\mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $\sigma=15$ which I assume that we know. Thus the model is that conditional on $\mu$ $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$ . The joint density of the data and $\mu$ is then

This is a multivariate normal joint density for $(X_1,\ldots,X_n,\mu)$ so the conditional distribution of $\theta$ given $X_1,\ldots,X_n$ is normal. Standard multivariate normal formulas can be used to calculate the conditional means and variances. Alternatively the exponent in the joint density is of the form

$\begin{displaymath}-\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right] \end{displaymath}$

plus terms not involving $\mu$ where

$\begin{displaymath}\frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right) \end{displaymath}$

and

$\begin{displaymath}\frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2} \end{displaymath}$

This means that the conditional density of $\mu$ given the data is $N(\psi,\gamma^2)$ . In other words the posterior mean of $\mu$ is

$\begin{displaymath}\frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{ \frac{n}{\sigma^2} + \frac{1}{\tau^2} } \end{displaymath}$

which is a weighted average of the prior mean $\nu$ and the sample mean $\bar{X}$ . Notice that the weight on the data is large when n is large or $\sigma$ is small (precise measurements) and small when $\tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior. For instance in the $N(\mu,\sigma^2)$ example consider the prior density $\pi(\theta) \equiv 1$ . This ``density'' integrates to $\infty$ but using Bayes' theorem to compute the posterior would give

$\begin{displaymath}\pi(\theta\vert X) = \frac{ (2\pi)^{-n/2} \sigma^{-n} \exp\{-... ...{-n/2} \sigma^{-n} \exp\{-\sum (X_i-\nu)^2/(2\sigma^2)\} d\nu} \end{displaymath}$

It is easy to see that this cancels to the limit of the case previously done when $\tau \to \infty$ giving a $N(\bar{X},\sigma^2/n)$ density. That is, the Bayes estimate of $\mu$ for this improper prior is $\bar{X}$ .

Admissibility: ayes procedures for proper priors are admissible. It follows that for each $w\in (0,1)$ and each real $\nu$ the estimate

$\begin{displaymath}w\bar{X} +(1-w)\nu \end{displaymath}$

is admissible. That this is also true for w=1, that is, that $\bar{X}$ is admissible is much harder to prove.

Minimax estimation: The risk function of $\bar{X}$ is simply $\sigma^2/n$ . That is, the risk function is constant since it does not depend on $\mu$ . Were $\bar{X}$ Bayes for a proper prior this would prove that $\bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Suppose that given p X has a Binomial(n,p) distribution. We will give p a Beta $(\alpha,\beta)$ prior density

The joint ``density'' of X and p is

so that the posterior density of p given X is of the form

$\begin{displaymath}cp^{X+\alpha-1}(1-p)^{n-X+\beta-1} \end{displaymath}$

for a suitable normalizing constant c. But this is a Beta $(X+\alpha,n-X+\beta)$ density. The mean of a Beta $(\alpha,\beta)$ distribution is $\alpha/(\alpha+\beta)$ . Thus the Bayes estimate of p is

$\begin{displaymath}\frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w) \frac{\alpha}{\alpha+\beta} \end{displaymath}$

where $\hat{p} =X/n$ is the usual mle. Notice that this is again a weighted average of the prior mean and the mle. Notice also that the prior is proper for $\alpha>0$ and $\beta>0$ . To get w=1 we take $\alpha=\beta=0$ and use the improper prior

$\begin{displaymath}\frac{1}{p(1-p)} \end{displaymath}$

Again we learn that each $w\hat{p} + (1-w) p_o$ is admissible for $w\in (0,1)$ . Again it is true that $\hat{p}$ is admissible but that our theorem is not adequate to prove this fact.

The risk function of $w\hat{p} +(1-w) p_0$ is

$\begin{displaymath}R(p) = E[(w\hat{p} +(1-w) p_0-p)^2] \end{displaymath}$

which is

$\begin{displaymath}w^2 Var(\hat{p}) + (wp+(1-w)p-p)^2 = w^2 p(1-p)/n +(1-w)^2(p-p_0)^2 \end{displaymath}$

This risk function will be constant if the coefficients of both p² and of p in the risk are 0. The coefficient of p is

-w²/n +(1-w)²

so w=n^1/2/(1+n^1/2). The coefficient of p is then

w²/n -2p₀(1-w)²

which will vanish if 2p₀=1 or p₀=1/2. Working backwards we find that to get these values for w and p₀ we require $\alpha=\beta$ . Moreover the equation

w²/(1-w)² = n

gives

or $\alpha =\beta = \sqrt{n}/2$ . The minimax estimate of p is

Example: Now suppose that $X_1,\ldots,X_n$ are iid $MVN(\mu,\Sigma)$ with $\Sigma$ known. Consider as the improper prior for $\mu$ which is constant. The posterior density of $\mu$ given X is then $MVN(\bar{X},\Sigma/n)$ .

For multivariate estimation it is common to extend the notion of squared error loss by defining

$\begin{displaymath}L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 = (\hat\theta-\theta)^t (\hat\theta-\theta) \, . \end{displaymath}$

For this loss function the risk is the sum of the MSEs of the individual components and the Bayes estimate is the posterior mean again. Thus $\bar{X}$ is Bayes for an improper prior in this problem. It turns out that $\bar{X}$ is minimax; its risk function is the constant $trace(\Sigma)/n$ . If the dimension p of $\theta$ is 1 or 2 then $\bar{X}$ is also admissible but if $p \ge 3$ then it is inadmissible. This fact was first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $\mu$ . The ``improved'' estimator, called the James Stein estimator is essentially never used.

$next$ $up$ $previous$

Richard Lockhart
1998-11-30