No Title

$next$ $up$ $previous$

Postscript version of these notes

STAT 801 Lecture 25

Reading for Today's Lecture:

Goals of Today's Lecture:

Do Bayesian estimation examples.
Look at hypothesis testing as decision theory.

Bayesian estimation

Now let's focus on the problem of estimation of a 1 dimensional parameter. Mean Squared Error corresponds to using

$\begin{displaymath}L(d,\theta) = (d-\theta)^2 \, . \end{displaymath}$

The risk function of a procedure (estimator) $\hat\theta$ is

$\begin{displaymath}R_{\hat\theta}(\theta) =E_\theta[(\hat\theta-\theta)^2 ] \end{displaymath}$

Now consider using a prior with density $\pi(\theta)$ . The Bayes risk of $\hat\theta$ is
$\begin{align*}r_\pi & = \int R_{\hat\theta}(\theta)\pi(\theta) d\theta \\ & = \int \int (\hat\theta(x) - \theta)^2 f(x;\theta) \pi(\theta) dx d\theta \end{align*}$

How should we choose $\hat\theta$ to minimize $r_\pi$ ? The solution lies in recognizing that $f(x;\theta)\pi(\theta)$ is really a joint density

$\begin{displaymath}\int\int f(x;\theta)\pi(\theta)dx d\theta= 1 \end{displaymath}$

For this joint density the conditional density of X given $\theta$ is just the model $f(x;\theta)$ . From now on I write the model as $f(x\vert\theta)$ to emphasize this fact. We can now compute $r_\pi$ a different way by factoring the joint density a different way:

$\begin{displaymath}f(x\vert\theta)\pi(\theta) = \pi(\theta\vert x)f(x) \end{displaymath}$

where now f(x) is the marginal density of x and $\pi(\theta\vert x)$ denotes the conditional density of $\theta$ given X. We call $\pi(\theta\vert x)$ the posterior density. It is found via Bayes theorem (which is why this is Bayesian statistics):

$\begin{displaymath}\pi(\theta\vert x ) = \frac{f(x\vert\theta)\pi(\theta)}{\int f(x\vert\phi)\pi(\phi) d\phi} \end{displaymath}$

With this notation we can write

$\begin{displaymath}r_\pi(\hat\theta) = \int \left[ \int (\hat\theta(x) - \theta)^2 \pi(\theta\vert x) d\theta \right] f(x) dx \end{displaymath}$

Now we can choose $\hat\theta(x)$ separately for each x to minimize the quantity in square brackets (as in the NP lemma). The quantity in square brackets is a quadratic function of $\hat\theta(x)$ and can be seen to be minimized by

$\begin{displaymath}\hat\theta(x) = \int \theta \pi(\theta\vert x) d\theta \end{displaymath}$

which is

$\begin{displaymath}E(\theta\vert X) \end{displaymath}$

and is called the posterior expected mean of $\theta$ .

Example: Consider first the problem of estimating a normal mean $\mu$ . Imagine, for example that $\mu$ is the true speed of sound. I think this is around 330 metres per second and am pretty sure that I am within 30 metres per second of the truth with that guess. I might summarize my opinion by saying that I think $\mu$ has a normal distribution with mean $\nu=$ 330 and standard deviation $\tau=10$ . That is, I take a prior density $\pi$ for $\mu$ to be $N(\nu,\tau^2)$ . Before I make any measurements my best guess of $\mu$ minimizes

$\begin{displaymath}\int (\hat\mu - \mu)^2 \frac{1}{\tau\sqrt{2\pi}} \exp\{-(\mu-\nu)^2/(2\tau^2)\} d\mu \end{displaymath}$

This quantity is minimized by the prior mean of $\mu$ , namely,

$\begin{displaymath}\hat\mu = E_\pi(\mu) = \int \mu \pi(\mu) d\mu =\nu\, . \end{displaymath}$

Now suppose we collect 25 measurements of the speed of sound. I will assume that the relationship between the measurements and $\mu$ is that the measurements are unbiased and that the standard deviation of the measurement errors is $\sigma=15$ which I assume that we know. Thus the model is that conditional on $\mu$ $X_1,\ldots,X_n$ are iid $N(\mu,\sigma^2)$ . The joint density of the data and $\mu$ is then
$\begin{multline*}\frac{\exp\{-\sum (X_i-\mu)^2/(2\sigma^2)\}}{ (2\pi)^{n/2} \sig... ... \times \frac{\exp\{-(\mu-\nu)^2/\tau^2\}}{ (2\pi)^{1/2} \tau} \end{multline*}$
Thus $(X_1,\ldots,X_n,\mu)\sim MVN$ . Conditional distribution of $\theta$ given $X_1,\ldots,X_n$ is normal.

Use standard MVN formulas to calculate conditional means and variances.

Alternatively the exponent in the joint density is of the form

$\begin{displaymath}-\frac{1}{2} \left[ \mu^2 /\gamma^2 - 2 \mu \psi/\gamma^2\right] \end{displaymath}$

plus terms not involving $\mu$ where

$\begin{displaymath}\frac{1}{\gamma^2} = \left(\frac{n}{\sigma^2} + \frac{1}{\tau^2}\right) \end{displaymath}$

and

$\begin{displaymath}\frac{\psi}{\gamma^2} = \frac{\sum X_i}{\sigma^2} +\frac{\nu}{\tau^2} \end{displaymath}$

This means that the conditional density of $\mu$ given the data is $N(\psi,\gamma^2)$ . In other words the posterior mean of $\mu$ is

$\begin{displaymath}\frac{\frac{n}{\sigma^2} \bar{X}+ \frac{1}{\tau^2} \nu}{ \frac{n}{\sigma^2} + \frac{1}{\tau^2} } \end{displaymath}$

which is a weighted average of the prior mean $\nu$ and the sample mean $\bar{X}$ . Notice that the weight on the data is large when n is large or $\sigma$ is small (precise measurements) and small when $\tau$ is small (precise prior opinion).

Improper priors: When the density does not integrate to 1 we can still follow the machinery of Bayes' formula to derive a posterior. For instance in the $N(\mu,\sigma^2)$ example consider the prior density $\pi(\theta) \equiv 1$ . This ``density'' integrates to $\infty$ but using Bayes' theorem to compute the posterior would give
$\begin{multline*}\pi(\theta\vert X) = \\ \frac{ (2\pi)^{-n/2} \sigma^{-n} \exp... ...)^{-n/2} \sigma^{-n} \exp\{-\sum (X_i-\nu)^2/(2\sigma^2)\} d\nu} \end{multline*}$
It is easy to see that this cancels to the limit of the case previously done when $\tau \to \infty$ giving a $N(\bar{X},\sigma^2/n)$ density. That is, the Bayes estimate of $\mu$ for this improper prior is $\bar{X}$ .

Admissibility: Bayes procedures corresponding to proper priors are admissible. It follows that for each $w\in (0,1)$ and each real $\nu$ the estimate

$\begin{displaymath}w\bar{X} +(1-w)\nu \end{displaymath}$

is admissible. That this is also true for w=1, that is, that $\bar{X}$ is admissible is much harder to prove.

Minimax estimation: The risk function of $\bar{X}$ is simply $\sigma^2/n$ . That is, the risk function is constant since it does not depend on $\mu$ . Were $\bar{X}$ Bayes for a proper prior this would prove that $\bar{X}$ is minimax. In fact this is also true but hard to prove.

Example: Given p X has a Binomial(n,p) distribution. Give p a Beta $(\alpha,\beta)$ prior density

$\begin{displaymath}\pi(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1} \end{displaymath}$

The joint ``density'' of X and p is

$\begin{displaymath}\dbinom{n}{X} p^X(1-p)^{n-X} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}\, ; \end{displaymath}$

posterior density of p given X is of the form

$\begin{displaymath}cp^{X+\alpha-1}(1-p)^{n-X+\beta-1} \end{displaymath}$

for a suitable normalizing constant c. This is Beta $(X+\alpha,n-X+\beta)$ density. Mean of Beta $(\alpha,\beta)$ distribution is $\alpha/(\alpha+\beta)$ . So Bayes estimate of p is

$\begin{displaymath}\frac{X+\alpha}{n+\alpha+\beta} = w\hat{p} +(1-w) \frac{\alpha}{\alpha+\beta} \end{displaymath}$

where $\hat{p} =X/n$ is usual mle. Notice: weighted average of prior mean and mle. Notice prior is proper for $\alpha>0$ and $\beta>0$ . To get w=1 take $\alpha=\beta=0$ and use improper prior

$\begin{displaymath}\frac{1}{p(1-p)} \end{displaymath}$

Again we learn that each $w\hat{p} + (1-w) p_o$ is admissible for $w\in (0,1)$ . Again $\hat{p}$ is admissible but our theorem is not adequate to prove this fact.

The risk function of $w\hat{p} +(1-w) p_0$ is

$\begin{displaymath}R(p) = E[(w\hat{p} +(1-w) p_0-p)^2] \end{displaymath}$

which is
$\begin{multline*}w^2 Var(\hat{p}) + (wp+(1-w)p-p)^2 \\ = \\ w^2 p(1-p)/n +(1-w)^2(p-p_0)^2 \end{multline*}$
Risk function constant if coefficients of p² and p in risk are 0. Coefficient of p² is

-w²/n +(1-w)²

so w=n^1/2/(1+n^1/2). Coefficient of p is then

w²/n -2p₀(1-w)²

which vanishes if 2p₀=1 or p₀=1/2. Working backwards we find that to get these values for w and p₀ we require $\alpha=\beta$ . Moreover

w²/(1-w)² = n

gives

$\begin{displaymath}n/(\alpha+\beta) =\sqrt{n} \end{displaymath}$

or $\alpha =\beta = \sqrt{n}/2$ . Minimax estimate of p is

$\begin{displaymath}\frac{\sqrt{n}}{1+\sqrt{n}} \hat{p} + \frac{1}{1+\sqrt{n}} \frac{1}{2} \end{displaymath}$

Example: Now suppose that $X_1,\ldots,X_n$ are iid $MVN(\mu,\Sigma)$ with $\Sigma$ known. Consider as the improper prior for $\mu$ which is constant. The posterior density of $\mu$ given X is then $MVN(\bar{X},\Sigma/n)$ .

For multivariate estimation it is common to extend the notion of squared error loss by defining

$\begin{displaymath}L(\hat\theta,\theta) = \sum (\hat\theta_i-\theta_i)^2 = (\hat\theta-\theta)^t (\hat\theta-\theta) \, . \end{displaymath}$

For this loss function the risk is the sum of the MSEs of the individual components and the Bayes estimate is the posterior mean again. Thus $\bar{X}$ is Bayes for an improper prior in this problem. It turns out that $\bar{X}$ is minimax; its risk function is the constant $trace(\Sigma)/n$ . If the dimension p of $\theta$ is 1 or 2 then $\bar{X}$ is also admissible but if $p \ge 3$ then it is inadmissible. This fact was first demonstrated by James and Stein who produced an estimate which is better, in terms of this risk function, for every $\mu$ . The ``improved'' estimator, called the James Stein estimator is essentially never used.

Hypothesis Testing and Decision Theory

Decision analysis of hypothesis testing takes $D=\{0,1\}$ and

$\begin{displaymath}L(d,\theta)=1(\mbox{make an error}) \end{displaymath}$

or more generally $L(0,\theta) = \ell_1 1(\theta\in\Theta_1)$ and $L(1,\theta) = \ell_2 1(\theta\in\Theta_0)$ for two positive constants $\ell_1$ and $\ell_2$ . We make the decision space convex by allowing a decision to be a probability measure on D. Any such measure can be specified by $\delta=P(\mbox{reject})$ so ${\cal D} =[0,1]$ . The loss function of $\delta\in[0,1]$ is

$\begin{displaymath}L(\delta,\theta) = (1-\delta)\ell_1 1(\theta\in\Theta_1) + \delta \ell_0 1(\theta\in\Theta_0) \, . \end{displaymath}$

Simple hypotheses: Prior is $\pi_0>0$ and $\pi_1>0$ with $\pi_0+\pi_1=1$ .

Procedure: map from sample space to $\cal D$ - a test function.

Risk function of procedure $\phi(X)$ is a pair of numbers:

$\begin{displaymath}R_\phi(\theta_0) = E_0(L(\delta,\theta_0)) \end{displaymath}$

and

$\begin{displaymath}R_\phi(\theta_1) = E_1(L(\delta,\theta_1)) \end{displaymath}$

We find

$\begin{displaymath}R_\phi(\theta_0) = \ell_0 E_0(\phi(X)) =\ell_0\alpha \end{displaymath}$

and

$\begin{displaymath}R_\phi(\theta_1) = \ell_1 E_1(1-\phi(X)) = \ell_1 \beta \end{displaymath}$

The Bayes risk of $\phi$ is

$\begin{displaymath}\pi_0\ell_0 \alpha+ \pi_1\ell_1\beta \end{displaymath}$

We saw in the hypothesis testing section that this is minimized by

$\begin{displaymath}\phi(X) = 1(f_1(X)/f_0(X) > \pi_1\ell_1/(\pi_0\ell_0)) \end{displaymath}$

which is a likelihood ratio test. These tests are Bayes and admissible. The risk is constant if $\beta\ell_1 = \alpha\ell_0$ ; you can use this to find the minimax test in this context.

$next$ $up$ $previous$

Richard Lockhart
2000-03-21