next up previous


Postscript version of these notes

STAT 804: Notes on Lecture 9

Fitting higher order autoregressions

For the model

$\displaystyle X_t- \mu = \sum_1^p a_i(X_{t-1}-\mu) + \epsilon_t
$

we will use conditional likelihood again. Let $ \phi$ denote the vector $ (a_1,\ldots, a_p)^t$. Now we condition on the first $ p$ values of $ X$ and use

$\displaystyle \ell_c(\phi,\mu,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\mu - \sum_1^p a_i(X_{t-i} - \mu)\right]^2
-(T-p)\log(\sigma)
$

If we estimate $ \mu$ using $ \bar{X}$ we find that we are trying to maximize

$\displaystyle -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p a_i(X_{t-i} - \bar{X})\right]^2
-(T-p)\log(\sigma)
$

To estimate $ a_1,\ldots,a_p$ then we merely minimize the sum of squares

$\displaystyle \sum_p^{T-1} \hat\epsilon_t^2 = \sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p
a_i(X_{t-i} - \bar{X})\right]^2
$

This is a straightforward regression problem. We regress the response vector

$\displaystyle \left[\begin{array}{c} X_p -\bar{X} \\  \vdots \\  X_{T-1} -\bar{X} \end{array} \right]
$

on the design matrix

$\displaystyle \left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\bar{X}
...
...
\vdots
\\
X_{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
$

An alternative to estimating $ \mu$ by $ \bar{X}$ is to define $ \alpha = \mu(1-\sum a_i)$ and then recognize that

$\displaystyle \ell(\alpha,\phi,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\alpha - \sum_1^p a_iX_{t-i}\right]^2
-(T-p)\log(\sigma)
$

is maximized by regressing the vector

$\displaystyle \left[\begin{array}{c} X_p \\  \vdots \\  X_{T-1} \end{array} \right]
$

on the design matrix

$\displaystyle \left[\begin{array}{ccc}
X_{p-1} & \cdots & X_0
\\
\vdots &
\vdots & \vdots
\\
X_{T-2} & \cdots & X_{T-p-1}
\end{array}\right]
$

From $ \hat\alpha$ and $ \hat\phi$ we would get an estimate for $ \mu$ by

$\displaystyle \hat\mu = \frac{\hat\alpha}{1-\sum \hat{a}_i}$

Notice that if we put

$\displaystyle Z=\left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\bar{X...
...&
\vdots
\\
X_{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
$

then

$\displaystyle Z^tZ \approx T \left[\begin{array}{cccc}
\hat{C}(0) & \hat{C}(1) ...
...ots & \cdots
\\
\cdots & \cdots & \hat{C}(1) & \hat{C}(0)
\end{array}\right]
$

and if

$\displaystyle Y=\left[\begin{array}{c} X_p -\bar{X} \\  \vdots \\  X_{T-1} -\bar{X}
\end{array} \right]
$

then

$\displaystyle Z^tY \approx T \left[\begin{array}{c} \hat{C}(1) \\  \vdots \\  \hat{C}(p)
\end{array}\right]
$

so that the normal equations (from least squares)

$\displaystyle Z^t Z \phi = Z^T Y
$

are nearly the Yule-Walker equations again.

Full maximum likelihood

To compute a full mle of $ \theta=(\mu,\phi,\sigma)$ you generally begin by finding preliminary estimates $ \hat\theta$ say by one of the conditional likelihood methods above and then iterate via Newton-Raphson or some other scheme for numerical maximization of the log-likelihood.

Fitting $ MA(q)$ models

Here we consider the model with known mean (generally this will mean we estimate $ \hat\mu=\bar{X}$ and subtract the mean from all the observations):

$\displaystyle X_t = \epsilon_t - b_1 \epsilon_{t-1} - \cdots - b_q\epsilon_{t-q}
$

In general $ X$ has a $ MVN(0,\Sigma)$ distribution and, letting $ \psi$ denote the vector of $ b_i$s we find

$\displaystyle \ell(\psi,\sigma) = -\frac{1}{2}\left[\log(\det(\Sigma))
- X^T \Sigma^{-1} X\right]
$

Here $ X$ denotes the column vector of all the data. As an example consider $ q=1$ so that

$\displaystyle \Sigma = \left[\begin{array}{ccccc}
\sigma^2(1+b_1^2) & -b_1\sigm...
...
\\
0 & \cdots & \cdots & -b_1\sigma^2& \sigma^2(1+b_1^2)
\end{array}\right]
$

It is not so easy to work with the determinant and inverse of matrices like this. Instead we try to mimic the conditional inference approach above but with a twist; we now condition on something we haven't observed -- $ \epsilon_{-1}$.

Notice that

$\displaystyle X_0$ $\displaystyle = \epsilon_0 - b\epsilon_{-1}$    
$\displaystyle X_1$ $\displaystyle = \epsilon_1 - b\epsilon_0$    
  $\displaystyle = \epsilon_1 - b(X_0+\epsilon_{-1})$    
$\displaystyle X_2$ $\displaystyle = \epsilon_2 - b\epsilon_1$    
  $\displaystyle = \epsilon_2 - b(X_1+b(X_0+\epsilon_{-1}))$    
$\displaystyle \vdots$      
$\displaystyle X_{t-1}$ $\displaystyle = \epsilon_{T-1} - b(X_{T-2} + b (X_{T-3} + \cdots \epsilon_{-1}))$    

Now imagine that the data were actually

$\displaystyle \epsilon_{-1}, X_0,\ldots,X_{T-1}
$

Then the same idea we used for an $ AR(1)$ would give

$\displaystyle \ell(b,\sigma)$ $\displaystyle = \log(f(\epsilon_{-1},\sigma)) + \log(f(X_0,\ldots,X_{T-1}\vert\epsilon_{-1},b,\sigma)$    
  $\displaystyle = \log(f(\epsilon_{-1},\sigma)) + \sum_0^{T-1} \log(f(X_t\vert X_{t-1},\ldots,X_0,\epsilon_{-1},b,\sigma)$    

The parameters are listed in the conditions in this formula merely to indicate which terms depend on which parameters. For Gaussian $ \epsilon$s the terms in this likelihood are squares as usual (plus logarithms of $ \sigma$) leading to

$\displaystyle \ell(b,\sigma) =$ $\displaystyle \frac{-\epsilon_{-1}}{2\sigma^2} - \log(\sigma)$    
  $\displaystyle - \sum_0^{T-1} \left[\frac{1}{2} (X_t - \psi X_{t-1} - \psi^2 X_{t-2} - \cdots -\psi^{t+1} \epsilon_{-1})^2+\log(\sigma)\right]$    

We will estimate the parameters by maximizing this function after getting rid of $ \epsilon_{-1}$ somehow.

Method A: Put $ \epsilon_{-1}=0$ since 0 is the most probable value and maximize

$\displaystyle -T\log(\sigma) - \frac{1}{2\sigma^2} \sum_0^{T-1} \left[X_t - \psi X_{t-1}
- \psi^2 X_{t-2} - \cdots -\psi^{t}X_0\right]^2
$

Notice that for large $ T$ the coefficients of $ \epsilon_{-1}$ are close to 0 for most $ t$ and the remaining few terms are negligible relatively to the total.

Method B: Backcasting is the process of guessing $ \epsilon_{-1}$ on the basis of the data; we replace $ \epsilon_{-1}$ in the log likelihood by

   E$\displaystyle (\epsilon_{-1}\vert X_0,\ldots,X_{T-1})\, .
$

The problem is that this quantity depends on $ b$ and $ \sigma$.

We will use the EM algorithm to solve this problem. The algorithm can be applied when we have (real or imaginary) missing data. Suppose the data we have is $ X$; some other data we didn't get is $ Y$ and $ Z=(X,Y)$. It often happens that we can think of a $ Y$ we didn't observe in such a way that the likelihood for the whole data set $ Z$ would be simple. In that case we can try to maximize the likelihood for $ X$ by following a two step algorithm first discussed in detail by Dempster, Laid and Rubin. This algorithm has two steps:

  1. The E or Estimation step. We ``estimate'' the missing data $ Y$ by computing E$ (Y\vert X)$. Technically, we are supposed to estimate the likelihood function based on $ Z$. Factor the density of $ Z$ as

    $\displaystyle f_Z = f_{Y\vert X}f_X
$

    and take logs to get

    $\displaystyle \ell(\theta\vert Z) = \log(f_{Y\vert X}) + \ell(\theta\vert X)
$

    We actually estimate the log conditional density (which is a function of $ \theta$) by computing

       E$\displaystyle _{\theta_0}(\log(f_{Y\vert X})\vert X)
$

    Notice the subscript $ \theta_0$ on E. This indicates that you have to know the parameter to compute the conditional expectation. Notice too that there is another $ \theta$ in the conditional expectation - the log conditional density has a parameter in it.

  2. We then maximize our estimate of $ \ell(\theta\vert Z)$ to get a new value $ \theta_1$ for $ \theta$. Go back to step 1 with this $ \theta_1$ replacing $ \theta_0$ and iterate.

To get started we need a preliminary estimate. Now look at our problem. In our case the quantity $ Y$ is $ \epsilon_{-1}$. Rather than work with the log-likelihood directly we work with $ Y$. Our preliminary estimate of $ Y$ is 0. We use this value to estimate $ \theta$ as above getting an estimate $ \theta_0$. Then we compute E$ _{\theta_0}(\epsilon_{-1}\vert X)$ and replace $ \epsilon_{-1}$ in the log-likelihood above by this conditional expectation. Then iterate. This process of guessing $ \epsilon_{-1}$ is called backcasting.

Summary


next up previous



Richard Lockhart
2001-09-30