next up previous


Postscript version of these notes

STAT 804: Notes on Lecture 8

Fitting $ ARIMA(p,d,q)$ models to data

Fitting the $ I$ part is easy we simply difference $ d$ times. The same observation applies to seasonal multiplicative model. Thus to fit an ARIMA$ (p,d,q)$ model to $ X$ you compute $ Y =(I-B)^d X$ (shortening your data set by $ d$ observations) and then you fit an $ ARMA(p,q)$ model to $ Y$. So we assume that $ d=0$.

Simplest case: fitting the AR(1) model

$\displaystyle X_t = \mu+\rho(X_{t-1}-\mu) + \epsilon_t
$

We must estimate 3 parameters: $ \mu, \rho$ and $ \sigma^2=$Var$ (\epsilon_t)$.

Our basic strategy will be:

Generally the full likelihood is rather complicated; we will use conditional likelihoods and ad hoc estimates of some parameters to simplify the situation.

The likelihood: Gaussian data

If the errors $ \epsilon$ are normal then so is the series $ X$. In general the vector $ X=(X_0,\ldots,X_{T-1})^t$ has a $ MVN({\bf\mu},\Sigma)$ where $ \Sigma_{ij} = C(i-j)$ and $ {\bf\mu}$ is a vector all of whose entries are $ \mu$. The joint density of $ X$ is

$\displaystyle f_X(x) = \frac{1}{(2\pi)^{T/2} \det(\Sigma)^{1/2}} \exp\left\{-\frac{1}{2}
(x-{\bf\mu})^t \Sigma^{-1} (x-{\bf\mu})\right\}
$

so that the log likelihood is

$\displaystyle \ell(\mu,a_1,\ldots,a_p,b_1,\ldots,b_q,\sigma) =
-\frac{1}{2}\left[
(x-{\bf\mu})^t \Sigma^{-1} (x-{\bf\mu}) + \log(\det(\Sigma))\right]
$

Here I have indicated precisely (for an $ ARMA(p,q)$) the parameters on which the quantity depends.

It is possible to carry out full maximum likelihood by maximizing the quantity in question numerically. In general this is hard, however.

Here I indicate some standard tactics. In your homework I will be asking you to carry through this analysis for one particular model.

The $ AR(1)$ model

Consider the model

$\displaystyle X_t-\mu = \rho(X_{t-1} - \mu) + \epsilon_t
$

This model formula permits us to write down the joint density of $ X$ in a simpler way:

$\displaystyle f_X = f_{X_{T-1}\vert X_{T-2},\ldots,X_0} f_{X_{T-2}\vert X_{T-3},\ldots,X_0}\cdots f_{X_1\vert X_0} f_{X_0}
$

Each of the conditional densities is simply

$\displaystyle f_{X_{k+1}\vert X_k,\ldots,X_0} (x_k\vert x_{k-1},\ldots,x_{0}) =
g\left[x_k-\mu-\rho(x_{k-1}-\mu)\right]
$

where $ g$ is the density of an individual $ \epsilon$. For iid $ N(0,\sigma^2)$ errors this gives a log likelihood which is

$\displaystyle \ell(\mu,\rho,\sigma) = - \frac{1}{2\sigma^2}\sum_1^{T-1} \left[x_k-\mu-\rho(x_{k-1}-\mu)\right]^2
-(T-1)\log(\sigma) + \log(f_{X_0})
$

Now for a stationary series I showed that $ X_t \sim N(\mu,\sigma^2/(1-\rho^2))$ so that

$\displaystyle \log(f_{X_0}(x_0)) = -\frac{1-\rho^2}{2\sigma^2}(x_0-\mu)^2 -\log(\sigma) + \log(1-\rho^2)
$

This makes

$\displaystyle \ell(\mu,\rho,\sigma)$ $\displaystyle = - \frac{1}{2\sigma^2} \left\{ \sum_1^{T-1}\left[x_k-\mu-\rho(x_{k-1}-\mu)\right]^2 +(1-\rho^2)(x_0-\mu)^2\right\}$    
  $\displaystyle \qquad -T\log(\sigma) + \log(1-\rho^2)$    

We can maximize this over $ \mu$ and $ \sigma$ explicitly. First

$\displaystyle \frac{\partial}{\partial\mu} \ell = \frac{1}{\sigma^2} \left\{
\s...
...}\left[x_k-\mu-\rho(x_{k-1}-\mu)\right](1-\rho)
+ (1-\rho^2)(x_0-\mu)\right\}
$

Set this equal to 0 to find

$\displaystyle \hat\mu(\rho)$ $\displaystyle = \frac{ (1-\rho)\sum_1^{T-1}(x_k -\rho x_{k-1}) + (1-\rho^2)x_0}{1-\rho^2 +(1-\rho)^2(T-1)}$    
  $\displaystyle = \frac{ \sum_1^{T-1}(x_k -\rho x_{k-1}) +(1+\rho)x_0}{1+\rho+ (1-\rho)(T-1)}$    

Notice that this estimate is free of $ \sigma$ and that if $ T$ is large we may drop the 1 in the denominator and the term inolving $ x_0$ in the denominator and get

$\displaystyle \hat\mu(\rho) \approx \frac{\sum_1^{T-1}(x_k -\rho x_{k-1})}{(T-1)(1-\rho)}
$

Finally, the numerator is actually

$\displaystyle \sum_0^{T-1} x_k - x_0 -\rho(\sum_0^{T-1} x_k -x_{T-1}) = (1-\rho)
\sum_0^{T-1} x_k -x_0 +\rho x_{T-1}
$

The last two terms here are smaller than the sum; if we neglect them we get

$\displaystyle \hat\mu(\rho) \approx \bar{X} \, .
$

Now compute

$\displaystyle \frac{\partial}{\partial\sigma} \ell = \frac{1}{\sigma^3}\left\{
...
...mu-\rho(x_{k-1}-\mu)\right]^2 +(1-\rho^2)(x_0-\mu)^2\right\}
-\frac{T}{\sigma}
$

and set this to 0 to find

$\displaystyle \hat\sigma^2(\rho) = \frac{\left\{
\sum_1^{T-1}\left[x_k-\mu(\rho)-\rho(x_{k-1}-\mu(\rho))\right]^2
+(1-\rho^2)(x_0-\mu(\rho))^2\right\}}{T}
$

When $ \rho$ is known it is easy to check that $ (\mu(\rho),\sigma(\rho))$ maximizes $ \ell(\mu,\rho,\sigma)$.

To find $ \hat\rho$ you now plug $ \hat\mu(\rho)$ and $ \hat\sigma(\rho)$ into $ \ell$ (getting the so called profile likelihood $ \ell(\hat\mu(\rho),\rho,\hat\sigma(\rho))$) and maximize over $ \rho$. Having thus found $ \hat\rho$ the mles of $ \mu$ and $ \hat\sigma$ are simply $ \hat\mu(\hat\rho)$ and $ \hat\sigma(\hat\rho)$.

It is worth observing that fitted residuals can then be calculated:

$\displaystyle \hat\epsilon_t = (X_t-\hat\mu) - \hat\rho(X_{t-1}-\hat\mu)
$

(There are only $ T-1$ of them since you cannot easily estimate $ \epsilon_0$.) Note, too, that the formula for $ \hat\sigma^2$ simplifies to

$\displaystyle \hat\sigma^2 = \frac{ \sum_1^{T-1} \hat\epsilon_t^2
+(1-\rho^2)(x_0-\mu(\rho))^2}{T} \approx \frac{ \sum_1^{T-1}
\hat\epsilon_t^2 }{T} \, .
$

In general, we simplify the maximum likelihood problem several ways:

Notice that we have made a great many suggestions for simplifications and adjustments. This is typical of statistical research - many ideas, only slightly different from each other, are suggested and compared. In practice it seems likely that there is very little difference between all the methods. I am asking you in a homework problem to investigate the differences between several of these methods on a single data set.

Higher order autoregressions

For the model

$\displaystyle X_t- \mu = \sum_1^p a_i(X_{t-1}-\mu) + \epsilon_t
$

we will use conditional likelihood again. Let $ \phi$ denote the vector $ (a_1,\ldots, a_p)^t$. Now we condition on the first $ p$ values of $ X$ and use

$\displaystyle \ell_c(\phi,\mu,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\mu - \sum_1^p a_i(X_{t-i} - \mu)\right]^2
-(T-p)\log(\sigma)
$

If we estimate $ \mu$ using $ \bar{X}$ we find that we are trying to maximize

$\displaystyle -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p a_i(X_{t-i} - \bar{X})\right]^2
-(T-p)\log(\sigma)
$

To estimate $ a_1,\ldots,a_p$ then we merely minimize the sum of squares

$\displaystyle \sum_p^{T-1} \hat\epsilon_t^2 = \sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p
a_i(X_{t-i} - \bar{X})\right]^2
$

This is a straightforward regression problem. We regress the response vector

$\displaystyle \left[\begin{array}{c} X_p -\bar{X} \\  \vdots \\  X_{T-1} -\bar{X} \end{array} \right]
$

on the design matrix

$\displaystyle \left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\bar{X}
...
...
\vdots
\\
X_{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
$

An alternative to estimating $ \mu$ by $ \bar{X}$ is to define $ \alpha = \mu(1-\sum a_i)$ and then recognize that

$\displaystyle \ell(\alpha,\phi,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\alpha - \sum_1^p a_iX_{t-i}\right]^2
-(T-p)\log(\sigma)
$

is maximized by regressing the vector

$\displaystyle \left[\begin{array}{c} X_p \\  \vdots \\  X_{T-1} \end{array} \right]
$

on the design matrix

$\displaystyle \left[\begin{array}{ccc}
X_{p-1} & \cdots & X_0
\\
\vdots &
\vdots & \vdots
\\
X_{T-2} & \cdots & X_{T-p-1}
\end{array}\right]
$

From $ \hat\alpha$ and $ \hat\phi$ we would get an estimate for $ \mu$ by

$\displaystyle \hat\mu = \frac{\hat\alpha}{1-\sum \hat{a}_i}$

Notice that if we put

$\displaystyle Z=\left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\bar{X...
...&
\vdots
\\
X_{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
$

then

$\displaystyle Z^tZ \approx T \left[\begin{array}{cccc}
\hat{C}(0) & \hat{C}(1) ...
...ots & \cdots
\\
\cdots & \cdots & \hat{C}(1) & \hat{C}(0)
\end{array}\right]
$

and if

$\displaystyle Y=\left[\begin{array}{c} X_p -\bar{X} \\  \vdots \\  X_{T-1} -\bar{X}
\end{array} \right]
$

then

$\displaystyle Z^tY \approx T \left[\begin{array}{c} \hat{C}(1) \\  \vdots \\  \hat{C}(p)
\end{array}\right]
$

so that the normal equations (from least squares)

$\displaystyle Z^t Z \phi = Z^T Y
$

are nearly the Yule-Walker equations again.

Full maximum likelihood

To compute a full mle of $ \theta=(\mu,\phi,\sigma)$ you generally begin by finding preliminary estimates $ \hat\theta$ say by one of the conditional likelihood methods above and then iterate via Newton-Raphson or some other scheme for numerical maximization of the log-likelihood.

Fitting $ MA(q)$ models

Here we consider the model with known mean (generally this will mean we estimate $ \hat\mu = \bar{X}$ and subtract the mean from all the observations):

$\displaystyle X_t = \epsilon_t - b_1 \epsilon_{t-1} - \cdots - b_q\epsilon_{t-q}
$

In general $ X$ has a $ MVN(0,\Sigma)$ distribution and, letting $ \psi$ denote the vector of $ b_i$s we find

$\displaystyle \ell(\psi,\sigma) = -\frac{1}{2}\left[\log(\det(\Sigma))
- X^T \Sigma^{-1} X\right]
$

Here $ X$ denotes the column vector of all the data. As an example consider $ q=1$ so that

$\displaystyle \Sigma = \left[\begin{array}{ccccc}
\sigma^2(1+b_1^2) & -b_1\sigm...
...
\\
0 & \cdots & \cdots & -b_1\sigma^2& \sigma^2(1+b_1^2)
\end{array}\right]
$

It is not so easy to work with the determinant and inverse of matrices like this. Instead we try to mimic the conditional inference approach above but with a twist; we now condition on something we haven't observed -- $ \epsilon_{-1}$.

Notice that

$\displaystyle X_0$ $\displaystyle = \epsilon_0 - b\epsilon_{-1}$    
$\displaystyle X_1$ $\displaystyle = \epsilon_1 - b\epsilon_0$    
  $\displaystyle = \epsilon_1 - b(X_0+\epsilon_{-1})$    
$\displaystyle X_2$ $\displaystyle = \epsilon_2 - b\epsilon_1$    
  $\displaystyle = \epsilon_2 - b(X_1+b(X_0+\epsilon_{-1}))$    
$\displaystyle \vdots$      
$\displaystyle X_{t-1}$ $\displaystyle = \epsilon_{T-1} - b(X_{T-2} + b (X_{T-3} + \cdots \epsilon_{-1}))$    

Now imagine that the data were actually

$\displaystyle \epsilon_{-1}, X_0,\ldots,X_{T-1}
$

Then the same idea we used for an $ AR(1)$ would give

$\displaystyle \ell(b,\sigma)$ $\displaystyle = \log(f(\epsilon_{-1},\sigma)) + \log(f(X_0,\ldots,X_{T-1}\vert\epsilon_{-1},b,\sigma)$    
  $\displaystyle = \log(f(\epsilon_{-1},\sigma)) + \sum_0^{T-1} \log(f(X_t\vert X_{t-1},\ldots,X_0,\epsilon_{-1},b,\sigma)$    

The parameters are listed in the conditions in this formula merely to indicate which terms depend on which parameters. For Gaussian $ \epsilon$s the terms in this likelihood are squares as usual (plus logarithms of $ \sigma$) leading to

$\displaystyle \ell(b,\sigma) =$ $\displaystyle \frac{-\epsilon_{-1}}{2\sigma^2} - \log(\sigma)$    
  $\displaystyle - \sum_0^{T-1} \left[\frac{1}{2} (X_t - \psi X_{t-1} - \psi^2 X_{t-2} - \cdots -\psi^{t+1} \epsilon_{-1})^2+\log(\sigma)\right]$    

We will estimate the parameters by maximizing this function after getting rid of $ \epsilon_{-1}$ somehow.

Method A: Put $ \epsilon_{-1}=0$ since 0 is the most probable value and maximize

$\displaystyle -T\log(\sigma) - \frac{1}{2\sigma^2} \sum_0^{T-1} \left[X_t - \psi X_{t-1}
- \psi^2 X_{t-2} - \cdots -\psi^{t}X_0\right]^2
$

Notice that for large $ T$ the coefficients of $ \epsilon_{-1}$ are close to 0 for most $ t$ and the remaining few terms are negligible relatively to the total.

Method B: Backcasting is the process of guessing $ \epsilon_{-1}$ on the basis of the data; we replace $ \epsilon_{-1}$ in the log likelihood by

   E$\displaystyle (\epsilon_{-1}\vert X_0,\ldots,X_{T-1})\, .
$

The problem is that this quantity depends on $ b$ and $ \sigma$.

We will use the EM algorithm to solve this problem.


next up previous



Richard Lockhart
2001-09-30