Mixed Models Theory

The MIXED Procedure

Mixed Models Theory

This section provides an overview of a likelihood-based approach to general linear mixed models. This approach simplifies and unifies many common statistical analyses, including those involving repeated measures, random effects, and random coefficients. The basic assumption is that the data are linearly related to unobserved multivariate normal random variables. Extensions to nonlinear and nonnormal situations are possible but are not discussed here. Additional theory with examples is provided in Littell et al. (1996) and Verbeke and Molenberghs (1997).

Matrix Notation

Suppose that you observe n data points y₁, ... , y_n and that you want to explain them using n values for each of p explanatory variables x₁₁, ... , x_1p, x₂₁, ... , x_2p, ... , x_n1, ... , x_np. The x_ij values may be either regression-type continuous variables or dummy variables indicating class membership. The standard linear model for this setup is

$y_i = \sum_{j=1}^p x_{ij} \beta_j + \epsilon_i i=1, ... ,n$

where $\beta_1, ... , \beta_p$ are unknown fixed-effects parameters to be estimated and $\epsilon_1, ... , \epsilon_n$ are unknown independent and identically distributed normal (Gaussian) random variables with mean 0 and variance $\sigma^2$ .

The preceding equations can be written simultaneously using vectors and a matrix, as follows:

$[y_1 \ y_2 \ \vdots \ y_n ] = [x_{11} & x_{12} & ... & x_{1p} \ x_{21} & x_{... ...beta_2 \ \vdots \ \beta_p ] + [\epsilon_1 \ \epsilon_2 \ \vdots \ \epsilon_n ]$

For convenience, simplicity, and extendibility, this entire system is written as

$y= X{\beta}+ {\epsilon}$

where y denotes the vector of observed y_i's, X is the known matrix of x_ij's, ${\beta}$ is the unknown fixed-effects parameter vector, and ${\epsilon}$ is the unobserved vector of independent and identically distributed Gaussian random errors.

In addition to denoting data, random variables, and explanatory variables in the preceding fashion, the subsequent development makes use of basic matrix operators such as transpose ('), inverse (^-1), generalized inverse (^-), determinant ( $|\cdot|$ ), and matrix multiplication. Refer to Searle (1982) for details on these and other matrix techniques.

Formulation of the Mixed Model

The previous general linear model is certainly a useful one (Searle 1971), and it is the one fitted by the GLM procedure. However, many times the distributional assumption about $\epsilon$ is too restrictive. The mixed model extends the general linear model by allowing a more flexible specification of the covariance matrix of $\epsilon$ . In other words, it allows for both correlation and heterogeneous variances, although you still assume normality.

The mixed model is written as

$y= X{\beta}+ Z{\gamma}+ {\epsilon}$

where everything is the same as in the general linear model except for the addition of the known design matrix, Z, and the vector of unknown random-effects parameters, ${\gamma}$ . The matrix Z can contain either continuous or dummy variables, just like X. The name mixed model comes from the fact that the model contains both fixed-effects parameters, ${\beta}$ , and random-effects parameters, ${\gamma}$ . Refer to Henderson (1990) and Searle, Casella, and McCulloch (1992) for historical developments of the mixed model.

A key assumption in the foregoing analysis is that ${\gamma}$ and ${\epsilon}$ are normally distributed with

$E[ {\gamma}\ {\epsilon}] & = & [0 \ 0 ] \ {Var}[ {\gamma}\ {\epsilon}] & = & [G& 0 \ 0 & R ]$

The variance of y is, therefore, V = ZGZ' + R. You can model V by setting up the random-effects design matrix Z and by specifying covariance structures for G and R.

Note that this is a general specification of the mixed model, in contrast to many texts and articles that discuss only simple random effects. Simple random effects are a special case of the general specification with Z containing dummy variables, G containing variance components in a diagonal structure, and $R= \sigma^2{I}_n$ , where I_n denotes the n ×n identity matrix. The general linear model is a further special case with Z = 0 and $R= \sigma^2{I}_n$ .

The following two examples illustrate the most common formulations of the general linear mixed model.

Example: Growth Curve with Compound Symmetry

Suppose that you have three growth curve measurements for s individuals and that you want to fit an overall linear trend in time. Your X matrix is as follows:

$X= [ 1 & 1 \ 1 & 2 \ 1 & 3 \ \vdots & \vdots \ 1 & 1 \ 1 & 2 \ 1 & 3 \ ]$

The first column (coded entirely with 1s) fits an intercept, and the second column (coded with times of 1,2,3) fits a slope. Here, n = 3s and p = 2.

Suppose further that you want to introduce a common correlation among the observations from a single individual, with correlation being the same for all individuals. One way of setting this up in the general mixed model is to eliminate the Z and G matrices and let the R matrix be block diagonal with blocks corresponding to the individuals and with each block having the compound-symmetry structure. This structure has two unknown parameters, one modeling a common covariance and the other a residual variance. The form for R would then be as follows:

$R= [ \sigma^2_1 + \sigma^2 & \sigma^2_1 & \sigma^2_1 & & & & \\sigma^2_1 & \sigm... ...sigma^2 & \sigma^2_1 \& & & & \sigma^2_1 & \sigma^2_1 & \sigma^2_1 + \sigma^2 \]$

where blanks denote zeroes. There are 3s rows and columns altogether, and the common correlation is $\sigma^2_1/(\sigma^2_1 + \sigma^2)$ .

The PROC MIXED code to fit this model is as follows:

   proc mixed;
      class indiv;
      model y = time;
      repeated / type=cs subject=indiv;
   run;

Here, indiv is a classification variable indexing individuals. The MODEL statement fits a straight line for time; the intercept is fit by default just as in PROC GLM. The REPEATED statement models the R matrix: TYPE=CS specifies the compound symmetry structure, and SUBJECT=INDIV specifies the blocks of R.

An alternative way of specifying the common intra-individual correlation is to let

$Z&=& [ 1 & & & \ 1 & & & \ 1 & & & \ & 1 & & \ & 1 & & \ & 1 & & \ & & \ddots & ... ...G&=& [ \sigma^2_1 & & & \ & \sigma^2_1 & & \ & & \ddots & \ & & & \sigma^2_1 \ ]$

and $R= \sigma^2{I}_n$ . The Z matrix has 3s rows and s columns, and G is s ×s.

You can set up this model in PROC MIXED in two different but equivalent ways:

   proc mixed;
      class indiv;
      model y = time;
      random indiv;
   run;

   proc mixed;
      class indiv;
      model y = time;
      random intercept / subject=indiv;
   run;

Both of these specifications fit the same model as the previous one that used the REPEATED statement; however, the RANDOM specifications constrain the correlation to be positive whereas the REPEATED specification leaves the correlation unconstrained.

Example: Split-Plot Design

The split-plot design involves two experimental treatment factors, A and B, and two different sizes of experimental units to which they are applied (refer to Winer 1971, Snedecor and Cochran 1980, and Milliken and Johnson 1992). The levels of A are randomly assigned to the larger sized experimental unit, called whole plots, whereas the levels of B are assigned to the smaller sized experimental unit, the subplots. The subplots are assumed to be nested within the whole plots, so that a whole plot consists of a cluster of subplots and a level of A is applied to the entire cluster.

Such an arrangement is often necessary by nature of the experiment, the classical example being the application of fertilizer to large plots of land and different crop varieties planted in subdivisions of the large plots. For this example, fertilizer is the whole plot factor A and variety is the subplot factor B.

The first example is a split-plot design for which the whole plots are arranged in a randomized block design. The appropriate PROC MIXED code is as follows:

   proc mixed;
      class a b block;
      model y = a|b;
      random block a*block;
   run;

Here

$R= \sigma^2 I_{24}$

and X, Z, and G have the following form:

$X= [ 1 & 1 & & & 1 & & 1 & & & & & \ 1 & 1 & & & & 1 & & 1 & & & & \ 1 & & 1 & &... ... & & 1 & & \ 1 & & & 1 & 1 & & & & & & 1 & \ 1 & & & 1 & & 1 & & & & & & 1 \ ] \$

$Z\, = [ 1 & & & & 1 & & & & & & & & & & & \ 1 & & & & 1 & & & & & & & & & & & \ ... ... & \ & & & 1 & & & & & & & & & & & & 1 \ & & & 1 & & & & & & & & & & & & 1 \ ] \$

$G= [ \sigma^2_B & & & & & & & \ & \sigma^2_B & & & & & & \ & & \sigma^2_B & & & ... ...& & & \sigma^2_{AB} & & \ & & & & & & \ddots & \ & & & & & & & \sigma^2_{AB} \ ]$

where $\sigma^2_B$ is the variance component for Block and $\sigma^2_{AB}$ is the variance component for A*Block. Changing the RANDOM statement to

   random int a / subject=block;

fits the same model, but with Z and G sorted differently.

$Z&=& [ 1 & 1 & & & & & & & & & & & & & & \ 1 & 1 & & & & & & & & & & & & & & \ 1... ...a^2_{AB} & & \ & & & & & & & \sigma^2_{AB} & \ & & & & & & & & \sigma^2_{AB} \ ]$

Estimating G and R in the Mixed Model

Estimation is more difficult in the mixed model than in the general linear model. Not only do you have ${\beta}$ as in the general linear model, but you have unknown parameters in ${\gamma}$ , G, and R as well. Least squares is no longer the best method. Generalized least squares (GLS) is more appropriate, minimizing

$(y- X{\beta})'V^{-1}(y- X{\beta})$

However, it requires knowledge of V and, therefore, knowledge of G and R. Lacking such information, one approach is to use estimated GLS, in which you insert some reasonable estimate for V into the minimization problem. The goal thus becomes finding a reasonable estimate of G and R.

In many situations, the best approach is to use likelihood-based methods, exploiting the assumption that ${\gamma}$ and ${\epsilon}$ are normally distributed (Hartley and Rao 1967; Patterson and Thompson 1971; Harville 1977; Laird and Ware 1982; Jennrich and Schluchter 1986). PROC MIXED implements two likelihood-based methods: maximum likelihood (ML) and restricted/residual maximum likelihood (REML). A favorable theoretical property of ML and REML is that they accommodate data that are missing at random (Rubin 1976; Little 1995). PROC MIXED constructs an objective function associated with ML or REML and maximizes it over all unknown parameters. Using calculus, it is possible to reduce this maximization problem to one over only the parameters in G and R. The corresponding log-likelihood functions are as follows:

${ML:} \;\;\;\;\;\; l(G,R) &=& -\frac{1}2 \log |{V}| - \frac{1}2 r'V^{-1}r - \f... ...}2 \log |{X}'V^{-1}X| \ & & - \frac{1}2 r'V^{-1}r - \frac{n-p}2 \log (2 \pi)\}$

where r = y- X(X'V^-1X)^-X'V^-1y and p is the rank of X. PROC MIXED actually minimizes -2 times these functions using a ridge-stabilized Newton-Raphson algorithm. Lindstrom and Bates (1988) provide reasons for preferring Newton-Raphson to the Expectation-Maximum (EM) algorithm described in Dempster, Laird, and Rubin (1977) and Laird, Lange, and Stram (1987), as well as analytical details for implementing a QR-decomposition approach to the problem. Wolfinger, Tobias, and Sall (1994) present the sweep-based algorithms that are implemented in PROC MIXED.

One advantage of using the Newton-Raphson algorithm is that the second derivative matrix of the objective function evaluated at the optima is available upon completion. Denoting this matrix H, the asymptotic theory of maximum likelihood (refer to Serfling 1980) shows that 2H^-1 is an asymptotic variance-covariance matrix of the estimated parameters of G and R. Thus, tests and confidence intervals based on asymptotic normality can be obtained. However, these can be unreliable in small samples, especially for parameters such as variance components which have sampling distributions that tend to be skewed to the right.

If a residual variance $\sigma^2$ is a part of your mixed model, it can usually be profiled out of the likelihood. This means solving analytically for the optimal $\sigma^2$ and plugging this expression back into the likelihood formula (refer to Wolfinger, Tobias, and Sall 1994). This reduces the number of optimization parameters by one and can improve convergence properties. PROC MIXED profiles the residual variance out of the log likelihood whenever it appears reasonable to do so. This includes the case when R equals $\sigma^2 I$ and when it has blocks with a compound symmetry, time series, or spatial structure. PROC MIXED does not profile the log likelihood when R has unstructured

blocks, when you use the HOLD= or NOITER option in the PARMS statement, or when you use the NOPROFILE option in the PROC MIXED statement.

Instead of ML or REML, you can use the noniterative MIVQUE0 method to estimate G and R (Rao 1972; LaMotte 1973; Wolfinger, Tobias, and Sall 1994). In fact, by default PROC MIXED uses MIVQUE0 estimates as starting values for the ML and REML procedures. For variance component models, another estimation method involves equating Type I, II, or III expected mean squares to their observed values and solving the resulting system. However, Swallow and Monahan (1984) present simulation evidence favoring REML and ML over MIVQUE0 and other method-of-moment estimators.

Estimating $\beta$ and $\gamma$ in the Mixed Model

ML, REML, MIVQUE0, or Type1 -Type3 provide estimates of G and R, which are denoted $\hat{G}$ and $\hat{R}$ ,respectively. To obtain estimates of ${\beta}$ and ${\gamma}$ , the standard method is to solve the mixed model equations (Henderson 1984):

$[X'\hat{R}^{-1}X& X'\hat{R}^{-1}Z\* Z'\hat{R}^{-1}X& Z'\hat{R}^{-1}Z+ \hat{G}^{-1} ] [\hat{{\beta}} \ \hat{{\gamma}} ] = [X'\hat{R}^{-1}y\ Z'\hat{R}^{-1}y ]$

The solutions can also be written as

$\hat{{\beta}} &=& (X'\hat{V}^{-1}X)^{-} X'\hat{V}^{-1}y\ \hat{{\gamma}} &=& \hat{G}Z'\hat{V}^{-1} (y- X\hat{{\beta}})$

and have connections with empirical Bayes estimators (Laird and Ware 1982).

Note that the mixed model equations are extended normal equations and that the preceding expression assumes that $\hat{G}$ is nonsingular. For the extreme case when the eigenvalues of $\hat{G}$ are very large, $\hat{G}^{-1}$ contributes very little to the equations and $\hat{{\gamma}}$ is close to what it would be if ${\gamma}$ actually contained fixed-effects parameters. On the other hand, when the eigenvalues of $\hat{G}$ are very small, $\hat{G}^{-1}$ dominates the equations and $\hat{{\gamma}}$ is close to 0. For intermediate cases, $\hat{G}^{-1}$ can be viewed as shrinking the fixed-effects estimates of ${\gamma}$ towards 0 (Robinson 1991).

If $\hat{G}$ is singular, then the mixed model equations are modified (Henderson 1984) as follows:

$[{X'\hat{R}}^{-1}X& {X'\hat{R}}^{-1} {Z\hat{L}}\* {\hat{L}'Z'\hat{R}}^{-1}X &... ...{\beta}} \ \hat{{\tau}} ] = [{X'\hat{R}}^{-1}y\ {\hat{L}'Z'\hat{R}}^{-1}y ]$

where $\hat{L}$ is the lower-triangular Cholesky root of $\hat{G}$ , satisfying $\hat{G} = \hat{L}\hat{L}'$ . Both $\hat{{\tau}}$ and a generalized inverse of the left-hand-side coefficient matrix are then transformed using $\hat{L}$ to determine $\hat{{\gamma}}$ .

An example of when the singular form of the equations is necessary is when a variance component estimate falls on the boundary constraint of 0.

Model Selection

The previous section on estimation assumes the specification of a mixed model in terms of X, Z, G, and R. Even though X and Z have known elements, their specific form and construction is flexible, and several possibilities may present themselves for a particular data set. Likewise, several different covariance structures for G and R might be reasonable.

Space does not permit a thorough discussion of model selection, but a few brief comments and references are in order. First, subject matter considerations and objectives are of great importance when selecting a model; refer to Diggle (1988) and Lindsey (1993).

Second, when the data themselves are looked to for guidance, many of the graphical methods and diagnostics appropriate for the general linear model extend to the mixed model setting as well (Christensen, Pearson, and Johnson 1992).

Finally, a likelihood-based approach to the mixed model provides several statistical measures for model adequacy as well. The most common of these are the likelihood ratio test and Akaike's and Schwarz's criteria (Bozdogan 1987; Wolfinger 1993).

Statistical Properties

If G and R are known, $\hat{{\beta}}$ is the best linear unbiased estimator (BLUE) of ${\beta}$ , and $\hat{{\gamma}}$ is the best linear unbiased predictor (BLUP) of ${\gamma}$ (Searle 1971; Harville 1988, 1990; Robinson 1991; McLean, Sanders, and Stroup 1991). Here, "best" means minimum mean squared error. The covariance matrix of $(\hat{{\beta}} - {\beta},\hat{{\gamma}} - {\gamma})$ is

$C= [X'R^{-1}X& X'R^{-1}Z\* Z'R^{-1}X& Z'R^{-1}Z+ G^{-1} ]^{-}$

where ^- denotes a generalized inverse (refer to Searle 1971).

However, G and R are usually unknown and are estimated using one of the aforementioned methods. These estimates, $\hat{G}$ and $\hat{R}$ , are therefore simply substituted into the preceding expression to obtain

$\hat{C} = [X'\hat{R}^{-1}X& X'\hat{R}^{-1}Z\* Z'\hat{R}^{-1}X& Z'\hat{R}^{-1}Z+ \hat{G}^{-1} ]^{-}$

as the approximate variance-covariance matrix of $(\hat{{\beta}} - {\beta},\hat{{\gamma}} - {\gamma}$ ). In this case, the BLUE and BLUP acronyms no longer apply, but the word empirical is often added to indicate such an approximation. The appropriate acronyms thus become EBLUE and EBLUP. McLean and Sanders (1988) show that $\hat{C}$ can also be written as

$\hat{C} = [\hat{C}_{11} & \hat{C}_{21}' \ \hat{C}_{21} & \hat{C}_{22} ]$

where

$\hat{C}_{11} &=& (X'\hat{V}^{-1}X)^{-} \ \hat{C}_{21} &=& -\hat{G}Z'\hat{V}^{-1}... ...(Z'\hat{R}^{-1}Z+ \hat{G}^{-1})^{-1} - \hat{C}_{21}X'\hat{V}^{-1}Z \hat{G}$

Note that $\hat{C}_{11}$ is the familiar estimated generalized least-squares formula for the variance-covariance matrix of $\hat{{\beta}}$ .

As a cautionary note, $\hat{C}$ tends to underestimate the true sampling variability of ( $\hat{{\beta}}$ $\hat{{\gamma}}$ ) because no account is made for the uncertainty in estimating G and R. Although inflation factors have been proposed (Kackar and Harville 1984; Kass and Steffey 1989; Prasad and Rao 1990), they tend to be small for data sets that are fairly well balanced. PROC MIXED does not compute any inflation factors by default, but rather accounts for the downward bias by using the approximate t and F statistics described subsequently. The DDFM=KENWARDROGER option in the MODEL statement prompts PROC MIXED to compute a specific inflation factor along with Satterthwaite-based degrees of freedom.

Inference and Test Statistics

For inferences concerning the covariance parameters in your model, you can use likelihood-based statistics. One common likelihood-based statistic is the Wald Z, which is computed as the parameter estimate divided by its asymptotic standard error. The asymptotic standard errors are computed from the inverse of the second derivative matrix of the likelihood with respect to each of the covariance parameters. The Wald Z is valid for large samples, but it can be unreliable for small data sets and for parameters such as variance components, which are known to have a skewed or bounded sampling distribution.

A better alternative is the likelihood ratio $\chi^2$ . This statistic compares two covariance models, one a special case of the other. To compute it, you must run PROC MIXED twice, once for each of the two models, and then subtract the corresponding values of -2 times the log likelihoods. You can use either ML or REML to construct this statistic, which tests whether the full model is necessary beyond the reduced model.

As long as the reduced model does not occur on the boundary of the covariance parameter space, the $\chi^2$ statistic computed in this fashion has a large-sample sampling distribution that is $\chi^2$ with degrees of freedom equal to the difference in the number of covariance parameters between the two models. If the reduced model does occur on the boundary of the covariance parameter space, the asymptotic distribution becomes a mixture of $\chi^2$ distributions (Self and Liang 1987). A common example of this is when you are testing that a variance component equals its lower boundary constraint of 0.

A final possibility for obtaining inferences concerning the covariance parameters is to simulate or resample data from your model and construct empirical sampling distributions of the parameters. The SAS macro language and the ODS system are useful tools in this regard.

For inferences concerning the fixed- and random-effects parameters in the mixed model, consider estimable linear combinations of the following form:

$L[{\beta}\ {\gamma} ]$

The estimability requirement (Searle 1971) applies only to the ${\beta}$ -portion of L, as any linear combination of ${\gamma}$ is estimable. Such a formulation in terms of a general L matrix encompasses a wide variety of common inferential procedures such as those employed with Type I -Type III tests and LS-means. The CONTRAST and ESTIMATE statements in PROC MIXED enable you to specify your own L matrices. Typically, inference on fixed-effects is the focus, and, in this case, the ${\gamma}$ -portion of L is assumed to contain all 0s.

Statistical inferences are obtained by testing the hypothesis

$H: L[{\beta}\ {\gamma} ] = 0$

or by constructing point and interval estimates.

When L consists of a single row, a general t-statistic can be constructed as follows (refer to McLean and Sanders 1988, Stroup 1989a):

$t = \frac{L[\hat{{\beta}} \ \hat{{\gamma}} ]}{\sqrt{L\hat{C}L'}}$

Under the assumed normality of ${\gamma}$ and $\epsilon$ , t has an exact t-distribution only for data exhibiting certain types of balance and for some special unbalanced cases. In general, t is only approximately t-distributed, and its degrees of freedom must be estimated. See the DDFM= option for a description of the various degrees-of-freedom methods available in PROC MIXED. With $\hat{\nu}$ being the approximate degrees of freedom, the associated confidence interval is

$L[\hat{{\beta}} \ \hat{{\gamma}} ] +- t_{\hat{\nu},\alpha/2} \sqrt{L\hat{C}L'}$

where $t_{\hat{\nu},\alpha/2}$ is the $(1 - \alpha/2)100$ th percentile of the $t_{\hat{\nu}}$ -distribution.

When the rank of L is greater than 1, PROC MIXED constructs the following general F-statistic:

$F = \frac{ [\hat{{\beta}} \ \hat{{\gamma}} ]'L'(L'\hat{C}L)^{-1} L[\hat{{\beta}} \ \hat{{\gamma}} ]} {{\rm rank}(L)}$

Analogous to t, F in general has an approximate F-distribution with rank(L) numerator degrees of freedom and $\hat{\nu}$ denominator degrees of freedom.

The t- and F-statistics enable you to make inferences about your fixed effects, which account for the variance-covariance model you select. An alternative is the $\chi^2$ statistic associated with the likelihood ratio test. This statistic compares two fixed-effects models, one a special case of the other. It is computed just as when comparing different covariance models, although you should use ML and not REML here because the penalty term associated with restricted likelihoods depends upon the fixed-effects specification.

Chapter Contents
Previous
Next
Top