Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The GLM Procedure

Multiple Comparisons

When comparing more than two means, an ANOVA F-test tells you whether the means are significantly different from each other, but it does not tell you which means differ from which other means. Multiple comparison procedures (MCPs), also called mean separation tests, give you more detailed information about the differences among the means. The goal in multiple comparisons is to compare the average effects of three or more "treatments" (for example, drugs, groups of subjects) to decide which treatments are better, which ones are worse, and by how much, while controlling the probability of making an incorrect decision. A variety of multiple comparison methods are available with the MEANS and LSMEANS statement in the GLM procedure.

The following classification is due to Hsu (1996). Multiple comparison procedures can be categorized in two ways: by the comparisons they make and by the strength of inference they provide. With respect to which comparisons are made, the GLM procedure offers two types:

The strength of inference says what can be inferred about the structure of the means when a test is significant; it is related to what type of error rate the MCP controls. MCPs available in the GLM procedure provide one of the following types of inference, in order from weakest to strongest.

Methods that control only individual error rates are not true MCPs at all. Methods that yield the strongest level of inference, simultaneous confidence intervals, are usually preferred, since they enable you not only to say which means are different but also to put confidence bounds on how much they differ, making it easier to assess the practical significance of a difference. They are also less likely to lead nonstatisticians to the invalid conclusion that nonsignificantly different sample means imply equal population means. Interval MCPs are available for both arithmetic means and LS-means via the MEANS and LSMEANS statements, respectively.*

Table 30.3 and Table 30.4 display MCPs available in PROC GLM for all pairwise comparisons and comparisons with a control, respectively, along with associated strength of inference and the syntax (when applicable) for both the MEANS and the LSMEANS statements.

Table 30.3: Multiple Comparisons Procedures for All Pairwise Comparison
  Strength of Syntax
Method Inference MEANS LSMEANS
Student's tIndividualTPDIFF ADJUST=T
DuncanIndividualDUNCAN 
Student-Newman-KeulsInhomogeneitySNK 
REGWQInequalitiesREGWQ 
Tukey-KramerIntervalsTUKEYPDIFF ADJUST=TUKEY
BonferroniIntervalsBONPDIFF ADJUST=BON
SidakIntervalsSIDAKPDIFF ADJUST=SIDAK
Scheff\acute{e}IntervalsSCHEFFEPDIFF ADJUST=SCHEFFE
SMMIntervalsSMMPDIFF ADJUST=SMM
GabrielIntervalsGABRIEL 
SimulationIntervals PDIFF ADJUST=SIMULATE

Table 30.4: Multiple Comparisons Procedures for Comparisons with a Control
  Strength of Syntax
Method Inference MEANS LSMEANS
Student's tIndividual PDIFF=CONTROL ADJUST=T
DunnettIntervalsDUNNETTPDIFF=CONTROL ADJUST=DUNNETT
BonferroniIntervals PDIFF=CONTROL ADJUST=BON
SidakIntervals PDIFF=CONTROL ADJUST=SIDAK
Scheff\acute{e}Intervals PDIFF=CONTROL ADJUST=SCHEFFE
SMMIntervals PDIFF=CONTROL ADJUST=SMM
SimulationIntervals PDIFF=CONTROL ADJUST=SIMULATE

Note: One-sided Dunnett's tests are also available from the MEANS statement with the DUNNETTL and DUNNETTU options and from the LSMEANS statement with PDIFF=CONTROLL and PDIFF=CONTROLU.

Details of these multiple comparison methods are given in the following sections.

Pairwise Comparisons

All the methods discussed in this section depend on the standardized pairwise differences t_{ij}=(\bar{y}_i-\bar{y}_j)/\hat{\sigma}_{ij},where Furthermore, all of the methods are discussed in terms of significance tests of the form
| t_{ij}| & \geq & c(\alpha)
where c(\alpha) is some constant depending on the significance level. Such tests can be inverted to form confidence intervals of the form
(\bar{y}_i-\bar{y}_j) - \hat{\sigma}_{ij}c(\alpha)
 & \leq & \mu_i-\mu_j & \leq
 & (\bar{y}_i-\bar{y}_j) + \hat{\sigma}_{ij}c(\alpha)

The simplest approach to multiple comparisons is to do a t test on every pair of means (the T option in the MEANS statement, ADJUST=T in the LSMEANS statement). For the ith and jth means, you can reject the null hypothesis that the population means are equal if

| t_{ij}| & \geq & t(\alpha ; \nu)
where \alpha is the significance level, \nu is the number of error degrees of freedom, and t(\alpha ; \nu) is the two-tailed critical value from a Student's t distribution. If the cell sizes are all equal to, say, n, the preceding formula can be rearranged to give
|\bar{y}_i - \bar{y}_j| & \geq & t(\alpha;\nu) s \sqrt{\frac{2}n}
the value of the right-hand side being Fisher's least significant difference (LSD).

There is a problem with repeated t tests, however. Suppose there are ten means and each t test is performed at the 0.05 level. There are 10(10-1)/2=45 pairs of means to compare, each with a 0.05 probability of a type 1 error (a false rejection of the null hypothesis). The chance of making at least one type 1 error is much higher than 0.05. It is difficult to calculate the exact probability, but you can derive a pessimistic approximation by assuming that the comparisons are independent, giving an upper bound to the probability of making at least one type 1 error (the experimentwise error rate) of

1 - (1 - 0.05)^{45} & = & 0.90
The actual probability is somewhat less than 0.90, but as the number of means increases, the chance of making at least one type 1 error approaches 1.

If you decide to control the individual type 1 error rates for each comparison, you are controlling the individual or comparisonwise error rate. On the other hand, if you want to control the overall type 1 error rate for all the comparisons, you are controlling the experimentwise error rate. It is up to you to decide whether to control the comparisonwise error rate or the experimentwise error rate, but there are many situations in which the experimentwise error rate should be held to a small value. Statistical methods for comparing three or more means while controlling the probability of making at least one type 1 error are called multiple comparisons procedures.

It has been suggested that the experimentwise error rate can be held to the \alpha level by performing the overall ANOVA F-test at the \alpha level and making further comparisons only if the F-test is significant, as in Fisher's protected LSD. This assertion is false if there are more than three means (Einot and Gabriel 1975). Consider again the situation with ten means. Suppose that one population mean differs from the others by such a sufficiently large amount that the power (probability of correctly rejecting the null hypothesis) of the F-test is near 1 but that all the other population means are equal to each other. There will be 9(9 - 1)/2=36 t tests of true null hypotheses, with an upper limit of 0.84 on the probability of at least one type 1 error. Thus, you must distinguish between the experimentwise error rate under the complete null hypothesis, in which all population means are equal, and the experimentwise error rate under a partial null hypothesis, in which some means are equal but others differ. The following abbreviations are used in the discussion:

CER
comparisonwise error rate

EERC
experimentwise error rate under the complete null hypothesis

MEER
maximum experimentwise error rate under any complete or partial null hypothesis

These error rates are associated with the different strengths of inference: individual tests control the CER; tests for inhomogeneity of means control the EERC; tests that yield confidence inequalities or confidence intervals control the MEER. A preliminary F-test controls the EERC but not the MEER.

You can control the MEER at the \alpha level by setting the CER to a sufficiently small value. The Bonferroni inequality (Miller 1981) has been widely used for this purpose. If
{\rm CER} & = & \frac{\alpha}c
where c is the total number of comparisons, then the MEER is less than \alpha. Bonferroni t tests (the BON option in the MEANS statement, ADJUST=BON in the LSMEANS statement) with {\rm MEER} \lt \alphadeclare two means to be significantly different if
| t_{ij}| & \geq & t(\epsilon ; \nu)
where
\epsilon & = & \frac{2 \alpha}{k(k - 1)}
for comparison of k means.

Sidak (1967) has provided a tighter bound, showing that
{\rm CER} & = & 1 - (1 - \alpha)^{1/c}
also ensures that {\rm MEER} \leq \alpha for any set of c comparisons. A Sidak t test (Games 1977), provided by the SIDAK option, is thus given by
| t_{ij}| & \geq & t(\epsilon ; \nu)
where
\epsilon & = & 1 - (1 - \alpha)^{\frac{2}{k(k - 1)}}
for comparison of k means.

You can use the Bonferroni additive inequality and the Sidak multiplicative inequality to control the MEER for any set of contrasts or other hypothesis tests, not just pairwise comparisons. The Bonferroni inequality can provide simultaneous inferences in any statistical application requiring tests of more than one hypothesis. Other methods discussed in this section for pairwise comparisons can also be adapted for general contrasts (Miller 1981).

Scheff\acute{e} (1953, 1959) proposes another method to control the MEER for any set of contrasts or other linear hypotheses in the analysis of linear models, including pairwise comparisons, obtained with the SCHEFFE option. Two means are declared significantly different if
| t_{ij}| & \geq & \sqrt{(k-1) F (\alpha;k-1,\nu)}
where F(\alpha;k-1,\nu) is the \alpha-level critical value of an F distribution with k-1 numerator degrees of freedom and \nudenominator degrees of freedom.

Scheff\acute{e}'s test is compatible with the overall ANOVA F-test in that Scheff\acute{e}'s method never declares a contrast significant if the overall F-test is nonsignificant. Most other multiple comparison methods can find significant contrasts when the overall F-test is nonsignificant and, therefore, suffer a loss of power when used with a preliminary F-test.

Scheff\acute{e}'s method may be more powerful than the Bonferroni or Sidak methods if the number of comparisons is large relative to the number of means. For pairwise comparisons, Sidak t tests are generally more powerful.

Tukey (1952, 1953) proposes a test designed specifically for pairwise comparisons based on the studentized range, sometimes called the "honestly significant difference test," that controls the MEER when the sample sizes are equal. Tukey (1953) and Kramer (1956) independently propose a modification for unequal cell sizes. The Tukey or Tukey-Kramer method is provided by the TUKEY option in the MEANS statement and the ADJUST=TUKEY option in the LSMEANS statement. This method has fared extremely well in Monte Carlo studies (Dunnett 1980). In addition, Hayter (1984) gives a proof that the Tukey-Kramer procedure controls the MEER for means comparisons, and Hayter (1989) describes the extent to which the Tukey-Kramer procedure has been proven to control the MEER for LS-means comparisons. The Tukey-Kramer method is more powerful than the Bonferroni, Sidak, or Scheff\acute{e} methods for pairwise comparisons. Two means are considered significantly different by the Tukey-Kramer criterion if
| t_{ij}| & \geq & q(\alpha;k,\nu)
where q(\alpha;k,\nu) is the \alpha-level critical value of a studentized range distribution of k independent normal random variables with \nu degrees of freedom.

Hochberg (1974) devised a method (the GT2 or SMM option) similar to Tukey's, but it uses the studentized maximum modulus instead of the studentized range and employs Sidak's (1967) uncorrelated t inequality. It is proven to hold the MEER at a level not exceeding \alpha with unequal sample sizes. It is generally less powerful than the Tukey-Kramer method and always less powerful than Tukey's test for equal cell sizes. Two means are declared significantly different if
| t_{ij}| & \geq & m(\alpha;c,\nu)
where m(\alpha;c,\nu) is the \alpha-level critical value of the studentized maximum modulus distribution of c independent normal random variables with \nu degrees of freedom and c = k(k-1)/2.

Gabriel (1978) proposes another method (the GABRIEL option) based on the studentized maximum modulus. This method is applicable only to arithmetic means. It rejects if
\frac{|\bar{y}_i - \bar{y}_j|}
 {s ( \frac{1}{\sqrt{2n_i}} + \frac{1}{\sqrt{2n_j}}
 ) }
 & \geq & m(\alpha;k,\nu)

For equal cell sizes, Gabriel's test is equivalent to Hochberg's GT2 method. For unequal cell sizes, Gabriel's method is more powerful than GT2 but may become liberal with highly disparate cell sizes (refer also to Dunnett 1980). Gabriel's test is the only method for unequal sample sizes that lends itself to a graphical representation as intervals around the means. Assuming \bar{y}_i \gt \bar{y}_j,you can rewrite the preceding inequality as
\bar{y}_i - m(\alpha;k,\nu) \frac{s}{\sqrt{2n_i}}
 & \geq & \bar{y}_j + m(\alpha;k,\nu) \frac{s}{\sqrt{2n_j}}

The expression on the left does not depend on j, nor does the expression on the right depend on i. Hence, you can form what Gabriel calls an (l,u)-interval around each sample mean and declare two means to be significantly different if their (l,u)-intervals do not overlap. See Hsu (1996, section 5.2.1.1) for a discussion of other methods of graphically representing all pair-wise comparisons.

Comparing All Treatments to a Control

One special case of means comparison is that in which the only comparisons that need to be tested are between a set of new treatments and a single control. In this case, you can achieve better power by using a method that is restricted to test only comparisons to the single control mean. Dunnett (1955) proposes a test for this situation that declares a mean significantly different from the control if
| t_{i0}| & \geq & d(\alpha;k,\nu,\rho_1, ... ,\rho_{k-1})
where \bar{y}_0 is the control mean and d(\alpha ;k,\nu,\rho_1, ... ,\rho_{k-1}) is the critical value of the "many-to-one t statistic" (Miller 1981; Krishnaiah and Armitage 1966) for k means to be compared to a control, with \nu error degrees of freedom and correlations \rho_1, ... ,\rho_{k-1},\rho_i = n_i / (n_0 + n_i). The correlation terms arise because each of the treatment means is being compared to the same control. Dunnett's test holds the MEER to a level not exceeding the stated \alpha.

Approximate and Simulation-based Methods

Both Tukey's and Dunnett's tests are based on the same general quantile calculation:
q^t(\alpha,\nu,R) & = & \{q \ni P(\max(| t_1|, ... ,| t_n|)\gt q)=\alpha\}
where the ti have a joint multivariate t distribution with \nudegrees of freedom and correlation matrix R. In general, evaluating q^t(\alpha,\nu,R) requires repeated numerical calculation of an (n+1)-fold integral. This is usually intractable, but the problem reduces to a feasible 2-fold integral when R has a certain symmetry in the case of Tukey's test, and a factor analytic structure (cf. Hsu 1992) in the case of Dunnett's test. The R matrix has the required symmetry for exact computation of Tukey's test if the tis are studentized differences between Refer to Hsu (1992, 1996) for more information. The R matrix has the factor analytic structure for exact computation of Dunnett's test if the tis are studentized differences between

However, other important situations that do not result in a correlation matrix R that has the structure for exact computation include

In these situations, exact calculation of q^t(\alpha,\nu,R) is intractable in general. Most of the preceding methods can be viewed as using various approximations for q^t(\alpha,\nu,R).When the sample sizes are unequal, the Tukey-Kramer test is equivalent to another approximation. For comparisons with a control when the correlation R does not have a factor analytic structure, Hsu (1992) suggests approximating R with a matrix R* that does have such a structure and correspondingly approximating q^t(\alpha,\nu,R) with q^t(\alpha,\nu,R^*).When you request Dunnett's test for LS-means (the PDIFF=CONTROL and ADJUST=DUNNETT options), the GLM procedure automatically uses Hsu's approximation when appropriate.

Finally, Edwards and Berry (1987) suggest calculating q^t(\alpha,\nu,R)by simulation. Multivariate t vectors are sampled from a distribution with the appropriate \nu and R parameters, and Edwards and Berry (1987) suggest estimating q^t(\alpha,\nu,R) by \hat{q}, the \alpha percentile of the observed values of \max(| t_1|, ... ,| t_n|). Sufficient samples are generated for the true P(\max(| t_1|, ... ,| t_n|)\gt\hat{q}) to be within a certain accuracy radius \gamma of \alphawith accuracy confidence 100(1-\epsilon). You can approximate q^t(\alpha,\nu,R)by simulation for comparisons between LS-means by specifying ADJUST=SIM (with either PDIFF=ALL or PDIFF=CONTROL). By default, \gamma=0.005 and \epsilon=0.01, so that the tail area of \hat{q} is within 0.005 of \alpha with 99% confidence. You can use the ACC= and EPS= options with ADJUST=SIM to reset \gamma and \epsilon, or you can use the NSAMP= option to set the sample size directly. You can also control the random number sequence with the SEED= option.

Hsu and Nelson (1998) suggest a more accurate simulation method for estimating q^t(\alpha,\nu,R), using a control variate adjustment technique. The same independent, standardized normal variates that are used to generate multivariate t vectors from a distribution with the appropriate \nu and R parameters are also used to generate multivariate t vectors from a distribution for which the exact value of q^t(\alpha,\nu,R) is known. \max(| t_1|, ... ,| t_n|) for the second sample is used as a control variate for adjusting the quantile estimate based on the first sample; refer to Hsu and Nelson (1998) for more details. The control variate adjustment has the drawback that it takes somewhat longer than the crude technique of Edwards and Berry (1987), but it typically yields an estimate that is many times more accurate. In most cases, if you are using ADJUST=SIM, then you should specify ADJUST=SIM(CVADJUST). You can also specify ADJUST=SIM(CVADJUST REPORT) to display a summary of the simulation that includes, among other things, the actual accuracy radius \gamma, which should be substantially smaller than the target accuracy radius (0.005 by default).

Multiple-Stage Tests

You can use all of the methods discussed so far to obtain simultaneous confidence intervals (Miller 1981). By sacrificing the facility for simultaneous estimation, you can obtain simultaneous tests with greater power using multiple-stage tests (MSTs). MSTs come in both step-up and step-down varieties (Welsch 1977). The step-down methods, which have been more widely used, are available in SAS/STAT software.

Step-down MSTs first test the homogeneity of all of the means at a level \gamma_k. If the test results in a rejection, then each subset of k-1 means is tested at level \gamma_{k-1}; otherwise, the procedure stops. In general, if the hypothesis of homogeneity of a set of p means is rejected at the \gamma_p level, then each subset of p-1 means is tested at the \gamma_{p-1} level; otherwise, the set of p means is considered not to differ significantly and none of its subsets are tested. The many varieties of MSTs that have been proposed differ in the levels \gamma_p and the statistics on which the subset tests are based. Clearly, the EERC of a step-down MST is not greater than \gamma_k, and the CER is not greater than \gamma_2, but the MEER is a complicated function of \gamma_p, p = 2, ... ,k.

With unequal cell sizes, PROC GLM uses the harmonic mean of the cell sizes as the common sample size. However, since the resulting operating characteristics can be undesirable, MSTs are recommended only for the balanced case. When the sample sizes are equal and if the range statistic is used, you can arrange the means in ascending or descending order and test only contiguous subsets. But if you specify the F statistic, this shortcut cannot be taken. For this reason, only range-based MSTs are implemented. It is common practice to report the results of an MST by writing the means in such an order and drawing lines parallel to the list of means spanning the homogeneous subsets. This form of presentation is also convenient for pairwise comparisons with equal cell sizes.

The best known MSTs are the Duncan (the DUNCAN option) and Student-Newman-Keuls (the SNK option) methods (Miller 1981). Both use the studentized range statistic and, hence, are called multiple range tests. Duncan's method is often called the "new" multiple range test despite the fact that it is one of the oldest MSTs in current use.

The Duncan and SNK methods differ in the \gamma_pvalues used. For Duncan's method, they are
\gamma_p & = & 1 - (1 - \alpha)^{p - 1}
whereas the SNK method uses
\gamma_p & = & \alpha
Duncan's method controls the CER at the \alpha level. Its operating characteristics appear similar to those of Fisher's unprotected LSD or repeated t tests at level \alpha (Petrinovich and Hardyck 1969). Since repeated t tests are easier to compute, easier to explain, and applicable to unequal sample sizes, Duncan's method is not recommended. Several published studies (for example, Carmer and Swanson 1973) have claimed that Duncan's method is superior to Tukey's because of greater power without considering that the greater power of Duncan's method is due to its higher type 1 error rate (Einot and Gabriel 1975).

The SNK method holds the EERC to the \alpha level but does not control the MEER (Einot and Gabriel 1975). Consider ten population means that occur in five pairs such that means within a pair are equal, but there are large differences between pairs. If you make the usual sampling assumptions and also assume that the sample sizes are very large, all subset homogeneity hypotheses for three or more means are rejected. The SNK method then comes down to five independent tests, one for each pair, each at the \alpha level. Letting \alpha be 0.05, the probability of at least one false rejection is
1 - (1 - 0.05)^5 & = & 0.23
As the number of means increases, the MEER approaches 1. Therefore, the SNK method cannot be recommended.

A variety of MSTs that control the MEER have been proposed, but these methods are not as well known as those of Duncan and SNK. An approach developed by Ryan (1959, 1960), Einot and Gabriel (1975), and Welsch (1977) sets
\gamma_p & = &
 \{1 - (1 - \alpha)^{p/k} & & {for} p \lt k - 1 \ 
 \alpha & & {for} p \geq k - 1
 .
You can use range statistics, leading to what is called the REGWQ method after the authors' initials. If you assume that the sample means have been arranged in descending order from \bar{y}_1 through \bar{y}_k, the homogeneity of means \bar{y}_i, ... ,\bar{y}_j,
i\lt j, is rejected by REGWQ if
\bar{y}_i - \bar{y}_j & \geq & q(\gamma_p;p,\nu) \frac{s}{\sqrt{n}}
where p=j-i+1 and the summations are over u = i, ... ,j (Einot and Gabriel 1975). To ensure that the MEER is controlled, the current implementation checks whether q(\gamma_p;p,\nu) is monotonically increasing in p. If not, then a set of critical values that are increasing in p is substituted instead.

REGWQ appears to be the most powerful step-down MST in the current literature (for example, Ramsey 1978). Use of a preliminary F-test decreases the power of all the other multiple comparison methods discussed previously except for Scheff\acute{e}'s test.

Bayesian Approach

Waller and Duncan (1969) and Duncan (1975) take an approach to multiple comparisons that differs from all the methods previously discussed in minimizing the Bayes risk under additive loss rather than controlling type 1 error rates. For each pair of population means \mu_i and \mu_j, null (H0ij) and alternative (Haij) hypotheses are defined:
H_0^{ij}\colon & & \mu_i - \mu_j \leq 0 \ 
 H_a^{ij}\colon & & \mu_i - \mu_j \gt 0
For any i, j pair, let d0 indicate a decision in favor of H0ij and da indicate a decision in favor of Haij, and let \delta=\mu_i-\mu_j. The loss function for the decision on the i, j pair is
L(d_0|\delta) & = &
 \{0 & & {if} \delta \leq 0 \ 
 \delta & & {if} \delta \gt 0...
 ...delta) & = &
 \{-k \delta & & {if} \delta \leq 0 \ 
 0 & & {if} \delta \gt 0 \ .
where k represents a constant that you specify rather than the number of means. The loss for the joint decision involving all pairs of means is the sum of the losses for each individual decision. The population means are assumed to have a normal prior distribution with unknown variance, the logarithm of the variance of the means having a uniform prior distribution. For the i, j pair, the null hypothesis is rejected if
\bar{y}_i - \bar{y}_j & \geq & t_B s \sqrt{\frac{2}n}
where tB is the Bayesian t value (Waller and Kemp 1976) depending on k, the F statistic for the one-way ANOVA, and the degrees of freedom for F. The value of tB is a decreasing function of F, so the Waller-Duncan test (specified by the WALLER option) becomes more liberal as F increases.

Recommendations

In summary, if you are interested in several individual comparisons and are not concerned about the effects of multiple inferences, you can use repeated t tests or Fisher's unprotected LSD. If you are interested in all pairwise comparisons or all comparisons with a control, you should use Tukey's or Dunnett's test, respectively, in order to make the strongest possible inferences. If you have weaker inferential requirements and, in particular, if you don't want confidence intervals for the mean differences, you should use the REGWQ method. Finally, if you agree with the Bayesian approach and Waller and Duncan's assumptions, you should use the Waller-Duncan test.

Interpretation of Multiple Comparisons

When you interpret multiple comparisons, remember that failure to reject the hypothesis that two or more means are equal should not lead you to conclude that the population means are, in fact, equal. Failure to reject the null hypothesis implies only that the difference between population means, if any, is not large enough to be detected with the given sample size. A related point is that nonsignificance is nontransitive: that is, given three sample means, the largest and smallest may be significantly different from each other, while neither is significantly different from the middle one. Nontransitive results of this type occur frequently in multiple comparisons.

Multiple comparisons can also lead to counter-intuitive results when the cell sizes are unequal. Consider four cells labeled A, B, C, and D, with sample means in the order A>B>C>D. If A and D each have two observations, and B and C each have 10,000 observations, then the difference between B and C may be significant, while the difference between A and D is not.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.