Cochran-Mantel-Haenszel Statistics

The FREQ Procedure

Cochran-Mantel-Haenszel Statistics

For n-way crosstabulation tables, consider the following example:

   proc freq;
      tables A*B*C*D / cmh;
   run;

The CMH option in the TABLES statement gives a stratified statistical analysis of the relationship between C and D, after controlling for A and B. The stratified analysis provides a way to adjust for the possible confounding effects of A and B without being forced to estimate parameters for them. The analysis produces Cochran-Mantel-Haenszel statistics, and for 2 ×2 tables, it includes estimation of the common odds ratio, common relative risks, and the Breslow-Day test for homogeneity of the odds ratios.

Let the number of strata be denoted by q, indexing the strata by h = 1, 2, ... , q. Each stratum contains a contingency table with X representing the row variable and Y representing the column variable. For table h, denote the cell frequency in row i and column j by n_hij, with corresponding row and column marginal totals denoted by n_hi. and n_h.j, and the overall stratum total by n_h.

Because the formulas for the Cochran-Mantel-Haenszel statistics are more easily defined in terms of matrices, the following notation is used. Vectors are presumed to be column vectors unless they are transposed (').

n_hi'	=	(n_hi1,n_hi2, ... ,n_hiC)	(1 ×C)
n_h'	=	(n_h1',n_h2', ... , n_hR')	(1 ×RC)
p_{hi ·}	=	[(n_{hi ·})/(n_h)]	(1 ×1)
p_{h ·j}	=	[(n_{h ·j})/(n_h)]	(1 ×1)
P_{h* ·}'	=	(p_{h1 ·},p_{h2 ·}, ... ,p_{hR ·})	(1 ×R)
P_{h ·*}'	=	(p_{h ·1},p_{h ·2}, ... ,p_{h ·C})	(1 ×C)

Assume that the strata are independent and that the marginal totals of each stratum are fixed. The null hypothesis, H₀, is that there is no association between X and Y in any of the strata. The corresponding model is the multiple hypergeometric; this implies that, under H₀, the expected value and covariance matrix of the frequencies are, respectively,

$m_h = E[n_h | H_0] = n_h(P_{h \cdot *} \otimes P_{h* \cdot})$

and

${{var}}[n_h | H_0] = c ( (D_{Ph \cdot *} - P_{h \cdot *}P_{h \cdot *}^') \otimes (D_{Ph* \cdot} - P_{h* \cdot}P_{h* \cdot}^') )$

where

c = [(n_h²)/(n_h-1)]

and where $\otimes$ denotes Kronecker product multiplication and D_a is a diagonal matrix with elements of a on the main diagonal.

The generalized CMH statistic (Landis, Heyman, and Koch 1978) is defined as

Q_CMH = G'V_G^-1G

where

$G & = & \sum_h B_h(n_h - m_h ) \ {V_G} & = & \sum_h B_h ( {{Var}}(n_h | H_0) ) B_h^' \$

and where

$B_h = C_h \otimes R_h$

is a matrix of fixed constants based on column scores C_h and row scores R_h. When the null hypothesis is true, the CMH statistic has an asymptotic chi-square distribution with degrees of freedom equal to the rank of B_h. If V_G is found to be singular, PROC FREQ prints a message and sets the value of the CMH statistic to missing.

PROC FREQ computes three CMH statistics using this formula for the generalized CMH statistic, with different row and column score definitions for each statistic. The CMH statistics that PROC FREQ computes are the correlation statistic, the ANOVA (row mean scores) statistic, and the general association statistic. These statistics test the null hypothesis of no association against different alternative hypotheses. The following sections describe the computation of these CMH statistics.

Caution: The CMH statistics have low power for detecting an association in which the patterns of association for some of the strata are in the opposite direction of the patterns displayed by other strata. Thus, a nonsignificant CMH statistic suggests either that there is no association or that no pattern of association has enough strength or consistency to dominate any other pattern.

Correlation Statistic

The correlation statistic, popularized by Mantel and Haenszel (1959) and Mantel (1963), has one degree of freedom and is known as the Mantel-Haenszel statistic.

The alternative hypothesis for the correlation statistic is that there is a linear association between X and Y in at least one stratum. If either X or Y does not lie on an ordinal (or interval) scale, then this statistic is not meaningful.

To compute the correlation statistic, PROC FREQ uses the formula for the generalized CMH statistic with the row and column scores determined by the SCORES= option in the TABLES statement. See the section "Scores" for more information on the available score types. The matrix of row scores R_h has dimension 1 ×R, and the matrix of column scores C_h has dimension 1 ×C.

When there is only one stratum, this CMH statistic reduces to (n-1)r², where r is the Pearson correlation coefficient between X and Y. When nonparametric (RANK or RIDIT) scores are specified, then the statistic reduces to (n-1)r_s², where r_s is the Spearman rank correlation coefficient between X and Y. When there is more than one stratum, then this CMH statistic becomes a stratum-adjusted correlation statistic.

ANOVA (Row Mean Scores) Statistic

The ANOVA statistic can be used only when the column variable Y lies on an ordinal (or interval) scale so that the mean score of Y is meaningful. For the ANOVA statistic, the mean score is computed for each row of the table, and the alternative hypothesis is that, for at least one stratum, the mean scores of the R rows are unequal. In other words, the statistic is sensitive to location differences among the R distributions of Y.

The matrix of column scores C_h has dimension 1 ×C, the column scores are determined by the SCORES= option.

The matrix of row scores R_h has dimension (R-1) ×R and is created internally by PROC FREQ as

R_h = [ I_R-1 , -J_R-1 ]

where I_R-1 is an identity matrix of rank R-1, and J_R-1 is an (R-1) ×1 vector of ones. This matrix has the effect of forming R-1 independent contrasts of the R mean scores.

When there is only one stratum, this CMH statistic is essentially an analysis of variance (ANOVA) statistic in the sense that it is a function of the variance ratio F statistic that would be obtained from a one-way ANOVA on the dependent variable Y. If nonparametric scores are specified in this case, then the ANOVA statistic is a Kruskal-Wallis test.

If there is more than one stratum, then this CMH statistic corresponds to a stratum-adjusted ANOVA or Kruskal-Wallis test. In the special case where there is one subject per row and one subject per column in the contingency table of each stratum, this CMH statistic is identical to Friedman's chi-square. See Example 28.8 for an illustration.

General Association Statistic

The alternative hypothesis for the general association statistic is that, for at least one stratum, there is some kind of association between X and Y. This statistic is always interpretable because it does not require an ordinal scale for either X or Y.

For the general association statistic, the matrix R_h is the same as the one used for the ANOVA statistic. The matrix C_h is defined similarly as

C_h = [ I_C-1 , -J_C-1 ]

PROC FREQ generates both score matrices internally. When there is only one stratum, then the general association CMH statistic reduces to Q_P (n-1)/n, where Q_P is the Pearson chi-square statistic. When there is more than one stratum, then the CMH statistic becomes a stratum-adjusted Pearson chi-square statistic. Note that a similar adjustment can be made by summing the Pearson chi-squares across the strata. However, the latter statistic requires a large sample size in each stratum to support the resulting chi-square distribution with q(R-1)(C-1) degrees of freedom. The CMH statistic requires only a large overall sample size since it has only (R-1)(C-1) degrees of freedom.

Refer to Cochran (1954); Mantel and Haenszel (1959); Mantel (1963); Birch (1965); Landis, Heyman, and Koch (1978).

Adjusted Odds Ratio and Relative Risk Estimates

The CMH option provides adjusted odds ratio and relative risk estimates for stratified 2 ×2 tables. For each of these measures, PROC FREQ computes the Mantel-Haenszel estimate and the logit estimate. These estimates apply to n-way table requests in the TABLES statement, when the row and column variables both have only two levels.

For example,

   proc freq;
      tables A*B*C*D / cmh;
   run;

In this example, if the row and columns variables C and D both have two levels, PROC FREQ provides odds ratio and relative risk estimates, adjusting for the confounding variables A and B.

The choice of an appropriate measure depends on the study design. For case-control (retrospective) studies, the odds ratio is appropriate. For cohort (prospective) or cross-sectional studies, the relative risk is appropriate. See the section "Odds Ratio and Relative Risks for 2×2 Tables" for more information on these measures.

Throughout this section, z denotes the $100(1 - \alpha/2)$ percentile of the standard normal distribution.

Odds Ratio, Case-Control Studies

Mantel-Haenszel Estimator The Mantel-Haenszel estimate of the common odds ratio is computed as

${OR}_{{\small MH}} = \frac{ \sum_h n_{h11} n_{h22} / n_h } { \sum_h n_{h12} n_{h21} / n_h }$

It is always computed unless the denominator is zero. Refer to Mantel and Haenszel (1959) and Agresti (1990).

Using the estimated variance for log( OR_MH) given by Robins, Breslow, and Greenland (1986), PROC FREQ computes the corresponding $100(1 - \alpha)$ % confidence limits for the odds ratio as

$( {OR}_{{\small MH}} \cdot \exp(-z \hat{\sigma}), {OR}_{{\small MH}} \cdot \exp( z \hat{\sigma}) )$

where

$\hat{\sigma}^2 & = & \hat{var}[ln({OR}_{{\small MH}})] \ & = & \frac{\sum_h (n... ... + n_{h21}) (n_{h12} n_{h21}) / n_h^2} {2 ( \sum_h n_{h12} n_{h21} / n_h )^2}$

Note that the Mantel-Haenszel odds ratio estimator is less sensitive to small n_h than the logit estimator.

Logit Estimator The adjusted logit estimate of the odds ratio (Woolf 1955) is computed as

${OR}_{{\small L}} = \exp ( \frac{\sum_h w_h \ln({OR}_h)}{\sum_h w_h} )$

and the corresponding $100(1 - \alpha)$ % confidence limits are

$( {OR}_{{\small L}} \cdot \exp ( \frac{-z}{\sqrt{\sum_h w_h}} ) , {OR}_{{\small L}} \cdot \exp ( \frac{ z}{\sqrt{\sum_h w_h}} ) )$

where OR_h is the odds ratio for stratum h, and

w_h = [1/(var(ln OR_h))]

If any cell frequency in a stratum h is zero, then PROC FREQ adds 0.5 to each cell of the stratum before computing OR_h and w_h (Haldane 1955), and prints a warning.

Relative Risks, Cohort Studies

Mantel-Haenszel Estimator The Mantel-Haenszel estimate of the common relative risk for column 1 is computed as

${RR}_{{\small MH}} = \frac{ \sum_h n_{h11} n_{h2 \cdot} / n_h} { \sum_h n_{h21} n_{h1 \cdot} / n_h}$

It is always computed unless the denominator is zero. Refer to Mantel and Haenszel (1959) and Agresti (1990).

Using the estimated variance for log( RR_MH) given by Greenland and Robins (1985), PROC FREQ computes the corresponding $100(1 - \alpha)$ % confidence limits for the relative risk as

$( {RR}_{{\small MH}} \cdot \exp(-z \hat{\sigma}), {RR}_{{\small MH}} \cdot \exp( z \hat{\sigma}) )$

where

$\hat{\sigma}^2 & = & \hat{var}[ln({RR}_{{\small MH}})] \ & = & \frac{ \sum_h... ... ( \sum_h n_{h11} n_{h2 \cdot} / n_h ) ( \sum_h n_{h21} n_{h1 \cdot} / n_h )}$

Logit Estimator The adjusted logit estimate of the common relative risk for column 1 is computed as

${RR}_{{\small L}} = \exp ( \frac{\sum_h w_h \ln {RR}_h} {\sum w_h} )$

and the corresponding $100(1 - \alpha)$ % confidence limits are

$( {RR}_{{\small L}} \exp ( \frac{-z}{\sqrt{\sum_h w_h}} ), {RR}_{{\small L}} \exp ( \frac{z}{\sqrt{\sum_h w_h}} ) )$

where RR_h is the column 1 relative risk estimate for stratum h, and

w_h = [1/(var (ln RR_h))]

If n_h11 or n_h21 is zero, then PROC FREQ adds 0.5 to each cell of the stratum before computing RR_h and w_h, and prints a warning. Refer to Kleinbaum, Kupper, and Morgenstern (1982, Sections 17.4 and 17.5).

Breslow-Day Test for Homogeneity of the Odds Ratios

When you specify the CMH option, PROC FREQ computes the Breslow-Day test for stratified analysis of 2 ×2 tables. It tests the null hypothesis that the odds ratios for the q strata are all equal. When the null hypothesis is true, the statistic has an asymptotic chi-square distribution with q-1 degrees of freedom.

The Breslow-Day statistic is computed as

$Q_{{\small BD}} = \frac{\sum_h ( n_{h11}-E(n_{h11}|{OR}_{{\small MH}}) )^2} {var (n_{h11}|{OR}_{{\small MH}})}$

where E and var denote expected value and variance, respectively. The summation does not include any table with a zero row or column. If OR_MH equals zero or if it is undefined, then PROC FREQ does not compute the statistic and prints a warning message.

Caution: Unlike the Cochran-Mantel-Haenszel statistics, the Breslow-Day test requires a large sample size within each stratum, and this limits its usefulness. In addition, the validity of the CMH tests does not depend on any assumption of homogeneity of the odds ratios; therefore, the Breslow-Day test should never be used as such an indicator of validity.

Refer to Breslow and Day (1994).

Chapter Contents
Previous
Next
Top