Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The CLUSTER Procedure

Miscellaneous Formulas

The root-mean-square standard deviation of a cluster CK is

{RMSSTD} = \sqrt{\frac{W_K}{v(N_K - 1)}}
The R2 statistic for a given level of the hierarchy is
R2 = 1 - [(PG)/T]
The squared semipartial correlation for joining clusters CK and CL is
semipartial R2 = [(BKL)/T]
The bimodality coefficient is
b = [(m32 + 1)/(m4 + [(3(n-1)2)/((n-2)(n-3))])]
where m3 is skewness and m4 is kurtosis. Values of b greater than 0.555 (the value for a uniform population) may indicate bimodal or multimodal marginal distributions. The maximum of 1.0 (obtained for the Bernoulli distribution) is obtained for a population with only two distinct values. Very heavy-tailed distributions have small values of b regardless of the number of modes.

Formulas for the cubic-clustering criterion and approximate expected R2 are given in Sarle (1983).

The pseudo F statistic for a given level is

pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])]
The pseudo t2 statistic for joining CK and CL is
pseudo t2 = [(BKL)/([(WK + WL)/(NK + NL - 2)])]
The pseudo F and t2 statistics may be useful indicators of the number of clusters, but they are not distributed as F and t2 random variables. If the data are independently sampled from a multivariate normal distribution with a scalar covariance matrix and if the clustering method allocates observations to clusters randomly (which no clustering method actually does), then the pseudo F statistic is distributed as an F random variable with v(G - 1) and v(n - G) degrees of freedom. Under the same assumptions, the pseudo t2 statistic is distributed as an F random variable with v and v(NK + NL - 2) degrees of freedom. The pseudo t2 statistic differs computationally from Hotelling's T2 in that the latter uses a general symmetric covariance matrix instead of a scalar covariance matrix. The pseudo F statistic was suggested by Calinski and Harabasz (1974). The pseudo t2 statistic is related to the Je(2)/Je(1) statistic of Duda and Hart (1973) by
[(Je (2))/(Je (1))] = [(WK + WL)/(WM)] = [1/(1 + [(t2)/(NK + NL - 2)])]
See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the performance of these statistics in estimating the number of population clusters. Conservative tests for the number of clusters using the pseudo F and t2 statistics can be obtained by the Bonferroni approach (Hawkins, Muller, and ten Krooden 1982, pp. 337 -340).

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.