Miscellaneous Formulas
The root-mean-square standard deviation of a cluster CK is
The R2 statistic for a given level of the hierarchy is
-
R2 = 1 - [(PG)/T]
The squared semipartial correlation for
joining clusters CK and CL is
-
semipartial R2 = [(BKL)/T]
The bimodality coefficient is
-
b = [(m32 + 1)/(m4 + [(3(n-1)2)/((n-2)(n-3))])]
where m3 is skewness and m4 is kurtosis.
Values of b greater than 0.555 (the value
for a uniform population) may indicate
bimodal or multimodal marginal distributions.
The maximum of 1.0 (obtained for the Bernoulli distribution)
is obtained for a population with only two distinct values.
Very heavy-tailed distributions have small
values of b regardless of the number of modes.
Formulas for the cubic-clustering criterion and
approximate expected R2 are given in Sarle (1983).
The pseudo F statistic for a given level is
-
pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])]
The pseudo t2 statistic for joining CK and CL is
-
pseudo t2 = [(BKL)/([(WK + WL)/(NK + NL - 2)])]
The pseudo F and t2 statistics may be useful
indicators of the number of clusters, but they are not
distributed as F and t2 random variables.
If the data are independently sampled from a multivariate
normal distribution with a scalar covariance matrix and if
the clustering method allocates observations to clusters
randomly (which no clustering method actually does), then the
pseudo F statistic is distributed as an F random
variable with v(G - 1) and v(n - G) degrees of freedom.
Under the same assumptions, the pseudo t2 statistic
is distributed as an F random variable with
v and v(NK + NL - 2) degrees of freedom.
The pseudo t2 statistic differs computationally from
Hotelling's T2 in that the latter uses a general symmetric
covariance matrix instead of a scalar covariance matrix.
The pseudo F statistic was suggested
by Calinski and Harabasz (1974).
The pseudo t2 statistic is related to the
Je(2)/Je(1) statistic of Duda and Hart (1973) by
-
[(Je (2))/(Je (1))] = [(WK + WL)/(WM)] = [1/(1 + [(t2)/(NK + NL - 2)])]
See Milligan and Cooper (1985) and Cooper and Milligan
(1988) regarding the performance of these statistics
in estimating the number of population clusters.
Conservative tests for the number of clusters using the pseudo
F and t2 statistics can be obtained by the Bonferroni
approach
(Hawkins, Muller, and ten Krooden 1982, pp. 337 -340).
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.