Displayed Output
Unless the SHORT or SUMMARY option
is specified, PROC FASTCLUS displays
- Initial Seeds, cluster seeds selected
after one pass through the data
- Change in Cluster Seeds for each iteration,
if you specify MAXITER=n>1
If you specify the LEAST=p option, with (1 < p < 2), and you
omit the IRLS option, an additional column is displayed in the
Iteration History table. The column contains a character to identify
the method used in each iteration. PROC FASTCLUS chooses the most
efficient method to cluster the data at each iterative step, given the
condition of the data. Thus, the method chosen is data dependent. The
possible values are described as follows:
|
Value
|
|
Method
|
| N | | Newton's Method |
| I or L | | iteratively weighted least squares (IRLS) |
| 1 | | IRLS step, halved once |
| 2 | | IRLS step, halved twice |
| 3 | | IRLS step, halved three times |
PROC FASTCLUS displays a Cluster Summary,
giving the following for each cluster:
- Cluster number
- Frequency, the number of observations in the cluster
- Weight, the sum of the weights of the observations in the
cluster, if you specify the WEIGHT statement
- RMS Std Deviation, the root mean square across
variables of the cluster standard deviations,
which is equal to the root mean square
distance between observations in the cluster
- Maximum Distance from Seed to Observation,
the maximum distance from the
cluster seed to any observation in the cluster
- Nearest Cluster, the number of the cluster with mean
closest to the mean of the current cluster
- Centroid Distance, the distance between the centroids
(means) of the current cluster and the nearest other cluster
A table of statistics for each variable is
displayed unless you specify the SUMMARY option.
The table contains
- Total STD, the total standard deviation
- Within STD, the pooled within-cluster standard deviation
- R-Squared, the R2 for predicting
the variable from the cluster
- RSQ/(1 - RSQ), the ratio of between-cluster variance to
within-cluster variance (R2/(1 - R2))
- OVER-ALL, all of the previous quantities
pooled across variables
PROC FASTCLUS also displays
- Pseudo F Statistic,
-
[( [(R2)/(c - 1)] )/( [(1 - R2)/(n - c)] )]
where R2 is the observed overall R2, c is the
number of clusters, and n is the number of observations.
The pseudo F statistic was suggested
by Calinski and Harabasz (1974).
Refer to Milligan and Cooper (1985) and Cooper and
Milligan (1988) regarding the use of the pseudo
F statistic in estimating the number of clusters.
See Example 23.2 in Chapter 23, "The CLUSTER Procedure," for a
comparison of pseudo F statistics.
- Observed Overall R-Squared, if you specify the SUMMARY option
- Approximate Expected Overall R-Squared,
the approximate expected value of the overall
R2 under the uniform null hypothesis
assuming that the variables are uncorrelated.
The value is missing if the number of clusters is
greater than one-fifth the number of observations.
- Cubic Clustering Criterion, computed under the
assumption that the variables are uncorrelated.
The value is missing if the number of clusters is
greater than one-fifth the number of observations.
If you are interested in the approximate expected
R2 or the cubic clustering criterion but your
variables are correlated, you should cluster
principal component scores from the PRINCOMP procedure.
Both of these statistics are described by Sarle (1983).
The performance of the cubic clustering criterion
in estimating the number of clusters is examined by
Milligan and Cooper (1985) and Cooper and Milligan (1988).
- Distances Between Cluster Means, if you specify the DISTANCE option
Unless you specify the SHORT or SUMMARY option,
PROC FASTCLUS displays
- Cluster Means for each variable
- Cluster Standard Deviations for each variable
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.