Using PROC FASTCLUS

The FASTCLUS Procedure

Using PROC FASTCLUS

Before using PROC FASTCLUS, decide whether your variables should be standardized in some way, since variables with large variances tend to have more effect on the resulting clusters than those with small variances. If all variables are measured in the same units, standardization may not be necessary. Otherwise, some form of standardization is strongly recommended. The STANDARD procedure can standardize all variables to mean zero and variance one. The FACTOR or PRINCOMP procedures can compute standardized principal component scores. The ACECLUS procedure can transform the variables according to an estimated within-cluster covariance matrix.

Nonlinear transformations of the variables may change the number of population clusters and should, therefore, be approached with caution. For most applications, the variables should be transformed so that equal differences are of equal practical importance. An interval scale of measurement is required. Ordinal or ranked data are generally not appropriate.

PROC FASTCLUS produces relatively little output. In most cases you should create an output data set and use other procedures such as PRINT, PLOT, CHART, MEANS, DISCRIM, or CANDISC to study the clusters. It is usually desirable to try several values of the MAXCLUSTERS= option. Macros are useful for running PROC FASTCLUS repeatedly with other procedures.

A simple application of PROC FASTCLUS with two variables to examine the 2- and 3-cluster solutions may proceed as follows:

   proc standard mean=0 std=1 out=stan;
      var v1 v2;
   run;

   proc fastclus data=stan out=clust maxclusters=2;
      var v1 v2;
   run;

   proc plot;
      plot v2*v1=cluster;
   run;

   proc fastclus data=stan out=clust maxclusters=3;
      var v1 v2;
   run;

   proc plot;
      plot v2*v1=cluster;
   run;

If you have more than two variables, you can use the CANDISC procedure to compute canonical variables for plotting the clusters, for example,

   proc standard mean=0 std=1 out=stan;
      var v1-v10;
   run;

   proc fastclus data=stan out=clust maxclusters=3;
      var v1-v10;
   run;

   proc candisc out=can;
      var v1-v10;
      class cluster;
   run;

   proc plot;
      plot can2*can1=cluster;
   run;

If the data set is not too large, it may also be helpful to use

   proc sort;
      by cluster distance;
   run;
   proc print;
      by cluster;
   run;

to list the clusters. By examining the values of DISTANCE, you can determine if any observations are unusually far from their cluster seeds.

It is often advisable, especially if the data set is large or contains outliers, to make a preliminary PROC FASTCLUS run with a large number of clusters, perhaps 20 to 100. Use MAXITER=0 and OUTSEED=SAS-data-set. You can save time on subsequent runs by selecting cluster seeds from this output data set using the SEED= option.

You should check the preliminary clusters for outliers, which often appear as clusters with only one member. Use a DATA step to delete outliers from the data set created by the OUTSEED= option before using it as a SEED= data set in later runs. If there are severe outliers, the subsequent PROC FASTCLUS runs should specify the STRICT option to prevent the outliers from distorting the clusters.

You can use the OUTSEED= data set with the PLOT procedure to plot _GAP_ by _FREQ_. An overlay of _RADIUS_ by _FREQ_ provides a baseline against which to compare the values of _GAP_. Outliers appear in the upper left area of the plot, with large values of _GAP_ and small _FREQ_ values. Good clusters appear in the upper right area, with large values of both _GAP_ and _FREQ_. Good potential cluster seeds appear in the lower right, as well as in the upper right, since large _FREQ_ values indicate high density regions. Small _FREQ_ values in the left part of the plot indicate poor cluster seeds because the points are in low density regions. It often helps to remove all clusters with small frequencies even though the clusters may not be remote enough to be considered outliers. Removing points in low density regions improves cluster separation and provides visually sharper cluster outlines in scatter plots.

Chapter Contents
Previous
Next
Top