Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The MODECLUS Procedure

Getting Started

This section illustrates how PROC MODECLUS can be used to examine the clusters of data in the following artificial data set.

   data example;
      input x y @@;
      datalines;
   18 18  20 22  21 20  12 23  17 12  23 25  25 20  16 27
   20 13  28 22  80 20  75 19  77 23  81 26  55 21  64 24
   72 26  70 35  75 30  78 42  18 52  27 57  41 61  48 64
   59 72  69 72  80 80  31 53  51 69  72 81
   ;

It is a good practice to plot the data to check for obvious clusters or pathologies prior to the analysis. The interactive graphics of the SAS/INSIGHT product are effective for visualizing clusters. In this example, with only two variables and a small sample size, the GPLOT procedure is adequate. The following statements produce Figure 42.1:

   axis1 label=(angle=90 rotate=0) minor=none 
         order=(0 to 80 by 20);
   axis2 minor=none;
   proc gplot;
      plot y*x /frame cframe=ligr vaxis=axis1 haxis=axis2;
   run;

The plot suggests three clusters. Of these clusters, the one in the lower left corner is the most compact, while the lower right cluster is more dispersed.

The upper cluster is elongated and would be difficult for most clustering algorithms to identify as a single cluster. The plot also suggests that a Euclidean distance of 10 or 20 is a good initial guess for the neighborhood size in density estimation and clustering.

modg1.gif (3040 bytes)

Figure 42.1: Scatter Plot of Data

To obtain a cluster analysis, you must specify the METHOD= option; for most purposes, METHOD=1 is recommended. The cluster analysis can be performed with a list of radii (R=10 15 35), as illustrated in the following PROC MODECLUS step. An output data set containing the cluster membership is created with the OUT= option and then used by PROC GPLOT to display the membership. The following statements produce Figure 42.2 through Figure 42.5:

   proc modeclus data=example method=1 r=10 15 35 out=out;
   run;

For each cluster solution, PROC MODECLUS produces a table of cluster statistics including the cluster number, the number of observations in the cluster, the maximum estimated density within the cluster, the number of observations in the cluster having a neighbor that belongs to a different cluster, and the estimated saddle density of the cluster. The results are displayed in Figure 42.2, Figure 42.3, and Figure 42.4 for three different radii. A smaller radius (R=10) yields a larger number of clusters (6), as displayed in Figure 42.1; a larger radius (R=35) includes all observations in a single cluster, as displayed in Figure 42.5. Note that all clusters in these three figures are "isolated" since their corresponding boundary frequencies are all 0s. Therefore, all the estimated saddle densities are missing.

The MODECLUS Procedure
R=10 METHOD=1

Cluster Statistics
Cluster Frequency Maximum
Estimated
Density
Boundary
Frequency
Estimated
Saddle
Density
1 10 0.00106103 0 .
2 9 0.00084883 0 .
3 7 0.00031831 0 .
4 2 0.00021221 0 .
5 1 0.0001061 0 .
6 1 0.0001061 0 .

Figure 42.2: Results from PROC MODECLUS for METHOD=1 and R=10

The MODECLUS Procedure
R=15 METHOD=1

Cluster Statistics
Cluster Frequency Maximum
Estimated
Density
Boundary
Frequency
Estimated
Saddle
Density
1 10 0.00047157 0 .
2 10 0.00042441 0 .
3 10 0.00023579 0 .

Figure 42.3: Results from PROC MODECLUS for METHOD=1 and R=15

The MODECLUS Procedure
R=35 METHOD=1

Cluster Statistics
Cluster Frequency Maximum
Estimated
Density
Boundary
Frequency
Estimated
Saddle
Density
1 30 0.00012126 0 .

Figure 42.4: Results from PROC MODECLUS for METHOD=1 and R=35

A table summarizing each cluster solution is then produced, as displayed in Figure 42.5.

The MODECLUS Procedure

Cluster Summary
R Number of
Clusters
Frequency of
Unclassified
Objects
10 6 0
15 3 0
35 1 0

Figure 42.5: Summary Table

The OUT= data set contains a complete copy of the input data set for each cluster solution. Using a BY statement in the following PROC GPLOT step, you can examine the differences in cluster memberships for each radius. The following statements produce Figure 42.6 through Figure 42.8:

   symbol1 v='1' font=swiss c=white; symbol2 v='2' font=swiss c=yellow;
   symbol3 v='3' font=swiss c=cyan;  symbol4 v='4' font=swiss c=green;
   symbol5 v='5' font=swiss c=orange;symbol6 v='6' font=swiss c=blue;
   symbol7 v='7' font=swiss c=black;
   proc gplot data=out;
      plot y*x=cluster /frame cframe=ligr nolegend vaxis=axis1 
           haxis=axis2;
   by _r_;
   run;

modg3.gif (3191 bytes)

Figure 42.6: Scatter Plots of Cluster Memberships with _R_=10

modg5.gif (3109 bytes)

Figure 42.7: Scatter Plots of Cluster Memberships with _R_=15

modg7.gif (2945 bytes)

Figure 42.8: Scatter Plots of Cluster Memberships with _R_=35

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.