Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The CLUSTER Procedure

Example 23.3: Cluster Analysis of Fisher Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

This example analyzes the iris data by Ward's method and two-stage density linkage and then illustrates how the FASTCLUS procedure can be used in combination with PROC CLUSTER to analyze large data sets.

   title 'Cluster Analysis of Fisher (1936) Iris Data';
   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      input SepalLength SepalWidth PetalLength PetalWidth Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;

The following macro, SHOW, is used in the subsequent analyses to display cluster results. It invokes the FREQ procedure to crosstabulate clusters and species. The CANDISC procedure computes canonical variables for discriminating among the clusters, and the first two canonical variables are plotted to show cluster membership. See Chapter 21, "The CANDISC Procedure," for a canonical discriminant analysis of the iris species.

   %macro show;
   proc freq;
      tables cluster*species;
   run;
   proc candisc noprint out=can;
      class cluster;
      var petal: sepal:;
   run;
   legend1 frame cframe=ligr cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;
   proc gplot;
      plot can2*can1=cluster / 
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;    
   run;
   %mend;

The first analysis clusters the iris data by Ward's method and plots the CCC and pseudo F and t2 statistics. The CCC has a local peak at 3 clusters but a higher peak at 5 clusters. The pseudo F statistic indicates 3 clusters, while the pseudo t2 statistic suggests 3 or 6 clusters. For large numbers of clusters, Version 6 of the SAS System produces somewhat different results than previous versions of PROC CLUSTER. This is due to changes in the treatment of ties. Results are identical for 5 or fewer clusters.

The TREE procedure creates an output data set containing the 3-cluster partition for use by the SHOW macro. The FREQ procedure reveals 16 misclassifications. The results are shown in Output 23.3.1.

   title2 'By Ward''s Method';
   proc cluster data=iris method=ward print=15 ccc pseudo;
      var petal: sepal:;
      copy species;
   run;
   legend1 frame cframe=ligr cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none order=(0 to 600 by 100);
   axis2 minor=none order=(1 to 30 by 1);
   axis3 label=(angle=90 rotate=0) minor=none order=(0 to 7 by 1);

   proc gplot;
      plot _ccc_*_ncl_  /
         frame cframe=ligr legend=legend1 vaxis=axis3 haxis=axis2;
      plot _psf_*_ncl_  _pst2_*_ncl_  /overlay 
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;
   run;

   proc tree noprint ncl=3 out=out;
      copy petal: sepal: species;
   run;

   %show;

Output 23.3.1: Cluster Analysis of Fisher Iris Data: CLUSTER with METHOD=WARD

Cluster Analysis of Fisher (1936) Iris Data
By Ward's Method

The CLUSTER Procedure
Ward's Minimum Variance Cluster Analysis

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 422.824171 398.557096 0.9246 0.9246
2 24.267075 16.446125 0.0531 0.9777
3 7.820950 5.437441 0.0171 0.9948
4 2.383509   0.0052 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Root-Mean-Square Distance Between Observations = 30.24221

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 T
i
e
15 CL24 CL28 15 0.0016 .971 .958 5.93 324 9.8  
14 CL21 CL53 7 0.0019 .969 .955 5.85 329 5.1  
13 CL18 CL48 15 0.0023 .967 .953 5.69 334 8.9  
12 CL16 CL23 24 0.0023 .965 .950 4.63 342 9.6  
11 CL14 CL43 12 0.0025 .962 .946 4.67 353 5.8  
10 CL26 CL20 22 0.0027 .959 .942 4.81 368 12.9  
9 CL27 CL17 31 0.0031 .956 .936 5.02 387 17.8  
8 CL35 CL15 23 0.0031 .953 .930 5.44 414 13.8  
7 CL10 CL47 26 0.0058 .947 .921 5.43 430 19.1  
6 CL8 CL13 38 0.0060 .941 .911 5.81 463 16.3  
5 CL9 CL19 50 0.0105 .931 .895 5.82 488 43.2  
4 CL12 CL11 36 0.0172 .914 .872 3.99 515 41.0  
3 CL6 CL7 64 0.0301 .884 .827 4.33 558 57.2  
2 CL4 CL3 100 0.1110 .773 .697 3.83 503 116  
1 CL5 CL2 150 0.7726 .000 .000 0.00 . 503  


clue3b.gif (3338 bytes)

clue3c.gif (5302 bytes)

Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER Species Total
Setosa Versicolor Virginica
1 0
0.00
0.00
0.00
49
32.67
76.56
98.00
15
10.00
23.44
30.00
64
42.67
 
 
2 0
0.00
0.00
0.00
1
0.67
2.78
2.00
35
23.33
97.22
70.00
36
24.00
 
 
3 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00


clue3e.gif (4808 bytes)

The second analysis uses two-stage density linkage. The raw data suggest 2 or 6 modes instead of 3:

k   modes
3 12
4-6 6
7 4
8 3
9-50 2
51+ 1

However, the ACECLUS procedure can be used to reveal 3 modes. This analysis uses K=8 to produce 3 clusters for comparison with other analyses. There are only 6 misclassifications. The results are shown in Output 23.3.2.

   title2 'By Two-Stage Density Linkage';
   proc cluster data=iris method=twostage k=8 print=15 ccc pseudo;
      var petal: sepal:;
      copy species;
   run;

   proc tree noprint ncl=3 out=out;
      copy petal: sepal: species;
   run;

   %show;

Output 23.3.2: Cluster Analysis of Fisher Iris Data: CLUSTER with METHOD=TWOSTAGE

Cluster Analysis of Fisher (1936) Iris Data
By Two-Stage Density Linkage

The CLUSTER Procedure
Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 422.824171 398.557096 0.9246 0.9246
2 24.267075 16.446125 0.0531 0.9777
3 7.820950 5.437441 0.0171 0.9948
4 2.383509   0.0052 1.0000

K = 8

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Cluster History
NCL   FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Normalized
Fusion Density
Maximum Density
in Each Cluster
T
i
e
Clusters Joined Lesser Greater
15 CL17 OB127 44 0.0025 .916 .958 -11 105 3.4 0.3903 0.2066 3.5156  
14 CL16 OB137 50 0.0023 .913 .955 -11 110 5.6 0.3637 0.1837 100.0  
13 CL15 OB74 45 0.0029 .910 .953 -10 116 3.7 0.3553 0.2130 3.5156  
12 CL28 OB49 46 0.0036 .907 .950 -8.0 122 5.2 0.3223 0.1736 8.3678 T
11 CL12 OB85 47 0.0036 .903 .946 -7.6 130 4.8 0.3223 0.1736 8.3678  
10 CL11 OB98 48 0.0033 .900 .942 -7.1 140 4.1 0.2879 0.1479 8.3678  
9 CL13 OB24 46 0.0037 .896 .936 -6.5 152 4.4 0.2802 0.2005 3.5156  
8 CL10 OB25 49 0.0019 .894 .930 -5.5 171 2.2 0.2699 0.1372 8.3678  
7 CL8 OB121 50 0.0035 .891 .921 -4.5 194 4.0 0.2586 0.1372 8.3678  
6 CL9 OB45 47 0.0042 .886 .911 -3.3 225 4.6 0.1412 0.0832 3.5156  
5 CL6 OB39 48 0.0049 .882 .895 -1.7 270 5.0 0.107 0.0605 3.5156  
4 CL5 OB21 49 0.0049 .877 .872 0.35 346 4.7 0.0969 0.0541 3.5156  
3 CL4 OB90 50 0.0047 .872 .827 3.28 500 4.1 0.0715 0.0370 3.5156  
2 CL3 CL7 100 0.0993 .773 .697 3.83 503 91.9 2.6277 3.5156 8.3678  

3 modal clusters have been formed.


Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER Species Total
Setosa Versicolor Virginica
1 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
2 0
0.00
0.00
0.00
47
31.33
94.00
94.00
3
2.00
6.00
6.00
50
33.33
 
 
3 0
0.00
0.00
0.00
3
2.00
6.00
6.00
47
31.33
94.00
94.00
50
33.33
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00


clue3g.gif (4954 bytes)

The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can, therefore, be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.

FASTCLUS automatically creates variables _FREQ_ and _RMSSTD_ in the MEAN= output data set. These variables are then automatically used by PROC CLUSTER in the computation of various statistics.

The iris data are used to illustrate the process of clustering clusters. In the preliminary analysis, PROC FASTCLUS produces ten clusters, which are then crosstabulated with species. The data set containing the preliminary clusters is sorted in preparation for later merges. The results are shown in Output 23.3.3.

   title2 'Preliminary Analysis by FASTCLUS';
   proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
                 mean=mean out=prelim cluster=preclus;
      var petal: sepal:;
   run;

   proc freq;
      tables preclus*species;
   run;

   proc sort data=prelim;
      by preclus;
   run;

Output 23.3.3: Preliminary Analysis of Fisher Iris Data

Cluster Analysis of Fisher (1936) Iris Data
Preliminary Analysis by FASTCLUS

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=10 Maxiter=99 Converge=0

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 9 2.7067 8.2027   5 8.7362
2 19 2.2001 7.7340   4 6.2243
3 18 2.1496 6.2173   8 7.5049
4 4 2.5249 5.3268   2 6.2243
5 3 2.7234 5.8214   1 8.7362
6 7 2.2939 5.1508   2 9.3318
7 17 2.0274 6.9576   10 7.9503
8 18 2.2628 7.1135   3 7.5049
9 22 2.2666 7.5029   8 9.0090
10 33 2.0594 10.0033   7 7.9503

Pseudo F Statistic = 370.58

Observed Over-All R-Squared = 0.95971

Approximate Expected Over-All R-Squared = 0.82928

Cubic Clustering Criterion = 27.077

WARNING: The two values above are invalid for correlated variables.


Cluster Analysis of Fisher (1936) Iris Data
Preliminary Analysis by FASTCLUS

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of PRECLUS by Species
PRECLUS(Cluster Species Total
Setosa Versicolor Virginica
1 0
0.00
0.00
0.00
0
0.00
0.00
0.00
9
6.00
100.00
18.00
9
6.00
 
 
2 0
0.00
0.00
0.00
19
12.67
100.00
38.00
0
0.00
0.00
0.00
19
12.67
 
 
3 0
0.00
0.00
0.00
18
12.00
100.00
36.00
0
0.00
0.00
0.00
18
12.00
 
 
4 0
0.00
0.00
0.00
3
2.00
75.00
6.00
1
0.67
25.00
2.00
4
2.67
 
 
5 0
0.00
0.00
0.00
0
0.00
0.00
0.00
3
2.00
100.00
6.00
3
2.00
 
 
6 0
0.00
0.00
0.00
7
4.67
100.00
14.00
0
0.00
0.00
0.00
7
4.67
 
 
7 17
11.33
100.00
34.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
17
11.33
 
 
8 0
0.00
0.00
0.00
3
2.00
16.67
6.00
15
10.00
83.33
30.00
18
12.00
 
 
9 0
0.00
0.00
0.00
0
0.00
0.00
0.00
22
14.67
100.00
44.00
22
14.67
 
 
10 33
22.00
100.00
66.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
33
22.00
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00


The following macro, CLUS, clusters the preliminary clusters. There is one argument to choose the METHOD= specification to be used by PROC CLUSTER. The TREE procedure creates an output data set containing the 3-cluster partition, which is sorted and merged with the OUT= data set from PROC FASTCLUS to determine to which cluster each of the original 150 observations belongs. The SHOW macro is then used to display the results. In this example, the CLUS macro is invoked using Ward's method, which produces 16 misclassifications, and Wong's hybrid method, which produces 22 misclassifications. The results are shown in Output 23.3.4 and Output 23.3.5.

   %macro clus(method);
   proc cluster data=mean method=&method ccc pseudo;
      var petal: sepal:;
      copy preclus;
   run;
   proc tree noprint ncl=3 out=out;
      copy petal: sepal: preclus;
   run;
   proc sort data=out;
      by preclus;
   run;
   data clus;
      merge prelim out;
      by preclus;
   run;
   %show;
   %mend;

   title2 'Clustering Clusters by Ward''s Method';
   %clus(ward);

   title2 'Clustering Clusters by Wong''s Hybrid Method';
   %clus(twostage hybrid);

Output 23.3.4: Clustering Clusters: with Ward's Method

Cluster Analysis of Fisher (1936) Iris Data
Clustering Clusters by Ward's Method

The CLUSTER Procedure
Ward's Minimum Variance Cluster Analysis

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 416.976349 398.666421 0.9501 0.9501
2 18.309928 14.952922 0.0417 0.9918
3 3.357006 3.126943 0.0076 0.9995
4 0.230063   0.0005 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Root-Mean-Square Distance Between Observations = 30.24221

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 T
i
e
9 OB2 OB4 23 0.0019 .958 .932 6.26 400 6.3  
8 OB1 OB5 12 0.0025 .955 .926 6.75 434 5.8  
7 CL9 OB6 30 0.0069 .948 .918 6.28 438 19.5  
6 OB3 OB8 36 0.0074 .941 .907 6.21 459 26.0  
5 OB7 OB10 50 0.0104 .931 .892 6.15 485 42.2  
4 CL8 OB9 34 0.0162 .914 .870 4.28 519 39.3  
3 CL7 CL6 66 0.0318 .883 .824 4.39 552 59.7  
2 CL4 CL3 100 0.1099 .773 .695 3.94 503 113  
1 CL2 CL5 150 0.7726 .000 .000 0.00 . 503  


Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER Species Total
Setosa Versicolor Virginica
1 0
0.00
0.00
0.00
50
33.33
75.76
100.00
16
10.67
24.24
32.00
66
44.00
 
 
2 0
0.00
0.00
0.00
0
0.00
0.00
0.00
34
22.67
100.00
68.00
34
22.67
 
 
3 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00


clue3k.gif (4811 bytes)

Output 23.3.5: Clustering Clusters: PROC CLUSTER with Wong's Hybrid Method

Cluster Analysis of Fisher (1936) Iris Data
Clustering Clusters by Wong's Hybrid Method

The CLUSTER Procedure
Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 416.976349 398.666421 0.9501 0.9501
2 18.309928 14.952922 0.0417 0.9918
3 3.357006 3.126943 0.0076 0.9995
4 0.230063   0.0005 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Cluster History
NCL   FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Normalized
Fusion Density
Maximum Density
in Each Cluster
T
i
e
Clusters Joined Lesser Greater
9 OB10 OB7 50 0.0104 .949 .932 3.81 330 42.2 40.24 58.2179 100.0  
8 OB3 OB8 36 0.0074 .942 .926 3.22 329 26.0 27.981 39.4511 48.4350  
7 OB2 OB4 23 0.0019 .940 .918 4.24 373 6.3 23.775 8.9675 46.3026  
6 CL8 OB9 58 0.0194 .921 .907 2.13 334 46.3 20.724 46.8846 48.4350  
5 CL7 OB6 30 0.0069 .914 .892 3.09 383 19.5 13.303 17.6360 46.3026  
4 CL6 OB1 67 0.0292 .884 .870 1.21 372 41.0 8.4137 10.8758 48.4350  
3 CL4 OB5 70 0.0138 .871 .824 3.33 494 12.3 5.1855 6.2890 48.4350  
2 CL3 CL5 100 0.0979 .773 .695 3.94 503 89.5 19.513 46.3026 48.4350  
1 CL2 CL9 150 0.7726 .000 .000 0.00 . 503 1.3337 48.4350 100.0  

3 modal clusters have been formed.


Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER Species Total
Setosa Versicolor Virginica
1 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
2 0
0.00
0.00
0.00
21
14.00
30.00
42.00
49
32.67
70.00
98.00
70
46.67
 
 
3 0
0.00
0.00
0.00
29
19.33
96.67
58.00
1
0.67
3.33
2.00
30
20.00
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00


clue3m.gif (4791 bytes)

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.