Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The ACECLUS Procedure

Example 16.1: Transformation and Cluster Analysis of Fisher Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

In this example PROC ACECLUS is used to transform the data, and the clustering is performed by PROC FASTCLUS. Compare this with the example in Chapter 27, "The FASTCLUS Procedure." The results from the FREQ procedure display fewer misclassifications when PROC ACECLUS is used. The following statements produce Output 16.1.1 through Output 16.1.5.

   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      title 'Fisher (1936) Iris Data';
      input SepalLength SepalWidth PetalLength PetalWidth Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;

   proc aceclus data=iris out=ace p=.02 outstat=score;
      var SepalLength SepalWidth PetalLength PetalWidth ;
   run;

   legend1 frame cframe=ligr cborder=black position=center
           value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;
   proc gplot data=ace;
      plot can2*can1=Species/
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;
      format Species specname. ;
   run;
   proc fastclus data=ace maxc=3 maxiter=10 conv=0 out=clus;
      var can:;
   run;
   proc freq;
      tables cluster*Species;
   run;

Output 16.1.1: Using PROC ACECLUS to Transform Fisher's Iris Data

Fisher (1936) Iris Data

The ACECLUS Procedure

Approximate Covariance Estimation for Cluster Analysis

Observations 150 Proportion 0.0200
Variables 4 Converge 0.00100

Means and Standard Deviations
Variable Mean Standard
Deviation
Label
SepalLength 58.4333 8.2807 Sepal Length in mm.
SepalWidth 30.5733 4.3587 Sepal Width in mm.
PetalLength 37.5800 17.6530 Petal Length in mm.
PetalWidth 11.9933 7.6224 Petal Width in mm.

Initial Within-Cluster Covariance Estimate = Full Covariance Matrix

The ACECLUS Procedure

Approximate Covariance Estimation for Cluster Analysis

COV: Total Sample Covariances
  SepalLength SepalWidth PetalLength PetalWidth
SepalLength 68.5693512 -4.2434004 127.4315436 51.6270694
SepalWidth -4.2434004 18.9979418 -32.9656376 -12.1639374
PetalLength 127.4315436 -32.9656376 311.6277852 129.5609396
PetalWidth 51.6270694 -12.1639374 129.5609396 58.1006264

Initial Within-Cluster Covariance Estimate = Full Covariance Matrix

Threshold = 0.334211

Iteration History
Iteration RMS
Distance
Distance
Cutoff
Pairs
Within
Cutoff
Convergence
Measure
1 2.828 0.945 408.0 0.465775
2 11.905 3.979 559.0 0.013487
3 13.152 4.396 940.0 0.029499
4 13.439 4.491 1506.0 0.046846
5 13.271 4.435 2036.0 0.046859
6 12.591 4.208 2285.0 0.025027
7 12.199 4.077 2366.0 0.009559
8 12.121 4.051 2402.0 0.003895
9 12.064 4.032 2417.0 0.002051
10 12.047 4.026 2429.0 0.000971

Algorithm converged.

ACE: Approximate Covariance Estimate Within Clusters
  SepalLength SepalWidth PetalLength PetalWidth
SepalLength 11.73342939 5.47550432 4.95389049 2.02902429
SepalWidth 5.47550432 6.91992590 2.42177851 1.74125154
PetalLength 4.95389049 2.42177851 6.53746398 2.35302594
PetalWidth 2.02902429 1.74125154 2.35302594 2.05166735

Output 16.1.2: Eigenvalues, Raw Canonical Coefficients, and Standardized Canonical Coefficients

The ACECLUS Procedure

Approximate Covariance Estimation for Cluster Analysis

Initial Within-Cluster Covariance Estimate = Full Covariance Matrix

Eigenvalues of Inv(ACE)*(COV-ACE)
  Eigenvalue Difference Proportion Cumulative
1 63.7716 61.1593 0.9367 0.9367
2 2.6123 1.5561 0.0384 0.9751
3 1.0562 0.4167 0.0155 0.9906
4 0.6395   0.00939 1.0000

Eigenvectors (Raw Canonical Coefficients)
  Can1 Can2 Can3 Can4
SepalLength Sepal Length in mm. -.012009 -.098074 -.059852 0.402352
SepalWidth Sepal Width in mm. -.211068 -.000072 0.402391 -.225993
PetalLength Petal Length in mm. 0.324705 -.328583 0.110383 -.321069
PetalWidth Petal Width in mm. 0.266239 0.870434 -.085215 0.320286

Standardized Canonical Coefficients
  Can1 Can2 Can3 Can4
SepalLength Sepal Length in mm. -0.09944 -0.81211 -0.49562 3.33174
SepalWidth Sepal Width in mm. -0.91998 -0.00031 1.75389 -0.98503
PetalLength Petal Length in mm. 5.73200 -5.80047 1.94859 -5.66782
PetalWidth Petal Width in mm. 2.02937 6.63478 -0.64954 2.44134

Output 16.1.3: Plot of Transformed Iris Data: PROC PLOT
acefig1.gif (4841 bytes)

Output 16.1.4: Clustering of Transformed Iris Data: Partial Output from PROC FASTCLUS

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 50 1.1016 5.2768   3 13.2845
2 50 1.8880 6.8298   3 5.8580
3 50 1.4138 5.3152   2 5.8580

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
Can1 8.04808 1.48537 0.966394 28.756658
Can2 1.90061 1.85646 0.058725 0.062389
Can3 1.43395 1.32518 0.157417 0.186826
Can4 1.28044 1.27550 0.021025 0.021477
OVER-ALL 4.24499 1.50298 0.876324 7.085666

Pseudo F Statistic = 520.80

Approximate Expected Over-All R-Squared = 0.80391

Cubic Clustering Criterion = 5.179

WARNING: The two above values are invalid  for correlated variables.

Cluster Means
Cluster Can1 Can2 Can3 Can4
1 -10.67516964 0.06706906 0.27068819 0.11164209
2 8.12988211 0.52566663 0.51836499 0.14915404
3 2.54528754 -0.59273569 -0.78905317 -0.26079612

Cluster Standard Deviations
Cluster Can1 Can2 Can3 Can4
1 0.953761025 0.931943571 1.398456061 1.058217627
2 1.799159552 2.743869556 1.270344142 1.370523175
3 1.572366584 1.393565864 1.303411851 1.372050319

Output 16.1.5: Crosstabulation of Cluster by Species for Fisher's Iris Data: PROC FREQ

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER(Cluster Species Total
Setosa Versicolor Virginica
1 50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
2 0
0.00
0.00
0.00
2
1.33
4.00
4.00
48
32.00
96.00
96.00
50
33.33
 
 
3 0
0.00
0.00
0.00
48
32.00
96.00
96.00
2
1.33
4.00
4.00
50
33.33
 
 
Total 50
33.33
50
33.33
50
33.33
150
100.00

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.