Example 23.3: Cluster Analysis of Fisher Iris Data

The CLUSTER Procedure

Example 23.3: Cluster Analysis of Fisher Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

This example analyzes the iris data by Ward's method and two-stage density linkage and then illustrates how the FASTCLUS procedure can be used in combination with PROC CLUSTER to analyze large data sets.

   title 'Cluster Analysis of Fisher (1936) Iris Data';
   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      input SepalLength SepalWidth PetalLength PetalWidth Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;

The following macro, SHOW, is used in the subsequent analyses to display cluster results. It invokes the FREQ procedure to crosstabulate clusters and species. The CANDISC procedure computes canonical variables for discriminating among the clusters, and the first two canonical variables are plotted to show cluster membership. See Chapter 21, "The CANDISC Procedure," for a canonical discriminant analysis of the iris species.

   %macro show;
   proc freq;
      tables cluster*species;
   run;
   proc candisc noprint out=can;
      class cluster;
      var petal: sepal:;
   run;
   legend1 frame cframe=ligr cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;
   proc gplot;
      plot can2*can1=cluster / 
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;    
   run;
   %mend;

The first analysis clusters the iris data by Ward's method and plots the CCC and pseudo F and t² statistics. The CCC has a local peak at 3 clusters but a higher peak at 5 clusters. The pseudo F statistic indicates 3 clusters, while the pseudo t² statistic suggests 3 or 6 clusters. For large numbers of clusters, Version 6 of the SAS System produces somewhat different results than previous versions of PROC CLUSTER. This is due to changes in the treatment of ties. Results are identical for 5 or fewer clusters.

The TREE procedure creates an output data set containing the 3-cluster partition for use by the SHOW macro. The FREQ procedure reveals 16 misclassifications. The results are shown in Output 23.3.1.

   title2 'By Ward''s Method';
   proc cluster data=iris method=ward print=15 ccc pseudo;
      var petal: sepal:;
      copy species;
   run;
   legend1 frame cframe=ligr cborder=black 
           position=center value=(justify=center);
   axis1 label=(angle=90 rotate=0) minor=none order=(0 to 600 by 100);
   axis2 minor=none order=(1 to 30 by 1);
   axis3 label=(angle=90 rotate=0) minor=none order=(0 to 7 by 1);

   proc gplot;
      plot _ccc_*_ncl_  /
         frame cframe=ligr legend=legend1 vaxis=axis3 haxis=axis2;
      plot _psf_*_ncl_  _pst2_*_ncl_  /overlay 
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;
   run;

   proc tree noprint ncl=3 out=out;
      copy petal: sepal: species;
   run;

   %show;

Output 23.3.1: Cluster Analysis of Fisher Iris Data: CLUSTER with METHOD=WARD

Cluster Analysis of Fisher (1936) Iris Data

By Ward's Method

The CLUSTER Procedure

Ward's Minimum Variance Cluster Analysis

Eigenvalues of the Covariance Matrix
	Eigenvalue	Difference	Proportion	Cumulative
1	422.824171	398.557096	0.9246	0.9246
2	24.267075	16.446125	0.0531	0.9777
3	7.820950	5.437441	0.0171	0.9948
4	2.383509		0.0052	1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Root-Mean-Square Distance Between Observations = 30.24221

Cluster History
NCL	Clusters Joined		FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	T i e
15	CL24	CL28	15	0.0016	.971	.958	5.93	324	9.8
14	CL21	CL53	7	0.0019	.969	.955	5.85	329	5.1
13	CL18	CL48	15	0.0023	.967	.953	5.69	334	8.9
12	CL16	CL23	24	0.0023	.965	.950	4.63	342	9.6
11	CL14	CL43	12	0.0025	.962	.946	4.67	353	5.8
10	CL26	CL20	22	0.0027	.959	.942	4.81	368	12.9
9	CL27	CL17	31	0.0031	.956	.936	5.02	387	17.8
8	CL35	CL15	23	0.0031	.953	.930	5.44	414	13.8
7	CL10	CL47	26	0.0058	.947	.921	5.43	430	19.1
6	CL8	CL13	38	0.0060	.941	.911	5.81	463	16.3
5	CL9	CL19	50	0.0105	.931	.895	5.82	488	43.2
4	CL12	CL11	36	0.0172	.914	.872	3.99	515	41.0
3	CL6	CL7	64	0.0301	.884	.827	4.33	558	57.2
2	CL4	CL3	100	0.1110	.773	.697	3.83	503	116
1	CL5	CL2	150	0.7726	.000	.000	0.00	.	503

Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER	Species			Total
CLUSTER	Setosa	Versicolor	Virginica	Total
1	0 0.00 0.00 0.00	49 32.67 76.56 98.00	15 10.00 23.44 30.00	64 42.67
2	0 0.00 0.00 0.00	1 0.67 2.78 2.00	35 23.33 97.22 70.00	36 24.00
3	50 33.33 100.00 100.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	50 33.33
Total	50 33.33	50 33.33	50 33.33	150 100.00

The second analysis uses two-stage density linkage. The raw data suggest 2 or 6 modes instead of 3:

k		modes
3		12
4-6		6
7		4
8		3
9-50		2
51+		1

However, the ACECLUS procedure can be used to reveal 3 modes. This analysis uses K=8 to produce 3 clusters for comparison with other analyses. There are only 6 misclassifications. The results are shown in Output 23.3.2.

   title2 'By Two-Stage Density Linkage';
   proc cluster data=iris method=twostage k=8 print=15 ccc pseudo;
      var petal: sepal:;
      copy species;
   run;

   proc tree noprint ncl=3 out=out;
      copy petal: sepal: species;
   run;

   %show;

Output 23.3.2: Cluster Analysis of Fisher Iris Data: CLUSTER with METHOD=TWOSTAGE

Cluster Analysis of Fisher (1936) Iris Data

By Two-Stage Density Linkage

The CLUSTER Procedure

Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
	Eigenvalue	Difference	Proportion	Cumulative
1	422.824171	398.557096	0.9246	0.9246
2	24.267075	16.446125	0.0531	0.9777
3	7.820950	5.437441	0.0171	0.9948
4	2.383509		0.0052	1.0000

K = 8

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Cluster History
NCL			FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	Normalized Fusion Density	Maximum Density in Each Cluster		T i e
NCL	Clusters Joined		FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	Normalized Fusion Density	Lesser	Greater	T i e
15	CL17	OB127	44	0.0025	.916	.958	-11	105	3.4	0.3903	0.2066	3.5156
14	CL16	OB137	50	0.0023	.913	.955	-11	110	5.6	0.3637	0.1837	100.0
13	CL15	OB74	45	0.0029	.910	.953	-10	116	3.7	0.3553	0.2130	3.5156
12	CL28	OB49	46	0.0036	.907	.950	-8.0	122	5.2	0.3223	0.1736	8.3678	T
11	CL12	OB85	47	0.0036	.903	.946	-7.6	130	4.8	0.3223	0.1736	8.3678
10	CL11	OB98	48	0.0033	.900	.942	-7.1	140	4.1	0.2879	0.1479	8.3678
9	CL13	OB24	46	0.0037	.896	.936	-6.5	152	4.4	0.2802	0.2005	3.5156
8	CL10	OB25	49	0.0019	.894	.930	-5.5	171	2.2	0.2699	0.1372	8.3678
7	CL8	OB121	50	0.0035	.891	.921	-4.5	194	4.0	0.2586	0.1372	8.3678
6	CL9	OB45	47	0.0042	.886	.911	-3.3	225	4.6	0.1412	0.0832	3.5156
5	CL6	OB39	48	0.0049	.882	.895	-1.7	270	5.0	0.107	0.0605	3.5156
4	CL5	OB21	49	0.0049	.877	.872	0.35	346	4.7	0.0969	0.0541	3.5156
3	CL4	OB90	50	0.0047	.872	.827	3.28	500	4.1	0.0715	0.0370	3.5156
2	CL3	CL7	100	0.0993	.773	.697	3.83	503	91.9	2.6277	3.5156	8.3678

3 modal clusters have been formed.

Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER	Species			Total
CLUSTER	Setosa	Versicolor	Virginica	Total
1	50 33.33 100.00 100.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	50 33.33
2	0 0.00 0.00 0.00	47 31.33 94.00 94.00	3 2.00 6.00 6.00	50 33.33
3	0 0.00 0.00 0.00	3 2.00 6.00 6.00	47 31.33 94.00 94.00	50 33.33
Total	50 33.33	50 33.33	50 33.33	150 100.00

The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can, therefore, be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.

FASTCLUS automatically creates variables _FREQ_ and _RMSSTD_ in the MEAN= output data set. These variables are then automatically used by PROC CLUSTER in the computation of various statistics.

The iris data are used to illustrate the process of clustering clusters. In the preliminary analysis, PROC FASTCLUS produces ten clusters, which are then crosstabulated with species. The data set containing the preliminary clusters is sorted in preparation for later merges. The results are shown in Output 23.3.3.

   title2 'Preliminary Analysis by FASTCLUS';
   proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
                 mean=mean out=prelim cluster=preclus;
      var petal: sepal:;
   run;

   proc freq;
      tables preclus*species;
   run;

   proc sort data=prelim;
      by preclus;
   run;

Output 23.3.3: Preliminary Analysis of Fisher Iris Data

Cluster Analysis of Fisher (1936) Iris Data

Preliminary Analysis by FASTCLUS

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=10 Maxiter=99 Converge=0

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	9	2.7067	8.2027		5	8.7362
2	19	2.2001	7.7340		4	6.2243
3	18	2.1496	6.2173		8	7.5049
4	4	2.5249	5.3268		2	6.2243
5	3	2.7234	5.8214		1	8.7362
6	7	2.2939	5.1508		2	9.3318
7	17	2.0274	6.9576		10	7.9503
8	18	2.2628	7.1135		3	7.5049
9	22	2.2666	7.5029		8	9.0090
10	33	2.0594	10.0033		7	7.9503

Pseudo F Statistic =	370.58

Observed Over-All R-Squared =	0.95971

Approximate Expected Over-All R-Squared =	0.82928

Cubic Clustering Criterion =	27.077

WARNING: The two values above are invalid for correlated variables.

Cluster Analysis of Fisher (1936) Iris Data

Preliminary Analysis by FASTCLUS

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of PRECLUS by Species
PRECLUS(Cluster	Species			Total
PRECLUS(Cluster	Setosa	Versicolor	Virginica	Total
1	0 0.00 0.00 0.00	0 0.00 0.00 0.00	9 6.00 100.00 18.00	9 6.00
2	0 0.00 0.00 0.00	19 12.67 100.00 38.00	0 0.00 0.00 0.00	19 12.67
3	0 0.00 0.00 0.00	18 12.00 100.00 36.00	0 0.00 0.00 0.00	18 12.00
4	0 0.00 0.00 0.00	3 2.00 75.00 6.00	1 0.67 25.00 2.00	4 2.67
5	0 0.00 0.00 0.00	0 0.00 0.00 0.00	3 2.00 100.00 6.00	3 2.00
6	0 0.00 0.00 0.00	7 4.67 100.00 14.00	0 0.00 0.00 0.00	7 4.67
7	17 11.33 100.00 34.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	17 11.33
8	0 0.00 0.00 0.00	3 2.00 16.67 6.00	15 10.00 83.33 30.00	18 12.00
9	0 0.00 0.00 0.00	0 0.00 0.00 0.00	22 14.67 100.00 44.00	22 14.67
10	33 22.00 100.00 66.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	33 22.00
Total	50 33.33	50 33.33	50 33.33	150 100.00

The following macro, CLUS, clusters the preliminary clusters. There is one argument to choose the METHOD= specification to be used by PROC CLUSTER. The TREE procedure creates an output data set containing the 3-cluster partition, which is sorted and merged with the OUT= data set from PROC FASTCLUS to determine to which cluster each of the original 150 observations belongs. The SHOW macro is then used to display the results. In this example, the CLUS macro is invoked using Ward's method, which produces 16 misclassifications, and Wong's hybrid method, which produces 22 misclassifications. The results are shown in Output 23.3.4 and Output 23.3.5.

   %macro clus(method);
   proc cluster data=mean method=&method ccc pseudo;
      var petal: sepal:;
      copy preclus;
   run;
   proc tree noprint ncl=3 out=out;
      copy petal: sepal: preclus;
   run;
   proc sort data=out;
      by preclus;
   run;
   data clus;
      merge prelim out;
      by preclus;
   run;
   %show;
   %mend;

   title2 'Clustering Clusters by Ward''s Method';
   %clus(ward);

   title2 'Clustering Clusters by Wong''s Hybrid Method';
   %clus(twostage hybrid);

Output 23.3.4: Clustering Clusters: with Ward's Method

Cluster Analysis of Fisher (1936) Iris Data

Clustering Clusters by Ward's Method

The CLUSTER Procedure

Ward's Minimum Variance Cluster Analysis

Eigenvalues of the Covariance Matrix
	Eigenvalue	Difference	Proportion	Cumulative
1	416.976349	398.666421	0.9501	0.9501
2	18.309928	14.952922	0.0417	0.9918
3	3.357006	3.126943	0.0076	0.9995
4	0.230063		0.0005	1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Root-Mean-Square Distance Between Observations = 30.24221

Cluster History
NCL	Clusters Joined		FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	T i e
9	OB2	OB4	23	0.0019	.958	.932	6.26	400	6.3
8	OB1	OB5	12	0.0025	.955	.926	6.75	434	5.8
7	CL9	OB6	30	0.0069	.948	.918	6.28	438	19.5
6	OB3	OB8	36	0.0074	.941	.907	6.21	459	26.0
5	OB7	OB10	50	0.0104	.931	.892	6.15	485	42.2
4	CL8	OB9	34	0.0162	.914	.870	4.28	519	39.3
3	CL7	CL6	66	0.0318	.883	.824	4.39	552	59.7
2	CL4	CL3	100	0.1099	.773	.695	3.94	503	113
1	CL2	CL5	150	0.7726	.000	.000	0.00	.	503

Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER	Species			Total
CLUSTER	Setosa	Versicolor	Virginica	Total
1	0 0.00 0.00 0.00	50 33.33 75.76 100.00	16 10.67 24.24 32.00	66 44.00
2	0 0.00 0.00 0.00	0 0.00 0.00 0.00	34 22.67 100.00 68.00	34 22.67
3	50 33.33 100.00 100.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	50 33.33
Total	50 33.33	50 33.33	50 33.33	150 100.00

Output 23.3.5: Clustering Clusters: PROC CLUSTER with Wong's Hybrid Method

Cluster Analysis of Fisher (1936) Iris Data

Clustering Clusters by Wong's Hybrid Method

The CLUSTER Procedure

Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
	Eigenvalue	Difference	Proportion	Cumulative
1	416.976349	398.666421	0.9501	0.9501
2	18.309928	14.952922	0.0417	0.9918
3	3.357006	3.126943	0.0076	0.9995
4	0.230063		0.0005	1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.69224

Cluster History
NCL			FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	Normalized Fusion Density	Maximum Density in Each Cluster		T i e
NCL	Clusters Joined		FREQ	SPRSQ	RSQ	ERSQ	CCC	PSF	PST2	Normalized Fusion Density	Lesser	Greater	T i e
9	OB10	OB7	50	0.0104	.949	.932	3.81	330	42.2	40.24	58.2179	100.0
8	OB3	OB8	36	0.0074	.942	.926	3.22	329	26.0	27.981	39.4511	48.4350
7	OB2	OB4	23	0.0019	.940	.918	4.24	373	6.3	23.775	8.9675	46.3026
6	CL8	OB9	58	0.0194	.921	.907	2.13	334	46.3	20.724	46.8846	48.4350
5	CL7	OB6	30	0.0069	.914	.892	3.09	383	19.5	13.303	17.6360	46.3026
4	CL6	OB1	67	0.0292	.884	.870	1.21	372	41.0	8.4137	10.8758	48.4350
3	CL4	OB5	70	0.0138	.871	.824	3.33	494	12.3	5.1855	6.2890	48.4350
2	CL3	CL5	100	0.0979	.773	.695	3.94	503	89.5	19.513	46.3026	48.4350
1	CL2	CL9	150	0.7726	.000	.000	0.00	.	503	1.3337	48.4350	100.0

3 modal clusters have been formed.

Cluster Analysis of Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct

Table of CLUSTER by Species
CLUSTER	Species			Total
CLUSTER	Setosa	Versicolor	Virginica	Total
1	50 33.33 100.00 100.00	0 0.00 0.00 0.00	0 0.00 0.00 0.00	50 33.33
2	0 0.00 0.00 0.00	21 14.00 30.00 42.00	49 32.67 70.00 98.00	70 46.67
3	0 0.00 0.00 0.00	29 19.33 96.67 58.00	1 0.67 3.33 2.00	30 20.00
Total	50 33.33	50 33.33	50 33.33	150 100.00

Chapter Contents
Previous
Next
Top