|
Chapter Contents |
Previous |
Next |
| The CLUSTER Procedure |
The following example shows the analysis of a data set in which size information is detrimental to the classification. Imagine that an archaeologist of the future is excavating a 20th century grocery store. The archaeologist has discovered a large number of boxes of various sizes, shapes, and colors and wants to do a preliminary classification based on simple external measurements: height, width, depth, weight, and the predominant color of the box. It is known that a given product may have been sold in packages of different size, so the archaeologist wants to remove the effect of size from the classification. It is not known whether color is relevant to the use of the products, so the analysis should be done both with and without color information.
Unknown to the archaeologist, the boxes actually fall into six general categories according to the use of the product: breakfast cereals, crackers, laundry detergents, Little Debbie snacks, tea, and toothpaste. These categories are shown in the analysis so that you can evaluate the effectiveness of the classification.
Since there is no reason for the archaeologist to assume that the true categories have equal sample sizes or variances, the centroid method is used to avoid undue bias. Each analysis is done with Euclidean distances after suitable transformations of the data. Color is coded as five dummy variables with values of 0 or 1. The DATA step is as follows:
options ls=120;
title 'Cluster Analysis of Grocery Boxes';
data grocery2;
length name $35 /* name of product */
class $16 /* category of product */
unit $1 /* unit of measurement for weights:
g=gram
o=ounce
l=lb
all weights are converted to grams */
color $8 /* predominant color of box */
height 8 /* height of box in cm. */
width 8 /* width of box in cm. */
depth 8 /* depth of box (front to back) in cm. */
weight 8 /* weight of box in grams */
c_white c_yellow c_red c_green c_blue 4;
/* dummy variables */
retain class;
drop unit;
/*--- read name with possible embedded blanks ---*/
input name & @;
/*--- if name starts with "---", ---*/
/*--- it's really a category value ---*/
if substr(name,1,3) = '---' then do;
class = substr(name,4,index(substr(name,4),'-')-1);
delete;
return;
end;
/*--- read the rest of the variables ---*/
input height width depth weight unit color;
/*--- convert weights to grams ---*/
select (unit);
when ('l') weight = weight * 454;
when ('o') weight = weight * 28.3;
when ('g') ;
otherwise put 'Invalid unit ' unit;
end;
/*--- use 0/1 coding for dummy variables for colors ---*/
c_white = (color = 'w');
c_yellow = (color = 'y');
c_red = (color = 'r');
c_green = (color = 'g');
c_blue = (color = 'b');
datalines;
---Breakfast cereals---
Cheerios 32.5 22.4 8.4 567 g y
Cheerios 30.3 20.4 7.2 425 g y
Cheerios 27.5 19 6.2 283 g y
Cheerios 24.1 17.2 5.3 198 g y
Special K 30.1 20.5 8.5 18 o w
Special K 29.6 19.2 6.7 12 o w
Special K 23.4 16.6 5.7 7 o w
Corn Flakes 33.7 25.4 8 24 o w
Corn Flakes 30.2 20.6 8.4 18 o w
Corn Flakes 30 19.1 6.6 12 o w
Grape Nuts 21.7 16.3 4.9 680 g w
Shredded Wheat 19.7 19.9 7.5 283 g y
Shredded Wheat, Spoon Size 26.6 19.6 5.6 510 g r
All-Bran 21.1 14.3 5.2 13.8 o y
Froot Loops 30.2 20.8 8.5 19.7 o r
Froot Loops 25 17.7 6.4 11 o r
---Crackers---
Wheatsworth 11.1 25.2 5.5 326 g w
Ritz 23.1 16 5.3 340 g r
Ritz 23.1 20.7 5.2 454 g r
Premium Saltines 11 25 10.7 454 g w
Waverly Wafers 14.4 22.5 6.2 454 g g
---Detergent---
Arm & Hammer Detergent 38.8 30 16.9 25 l y
Arm & Hammer Detergent 39.5 25.8 11 14.2 l y
Arm & Hammer Detergent 33.7 22.8 7 7 l y
Arm & Hammer Detergent 27.8 19.4 6.3 4 l y
Tide 39.4 24.8 11.3 9.2 l r
Tide 32.5 23.2 7.3 4.5 l r
Tide 26.5 19.9 6.3 42 o r
Tide 19.3 14.6 4.7 17 o r
---Little Debbie---
Figaroos 13.5 18.6 3.7 12 o y
Swiss Cake Rolls 10.1 21.8 5.8 13 o w
Fudge Brownies 11 30.8 2.5 12 o w
Marshmallow Supremes 9.4 32 7 10 o w
Apple Delights 11.2 30.1 4.9 15 o w
Snack Cakes 13.4 32 3.4 13 o b
Nutty Bar 13.2 18.5 4.2 12 o y
Lemon Stix 13.2 18.5 4.2 9 o w
Fudge Rounds 8.1 28.3 5.4 9.5 o w
---Tea---
Celestial Saesonings Mint Magic 7.8 13.8 6.3 49 g b
Celestial Saesonings Cranberry Cove 7.8 13.8 6.3 46 g r
Celestial Saesonings Sleepy Time 7.8 13.8 6.3 37 g g
Celestial Saesonings Lemon Zinger 7.8 13.8 6.3 56 g y
Bigelow Lemon Lift 7.7 13.4 6.9 40 g y
Bigelow Plantation Mint 7.7 13.4 6.9 35 g g
Bigelow Earl Grey 7.7 13.4 6.9 35 g b
Luzianne 8.9 22.8 6.4 6 o r
Luzianne 18.4 20.2 6.9 8 o r
Luzianne Decaffeinated 8.9 22.8 6.4 5.25 o g
Lipton Tea Bags 17.1 20 6.7 8 o r
Lipton Tea Bags 11.5 14.4 6.6 3.75 o r
Lipton Tea Bags 6.7 10 5.7 1.25 o r
Lipton Family Size Tea Bags 13.7 24 9 12 o r
Lipton Family Size Tea Bags 8.7 20.8 8.2 6 o r
Lipton Family Size Tea Bags 8.9 11.1 8.2 3 o r
Lipton Loose Tea 12.7 10.9 5.4 8 o r
---Paste, Tooth---
Colgate 4.4 22 3.5 7 o r
Colgate 3.6 15.6 3.3 3 o r
Colgate 4.2 18.3 3.5 5 o r
Crest 4.3 21.7 3.7 6.4 o w
Crest 4.3 17.4 3.6 4.6 o w
Crest 3.5 15.2 3.2 2.7 o w
Crest 3.0 10.9 2.8 .85 o w
Arm & Hammer 4.4 17 3.7 5 o w
;
data grocery;
length name $16;
set grocery2;
The FORMAT procedure is used to define to formats to make the output easier to read. The STARS. format is used for graphical crosstabulations in the TABULATE procedure. The $COLOR format displays the names of the colors instead of just the first letter.
/*------ formats and macros for displaying ------*/
/*------ cluster results ------*/
proc format; value stars
0=' '
1=' #'
2=' ##'
3=' ###'
4=' ####'
5=' #####'
6=' ######'
7=' #######'
8=' ########'
9=' #########'
10=' ##########'
11=' ###########'
12=' ############'
13=' #############'
14=' ##############'
15-high='>##############';
run;
proc format; value $color
'w'='White'
'y'='Yellow'
'r'='Red'
'g'='Green'
'b'='Blue';
run;
Since a full display of the results of each cluster analysis would be very long, a macro is used with five macro variables to select parts of the output. The macro variables are set to select only the PROC CLUSTER output and the crosstabulation of clusters and true categories for the first two analyses. The example could be run with different settings of the macro variables to show the full output or other selected parts.
%let cluster=1; /* 1=show CLUSTER output, 0=don't */
%let tree=0; /* 1=print TREE diagram, 0=don't */
%let list=0; /* 1=list clusters, 0=don't */
%let crosstab=1; /* 1=crosstabulate clusters and classes,
0=don't */
%let crosscol=0; /* 1=crosstabulate clusters and colors,
0=don't */
/*--- define macro with options for TREE ---*/
%macro treeopt;
%if &tree %then h page=1;
%else noprint;
%mend;
/*--- define macro with options for CLUSTER ---*/
%macro clusopt;
%if &cluster %then pseudo ccc p=20;
%else noprint;
%mend;
/*------ macro for showing cluster results ------*/
%macro show(n); /* n=number of clusters
to show results for */
proc tree data=tree %treeopt n=&n out=out;
id name;
copy class height width depth weight color;
run;
%if &list %then %do;
proc sort;
by cluster;
run;
proc print;
var class name height width depth weight color;
by cluster clusname;
run;
%end;
%if &crosstab %then %do;
proc tabulate noseps /* formchar=' ' */;
class class cluster;
table cluster, class*n='
'*f=stars./rts=10 misstext=' ';
run;
%end;
%if &crosscol %then %do;
proc tabulate noseps /* formchar=' ' */;
class color cluster;
table cluster, color*n='
'*f=stars./rts=10 misstext=' ';
format color $color.;
run;
%end;
%mend;
The first analysis uses the variables height, width, depth, and weight in standardized form to show the effect of including size information. The CCC, pseudo F, and pseudo t2 statistics indicate 10 clusters. Most of the clusters do not correspond closely to the true categories, and four of the clusters have only one or two observations.
/**********************************************************/
/* */
/* Analysis 1: standardized box measurements */
/* */
/**********************************************************/
title2 'Analysis 1: Standardized data';
proc cluster data=grocery m=cen std %clusopt outtree=tree;
var height width depth weight;
id name;
copy class color;
run;
%show(10);
Output 23.6.1: Analysis of Standardized Data|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/**********************************************************/
/* */
/* Analysis 2: standardized row-centered logarithms */
/* */
/**********************************************************/
title2 'Row-centered logarithms';
data shape;
set grocery;
array x height width depth weight;
array l l_height l_width l_depth l_weight;
/* logarithms */
weight=weight**(1/3); /* take cube root to conform with
the other linear measurements */
do over l; /* take logarithms */
l=log(x);
end;
mean=mean( of l(*)); /* find row mean of logarithms */
do over l;
l=l-mean; /* center row */
end;
run;
title2 'Analysis 2: Standardized row-centered logarithms';
proc standard data=shape out=shapstan m=0 s=1;
var l_height l_width l_depth l_weight;
run;
proc cluster data=shapstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight;
id name;
copy class height width depth weight color;
run;
%show(8);
The results of the second analysis are shown for eight clusters. Clusters 1 through 4 correspond fairly well to tea, toothpaste, breakfast cereals, and detergents. Crackers and Little Debbie products are scattered among several clusters.
Output 23.6.2: Analysis of Standardized Row-Centered Logarithms|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/**********************************************************/
/* */
/* Analysis 3: standardized row-standardized logarithms */
/* */
/**********************************************************/
%let list=1;
%let crosscol=1;
title2 'Row-standardized logarithms';
data std;
set grocery;
array x height width depth weight;
array l l_height l_width l_depth l_weight;
/* logarithms */
weight=weight**(1/3); /* take cube root to conform with
the other linear measurements */
do over l;
l=log(x); /* take logarithms */
end;
mean=mean( of l(*)); /* find row mean of logarithms */
std=std( of l(*)); /* find row standard deviation */
do over l;
l=(l-mean)/std; /* standardize row */
end;
run;
title2 'Analysis 3: Standardized row-standardized logarithms';
proc standard data=std out=stdstan m=0 s=1;
var l_height l_width l_depth l_weight;
run;
proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight;
id name;
copy class height width depth weight color;
run;
%show(7);
The output from the third analysis shows that cluster 1 contains 9 of the 17 teas. Cluster 2 contains all of the detergents plus Grape Nuts, a very heavy cereal. Cluster 3 includes all of the toothpastes and one Little Debbie product that is of very similar shape, although roughly twice as large. Cluster 4 has most of the cereals, Ritz crackers (which come in a box very similar to most of the cereal boxes), and Lipton Loose Tea (all the other teas in the sample come in tea bags). Clusters 5 and 6 each contain several Luzianne and Lipton teas and one or two miscellaneous items. Cluster 7 includes most of the Little Debbie products and two types of crackers. Thus, the crackers are not identified and the teas are broken up into three clusters, but the other categories correspond to single clusters. This analysis classifies toothpaste and Little Debbie products slightly better than the second analysis,
Output 23.6.3: Analysis of Standardized Row-Standardized Logarithms|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
CLUSTER=4 CLUSNAME=CL13 |
|
| Obs | class | name | height | width | depth | weight | color |
| 28 | Breakfast cereal | Cheerios | 27.5 | 19.0 | 6.2 | 6.56541 | y |
| 29 | Breakfast cereal | Froot Loops | 25.0 | 17.7 | 6.4 | 6.77735 | r |
| 30 | Breakfast cereal | Special K | 30.1 | 20.5 | 8.5 | 7.98644 | w |
| 31 | Breakfast cereal | Corn Flakes | 30.2 | 20.6 | 8.4 | 7.98644 | w |
| 32 | Breakfast cereal | Special K | 29.6 | 19.2 | 6.7 | 6.97679 | w |
| 33 | Breakfast cereal | Corn Flakes | 30.0 | 19.1 | 6.6 | 6.97679 | w |
| 34 | Breakfast cereal | Froot Loops | 30.2 | 20.8 | 8.5 | 8.23034 | r |
| 35 | Breakfast cereal | Cheerios | 30.3 | 20.4 | 7.2 | 7.51847 | y |
| 36 | Breakfast cereal | Cheerios | 24.1 | 17.2 | 5.3 | 5.82848 | y |
| 37 | Breakfast cereal | Corn Flakes | 33.7 | 25.4 | 8.0 | 8.79021 | w |
| 38 | Breakfast cereal | Special K | 23.4 | 16.6 | 5.7 | 5.82946 | w |
| 39 | Breakfast cereal | Cheerios | 32.5 | 22.4 | 8.4 | 8.27677 | y |
| 40 | Breakfast cereal | Shredded Wheat, | 26.6 | 19.6 | 5.6 | 7.98957 | r |
| 41 | Crackers | Ritz | 23.1 | 16.0 | 5.3 | 6.97953 | r |
| 42 | Breakfast cereal | All-Bran | 21.1 | 14.3 | 5.2 | 7.30951 | y |
| 43 | Tea | Lipton Loose Tea | 12.7 | 10.9 | 5.4 | 6.09479 | r |
| 44 | Crackers | Ritz | 23.1 | 20.7 | 5.2 | 7.68573 | r |
| CLUSTER=5 CLUSNAME=CL10 |
| Obs | class | name | height | width | depth | weight | color |
| 45 | Tea | Luzianne | 8.9 | 22.8 | 6.4 | 5.53748 | r |
| 46 | Tea | Luzianne Decaffe | 8.9 | 22.8 | 6.4 | 5.29641 | g |
| 47 | Crackers | Premium Saltines | 11.0 | 25.0 | 10.7 | 7.68573 | w |
| 48 | Tea | Lipton Family Si | 8.7 | 20.8 | 8.2 | 5.53748 | r |
| 49 | Little Debbie | Marshmallow Supr | 9.4 | 32.0 | 7.0 | 6.56541 | w |
| 50 | Tea | Lipton Family Si | 13.7 | 24.0 | 9.0 | 6.97679 | r |
|
CLUSTER=6 CLUSNAME=CL9 |
|
| Obs | class | name | height | width | depth | weight | color |
| 51 | Tea | Luzianne | 18.4 | 20.2 | 6.9 | 6.09479 | r |
| 52 | Tea | Lipton Tea Bags | 17.1 | 20.0 | 6.7 | 6.09479 | r |
| 53 | Breakfast cereal | Shredded Wheat | 19.7 | 19.9 | 7.5 | 6.56541 | y |
| 54 | Tea | Lipton Tea Bags | 11.5 | 14.4 | 6.6 | 4.73448 | r |
| CLUSTER=7 CLUSNAME=CL8 |
| Obs | class | name | height | width | depth | weight | color |
| 55 | Crackers | Wheatsworth | 11.1 | 25.2 | 5.5 | 6.88239 | w |
| 56 | Little Debbie | Swiss Cake Rolls | 10.1 | 21.8 | 5.8 | 7.16545 | w |
| 57 | Little Debbie | Figaroos | 13.5 | 18.6 | 3.7 | 6.97679 | y |
| 58 | Little Debbie | Nutty Bar | 13.2 | 18.5 | 4.2 | 6.97679 | y |
| 59 | Little Debbie | Apple Delights | 11.2 | 30.1 | 4.9 | 7.51552 | w |
| 60 | Little Debbie | Lemon Stix | 13.2 | 18.5 | 4.2 | 6.33884 | w |
| 61 | Little Debbie | Fudge Brownies | 11.0 | 30.8 | 2.5 | 6.97679 | w |
| 62 | Little Debbie | Snack Cakes | 13.4 | 32.0 | 3.4 | 7.16545 | b |
| 63 | Crackers | Waverly Wafers | 14.4 | 22.5 | 6.2 | 7.68573 | g |
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Since dummy variables drastically violate the normality assumption on which the CCC depends, the CCC tends to indicate an excessively large number of clusters.
/************************************************************/
/* */
/* Analyses 4-7: standardized row-standardized logs & color */
/* */
/************************************************************/
%let list=0;
%let crosscol=1;
title2
'Analysis 4: Standardized row-standardized
logarithms and color (s=.2)';
proc standard data=stdstan out=stdstan m=0 s=.2;
var c_:;
run;
proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;
%show(7);
title2
'Analysis 5: Standardized row-standardized
logarithms and color (s=.3)';
proc standard data=stdstan out=stdstan m=0 s=.3;
var c_:;
run;
proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;
%show(6);
title2
'Analysis 6: Standardized row-standardized
logarithms and color (s=.4)';
proc standard data=stdstan out=stdstan m=0 s=.4;
var c_:;
run;
proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;
%show(3);
title2
'Analysis 7: Standardized row-standardized
logarithms and color (s=.8)';
proc standard data=stdstan out=stdstan m=0 s=.8;
var c_:;
run;
proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;
%show(10);
Using PROC STANDARD on the dummy variables with S=0.2 causes four of the Little Debbie products to join the toothpastes. Using S=0.3 causes one of the tea clusters to merge with the breakfast cereals while three cereals defect to the detergents. Using S=0.4 produces three clusters consisting of (1) cereals and detergents, (2) Little Debbie products and toothpaste, and (3) teas, with crackers divided among all three clusters and a few other misclassifications. With S=0.8, ten clusters are indicated, each entirely monochrome. So, S=0.2 or S=0.3 degrades the classification, S=0.4 yields a good but perhaps excessively coarse classification, and higher values of the S= option produce clusters that are determined mainly by color.
Output 23.6.4: Analysis of Standardized Row-Standardized Logarithms and Color|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.