|
Chapter Contents |
Previous |
Next |
| The CLUSTER Procedure |
If, at some level of the cluster history, there is a tie for minimum distance between clusters, then one or more levels of the sample cluster tree are not uniquely determined. This example shows how the degree of indeterminacy can be assessed.
Mammals have four kinds of teeth: incisors, canines, premolars, and molars. The following data set gives the number of teeth of each kind on one side of the top and bottom jaws for 32 mammals.
Since all eight variables are measured in the same units, it is not strictly necessary to rescale the data. However, the canines have much less variance than the other kinds of teeth and, therefore, have little effect on the analysis if the variables are not standardized. An average linkage cluster analysis is run with and without standardization to allow comparison of the results. The results are shown in Output 23.4.1 and Output 23.4.2.
title 'Hierarchical Cluster Analysis of Mammals'' Teeth Data';
title2 'Evaluating the Effects of Ties';
data teeth;
input mammal $ 1-16
@21 (v1-v8) (1.);
label v1='Top incisors'
v2='Bottom incisors'
v3='Top canines'
v4='Bottom canines'
v5='Top premolars'
v6='Bottom premolars'
v7='Top molars'
v8='Bottom molars';
datalines;
BROWN BAT 23113333
MOLE 32103333
SILVER HAIR BAT 23112333
PIGMY BAT 23112233
HOUSE BAT 23111233
RED BAT 13112233
PIKA 21002233
RABBIT 21003233
BEAVER 11002133
GROUNDHOG 11002133
GRAY SQUIRREL 11001133
HOUSE MOUSE 11000033
PORCUPINE 11001133
WOLF 33114423
BEAR 33114423
RACCOON 33114432
MARTEN 33114412
WEASEL 33113312
WOLVERINE 33114412
BADGER 33113312
RIVER OTTER 33114312
SEA OTTER 32113312
JAGUAR 33113211
COUGAR 33113211
FUR SEAL 32114411
SEA LION 32114411
GREY SEAL 32113322
ELEPHANT SEAL 21114411
REINDEER 04103333
ELK 04103333
DEER 04003333
MOOSE 04003333
;
proc cluster data=teeth method=average nonorm outtree=_null_;
var v1-v8;
id mammal;
title3 'Raw Data';
run;
proc cluster data=teeth std method=average nonorm outtree=_null_;
var v1-v8;
id mammal;
title3 'Standardized Data';
run;
Output 23.4.1: Average Linkage Analysis of Mammals' Teeth Data: Raw Data|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
One way to assess the importance of the ties in the analysis is to repeat the analysis on several random permutations of the observations and then to see to what extent the results are consistent at the interesting levels of the cluster history. Three macros are presented to facilitate this process.
/* --------------------------------------------------------- */
/* */
/* The macro CLUSPERM randomly permutes observations and */
/* does a cluster analysis for each permutation. */
/* The arguments are as follows: */
/* */
/* data data set name */
/* var list of variables to cluster */
/* id id variable for proc cluster */
/* method clustering method (and possibly other options) */
/* nperm number of random permutations. */
/* */
/* --------------------------------------------------------- */
%macro CLUSPERM(data,var,id,method,nperm);
/* ------CREATE TEMPORARY DATA SET WITH RANDOM NUMBERS------ */
data _temp_;
set &data;
array _random_ _ran_1-_ran_&nperm;
do over _random_;
_random_=ranuni(835297461);
end;
run;
/* ------PERMUTE AND CLUSTER THE DATA----------------------- */
%do n=1 %to &nperm;
proc sort data=_temp_(keep=_ran_&n &var &id) out=_perm_;
by _ran_&n;
run;
proc cluster method=&method noprint outtree=_tree_&n;
var &var;
id &id;
run;
%end;
%mend;
/* --------------------------------------------------------- */
/* */
/* The macro PLOTPERM plots various cluster statistics */
/* against the number of clusters for each permutation. */
/* The arguments are as follows: */
/* */
/* stats names of variables from tree data set */
/* nclus maximum number of clusters to be plotted */
/* nperm number of random permutations. */
/* */
/* --------------------------------------------------------- */
%macro PLOTPERM(stat,nclus,nperm);
/* ---CONCATENATE TREE DATA SETS FOR 20 OR FEWER CLUSTERS--- */
data _plot_;
set %do n=1 %to &nperm; _tree_&n(in=_in_&n) %end; ;
if _ncl_<=&nclus;
%do n=1 %to &nperm;
if _in_&n then _perm_=&n;
%end;
label _perm_='permutation number';
keep _ncl_ &stat _perm_;
run;
/* ---PLOT THE REQUESTED STATISTICS BY NUMBER OF CLUSTERS--- */
proc plot;
plot (&stat)*_ncl_=_perm_ /vpos=26;
title2 'Symbol is value of _PERM_';
run;
%mend;
/* --------------------------------------------------------- */
/* */
/* The macro TREEPERM generates cluster-membership variables */
/* for a specified number of clusters for each permutation. */
/* PROC PRINT lists the objects in each cluster-combination, */
/* and PROC TABULATE gives the frequencies and means. The */
/* arguments are as follows: */
/* */
/* var list of variables to cluster */
/* (no "-" or ":" allowed) */
/* id id variable for proc cluster */
/* meanfmt format for printing means in PROC TABULATE */
/* nclus number of clusters desired */
/* nperm number of random permutations. */
/* */
/* --------------------------------------------------------- */
%macro TREEPERM(var,id,meanfmt,nclus,nperm);
/* ------CREATE DATA SETS GIVING CLUSTER MEMBERSHIP--------- */
%do n=1 %to &nperm;
proc tree data=_tree_&n noprint n=&nclus
out=_out_&n(drop=clusname
rename=(cluster=_clus_&n));
copy &var;
id &id;
run;
proc sort;
by &id &var;
run;
%end;
/* ------MERGE THE CLUSTER VARIABLES------------------------ */
data _merge_;
merge
%do n=1 %to &nperm;
_out_&n
%end; ;
by &id &var;
length all_clus $ %eval(3*&nperm);
%do n=1 %to &nperm;
substr( all_clus, %eval(1+(&n-1)*3), 3) =
put( _clus_&n, 3.);
%end;
run;
/* ------PRINT AND TABULATE CLUSTER COMBINATIONS------------ */
proc sort;
by _clus_:;
run;
proc print;
var &var;
id &id;
by all_clus notsorted;
run;
proc tabulate order=data formchar=' ';
class all_clus;
var &var;
table all_clus, n='FREQ'*f=5. mean*f=&meanfmt*(&var) /
rts=%eval(&nperm*3+1);
run;
%mend;
To use these, it is first convenient to define a macro, VLIST, listing the teeth variables, since the forms V1-V8 or V: cannot be used with the TABULATE procedure in the TREEPERM macro:
/* -TABULATE does not accept hyphens or colons in VAR lists- */ %let vlist=v1 v2 v3 v4 v5 v6 v7 v8;
The CLUSPERM macro is then called to analyze ten random permutations. The PLOTPERM macro plots the pseudo F and t2 statistics and the cubic clustering criterion. Since the data are discrete, the pseudo F statistic and the cubic clustering criterion can be expected to increase as the number of clusters increases, so local maxima or large jumps in these statistics are more relevant than the global maximum in determining the number of clusters. For the raw data, only the pseudo t2 statistic indicates the possible presence of clusters, with the 4-cluster level being suggested. Hence, the TREEPERM macro is used to analyze the results at the 4-cluster level:
title3 'Raw Data'; /* ------CLUSTER RAW DATA WITH AVERAGE LINKAGE-------------- */ %clusperm( teeth, &vlist, mammal, average, 10); /* -----PLOT STATISTICS FOR THE LAST 20 LEVELS-------------- */ %plotperm( _psf_ _pst2_ _ccc_, 20, 10); /* ------ANALYZE THE 4-CLUSTER LEVEL------------------------ */ %treeperm( &vlist, mammal, 9.1, 4, 10);
The results are shown in Output 23.4.3.
Output 23.4.3: Analysis of Ten Random Permutations of Raw Mammals' Teeth Data: Indeterminacy at the 4-Cluster Level|
|
|
|
|
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Next, the analysis is repeated with the standardized data. The pseudo F and t2 statistics indicate 3 or 4 clusters, while the cubic clustering criterion shows a sharp rise up to 4 clusters and then levels off up to 6 clusters. So the TREEPERM macro is used again at the 4-cluster level. In this case, there is no indeterminacy, as the same four clusters are obtained with every permutation, although in different orders. It must be emphasized, however, that lack of indeterminacy in no way indicates validity. The results are shown in Output 23.4.4.
title3 'Standardized Data'; /*------CLUSTER STANDARDIZED DATA WITH AVERAGE LINKAGE------*/ %clusperm( teeth, &vlist, mammal, average std, 10); /*------PLOT STATISTICS FOR THE LAST 20 LEVELS--------------*/ %plotperm( _psf_ _pst2_ _ccc_, 20, 10); /*------ANALYZE THE 4-CLUSTER LEVEL-------------------------*/ %treeperm( &vlist, mammal, 9.1, 4, 10);Output 23.4.4: Analysis of Ten Random Permutations of Standardized Mammals' Teeth Data: No Indeterminacy at the 4-Cluster Level
|
|
|
|
|
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.