Chapter Contents |
Previous |
Next |
The VARCLUS Procedure |
This example demonstrates how you can use the VARCLUS procedure to create hierarchical, unidimensional clusters of variables.
The following data, from Hand, et al. (1994), represent amounts of protein consumed from nine food groups for each of 25 European countries. The nine food groups are red meat (RedMeat), white meat (WhiteMeat), eggs (Eggs), milk (Milk), fish (Fish), cereal (Cereal), starch (Starch), nuts (Nuts), and fruits and vegetables (FruitVeg).
Suppose you want to simplify interpretation of the data by reducing the number of variables to a smaller set of variable cluster components. You can use the VARCLUS procedure for this type of variable reduction.
The following DATA step creates the SAS data set Protein:
data Protein; input Country $18. RedMeat WhiteMeat Eggs Milk Fish Cereal Starch Nuts FruitVeg; datalines; Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2 Sweden 9.9 7.8 3.5 4.7 7.5 19.5 3.7 1.4 2.0 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2 ;
The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $18. in the INPUT statement specifies that the variable Country is a character variable with a length of 18.
The following statements create the variable clusters.
proc varclus data=Protein outtree=tree centroid maxclusters=4; var RedMeat--FruitVeg; run;
The DATA= option specifies the SAS data set Protein as input. The OUTTREE= option creates the output SAS data set Tree to contain the tree structure information. When you specify this option, you are implicitly requiring the clusters to be hierarchical rather than disjoint.
The CENTROID option specifies the centroid method of clustering. This means that the calculated cluster components are the unweighted averages of the standardized variables. The MAXCLUSTERS=4 option specifies that no more than four clusters be computed.
The VAR statement lists the numeric variables (RedMeat -FruitVeg) to be used in the analysis.
The results of this analysis are displayed in the following figures.
Although PROC VARCLUS displays output for each step in the clustering process, the following figures display only the final analysis for four clusters. Figure 68.1 displays the final cluster summary.
For each cluster, Figure 68.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables.
The line labeled `Total variation explained' in Figure 68.1 gives the sum of the explained variation over all clusters. The final `Proportion' represents the total explained variation divided by the sum of cluster variation. This value, 0.6920, indicates that about 69% of the total variation in the data can be accounted for by the four clusters.
Figure 68.2 shows how the variables are clustered. The first cluster represents animal protein (RedMeat, WhiteMeat, Eggs, and Milk), the second cluster contains the variables Cereal and Nuts, the third cluster is composed of the variables Fish and Starch, and the last cluster contains the single variable representing fruits and vegetables (FruitVeg).
|
Figure 68.2 also displays the R2 value of each variable with its own cluster and the R2 value with its nearest cluster. The R2 value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of 1-Rown2/1-Rnearest2 for each variable. Small values of this ratio indicate good clustering.
Figure 68.3 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. This gives an indication of how and to what extent the cluster represents the variable. The table of intercorrelations contains the correlations between the cluster components.
|
PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 68.4). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters.
As displayed in Figure 68.4, when the number of allowable clusters is two, the total variation explained is 3.9607, and the cumulative proportion of variation explained by two clusters is 0.4401. When the number of clusters increases to three, the proportion of explained variance increases to 0.5880. When four clusters are computed, the explained variation is 0.6920.
|
Figure 68.4 also displays the minimum proportion of variance explained by a cluster, the minimum R2 for a variable, and the maximum (1-R2) ratio for a variable. The last quantity is the ratio of the value 1-R2 for a variable's own cluster to the value 1-R2 for its nearest cluster.
The following statements produce a tree diagram of the cluster structure created by PROC VARCLUS. First, the AXIS1 statement is defined. The ORDER= option specifies the data values in the order in which they should appear on the axis.
axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none order=(0 to 1 by .2); proc tree data=tree horizontal vaxis=axis1 haxis=axis2; height _propor_; run;
Next, the TREE procedure is invoked. The procedure uses the SAS data set Tree, created by the OUTTREE= option in the preceding PROC VARCLUS statement. The HORIZONTAL option orients the tree diagram horizontally. The VAXIS and HAXIS options specify the AXIS1 and AXIS2 statements, respectively, to customize the axes of the tree diagram. The HEIGHT statement specifies the use of the variable _PROPOR_ (the proportion of variance explained) as the height variable.
Figure 68.5 shows how the clusters are created. The ordered variable names are displayed on the vertical axis. The horizontal axis displays the proportion of variance explained at each clustering level.
As you look from left to right in the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters.
For example, when the variables are formed into three clusters, one cluster contains the variables RedMeat, WhiteMeat, Eggs, and Milk; the second cluster contains the variables Fish and Starch; the third cluster contains the variables Cereal, Nuts, and FruitVeg. The proportion of variance explained at that level is 0.5880 (from Figure 68.4). At the next stage of clustering, the third cluster is split as the variable FruitVeg forms the fourth cluster; the proportion of variance explained is 0.6920.
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.