The VARCLUS Procedure

Getting Started

This example demonstrates how you can use the VARCLUS procedure to create hierarchical, unidimensional clusters of variables.

The following data, from Hand, et al. (1994), represent amounts of protein consumed from nine food groups for each of 25 European countries. The nine food groups are red meat (RedMeat), white meat (WhiteMeat), eggs (Eggs), milk (Milk), fish (Fish), cereal (Cereal), starch (Starch), nuts (Nuts), and fruits and vegetables (FruitVeg).

Suppose you want to simplify interpretation of the data by reducing the number of variables to a smaller set of variable cluster components. You can use the VARCLUS procedure for this type of variable reduction.

The following DATA step creates the SAS data set Protein:

   data Protein;                                                        
      input Country $18. RedMeat WhiteMeat Eggs Milk
         Fish Cereal Starch Nuts FruitVeg;
      datalines;
   Albania        10.1  1.4  0.5   8.9  0.2  42.3  0.6  5.5  1.7
   Austria         8.9 14.0  4.3  19.9  2.1  28.0  3.6  1.3  4.3   
   Belgium        13.5  9.3  4.1  17.5  4.5  26.6  5.7  2.1  4.0
   Bulgaria        7.8  6.0  1.6   8.3  1.2  56.7  1.1  3.7  4.2
   Czechoslovakia  9.7 11.4  2.8  12.5  2.0  34.3  5.0  1.1  4.0
   Denmark        10.6 10.8  3.7  25.0  9.9  21.9  4.8  0.7  2.4
   E Germany       8.4 11.6  3.7  11.1  5.4  24.6  6.5  0.8  3.6
   Finland         9.5  4.9  2.7  33.7  5.8  26.3  5.1  1.0  1.4
   France         18.0  9.9  3.3  19.5  5.7  28.1  4.8  2.4  6.5
   Greece         10.2  3.0  2.8  17.6  5.9  41.7  2.2  7.8  6.5
   Hungary         5.3 12.4  2.9   9.7  0.3  40.1  4.0  5.4  4.2
   Ireland        13.9 10.0  4.7  25.8  2.2  24.0  6.2  1.6  2.9
   Italy           9.0  5.1  2.9  13.7  3.4  36.8  2.1  4.3  6.7
   Netherlands     9.5 13.6  3.6  23.4  2.5  22.4  4.2  1.8  3.7
   Norway          9.4  4.7  2.7  23.3  9.7  23.0  4.6  1.6  2.7
   Poland          6.9 10.2  2.7  19.3  3.0  36.1  5.9  2.0  6.6
   Portugal        6.2  3.7  1.1   4.9 14.2  27.0  5.9  4.7  7.9
   Romania         6.2  6.3  1.5  11.1  1.0  49.6  3.1  5.3  2.8
   Spain           7.1  3.4  3.1   8.6  7.0  29.2  5.7  5.9  7.2
   Sweden          9.9  7.8  3.5   4.7  7.5  19.5  3.7  1.4  2.0
   Switzerland    13.1 10.1  3.1  23.8  2.3  25.6  2.8  2.4  4.9
   UK             17.4  5.7  4.7  20.6  4.3  24.3  4.7  3.4  3.3
   USSR            9.3  4.6  2.1  16.6  3.0  43.6  6.4  3.4  2.9
   W Germany      11.4 12.5  4.1  18.8  3.4  18.6  5.2  1.5  3.8
   Yugoslavia      4.4  5.0  1.2   9.5  0.6  55.9  3.0  5.7  3.2
   ;

The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $18. in the INPUT statement specifies that the variable Country is a character variable with a length of 18.

The following statements create the variable clusters.

   proc varclus data=Protein outtree=tree centroid maxclusters=4;
      var RedMeat--FruitVeg;
   run;

The DATA= option specifies the SAS data set Protein as input. The OUTTREE= option creates the output SAS data set Tree to contain the tree structure information. When you specify this option, you are implicitly requiring the clusters to be hierarchical rather than disjoint.

The CENTROID option specifies the centroid method of clustering. This means that the calculated cluster components are the unweighted averages of the standardized variables. The MAXCLUSTERS=4 option specifies that no more than four clusters be computed.

The VAR statement lists the numeric variables (RedMeat -FruitVeg) to be used in the analysis.

The results of this analysis are displayed in the following figures.

Although PROC VARCLUS displays output for each step in the clustering process, the following figures display only the final analysis for four clusters. Figure 68.1 displays the final cluster summary.

Oblique Centroid Component Cluster Analysis

Cluster summary for 4 clusters
Cluster	Members	Cluster Variation	Variation Explained	Proportion Explained
1	4	4	2.173024	0.5433
2	2	2	1.650997	0.8255
3	2	2	1.403853	0.7019
4	1	1	1	1.0000

Total variation explained = 6.227874 Proportion = 0.6920

Figure 68.1: Final Cluster Summary from the VARCLUS Procedure

For each cluster, Figure 68.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables.

The line labeled `Total variation explained' in Figure 68.1 gives the sum of the explained variation over all clusters. The final `Proportion' represents the total explained variation divided by the sum of cluster variation. This value, 0.6920, indicates that about 69% of the total variation in the data can be accounted for by the four clusters.

Figure 68.2 shows how the variables are clustered. The first cluster represents animal protein (RedMeat, WhiteMeat, Eggs, and Milk), the second cluster contains the variables Cereal and Nuts, the third cluster is composed of the variables Fish and Starch, and the last cluster contains the single variable representing fruits and vegetables (FruitVeg).

Oblique Centroid Component Cluster Analysis

Cluster	Variable	R-squared with		1-R**2 Ratio
Cluster	Variable	Own Cluster	Next Closest	1-R**2 Ratio
Cluster 1	RedMeat	0.4375	0.1518	0.6631
	WhiteMeat	0.6302	0.3331	0.5545
	Eggs	0.7024	0.4902	0.5837
	Milk	0.4288	0.2721	0.7847
Cluster 2	Cereal	0.8255	0.3983	0.2900
	Nuts	0.8255	0.5901	0.4257
Cluster 3	Fish	0.7019	0.1365	0.3452
	Starch	0.7019	0.3075	0.4304
Cluster 4	FruitVeg	1.0000	0.0578	0.0000

Figure 68.2: R-square Values from the VARCLUS Procedure

Figure 68.2 also displays the R² value of each variable with its own cluster and the R² value with its nearest cluster. The R² value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of 1-R_own²/1-R_nearest² for each variable. Small values of this ratio indicate good clustering.

Figure 68.3 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. This gives an indication of how and to what extent the cluster represents the variable. The table of intercorrelations contains the correlations between the cluster components.

Oblique Centroid Component Cluster Analysis

Cluster Structure
Cluster	1	2	3	4
RedMeat	0.66145	-0.38959	0.06450	-0.34109
WhiteMeat	0.79385	-0.57715	0.04760	-0.06132
Eggs	0.83811	-0.70012	0.30902	-0.04552
Milk	0.65483	-0.52163	0.16805	-0.26096
Fish	-0.08108	-0.36947	0.83781	0.26614
Cereal	-0.58070	0.90857	-0.63111	0.04655
Starch	0.41593	-0.55448	0.83781	0.08441
Nuts	-0.76817	0.90857	-0.37089	0.37497
FruitVeg	-0.24045	0.23197	0.20920	1.00000

Inter-Cluster Correlations
Cluster	1	2	3	4
1	1.00000	-0.74230	0.19984	-0.24045
2	-0.74230	1.00000	-0.55141	0.23197
3	0.19984	-0.55141	1.00000	0.20920
4	-0.24045	0.23197	0.20920	1.00000

Figure 68.3: Cluster Correlations and Intercorrelations

PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 68.4). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters.

As displayed in Figure 68.4, when the number of allowable clusters is two, the total variation explained is 3.9607, and the cumulative proportion of variation explained by two clusters is 0.4401. When the number of clusters increases to three, the proportion of explained variance increases to 0.5880. When four clusters are computed, the explained variation is 0.6920.

Oblique Centroid Component Cluster Analysis

Number of Clusters	Total Variation Explained by Clusters	Proportion of Variation Explained by Clusters	Minimum Proportion Explained by a Cluster	Minimum R-squared for a Variable	Maximum 1-R**2 Ratio for a Variable
1	0.732343	0.0814	0.0814	0.0875
2	3.960717	0.4401	0.3743	0.1007	1.0213
3	5.291887	0.5880	0.5433	0.3928	0.7978
4	6.227874	0.6920	0.5433	0.4288	0.7847

Figure 68.4: Final Cluster Summary Table from the VARCLUS Procedure

Figure 68.4 also displays the minimum proportion of variance explained by a cluster, the minimum R² for a variable, and the maximum (1-R²) ratio for a variable. The last quantity is the ratio of the value 1-R² for a variable's own cluster to the value 1-R² for its nearest cluster.

The following statements produce a tree diagram of the cluster structure created by PROC VARCLUS. First, the AXIS1 statement is defined. The ORDER= option specifies the data values in the order in which they should appear on the axis.

   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none order=(0 to 1 by .2);   
   proc tree data=tree horizontal vaxis=axis1 haxis=axis2;
   height _propor_;
   run;

Next, the TREE procedure is invoked. The procedure uses the SAS data set Tree, created by the OUTTREE= option in the preceding PROC VARCLUS statement. The HORIZONTAL option orients the tree diagram horizontally. The VAXIS and HAXIS options specify the AXIS1 and AXIS2 statements, respectively, to customize the axes of the tree diagram. The HEIGHT statement specifies the use of the variable _PROPOR_ (the proportion of variance explained) as the height variable.

Figure 68.5 shows how the clusters are created. The ordered variable names are displayed on the vertical axis. The horizontal axis displays the proportion of variance explained at each clustering level.

Figure 68.5: Horizontal Tree Diagram from PROC TREE

As you look from left to right in the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters.

For example, when the variables are formed into three clusters, one cluster contains the variables RedMeat, WhiteMeat, Eggs, and Milk; the second cluster contains the variables Fish and Starch; the third cluster contains the variables Cereal, Nuts, and FruitVeg. The proportion of variance explained at that level is 0.5880 (from Figure 68.4). At the next stage of clustering, the third cluster is split as the variable FruitVeg forms the fourth cluster; the proportion of variance explained is 0.6920.

Chapter Contents
Previous
Next
Top