PROC VARCLUS Statement

PROC VARCLUS < options >;

Table 68.1

Table 68.1: Options Available in the PROC VARCLUS Statement

Task	Options
Specify data sets	DATA= OUTSTAT= OUTTREE=
Determine the number of clusters	MAXCLUSTERS= MINCLUSTERS= MAXEIGEN= PROPORTION=
Specify cluster formation	CENTROID COVARIANCE HIERARCHY INTITIAL= MAXITER= MAXSEARCH= MULTIPLEGROUP RANDOM=
Control output	CORR NOPRINT SHORT SIMPLE SUMMARY TRACE
Omit intercept	NOINT
Specify divisor for variances	VARDEF=

CENTROID

uses centroid components rather than principal components. You should specify centroid components if you want the cluster components to be unweighted averages of the standardized variables (the default) or the unstandardized variables (if you specify the COV option). It is possible to obtain locally optimal clusterings in which a variable is not assigned to the cluster component with which it has the highest squared correlation. You cannot specify the CENTROID option with the MAXEIGEN= option.

CORR

displays the correlation matrix.

COVARIANCE

COV

analyzes the covariance matrix rather than the correlation matrix.

DATA=SAS-data-set

specifies the input data set to be analyzed. The data set can be an ordinary SAS data set or TYPE=CORR, UCORR, COV, UCOV, FACTOR, or SSCP. If you do not specify the DATA= option, the most recently created SAS data set is used. See Appendix A, "Special SAS Data Sets," for more information on types of SAS data sets.

HIERARCHY

requires the clusters at different levels to maintain a hierarchical structure.

INITIAL=GROUP

INITIAL=INPUT

INITIAL=RANDOM

INITIAL=SEED

specifies the method for initializing the clusters. If the INITIAL= option is omitted and the MINCLUSTERS= option is greater than 1, the initial cluster components are obtained by extracting the required number of principal components and performing an orthoblique rotation. The following list describes the values for the INITIAL= option:

GROUP: specifies that clusters be initialized by group. You can use this option if the input data set is a TYPE=CORR, UCORR, COV, UCOV, or FACTOR data set. The cluster membership of each variable is obtained from an observation with _TYPE_='GROUP', which contains an integer for each variable ranging from one to the number of clusters. You can use a data set created either by a previous run of PROC VARCLUS or in a DATA step.
INPUT: specifies that the input data set is a TYPE=CORR, UCORR, COV, UCOV, or FACTOR data set, in which case scoring coefficients are read from observations where _TYPE_='SCORE'. You can use scoring coefficients from the FACTOR procedure or a previous run of PROC VARCLUS, or you can enter other coefficients in a DATA step.
RANDOM: assigns variables randomly to clusters. If you specify INITIAL=RANDOM without the CENTROID option, it is recommended that you specify MAXSEARCH=5, although the CPU time required is substantially increased.
SEED: initializes clusters according to the variables named in the SEED statement. Each variable listed in the SEED statement becomes the sole member of a cluster, and the other variables remain unassigned. If you do not specify the SEED statement, the first MINCLUSTERS= variables in the VAR statement are used as seeds.

MAXCLUSTERS=n

MAXC=n

specifies the largest number of clusters desired. The default value is the number of variables.

MAXEIGEN=n

specifies the largest permissible value of the second eigenvalue in each cluster. If you do not specify either the PROPORTION= or the MAXCLUSTERS= option, the default value is the average of the diagonal elements of the matrix being analyzed. This value is either the average variance if a covariance matrix is analyzed, or 1 if the correlation matrix is analyzed (unless some of the variables are constant, in which case the value is the number of nonconstant variables divided by the number of variables). Otherwise, the default is 0. The MAXEIGEN= option cannot be used with the CENTROID option.

MAXSEARCH=n

specifies the maximum number of iterations during the search phase. The default is 10 if you specify the CENTROID option; the default is 0 otherwise.

MINCLUSTERS=n

MINC=n

specifies the smallest number of clusters desired. The default value is 2 if INITIAL=RANDOM or INITIAL=SEED; otherwise, the procedure begins with one cluster and tries to split it in accordance with the PROPORTION= or MAXEIGEN= option.

MULTIPLEGROUP

performs a multiple group component analysis (refer to Harman 1976). The input data set must be TYPE=CORR, UCORR, COV, UCOV, FACTOR or SSCP and must contain an observation with _TYPE_='GROUP' defining the variable groups. Specifying the MULTIPLEGROUP option is equivalent to specifying all of the following options: MINC=1, MAXITER=0, MAXSEARCH=0, MAXEIGEN=0, PROPORTION=0, and INITIAL=GROUP.

NOINT

requests that no intercept be used; covariances or correlations are not corrected for the mean. If you specify the NOINT option, the OUTSTAT= data set is TYPE=UCORR.

NOPRINT

suppresses the output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 15, "Using the Output Delivery System."

OUTSTAT=SAS-data-set

creates an output data set to contain statistics including means, standard deviations, correlations, cluster scoring coefficients, and the cluster structure. If you want to create a permanent SAS data set, you must specify a two-level name. The OUTSTAT= data set is TYPE=UCORR if the NOINT option is specified. For more information on permanent SAS data sets, refer to "SAS Files" and "DATA Step Concepts" in SAS Language Reference: Concepts. For information on types of SAS data sets, see Appendix A.

OUTTREE=SAS-data-set

creates an output data set to contain information on the tree structure that can be used by the TREE procedure to print a tree diagram. The OUTTREE= option implies the HIERARCHY option. See Example 68.1 for use of the OUTTREE= option. If you want to create a permanent SAS data set, you must specify a two-level name. For more information on permanent SAS data sets, refer to "SAS Files" and "DATA Step Concepts" in SAS Language Reference: Concepts.

PROPORTION=n

PERCENT=n

gives the proportion or percentage of variation that must be explained by the cluster component. Values greater than 1.0 are considered to be percentages, so PROPORTION=0.75 and PERCENT=75 are equivalent. If you specify the CENTROID option, the default value is 0.75; otherwise, the default value is 0.

MAXITER=n

specifies the maximum number of iterations during the alternating least-squares phase. The default value is 1 if you specify the CENTROID option; the default is 10 otherwise.

RANDOM=n

specifies a positive integer as a starting value for use with REPLACE=RANDOM. If you do not specify the RANDOM= option, the time of day is used to initialize the pseudo-random number sequence.

SHORT

suppresses printing of the cluster structure, scoring coefficient, and intercluster correlation matrices.

SIMPLE

displays means and standard deviations.

SUMMARY

suppresses all default output except the final summary table.

TRACE

lists the cluster to which each variable is assigned during the iterations.

VARDEF=DF

VARDEF=N

VARDEF=WDF

VARDEF=WEIGHT | WGT

specifies the divisor to be used in the calculation of variances and covariances. The default value is VARDEF=DF. The values and associated divisors are displayed in the following table.

Value	Divisor	Formula
DF	degrees of freedom	n-i
N	number of observations	n
WDF	sum of weights minus one	$(\sum_j w_j)-1$
WEIGHT \| WGT	sum of weights	$\sum_j w_j$

In the preceding table, i=0 if the NOINT option is specified, and i=1 otherwise.