The CORR Procedure

Concepts

Interpreting Correlation Coefficients
Correlation coefficients contain information on both the strength and direction of a linear relationship between two numeric random variables. If one variable x is an exact linear function of another variable y, a positive relationship exists when the correlation is 1 and an inverse relationship exists when the correlation is -1. If there is no linear predictability between the two variables, the correlation is 0. If the variables are normal and correlation is 0, the two variables are independent. However, correlation does not imply causality because, in some cases, an underlying causal relationship may exist.

The scatterplots in Examining Correlations Using Scatterplots depict the relationship between two numeric random variables.

Examining Correlations Using Scatterplots

[IMAGE]

When the relationship between two variables is nonlinear or when outliers are present, the correlation coefficient incorrectly estimates the strength of the relationship. Plotting the data before computing a correlation coefficient enables you to verify the linear relationship and to identify the potential outliers.

Determining Computer Resources
The only factor limiting the number of variables that you can analyze is the amount of available memory. The computer resources that PROC CORR requires depend on which statements and options you specify. To determine the computer resources that you need, use

N number of observations in the data set.

C number of correlation types (1 to 4).

V number of VAR statement variables.

W number of WITH statement variables.

P number of PARTIAL statement variables.

so that

T= V+W+P

K= V*W when W>0

V*(V+1)/2 when W=0

L= K when P=0

T*(T+1)/2 when P>0

For small N and large K, the CPU time varies as K for all types of correlations. For large N, the CPU time depends on the type of correlation. To calculate CPU time use

K*N with PEARSON (default)

T*N*log N with SPEARMAN

K*N*log N with HOEFFDING or KENDALL

You can reduce CPU time by specifying NOMISS. Without NOMISS, processing is much faster when most observations do not contain missing values.

The options and statements you use in the procedure require different amounts of storage to process the data. For Pearson correlations, the amount of temporary storage in bytes (M) is

40T+16L with NOMISS and NOSIMPLE

40T+16L+56T with NOMISS

40T+16L+56K with NOSIMPLE

40T+16L+56K+56T with no options

Using a PARTIAL statement increases the amount of temporary storage by 12T bytes. Using the ALPHA option increases the amount of temporary storage by 32V+16 bytes.

The following example uses a PARTIAL statement, which invokes NOMISS.

proc corr;
   var x1 x2;
   with y1 y2 y3;
   partial z1;

Therefore, using 40T+16L+56T+12T, the minimum temporary storage equals 984 bytes (T=2+3+1 and L=T(T+1)/2).

Using the SPEARMAN, KENDALL, or HOEFFDING option requires additional temporary storage for each observation. For the most time-efficient processing, the amount of temporary storage in bytes is

40T+8K+8L*C+12T*N+28N+QS+QP+QK
where

QS= 0 with NOSIMPLE

68T otherwise

QP= 56K with PEARSON and without NOMISS

0 otherwise

QK = 32N with KENDALL or HOEFFDING

0 otherwise.

The following example uses KENDALL:

  proc corr kendall;
     var x1 x2 x3;

Therefore, the minimum temporary storage in bytes is

40*3+8*6+8*6*1+12*3N+28N+3*68+32N = 420+96N

where N is the number of observations.

If M bytes are not available, PROC CORR must process the data multiple times to compute all the statistics. This reduces the minimum temporary storage you need by 12(T-2)N bytes. When this occurs, PROC CORR prints a note suggesting a larger memory region.

Chapter Contents
Previous
Next
Top of Page

N	number of observations in the data set.
C	number of correlation types (1 to 4).
V	number of VAR statement variables.
W	number of WITH statement variables.
P	number of PARTIAL statement variables.

T=	V+W+P
K=	V*W	when W>0
	V*(V+1)/2	when W=0
L=	K	when P=0
	T*(T+1)/2	when P>0

K*N	with PEARSON (default)
TNlog N	with SPEARMAN
KNlog N	with HOEFFDING or KENDALL

40T+16L	with NOMISS and NOSIMPLE
40T+16L+56T	with NOMISS
40T+16L+56K	with NOSIMPLE
40T+16L+56K+56T	with no options

QS=	0	with NOSIMPLE
	68T	otherwise
QP=	56K	with PEARSON and without NOMISS
	0	otherwise
QK =	32N	with KENDALL or HOEFFDING
	0	otherwise.