Chapter Contents

Previous

Next
The CORR Procedure

Statistical Computations

PROC CORR computes several parametric and nonparametric correlation statistics as measures of association. The formulas for computing these measures and the associated probabilities follow.


Pearson Product-Moment Correlation
The Pearson product-moment correlation is a parametric measure of association for two continuous random variables. The formula for the true Pearson product-moment correlation, denoted [IMAGE], is

[IMAGE]

The sample correlation, such as a Pearson product-moment correlation or weighted product-moment correlation, estimates the true correlation. The formula for the Pearson product-moment correlation is

[IMAGE]

where [IMAGE] is the sample mean of [IMAGE] and [IMAGE] is the sample mean of [IMAGE].

The formula for a weighted Pearson product-moment correlation is

[IMAGE]

where

[IMAGE]

Note that [IMAGE] is the weighted mean of [IMAGE], [IMAGE] is the weighted mean of [IMAGE], and [IMAGE] is the weight.

When one variable is dichotomous (0,1) and the other variable is continuous, a Pearson correlation is equivalent to a point biserial correlation. When both variables are dichotomous, a Pearson correlation coefficient is equivalent to the phi coefficient.


Spearman Rank-Order Correlation
Spearman rank-order correlation is a nonparametric measure of association based on the rank of the data values. The formula is

[IMAGE]

where [IMAGE] is the rank of the [IMAGE] value, [IMAGE] is the rank of the [IMAGE] value, [IMAGE] is the mean of the [IMAGE] values, and [IMAGE] is the mean of the [IMAGE] values.

PROC CORR computes the Spearman's correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.


Kendall's tau-b
Kendall's tau-b is a nonparametric measure of association based on the number of concordances and discordances in paired observations. Concordance occurs when paired observations vary together, and discordance occurs when paired observations vary differently. The formula for Kendall's tau-b is

[IMAGE]

where

[IMAGE]

and where [IMAGE] is the number of tied [IMAGE] values in the [IMAGE] group of tied [IMAGE] values, [IMAGE] is the number of tied [IMAGE] values in the [IMAGE] group of tied [IMAGE] values, [IMAGE] is the number of observations, and sgn(z) is defined as

[IMAGE]

PROC CORR computes Kendall's correlation by ranking the data and using a method similar to Knight (1966). The data are double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. PROC CORR computes Kendall's tau-b from the number of interchanges of the first variable and corrects for tied pairs (pairs of observations with equal values of X or equal values of Y).


Hoeffding's Measure of Dependence, D
Hoeffding's measure of dependence, D, is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of [IMAGE] values are cut points for the classification. The formula for Hoeffding's D is

[IMAGE]

where

[IMAGE]

[IMAGE] is the rank of [IMAGE], [IMAGE] is the rank of [IMAGE], and [IMAGE] (also called the bivariate rank) is 1 plus the number of points with both [IMAGE] and [IMAGE] values less than the [IMAGE] point. A point that is tied on only the [IMAGE] value or [IMAGE] value contributes 1/2 to [IMAGE] if the other value is less than the corresponding value for the [IMAGE] point. A point that is tied on both [IMAGE] and [IMAGE] contributes 1/4 to [IMAGE].

PROC CORR obtains the [IMAGE] values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding's D statistic is computed using the number of interchanges of the first variable.

When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than -0.5 . For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228).


Partial Correlation
A partial correlation measures the strength of a relationship between two variables, while controlling the effect of one or more additional variables. The Pearson partial correlation for a pair of variables may be defined as the correlation of errors after regression on the controlling variables. Let [IMAGE] be the set of variables to correlate. Also let [IMAGE] and [IMAGE] be sets of regression parameters and [IMAGE] be the set of controlling variables, where [IMAGE], [IMAGE] is the slope, and [IMAGE]. Suppose

[IMAGE]

is a regression model for [IMAGE] given [IMAGE]. The population Pearson partial correlation between the [IMAGE] and the [IMAGE] variables of [IMAGE] given [IMAGE] is defined as the correlation between errors [IMAGE] and [IMAGE].

If the exact values of [IMAGE] and [IMAGE] are unknown, you can use a sample Pearson partial correlation to estimate the population Pearson partial correlation. For a given sample of observations, you estimate the sets of unknown parameters [IMAGE] and [IMAGE] using the least-squares estimators [IMAGE] and [IMAGE]. Then the fitted least-squares regression model is

[IMAGE]

The partial corrected sums of squares and crossproducts (CSSCP) of [IMAGE] given [IMAGE] are the corrected sums of squares and crossproducts of the residuals [IMAGE]. Using these partial corrected sums of squares and crossproducts, you can calculate the partial variances, partial covariances, and partial correlations.

PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let [IMAGE] be the partitioned CSSCP matrix between two sets of variables, [IMAGE] and [IMAGE]:

[IMAGE]

PROC CORR calculates [IMAGE], the partial CSSCP matrix of [IMAGE] after controlling for [IMAGE], by applying the Cholesky decomposition algorithm sequentially on the rows associated with [IMAGE], the variables being partialled out.

After applying the Cholesky decomposition algorithm to each row associated with variables [IMAGE], PROC CORR checks all higher numbered diagonal elements associated with [IMAGE] for singularity. After the Cholesky decomposition, a variable is considered singular if the value of the corresponding diagonal element is less than [IMAGE] times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion [IMAGE] using the SINGULAR= option. For Pearson partial correlations, a controlling variable [IMAGE] is considered singular if the [IMAGE] for predicting this variable from the variables that are already partialled out exceeds [IMAGE]. When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the [IMAGE] for predicting this variable from the controlling variables exceeds [IMAGE]. When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero.

After the Cholesky decomposition algorithm is performed on all rows associated with [IMAGE], the resulting matrix has the form

[IMAGE]

where [IMAGE] is an upper triangular matrix with

[IMAGE]

If [IMAGE] is positive definite, then the partial CSSCP matrix [IMAGE] is identical to the matrix derived from the formula

[IMAGE]

The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix.

To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available.

When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation matrix) is positive definite, the resulting partial correlation between variables [IMAGE] and [IMAGE] after adjusting for a single variable [IMAGE] is identical to that obtained from the first-order partial correlation formula

[IMAGE]

where [IMAGE], [IMAGE], and [IMAGE] are the appropriate correlations.

The formula for higher-order partial correlations is a straightforward extension of the above first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between [IMAGE] and [IMAGE] controlling for both [IMAGE] and [IMAGE] is identical to the second-order partial correlation formula

[IMAGE]

where [IMAGE], [IMAGE], and [IMAGE] are first-order partial correlations among variables [IMAGE], [IMAGE], and [IMAGE] given [IMAGE].


Cronbach's Coefficient Alpha
Analyzing latent constructs such as job satisfaction, motor ability, sensory recognition, or customer satisfaction requires instruments to accurately measure the constructs. Interrelated items may be summed to obtain an overall score for each participant. Cronbach's coefficient alpha estimates the reliability of this type of scale by determining the internal consistency of the test or the average correlation of items within the test (Cronbach 1951).

When a value is recorded, the observed value contains some degree of measurement error. Two sets of measurements on the same variable for the same individual may not have identical values. However, repeated measurements for a series of individuals will show some consistency. Reliability measures internal consistency from one set of measurements to another. The observed value Y is divided into two components, a true value T and a measurement error E. The measurement error is assumed to be independent of the true value, that is,

[IMAGE]

The reliability coefficient of a measurement test is defined as the squared correlation between the observed value Y and the true value T, that is,

[IMAGE]

which is the proportion of the observed variance due to true differences among individuals in the sample. If Y is the sum of several observed variables measuring the same feature, you can estimate var(T). Cronbach's coefficient alpha, based on a lower bound for var(T), is an estimate of the reliability coefficient.

Suppose [IMAGE] variables are used with [IMAGE] for [IMAGE], where [IMAGE] is the observed value, [IMAGE] is the true value, and [IMAGE] is the measurement error. The measurement errors ( [IMAGE]) are independent of the true values ( [IMAGE]) and are also independent of each other. Let [IMAGE] be the total observed score and [IMAGE] be the total true score. Because

[IMAGE]

a lower bound for [IMAGE] is given by

[IMAGE]

With [IMAGE] for [IMAGE], a lower bound for the reliability coefficient is then given by the Cronbach's coefficient alpha:

[IMAGE]

If the variances of the items vary widely, you can standardize the items to a standard deviation of 1 before computing the coefficient alpha. If the variables are dichotomous (0,1), the coefficient alpha is equivalent to the Kuder-Richardson 20 (KR-20) reliability measure.

When the correlation between each pair of variables is 1, the coefficient alpha has a maximum value of 1. With negative correlations between some variables, the coefficient alpha can have a value less than zero. The larger the overall alpha coefficient, the more likely that items contribute to a reliable scale. Nunnally (1978) suggests .70 as an acceptable reliability coefficient; smaller reliability coefficients are seen as inadequate. However, this varies by discipline.

To determine how each item reflects the reliability of the scale, you calculate a coefficient alpha after deleting each variable independently from the scale. The Cronbach's coefficient alpha from all variables except the [IMAGE] variable is given by

[IMAGE]

If the reliability coefficient increases after deleting an item from the scale, you can assume that the item is not correlated highly with other items in the scale. Conversely, if the reliability coefficient decreases you can assume that the item is highly correlated with other items in the scale. See SAS Communications, 4th Quarter 1994, for more information on how to interpret Cronbach's coefficient alpha.

Listwise deletion of observations with missing values is necessary to correctly calculate Cronbach's coefficient alpha. PROC CORR does not automatically use listwise deletion when you specify ALPHA. Therefore, use the NOMISS option if the data set contains missing values. Otherwise, PROC FREQ prints a warning message in the SAS log indicating the need to use NOMISS with ALPHA.


Probability Values
Probability values for the Pearson and Spearman correlations are computed by treating

[IMAGE]

as coming from a t distribution with [IMAGE] degrees of freedom, where [IMAGE] is the appropriate correlation.

Probability values for the Pearson and Spearman partial correlations are computed by treating

[IMAGE]

as coming from a t distribution with [IMAGE] degrees of freedom, where [IMAGE] is the appropriate partial correlation and [IMAGE] is the number of variables being partialled out.

Probability values for Kendall correlations are computed by treating

[IMAGE]

as coming from a normal distribution when

[IMAGE]

and where [IMAGE] are the values of the first variable, [IMAGE] are the values of the second variable, and the function sgn(z) is defined as

[IMAGE]

The formula for the variance of [IMAGE], var( [IMAGE]), is computed as

[IMAGE]

where
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]

The sums are over tied groups of values where [IMAGE] is the number of tied [IMAGE] values and [IMAGE] is the number of tied [IMAGE] values (Noether 1967). The sampling distribution of Kendall's partial tau-b is unknown; therefore, the probability values are not available.

The probability values for Hoeffding's D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is

[IMAGE]

which comes from the asymptotic distribution. When the sample size is less than 10, see the tables for the distribution of D in Hollander and Wolfe (1973).


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.