Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The SURVEYREG Procedure

Computational Method

For a stratified clustered sample design, observations are represented by an n ×(p+2) matrix
(w, y, X) = (whij, yhij, xhij)
where

Also, fh denotes the sampling rate for stratum h. You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section "Specification of Population Totals and Sampling Rates" for details. If you input stratum totals, PROC SURVEYREG computes fh as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYREG uses these values directly for fh. If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates fh are negligible, and a finite population correction is not used when computing variances.

Regression Coefficients

PROC SURVEYREG solves the normal equations {X'WX}{\beta}={X'Wy} using a modified sweep routine that produces a generalized (g2) inverse (X'WX)- and a solution (Pringle and Raynor 1971)
\hat{{\beta}}={(X'WX)^-X'WY}
where W is the diagonal matrix constructed from WEIGHT variable values.

For models with class variables, there are more design matrix columns than there are degrees of freedom (DF) for the effect. Thus, there are linear dependencies among the columns. In this case, the parameters are not estimable; there is an infinite number of least-squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to obtain values for the estimates. The solution values are not displayed unless you specify the SOLUTION option in the MODEL statement. The solution has the characteristic that estimates are 0 whenever the design column for that parameter is a linear combination of previous columns. (Strictly termed, the solution values should not be called estimates.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable.

Variance Estimation

PROC SURVEYREG uses the Taylor series expansion theory to estimate the covariance-variance matrix of the estimated regression coefficients (Fuller 1975). Let
{r=y-X\hat{{\beta}}}
where the (h,i,j)th element is rhij. Compute p row vectors
e_{hij} &=& w_{hij} r_{hij} x_{hij}
 \ 
 e_{hi\cdot}&=& \sum_{j=1}^{m_{hi}}e_{hij} \ \bar{e}_{h\cdot\cdot} &=&
 \frac1{n_h}\sum_{i=1}^{n_h}e_{hi\cdot}
and calculate the p×p matrix
G=\frac{n-1}{n-p} \sum_{h=1}^H
 { \frac{n_h(1-f_h)}{n_h-1}
 \sum_{i=1}^{n_h}
 { (e_{hi\cdot}-\bar{e}_{h\cdot\cdot})'
 (e_{hi\cdot}-\bar{e}_{h\cdot\cdot})
 }
 }
PROC SURVEYREG computes the covariance matrix of {\beta} as
{\hat V = (X'WX)^-G(X'WX)^-}

Testing Effects

For each effect in the model, PROC SURVEYREG computes an L matrix such that every element of L{\beta} is estimable; the L matrix has the maximum possible rank associated with the effect. To test the effect, the procedure uses the Wald F statistic for the hypothesis H_{0}\colon L {\beta}= 0. The Wald F statistic equals
F_{\rm Wald} = \frac{(L\hat{{\beta}})'
 (L'\hat{V}
 L)^{-1}
 (L\hat{{\beta}})
 }
 {{\rm rank}(L)}
with numerator degrees of freedom equal to rank(L) and denominator degrees of freedom equal to the number of clusters minus the number of strata (unless you have specified the denominator degrees of freedom with the DF= option in the MODEL statement; see the section "Denominator Degrees of Freedom"). It is possible that the L matrix cannot be constructed for an effect, in which case that effect is not testable. For more information on how the matrix L is constructed, see the discussion in Chapter 12, "The Four Types of Estimable Functions."

Multiple R-squared

PROC SURVEYREG computes a multiple R-squared for the weighted regression as
R2 = 1-[(SSerror)/(SStotal)]
where SSerror is the error sum of squares in the ANOVA table
SSerror = r'Wr
and SStotal is the total sum of squares
{\rm SS_{total}} =
 \{ {{y'Wy} & {if no intercept}\ {y'Wy}- \displaystyle
 {(\su...
 ...{hij}y_{hij})^2
 /  w_{\cdot \cdot \cdot} }
 & {if there is an
 intercept}
 } .
where w··· is the sum of the sampling weights over all observations.

Root Mean Square Errors

PROC SURVEYREG computes the square root of mean square errors as
\sqrt{\rm MSE} =
 \sqrt{n  {\rm SS_{error}}  / 
 (n-p)  w_{\cdot \cdot \cdot} }
where w··· is the sum of the sampling weights over all observations.

Design Effect

If you specify the DEFF option in the MODEL statement, PROC SURVEYREG calculates the design effects for the regression coefficients. The design effect of an estimate is the ratio of the actual variance to the variance computed under the assumption of simple random sampling.
DEFF = [ Variance under the Sample Design/ Variance under Simple Random Sampling]

Refer to Kish (1965, p.258). PROC SURVEYREG computes the numerator as described in the section "Variance Estimation". And the denominator is computed under the assumption that the sample design is simple random sampling, with no stratification and no clustering.

To compute the variance under the assumption of simple random sampling, PROC SURVEYREG calculates the sampling rate as follows. If you specify both sampling weights and sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is calculated as

fSRS = n   /   w···
where n is the sample size and w··· (the sum of the weights over all observations) estimates the population size. If the sum of the weights is less than the sample size, fSRS is set to zero. If you specify sampling rates for the analysis but not sampling weights, then PROC SURVEYREG computes the sampling rate under simple random sampling as the average of the stratum sampling rates.
f_{{ \rm SRS}}
 = \frac{1}H \sum_{h=1}^H f_h
If you do not specify sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is assumed to be zero.
fSRS = 0

Sampling Rate of the Pooled Stratum from Collapse

Assuming that PROC SURVEYREG collapses single-unit strata h1, h2, ... , hc into the pooled stratum, the procedure calculates the sampling rate for the pooled stratum as
f_{{Pooled Stratum}}=
 \{ {0 & {if any of }
 f_{h_l}=0 { where } l=1, 2,  ... , ...
 ...{l=1}^c n_{h_l}f_{h_l}^{-1}
 )^{-1}
 \sum_{l=1}^c n_{h_l}}
 & {otherwise}
 } .

Contrasts

You can use the CONTRAST statement to perform custom hypothesis tests. If the hypothesis is testable in the univariate case, the Wald F statistic for H_{0}:
 L {\beta}= 0 is computed as
F_{\rm Wald} = \frac{({L_{\rm Full}}\hat{{\beta}})'
 ({L_{\rm
 Full}}'\hat{V}
 {L_{\rm Full}})^{-1}
 ({L_{\rm Full}}\hat{{\beta}})
 }
 {{\rm rank}(L)}
where L is the contrast vector or matrix you specify, {\beta} is the vector of regression parameters, \hat{{\beta}}={(X'WX)^-X'WY}, \hat{V} is the estimated covariance matrix of \hat{{\beta}}, rank(L) is the rank of L, and LFull is a matrix such that
-
LFull has the same number of columns as L
-
LFull has full row rank
-
the rank of LFull equals the rank of the L matrix
-
all rows of LFull are estimable functions
-
the Wald F statistic computed using the LFull matrix is equivalent to the Wald F statistic computed using the L matrix with any row deleted that is a linear combination of previous rows
If L is a full-rank matrix, and all rows of L are estimable functions, then LFull is the same as L. It is possible that LFull matrix cannot be constructed for contrasts in a CONTRAST statement, in which case the contrasts are not testable.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.