Statistical Computation

The SURVEYMEANS Procedure

Statistical Computation

The SURVEYMEANS procedure uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variation among PSUs. When the design is stratified, the procedure pools stratum variance estimates to compute the overall variance estimate. For t tests of the estimates, the degrees of freedom equals the number of clusters minus the number of strata in the sample design.

For a multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling. This variance estimation method assumes that the first-stage sampling fraction is small, or the first-stage sample is drawn with replacement, as it often is in practice.

For more information on the analysis of sample survey data, refer to Lee, Forthoffer, and Lorimor (1989), Cochran (1977), Kish (1965), and Hansen, Hurwitz, and Madow (1953).

Definitions and Notation

For a stratified clustered sample design, together with the sampling weights, the sample can be represented by an n ×(P+1) matrix

$({w,Y}) &=& ( w_{hij}, y_{hij} ) \ &=& ( w_{hij}, y_{hij}^{(1)}, y_{hij}^{(2)}, ... , y_{hij}^{(P)})$

where

h = 1, 2, ... , H is the stratum number, with a total of H strata
i = 1, 2, ... , n_h is the cluster number within stratum h, with a total of n_h clusters
j = 1, 2, ... , m_hi is the unit number within cluster i of stratum h, with a total of m_hi units
p = 1, 2, ... , P is the analysis variable number, with a total of P variables
$n=\sum_{h=1}^H \sum_{i=1}^{n_h} {m_{hi}}$ is the total number of observations in the sample
w_hij denotes the sampling weight for observation j in cluster i of stratum h
y_hij = ( y_hij⁽¹), y_hij⁽²⁾, ... , y_hij^(P)) are the observed values of the analysis variables for observation j in cluster i of stratum h, including both the values of numerical variables and the values of indicator variables for levels of categorical variables.

For a categorical variable C, let l denote the number of levels of C, and denote the level values as c₁, c₂, ... , c_l. Then there are l indicator variables associated with these levels. That is, for level C=c_k (k = 1, 2, ... , l), a y^(q) $(q\in\{1, 2, ... , P\})$ contains the values of the indicator variable for the category C=c_k, with the value of observation j in cluster i of stratum h:

$y_{hij}^{(q)} = I_{\{C=c_k\}}(h,i,j) = \{ 1 & {if C_{hij}=c_k } \ 0 & {otherwise} .$

Therefore, the total number of analysis variables, P, is the total number of numerical variables plus the total number of levels of all categorical variables.

Also, f_h denotes the sampling rate for stratum h. You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section "Specification of Population Totals and Sampling Rates" for details. If you input stratum totals, PROC SURVEYMEANS computes f_h as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYMEANS uses these values directly for f_h. If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates f_h are negligible, and a finite population correction is not used when computing variances.

This notation is also applicable to other sample designs. For example, for a sample design without stratification, you can let H=1; for a sample design without clusters, you can let m_hi=1 for every h and i.

Mean

When you specify the keyword MEAN, the procedure computes the estimate of the mean (mean per element) from the survey data. Also, the procedure computes the mean by default if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement.

PROC SURVEYMEANS computes the estimate of the mean as

$\hat{\bar{Y}} = ( \sum_{h=1}^H\sum_{i=1}^{n_h} \sum_{j=1}^{m_{hi}} w_{hij} y_{hij} ) / w_{\cdot\cdot\cdot}$

where

$w_{\cdot\cdot\cdot} = \sum_{h=1}^H\sum_{i=1}^{n_h} \sum_{j=1}^{m_{hi}} w_{hij}$

is the sum of the weights over all observations in the sample.

Variance and Standard Error of the Mean

When you specify the keyword STDERR, the procedure computes the standard error of the mean. Also, the procedure computes the standard error by default if you specify the keyword MEAN, or if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement. The keyword VAR requests the variance of the mean.

PROC SURVEYMEANS uses the Taylor series expansion theory to estimate the variance of the mean $\hat{\bar{Y}}$ . The procedure computes the estimated variance as

$\hat{V}(\hat{\bar{Y}}) = \sum_{h=1}^H { \frac{n_h(1-f_h)}{n_h-1} \sum_{i=1}^{n_h} {(e_{hi\cdot}-\bar{e}_{h\cdot\cdot})^2}}$

where

$e_{hi\cdot}&=& ( \sum_{j=1}^{m_{hi}}w_{hij}(y_{hij}- \hat{\bar{Y}}) ) / w_{\... ...cdot\cdot} \ \bar{e}_{h\cdot\cdot} &=& ( \sum_{i=1}^{n_h}e_{hi\cdot} ) / n_h$

The standard error of the mean is the square root of the estimated variance.

${StdErr}(\hat{{\bar{Y}}})= \sqrt{\hat{V}(\hat{\bar{Y}})}$

t Test for the Mean

If you specify the keyword T, PROC SURVEYMEANS computes the t value for testing that the population mean equals zero, $H_0: \bar{Y} = 0$ . The test statistic equals

$t(\hat{{\bar{Y}}}) = {\hat{{\bar{Y}}}} / {{StdErr}(\hat{{\bar{Y}}})}$

The two-sided p-value for this test is

$\mathrm {Prob}(| T|\gt| t(\hat{{\bar{Y}}})|)$

where T is a random variable with the t distribution with df degrees of freedom.

PROC SURVEYMEANS calculates the degrees of freedom for the t test as the number of clusters minus the number of strata. If there are no clusters, then df equals the number of observations minus the number of strata. If the design is not stratified, then df equals the number of clusters minus one. The procedure displays df for the t test if you specify the keyword DF in the PROC SURVEYMEANS statement.

Confidence Limits for the Mean

If you specify the keyword CLM, the procedure computes confidence limits for the mean. Also, the procedure includes the confidence limits by default if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement.

The confidence coefficient is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. The confidence limits are computed as

$\hat{\bar{Y}} +- {StdErr}(\hat{{\bar{Y}}})\cdott_{df,\,\,\alpha/2}$

where $\hat{\bar{Y}}$ is the estimate of the mean, ${StdErr}(\hat{{\bar{Y}}})$ is the standard error of the mean, and $t_{df,\,\,\alpha/2}$ is the $100(1-\alpha/2)$ percentile of the t distribution with df calculated as described in the section "t Test for the Mean" .

Coefficient of Variation

If you specify the keyword CV, PROC SURVEYMEANS computes the coefficient of variation, which is the ratio of the standard error of the mean to the estimated mean.

$cv = {StdErr}(\hat{{\bar{Y}}}) /\hat{{\bar{Y}}}$

Proportions

If you specify the keyword MEAN for a categorical variable, PROC SURVEYMEANS estimates the proportion, or relative frequency, for each level of the categorical variable. If you do not specify any statistic-keywords in the PROC SURVEYMEANS statement, the procedure estimates the proportions for levels of the categorical variables, together with their standard errors and confidence limits.

The procedure estimates the proportion in level c_k for variable C as

$\hat p=\frac{\sum_{h=1}^H\sum_{i=1}^{n_h} \sum_{j=1}^{m_{hi}} w_{hij} y_{hij}^{(q)}} {\sum_{h=1}^H\sum_{i=1}^{n_h} \sum_{j=1}^{m_{hi}} w_{hij}}$

where y_hij^(q) is value of the indicator function for level C=c_k, defined in the section "Definitions and Notation", y_hij^(q) equals 1 if the observed value of variable C equals c_k, and y_hij^(q) equals 0 otherwise. Since the proportion estimator is actually an estimator of the mean for an indicator variable, the procedure computes its variance and standard error according to the method outlined in the section "Variance and Standard Error of the Mean". Similarly, the procedure computes confidence limits for proportions as described in the section "Confidence Limits for the Mean".

Total

If you specify the keyword SUM, the procedure computes the estimate of the population total from the survey data. The estimate of the total is the weighted sum over the sample.

$\hat{Y} = \sum_{h=1}^H\sum_{i=1}^{n_h} \sum_{j=1}^{m_{hi}} w_{hij} y_{hij}$

For a categorical variable level, $\hat{Y}$ estimates its total frequency in the population.

Variance and Standard Deviation of the Total

When you specify the keyword STD or the keyword SUM, the procedure estimates the standard deviation of the total. The keyword VARSUM requests the variance of the total.

PROC SURVEYMEANS estimates the variance of the total as

$\hat{V}(\hat{Y}) = \sum_{h=1}^H { \frac{n_h(1-f_h)}{n_h-1} \sum_{i=1}^{n_h} {(y_{hi\cdot}-\bar{y}_{h\cdot\cdot})^2}}$

where

$y_{hi\cdot}&=& \sum_{j=1}^{m_{hi}} w_{hij} y_{hij}\ \bar{y}_{h\cdot\cdot} &=& ( \sum_{i=1}^{n_h}y_{hi\cdot} ) / n_h$

The standard deviation of the total equals

${Std}(\hat{Y})= \sqrt{\hat{V}(\hat{Y})}$

Confidence Limits of a Total

If you specify the keyword CLSUM, the procedure computes confidence limits for the total. The confidence coefficient is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. The confidence limits are computed as

$\hat{Y} +- {Std}(\hat{Y})\cdott_{df,\,\,\alpha/2}$

where $\hat{Y}$ is the estimate of the total, ${Std}(\hat{Y})$ is the estimated standard deviation, and $t_{df,\,\,\alpha/2}$ is the $100(1-\alpha/2)$ percentile of the t distribution with df calculated as described in the section "t Test for the Mean".

Chapter Contents
Previous
Next
Top