Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement

PROC SURVEYSELECT options ;
The PROC SURVEYSELECT statement invokes the procedure and optionally identifies input and output data sets. If you do not name a DATA= input data set, the procedure selects the sample from the most recently created SAS data set. If you do not name an OUT= output data set to contain the sample of selected units, the procedure still creates an output data set and names it according to the DATAn convention.

The PROC SURVEYSELECT statement also specifies the sample selection method, the sample size, and other sample design parameters. If you do not specify a selection method, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) if there is no SIZE statement. If you specify a SIZE statement but do not specify a selection method, PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS). You must specify the sample size or sampling rate unless you request a method that selects two units from each stratum (METHOD=PPS_BREWER or METHOD=PPS_MURTHY).

You can use the SAMPSIZE=n option to specify the sample size, or you can use the SAMPSIZE=SAS-data-set option to name a secondary input data set that contains stratum sample sizes. You can also specify stratum sampling rates, minimum size measures, maximum size measures, and certainty size measures in the secondary input data set. See the descriptions of the SAMPSIZE=, SAMPRATE=, MINSIZE=, MAXSIZE=, and CERTSIZE= options. You can name only one secondary input data set in each invocation of the procedure.

The following table lists the options available with the PROC SURVEYSELECT statement. Descriptions follow in alphabetical order.

Table 63.1: PROC SURVEYSELECT Statement Options
Task Options
Specify the input data setDATA=
Specify output data setsOUT=
 OUTSORT=
Suppress displayed outputNOPRINT
Specify selection methodMETHOD=
Specify sample sizeSAMPSIZE=
Specify sampling rateSAMPRATE=
 NMIN=
 NMAX=
Specify number of replicatesREP=
Adjust size measuresMINSIZE=
 MAXSIZE=
Specify certainty size measuresCERTSIZE=
Specify sorting typeSORT=
Specify random number seedSEED=
Control OUT= contentsJTPROBS
 OUTHITS
 OUTSIZE
 STATS


You can specify the following options in the PROC SURVEYSELECT statement.

CERTSIZE
requests automatic selection of those units with size measures greater than or equal to the stratum certainty size measures, which you provide in the secondary input data set variable _CERTSIZE_. Use the CERTSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MAXSIZE=SAS-data-set, or MINSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure.

If any size measure is greater than or equal to the certainty size measure for its stratum, then PROC SURVEYSELECT selects this unit with certainty. Each certainty size measure must be a positive number. The CERTSIZE option is available for METHOD=PPS and METHOD=PPS_SAMPFORD.

If you want to specify a single certainty size measure in the PROC SURVEYSELECT statement, use the CERTSIZE=certain option.

CERTSIZE=certain
specifies the certainty size measure. PROC SURVEYSELECT selects with certainty any unit with size measure greater than or equal to the value certain. The certainty size measure must be a positive number. This option is available for METHOD=PPS and METHOD=PPS_SAMPFORD.

If you request a stratified sample design with a STRATA statement and specify the CERTSIZE= option, PROC SURVEYSELECT uses the certainty size certain for all strata. If you do not want to use the same certainty size for all strata, use the CERTSIZE=SAS-data-set option to specify a certainty size for each stratum.

CERTSIZE=SAS-data-set
names a SAS data set that contains the certainty size measures for the strata. PROC SURVEYSELECT selects with certainy any unit with size measure greater than or equal to the certainty size measure for its stratum. Each certainty size measure must be a positive number. This option is available for METHOD=PPS and METHOD=PPS_SAMPFORD.

The CERTSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the CERTSIZE= data set as in the DATA= data set. The CERTSIZE= data set should have a variable named _CERTSIZE_ that contains the certainty size measure for each stratum.

CERTSIZE=P=p
specifies the certainty proportion. PROC SURVEYSELECT selects with certainty any unit with size measure greater than or equal to the proportion p of the total size for all units in the stratum. The procedure repeats this process with the remaining units until no more certainty units are selected. This option is available for METHOD=PPS and METHOD=PPS_SAMPFORD.

The certainty proportion must be a positive number. You can specify p as a number between 0 and 1. Or you can specify p in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

If you request a stratified sample design with a STRATA statement and specify the CERTSIZE=P= option, PROC SURVEYSELECT uses the same certainty proportion p for all strata.

DATA=SAS-data-set
names the SAS data set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created SAS data set. In sampling terminology, the input data set is the sampling frame, or list of units from which the sample is selected.

JTPROBS
includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability proportional to size selection methods: METHOD=PPS, METHOD=PPS_SAMPFORD, and METHOD=PPS_WR. By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS_BREWER and METHOD=PPS_MURTHY, which select two units per stratum. For more information on the contents of the output data set, see the section "Output Data Set".

MAXSIZE
requests size measure adjustment by stratum maximum size measures, which you provide in the secondary input data set variable _MAXSIZE_. Use the MAXSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MINSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure.

If any size measure exceeds the maximum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure downward to equal the maximum size measure. Each maximum size measure must be a positive number. The MAXSIZE option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratification.

If you want to specify a single maximum size value in the PROC SURVEYSELECT statement, use the MAXSIZE=max option.

MAXSIZE=max
specifies the maximum allowable size measure. If any size measure exceeds the value max, then PROC SURVEYSELECT adjusts this size measure to equal max. The maximum size measure must be a positive number. This option is available whenever you specify a SIZE statement for selection with probability proportional to size.

If you request a stratified sample design with a STRATA statement and specify the MAXSIZE= option, PROC SURVEYSELECT uses the maximum size max for all strata. If you do not want to use the same maximum size for all strata, use the MAXSIZE=SAS-data-set option to specify a maximum size for each stratum.

MAXSIZE=SAS-data-set
names a SAS data set that contains the maximum allowable size measures for the strata. If any size measure exceeds the maximum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure downward to equal the maximum size measure. Each maximum size measure must be a positive number. This option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratified selection.

The MAXSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MAXSIZE= data set as in the DATA= data set. The MAXSIZE= data set should have a variable named _MAXSIZE_ that contains the maximum size measure for each stratum.

METHOD=name
M=name
specifies the method for sample selection. If you do not specify the METHOD= option, by default PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) if there is no SIZE statement. If you specify a SIZE statement, the default selection method is probability proportional to size without replacement (METHOD=PPS). Valid values for name are as follows:

PPS
requests selection with probability proportional to size and without replacement. See the section "PPS Sampling without Replacement" for details. If you specify METHOD=PPS, you must name the size measure variable in the SIZE statement.

PPS_BREWER | BREWER
requests selection according to Brewer's method. Brewer's method selects two units from each stratum with probability proportional to size and without replacement. See the section "Brewer's PPS Method" for details. If you specify METHOD=PPS_BREWER, you must name the size measure variable in the SIZE statement. You do not need to specify the sample size with the SAMPSIZE= option, since Brewer's method selects two units from each stratum.

PPS_MURTHY | MURTHY
requests selection according to Murthy's method. Murthy's method selects two units from each stratum with probability proportional to size and without replacement. See the section "Murthy's PPS Method" for details. If you specify METHOD=PPS_MURTHY, you must name the size measure variable in the SIZE statement. You do not need to specify the sample size with the SAMPSIZE= option, since Murthy's method selects two units from each stratum.

PPS_SAMPFORD | SAMPFORD
requests selection according to Sampford's method. Sampford's method selects units with probability proportional to size and without replacement. See the section "Sampford's PPS Method" for details. If you specify METHOD=PPS_SAMPFORD, you must name the size measure variable in the SIZE statement.

PPS_SEQ | CHROMY
requests sequential selection with probability proportional to size and with minimum replacement. This method is also known as Chromy's method. See the section "PPS Sequential Sampling" for details. If you specify METHOD=PPS_SEQ, you must name the size measure variable in the SIZE statement.

PPS_SYS
requests systematic selection with probability proportional to size. See the section "PPS Systematic Sampling" for details on this method. If you specify METHOD=PPS_SYS, you must name the size measure variable in the SIZE statement.

PPS_WR
requests selection with probability proportional to size and with replacement. See the section "PPS Sampling with Replacement" for details on this method. If you specify METHOD=PPS_WR, you must name the size measure variable in the SIZE statement.

SEQ
requests sequential selection according to Chromy's method. If you specify METHOD=SEQ and do not specify a size measure with the SIZE statement, PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. See the section "Sequential Random Sampling" for details on this method. If you specify METHOD=SEQ and also name a size measure in the SIZE statement, PROC SURVEYSELECT uses METHOD=PPS_SEQ, which is sequential selection with probability proportional to size and with minimum replacement.

See the section "PPS Sequential Sampling" for details on this method.

SRS
requests simple random sampling, which is selection with equal probability and without replacement. See the section "Simple Random Sampling" for details. This method is the default if you do not specify the METHOD= option and also do not specify a SIZE statement.

SYS
requests systematic random sampling. If you specify METHOD=SYS and do not specify a size measure with the SIZE statement, PROC SURVEYSELECT uses systematic selection with equal probability. See the section "Systematic Random Sampling" for details on this method. If you specify METHOD=SYS and also name a size measure in the SIZE statement, PROC SURVEYSELECT uses METHOD=PPS_SYS, which is systematic selection with probability proportional to size. See the section "PPS Systematic Sampling" for details.

URS
requests unrestricted random sampling, which is selection with equal probability and with replacement. See the section "Unrestricted Random Sampling" for details.

MINSIZE
requests size measure adjustment by the stratum minimum size measures, which you provide in the secondary input data set variable _MINSIZE_. Use the MINSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MAXSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure.

If any size measure is less than the minimum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure upward to equal the minimum size measure. Each minimum size measure must be a positive number. The MINSIZE option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratification.

If you want to specify a single minimum size value in the PROC SURVEYSELECT statement, use the MINSIZE=min option.

MINSIZE=min
specifies the minimum allowable size measure. If any size measure is less than the value min, then PROC SURVEYSELECT adjusts this size measure upward to equal min. The minimum size measure must be a positive number. This option is available whenever you specify a SIZE statement for selection with probability proportional to size.

If you request a stratified sample design with a STRATA statement and specify the MINSIZE= option, PROC SURVEYSELECT uses the minimum size min for all strata. If you do not want to use the same minimum size for all strata, use the MINSIZE=SAS-data-set option to specify a minimum size for each stratum.

MINSIZE=SAS-data-set
names a SAS data set that contains the minimum allowable size measures for the strata. If any size measure is less than the minimum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure upward to equal the minimum size measure. Each minimum size measure must be a positive number. This option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratified selection.

The MINSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MINSIZE= data set as in the DATA= data set. The MINSIZE= data set should have a variable named _MINSIZE_ that contains the minimum size measure for each stratum.

NMAX=n
specifies the maximum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the desired stratum sample size from the specified sampling rate and the total number of units in the stratum. If this sample size is greater than the value NMAX=n, then PROC SURVEYSELECT selects the maximum of n units.

The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which may be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ).

NMIN=n
specifies the minimum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the desired stratum sample size from the specified sampling rate and the total number of units in the stratum. If this sample size is less than the value NMIN=n, then PROC SURVEYSELECT selects the minimum of n units.

The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which may be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ).

NOPRINT
suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 15, "Using the Output Delivery System."

OUT=SAS-data-set
names the output data set that contains the sample. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique.

The output data set contains the units selected for the sample, as well as design information and selection statistics, depending on the selection method and output options you specify. See the descriptions for the options JTPROBS, OUTHITS, OUTSIZE, and STATS. For information on the contents of the output data set, see the section "Output Data Set".

OUTHITS
includes a separate observation in the output data set for each selection when the same unit is selected more than once. By default, the output data set contains only one observation for each selected unit, even if it is selected more than once, and the variable NumberHits contains the number of hits or selections for that unit. The OUTHITS option is available for selection methods that select with replacement or with minimum replacement (METHOD=URS, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ).

OUTSIZE
includes additional design and sampling frame parameters in the output data set. If you specify the OUTSIZE option, PROC SURVEYSELECT includes the sample size or sampling rate in the output data set. When you request the OUTSIZE option and also specify the SIZE statement, the procedure outputs the size measure total for the sampling frame. If you do not specify the SIZE statement, the procedure outputs the total number of sampling units in the frame. Also, PROC SURVEYSELECT includes the minimum size measure if you specify the MINSIZE= option, the maximum size measure if you specify the MAXSIZE= option, and the certainty size measure if you specify the CERTSIZE= option.

If you have a stratified design, the output data set includes the stratum-level values of these parameters. Otherwise, the output data set includes the overall population-level values.

For information on the contents of the output data set, see the section "Output Data Set".

OUTSORT=SAS-data-set
names an output data set that contains the sorted input data set. This option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ). PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample.

If you specify CONTROL variables but do not name an output data set with the OUTSORT= option, then the sorted data set replaces the input data set.

REP=nrep
specifies the number of sample replicates. If you specify the REP= option, PROC SURVEYSELECT selects nrep independent samples, each with the same specified sample size or sampling rate and the same sample design.

You can use replicated sampling to provide a simple method of variance estimation for any form of statistic, as well as to evaluate variable nonsampling errors such as interviewer differences. Refer to Kish (1965), Kish (1987), and Kalton (1983) for information on replicated sampling.

SAMPRATE=r
RATE=r
specifies the sampling rate, which is the proportion of units selected for the sample. The sampling rate r must be a positive number. You can specify r as a number between 0 and 1. Or you can specify r in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the sampling rate r as the interval. See the section "Systematic Random Sampling" for details. For other selection methods, PROC SURVEYSELECT converts the sampling rate r to the sample size before selection, multiplying the rate by the number of units in the stratum or frame and rounding up to the nearest integer.

If you request a stratified sample design with a STRATA statement and specify the SAMPRATE=r option, PROC SURVEYSELECT uses the sampling rate r for each stratum. If you do not want to use the same sampling rate for each stratum, use the SAMPRATE=(values) option or the SAMPRATE=SAS-data-set option to specify a sampling rate for each stratum.

SAMPRATE=(values)
RATE=(values)
specifies sampling rates for the strata. You can separate values with blanks or commas. The number of SAMPRATE= values must equal the number of strata in the input data set.

List the stratum sampling rate values in the order in which the strata appear in the input data set. If you use the SAMPRATE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED options in the STRATA statement.

Each stratum sampling rate value must be a positive number. You can specify each value as a number between 0 and 1. Or you can specify a value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the interval for the stratum. See the section "Systematic Random Sampling" for details on systematic sampling. For other selection methods, PROC SURVEYSELECT converts the stratum sampling rate to a stratum sample size before selection, multiplying the rate by the number of units in the stratum and rounding up to the nearest integer.

SAMPRATE=SAS-data-set
RATE=SAS-data-set
names a SAS data set that contains sampling rates for the strata. This input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SAMPRATE= data set should have a variable _RATE_ that contains the sampling rate for each stratum.

Each sampling rate value must be a positive number. You can specify each value as a number between 0 and 1. Or you can specify a value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the interval for the stratum. See the section "Systematic Random Sampling" for details. For other selection methods, PROC SURVEYSELECT converts the stratum sampling rate to the stratum sample size before selection, multiplying the rate by the number of units in the stratum and rounding up to the nearest integer.

SAMPSIZE=n
N=n
specifies the sample size, which is the number of units selected for the sample. The sample size n must be a positive integer. For methods that select without replacement, the sample size n must not exceed the number of units in the input data set.

If you request a stratified sample design with a STRATA statement and specify the SAMPSIZE=n option, PROC SURVEYSELECT selects n units from each stratum. For methods that select without replacement, the sample size n must not exceed the number of units in any stratum. If you do not want to select the same number of units from each stratum, use the SAMPSIZE=(values) option or the SAMPSIZE=SAS-data-set option to specify different sample sizes for the strata.

SAMPSIZE=(values)
N=(values)
specifies sample sizes for the strata. You can separate values with blanks or commas. The number of SAMPSIZE= values must equal the number of strata in the input data set.

List the stratum sample size values in the order in which the strata appear in the input data set. If you use the SAMPSIZE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED options in the STRATA statement.

Each stratum sample size value must be a positive integer. For methods that select without replacement, the sample size for a stratum must not exceed the number of units in that stratum.

SAMPSIZE=SAS-data-set
N=SAS-data-set
names a SAS data set that contains the sample sizes for the strata. This input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SAMPSIZE= data set should have a variable _NSIZE_ that contains the sample size for each stratum. Each sample size value must be a positive integer. For methods that select without replacement, the stratum sample size must not exceed the number of units in the stratum.

SEED=number
specifies the initial seed for random number generation. The value of the SEED= option must be a positive integer. If you do not specify the SEED= option, PROC SURVEYSELECT uses the time of day from the computer's clock to obtain the initial seed.

SORT=NEST | SERP
specifies the type of sorting by CONTROL variables. The option SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic serpentine sorting. The default is SORT=SERP. See the section "Sorting by CONTROL Variables" for descriptions of serpentine and nested sorting. Where there is only one CONTROL variable, the two types of sorting are equivalent.

This option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ). PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample.

STATS
includes selection probabilities and sampling weights in the OUT= output data set for equal probability selection methods when you do not specify a STRATA statement. This option is available for the folowing equal probability selection methods: METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ. For PPS selection methods and stratified designs, the output data set contains selection probabilities and sampling weights by default. For more information on the contents of the output data set, see the section "Output Data Set".

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.