Output Data Sets

The VARIOGRAM Procedure

Output Data Sets

The VARIOGRAM procedure produces three data sets: the OUTVAR=SAS-data-set, the OUTPAIR=SAS-data-set, and the OUTDIST=SAS-data-set. These data sets are described in the following sections.

OUTVAR=SAS-data-set

The OUTVAR= data set contains the standard and robust versions of the sample semivariogram, the covariance, and other information at each lag class.

The details of the computation of the variogram, the robust variogram, and the covariance is described in the section "Theoretical and Computational Details of the Semivariogram".

The OUTVAR= data set contains the following variables:

ANGLE, which is the angle class value (clockwise from N -S)
ATOL, which is the angle tolerance for the lag/angle class
AVERAGE, which is the average variable value for the lag/angle class
BANDW, which is the band width for the lag/angle class
COUNT, which is the number of pairs in the lag/angle class
COVAR, which is the covariance value for the lag/angle class
DISTANCE, which is the average lag distance for the lag/angle class
LAG, which is lag class value (in LAGDISTANCE= units)
RVARIO, which is the sample robust variogram value for the lag/angle class
VARIOG, which is the sample variogram value for the lag/angle class
VARNAME, which is the name of the current VAR= variable

The bandwidth variable, BANDW, is not included in the data set if no bandwidth specification is given in the COMPUTE statement or in a DIRECTIONS statement.

The OUTDIST= data set contains counts for a modified histogram showing the distribution of pairwise distances. The purpose of this data set is to enable you to make choices for the value of the LAGDISTANCE= option in the COMPUTE statement in subsequent runs of PROC VARIOGRAM.

For plotting and estimation purposes, it is desirable to have as many points as possible for a variogram plot. However, a rule of thumb used in computing sample semivariograms is to use at least 30 points in each interval whenever possible. Hence, there is a lower limit to the value of the LAGDISTANCE= option.

Since the distribution of pairwise distances is seldom known in advance, the information contained in the OUTDIST= data set enables you to choose, in an iterative fashion, a value for the LAGDISTANCE= parameter. The value you choose is a compromise between the number of pairs making up each variogram point and the number of variogram points.

In some cases, the pattern of measured points may result in some lag or distance classes having a small number of pairs, while the remaining classes have a large number of pairs. By adjusting the value of the LAGDISTANCE= option to honor the rule of thumb (at least 30 pairs), you are "wasting" pairs in the other distance classes.

One strategy for solving this problem is to use less than 30 pairs for these distance classes. Then, either delete the corresponding variogram points or use them and accept the increased uncertainty. Unfortunately, the deficient distance classes are usually those close to the origin (h=0). This is the crucial portion of the experimental variogram curve for determining the form of the theoretical variogram and for detecting the presence of a nugget effect.

Another alternative is to force distance classes to contain approximately the same number of pairs. This results in distance classes of unequal widths.

While PROC VARIOGRAM does not produce such distance classes directly, the OUTPAIR= data set, described in the section "OUTPAIR=SAS-data-set", contains information on all distinct pairs of points. You can use this data set, along with the RANK procedure, to produce experimental variogram-based equal numbers of pairs in each distance class.

To request an OUTDIST= data set, you specify the OUTDIST= data set in the PROC VARIOGRAM statement and the NOVARIOGRAM option in the COMPUTE statement. The NOVARIOGRAM option prevents any variogram or covariance computation from being performed.

Computation of the Distribution Distance Classes

The simplest way of determining the distribution of pairwise distances is to determine the maximum distance h_max between pairs and divide this distance by some number N of intervals to produce distance classes of length $\delta = \frac{h_{max}}N$ . The distance between each pair of points P₁, P₂, denoted | P₁P₂ |, is computed, and the pair P₁P₂ is counted in the kth distance class if $| P_1P_2 | \in [(k-1)\delta,k\delta)$ for k = 1, ... ,N.

The actual computation is a slight variation of this. A bound, rather than the actual maximum distance, is computed. This bound is the length of the diagonal of a bounding rectangle for the data points. This bounding rectangle is found by using the maximum and minimum x and y coordinates, x_max, x_min, y_max, y_min, and forming the rectangle determined by the points

(x_max, y_max), (x_max, y_min), (x_min, y_min), (x_min, y_max)

See Figure 70.16 for an illustration of the bounding rectangle.

Figure 70.16: Bounding Rectangle to Determine Maximum Pairwise Distance

The pairwise distance bound, denoted by h_b, is given by

h_b² = (x_max-x_min)² + (y_max-y_min)²

Using h_b, the interval (0,h_b] is divided into N+1 subintervals, where N is the value of the NHCLASSES= option specified in the COMPUTE statement, or N=10 if the NHCLASSES= option is not specified. The basic distance unit is h₀ = [(h_b)/N]; the distance intervals are centered on h₀,2h₀, ... ,Nh₀, with a distance tolerance of $+-\frac{h_0}2$ .The extra subinterval is (0,h₀/2), corresponding to the 0th lag. It is half the length of the remaining subintervals, and it often contains the smallest number of pairs. This method of partitioning the interval (0,h_b] is identical to what is done when you actually compute the sample variogram.

The lag classes corresponding to h₀=1 are shown in Figure 70.17.

Figure 70.17: Lag Classes Corresponding to h₀=1

By increasing or decreasing the value of the NHCLASSES= option, you can adjust the lag or distance class with the smallest count so that this count is around 30 or some other value that you judge appropriate.

Once you determine an appropriate value for the NHCLASSES= option, you can use the width of the lag classes as a candidate value for the LAGDIST= option in the COMPUTE statement. The width of the lag classes is determined by the upper bound (UB) and lower bound (LB) variables.

For example, read the observation from the OUTDIST= data set corresponding to lag 1 and compute the quantity UB-LB. Use this value for the LAGDIST= option in the COMPUTE statement.

Note: Do not use the 0th lag class; it is half the length of the other intervals. Use lag 1 instead.

Variables in the OUTDIST= data set

The following variables are written to the OUTDIST= data set:

COUNT, which is the number of pairs falling into this lag class
LAG, which is the lag class value
LB, which is the lower bound of the lag class interval
UB, which is the upper bound of the lag class interval
PER, which is the percent of all pairs falling in this lag class
VARNAME, which is the name of the current VAR= variable

OUTPAIR=SAS-data-set

The OUTPAIR= data set contains one observation for each distinct pair of points P₁, P₂ in the original data set, unless you specify the OUTPDISTANCE= option in the COMPUTE statement.

If you specify OUTPDISTANCE=D_max in the COMPUTE statement, all pairs P₁, P₂ in the original data set that satisfy the relation $| P_1P_2 | \le D_{max}$ are written to the OUTPAIR= data set.

Note that the OUTPAIR= data set can be very large even for a moderately sized DATA= data set.

For example, if the DATA= data set has NOBS=500, the OUTPAIR= data set has NOBS( NOBS-1)/2 =124,750 if no OUTPDISTANCE= restriction is given in the COMPUTE statement.

The OUTPAIR= data set contains information on the distance and orientation for each point pair, and you can use it for specialized continuity measure calculations.

The OUTPAIR= data set contains the following variables:

AC, which is the angle class value
COS, which is the cosine of the angle between pairs
DC, which is the distance (lag) class
DISTANCE, which is the distance between pairs
V1, which is the variable value for the first point in the pair
V2, which is the variable value for the second point in the pair
VARNAME, which is the variable name for the current VAR variable
X1, which is the x coordinate of the first point in the pair
X2, which is the x coordinate of the second point in the pair
Y1, which is the y coordinate of the first point in the pair
Y2, which is the y coordinate of the second point in the pair

Chapter Contents
Previous
Next
Top

Output Data Sets

OUTVAR=SAS-data-set

OUTDIST=SAS-data-set

Computation of the Distribution Distance Classes

Variables in the OUTDIST= data set

OUTPAIR=SAS-data-set