Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The VARIOGRAM Procedure

Preliminary Variogram Analysis

Recall that the goal of this example is spatial prediction. In particular, you would like to produce a contour map or surface plot on a regular grid of predicted values based on ordinary kriging. Ordinary kriging requires the complete specification of the spatial covariance or semivariogram.

You can use PROC VARIOGRAM, along with a DATA step and PROC GPLOT, to estimate visually a reasonable semivariogram model (both the form and associated parameters) for the thickness data.

Before proceeding with this estimation, consider the formula for the empirical or experimental semivariogram \gamma_z(h).Denote the coal seam thickness process by \{Z(r), r \in D \subset \mathcal{R}^2 \}.You have measurements (Z(ri), i = 1, ... ,75). The standard formula for \gamma_z(h) (isotropic case) is

2\gamma_z(h) = \frac{1}{| N(h) |}\sum_{N(h)}(Z(r_i)-Z(r_j))^2
where N(h) is given by

N(h) = {i,j: | ri-rj | = h}
and | N(h) | is the number of such pairs (i,j).

For actual data, it is unlikely that any pair (i,j) would exactly satisfy | ri-rj | = h, so typically a range of pairwise distances, | r_i-r_j | \in [h-\delta h,h+\delta h),is used to group pairs (ri,rj) for a single term in the expression for \gamma_z(h). Using this range, N(h) is modified by

N(h,\delta h) = \{i,j: | r_i-r_j | \in [h-\delta h,h+\delta h)\}

PROC VARIOGRAM performs this grouping with two required options for variogram computation: the LAGDISTANCE= and MAXLAGS= options.

The meaning of the required LAGDISTANCE= option is as follows. Classify all pairs of points into intervals according to their pairwise distance. The width of the distance interval is the LAGDISTANCE= value. The meaning of the required MAXLAGS= option is simply the number of intervals.

The problem is that a surface plot of the original data, or the scatter plot of the measurement locations, is not very helpful in determining the distribution of these pairwise distances; it is not clear what values to give to the LAGDISTANCE= and MAXLAGS= options.

You use PROC VARIOGRAM with the OUTDISTANCE= option to produce a modified histogram of the pairwise distances in order to find reasonable values for the LAGDISTANCE= and MAXLAGS= options. In the following analysis, you use the NOVARIOGRAM option in the COMPUTE statement and the OUTDISTANCE= option in the PROC VARIOGRAM statement. You need the NOVARIOGRAM option to keep an error message from being issued due to the absence of the LAGDISTANCE= and MAXLAGS= options.

The DATA step after the PROC VARIOGRAM statement computes the midpoint of each distance interval. This midpoint is then used in the GCHART procedure. Since the number of distance intervals is not specified by using the NHCLASSES= option in the COMPUTE statement, the default of 10 is used.

   proc variogram data=thick outdistance=outd;
      compute novariogram;
      coordinates xc=east yc=north;
      var thick;
   run;

   title 'OUTDISTANCE= Data Set Showing Distance Intervals';
   proc print data=outd;
   run;

   data outd; set outd;
      mdpt=round((lb+ub)/2,.1);
      label mdpt = 'Midpoint of Interval'; 
   run;

   axis1 minor=none;
   axis2 minor=none label=(angle=90 rotate=0);
   title 'Distribution of Pairwise Distances';
   proc gchart data=outd;
      vbar mdpt / type=sum sumvar=count discrete frame 
                  cframe=ligr gaxis=axis1 raxis=axis2 nolegend;
   run;

OUTDISTANCE= Data Set Showing Distance Intervals

Obs VARNAME LAG LB UB COUNT PER
1 thick 0 0.000 6.969 45 0.01622
2 thick 1 6.969 20.907 263 0.09477
3 thick 2 20.907 34.845 383 0.13802
4 thick 3 34.845 48.783 436 0.15712
5 thick 4 48.783 62.720 495 0.17838
6 thick 5 62.720 76.658 525 0.18919
7 thick 6 76.658 90.596 412 0.14847
8 thick 7 90.596 104.534 179 0.06450
9 thick 8 104.534 118.472 35 0.01261
10 thick 9 118.472 132.410 2 0.00072
11 thick 10 132.410 146.348 0 0.00000

Figure 70.3: OUTDISTANCE= Data Set Showing Distance Intervals

varg1d.gif (5198 bytes)

Figure 70.4: Distribution of Pairwise Distances

For plotting and estimations purposes, it is desirable to have as many points as possible for the plot of \gamma_z(h) against h. This corresponds to having as many distance intervals as possible, that is, having a small value for the LAGDISTANCE= option.

However, a rule of thumb used in computing sample semivariograms is to use at least 30 point pairs in computing a single value of the empirical or experimental semivariogram.

If the LAGDISTANCE= value is set too small, there may be too few points in one or more of the intervals. On the other hand, if the LAGDISTANCE= value is set to a large value, the number of point pairs in the distance intervals may be much greater than that needed for estimation precision, thereby "wasting" point pairs at the expense of variogram points.

Hence, there is a tradeoff between the number of distance intervals and the number of point pairs within each interval.

As discussed in the section "OUTDIST=SAS-data-set " the first few distance intervals, corresponding to lag 0 and lag 1, are typically the limiting intervals. This is particularly true for lag 0 since it is half the width of the remaining intervals. For the default of NHCLASSES=10, the lag 0 class contains 45 points, which is reasonably close to 30, but the lag 1 class contains 263 points.

If you rerun PROC VARIOGRAM with NHCLASSES=20, these numbers become 8 and 83 for lags 0 and 1, respectively. Because of the asymmetrical nature of lag 0, you are willing to violate the rule of thumb for the 0th lag. You will, however, have sufficient numbers in lag 1 and above.

   proc variogram data=thick outdistance=outd;
      compute nhc=20 novariogram;
      coordinates xc=east yc=north;
      var thick;
   run;

   title 'OUTDISTANCE= Data Set Showing Distance Intervals';
   proc print data=outd;
   run;

   data outd; set outd;
      mdpt=round((lb+ub)/2,.1);
      label mdpt = 'Midpoint of Interval';
   run;

   axis1 minor=none;
   axis2 minor=none label=(angle=90 rotate=0);
   title 'Distribution of Pairwise Distances';     
   proc gchart data=outd;
      vbar mdpt / type=sum sumvar=count discrete frame 
                  cframe=ligr gaxis=axis1 raxis=axis2 nolegend;
   run;

OUTDISTANCE= Data Set Showing Distance Intervals

Obs VARNAME LAG LB UB COUNT PER
1 thick 0 0.000 3.484 8 0.00288
2 thick 1 3.484 10.453 83 0.02991
3 thick 2 10.453 17.422 143 0.05153
4 thick 3 17.422 24.391 167 0.06018
5 thick 4 24.391 31.360 198 0.07135
6 thick 5 31.360 38.329 197 0.07099
7 thick 6 38.329 45.298 203 0.07315
8 thick 7 45.298 52.267 235 0.08468
9 thick 8 52.267 59.236 234 0.08432
10 thick 9 59.236 66.205 284 0.10234
11 thick 10 66.205 73.174 264 0.09514
12 thick 11 73.174 80.143 236 0.08505
13 thick 12 80.143 87.112 221 0.07964
14 thick 13 87.112 94.081 165 0.05946
15 thick 14 94.081 101.050 75 0.02703
16 thick 15 101.050 108.018 41 0.01477
17 thick 16 108.018 114.987 15 0.00541
18 thick 17 114.987 121.956 5 0.00180
19 thick 18 121.956 128.925 1 0.00036
20 thick 19 128.925 135.894 0 0.00000
21 thick 20 135.894 142.863 0 0.00000

Figure 70.5: OUTDISTANCE= Data Set Showing Distance Intervals

varg1f.gif (5472 bytes)

Figure 70.6: Distribution of Pairwise Distances

The length of the lag 1 class (3.484,10.453) is 6.969. You round off and use LAGDISTANCE=7.0 in the COMPUTE statement.

The use of the MAXLAGS= option is more difficult. From Figure 70.5, note that up to a pairwise distance of 101, you have a sufficient number of pairs. With your choice of LAGDISTANCE=7.0, this yields a maximum number of lags \frac{101}7 \approx 14.

The problem with using the maximum lag value is that it includes pairs of points so far apart that they are likely to be independent. Using pairs of points that are independent adds nothing to the empirical semivariogram plot; they are essentially added noise.

If there is an estimate of correlation length, perhaps from a prior geologic study of a similar site, you can specify the MAXLAGS= value so that the maximum pairwise distance does not exceed two or three correlation lengths. If there is no estimate of correlation length, you can use the following rule of thumb: use (1/2) to (3/4) of the "diameter" of the region containing the data. A MAXLAGS= value of 10 is within this range.

You now rerun PROC VARIOGRAM with these values.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.