Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The LOESS Procedure

Scatter Plot Smoothing

The following data from the Connecticut Tumor Registry presents age-adjusted numbers of melanoma incidences per 100,000 people for 37 years from 1936 to 1972 (Houghton, Flannery, and Viola, 1980).

   data Melanoma;
      input  Year Incidences @@; 
      format Year d4.0;
      format DepVar d4.1; 
   datalines;
   1936    0.9   1937   0.8  1938   0.8  1939   1.3
   1940    1.4   1941   1.2  1942   1.7  1943   1.8
   1944    1.6   1945   1.5  1946   1.5  1947   2.0
   1948    2.5   1949   2.7  1950   2.9  1951   2.5
   1952    3.1   1953   2.4  1954   2.2  1955   2.9
   1956    2.5   1957   2.6  1958   3.2  1959   3.8
   1960    4.2   1961   3.9  1962   3.7  1963   3.3
   1964    3.7   1965   3.9  1966   4.1  1967   3.8
   1968    4.7   1969   4.4  1970   4.8  1971   4.8
   1972    4.8
   ;

The following PROC GPLOT statements produce the simple scatter plot of these data displayed in Figure 38.1.

 
   symbol1 color=black value=dot ;  
   proc gplot data=Melanoma;
      title1 'Scatter Plot of Melanoma Data';
      plot Incidences*Year;
   run;

lwsg1a.gif (3212 bytes)

Figure 38.1: Scatter Plot of Incidences versus Year for the Melanoma Data

Suppose that you want to smooth the response variable Incidences as a function of the variable Year. The following PROC LOESS statements request this analysis:

   proc loess data=Melanoma;
      model Incidences=Year/details(OutputStatistics);
   run;

You use the PROC LOESS statement to invoke the procedure and specify the data set. The MODEL statement names the dependent and independent variables. You use options in the MODEL statement to specify fitting parameters and control the displayed output. For example, the MODEL statement option DETAILS(OutputStatistics) requests that the "Output Statistics" table be included in the displayed output. By default, this table is not displayed.

The results are displayed in Figure 38.2 and Figure 38.3.

 
Loess Fit of Melanoma Data

The LOESS Procedure

Independent Variable Scaling
Scaling applied: None
Statistic Year
Minimum Value 1936
Maximum Value 1972

 


 
Loess Fit of Melanoma Data

The LOESS Procedure
Smoothing Parameter: 0.5
Dependent Variable: Incidences

Output Statistics
Obs Year Incidences Predicted Incidences
1 1936 0.9 0.79168
2 1937 0.8 0.90451
3 1938 0.8 1.01734
4 1939 1.3 1.13103
5 1940 1.4 1.24472
6 1941 1.2 1.36308
7 1942 1.7 1.48143
8 1943 1.8 1.59978
9 1944 1.6 1.73162
10 1945 1.5 1.86345
11 1946 1.5 1.97959
12 1947 2.0 2.09573
13 1948 2.5 2.21187
14 1949 2.7 2.30363
15 1950 2.9 2.39539
16 1951 2.5 2.48929
17 1952 3.1 2.58320
18 1953 2.4 2.68985
19 1954 2.2 2.79649
20 1955 2.9 2.89805
21 1956 2.5 2.99960
22 1957 2.6 3.10116
23 1958 3.2 3.20623
24 1959 3.8 3.31130
25 1960 4.2 3.43311
26 1961 3.9 3.55493
27 1962 3.7 3.67934
28 1963 3.3 3.80375
29 1964 3.7 3.91434
30 1965 3.9 4.02493
31 1966 4.1 4.13552
32 1967 3.8 4.24475
33 1968 4.7 4.35398
34 1969 4.4 4.46846
35 1970 4.8 4.58293
36 1971 4.8 4.70316
37 1972 4.8 4.82338
Figure 38.2: Output from PROC LOESS

 
Loess Fit of Melanoma Data

The LOESS Procedure
Smoothing Parameter: 0.5
Dependent Variable: Incidences

Fit Summary
Fit Method Interpolation
Number of Observations 37
Number of Fitting Points 17
kd Tree Bucket Size 3
Degree of Local Polynomials 1
Smoothing Parameter 0.50000
Points in Local Neighborhood 18
Residual Sum of Squares 3.97047
Figure 38.3: Output from PROC LOESS continued

Usually, such displayed results are of limited use. Most frequently the results are needed in an output data set so that they can be displayed graphically and analyzed further. For example, to place the "Output Statistics" table shown in Figure 38.2 in an output data set, you use the ODS OUTPUT statement as follows:

   proc loess data=Melanoma;
      model Incidences=Year;
      ods output OutputStatistics=Results;
   run;

The statement

    ods output OutputStatistics=Results;
requests that the "Output Statistics" table that appears in Figure 38.2 be placed in a SAS data set named Results. Note also that the DETAILS(OutputStatistics) option that caused this table to be included in the displayed output need not be specified.

The PRINT procedure displays the first five observations of this data set:

   title1 'First 5 Observations of the Results Data Set'; 
   proc print data=Results(obs=5); 
     id obs;
   run;

 
First 5 Observations of the Results Data Set

Obs SmoothingParameter Dependent Year DepVar Pred
1 0.5 Incidences 1936 0.9 0.79168
2 0.5 Incidences 1937 0.8 0.90451
3 0.5 Incidences 1938 0.8 1.01734
4 0.5 Incidences 1939 1.3 1.13103
5 0.5 Incidences 1940 1.4 1.24472
Figure 38.4: PROC PRINT Output of the Results Data Set

You can now produce a scatter plot including the fitted loess curve as follows:

   symbol1 color=black value=dot;  
   symbol2 color=black interpol=join value=none;

   /* macro used in subsequent examples */
   %let opts=vaxis=axis1 hm=3 vm=3 overlay;

   axis1 label=(angle=90 rotate=0);

   proc gplot data=Results; 
     title1 'Melanoma Data with Default LOESS Fit';
     plot  DepVar*Year Pred*Year/ &opts;
   run;

lwsg1b.gif (3477 bytes)

Figure 38.5: Default Loess FIT for Melanoma Data

The loess fit shown in Figure 38.5 was obtained with the default value of the smoothing parameter, which is 0.5. It is evident that this results in a loess fit that is too smooth for the Melanoma data. The loess fit captures the increasing trend in the data but does not reflect the periodic pattern in the data, which is related to an 11-year sunspot activity cycle. By using the SMOOTH= option in the MODEL statement, you can obtain loess fits for a range of smoothing parameters as follows:

   proc loess data=Melanoma;
      model Incidences=Year/smooth=0.1 0.2 0.3 0.4 residual;
      ods output OutputStatistics=Results;
   run;

The RESIDUAL option causes the residuals to be added to the "Output Statistics" table. PROC PRINT displays the first five observations of this data set:

 
   proc print data=Results(obs=5); 
     id obs;
   run;

 
First 5 Observations of the Results Data Set

Obs SmoothingParameter Dependent Year DepVar Pred Residual
1 0.1 Incidences 1936 0.9 0.90000 0
2 0.1 Incidences 1937 0.8 0.80000 0
3 0.1 Incidences 1938 0.8 0.80000 0
4 0.1 Incidences 1939 1.3 1.30000 0
5 0.1 Incidences 1940 1.4 1.40000 0
Figure 38.6: PROC PRINT Output of the Results Data Set

Note that the fits for all the smoothing parameters are placed in single data set and that ODS has added a SmoothingParameter variable to this data set that you can use to distinguish each fit.

The following statements display the loess fits obtained in a 2 by 2 plot grid:

   goptions  nodisplay;
   proc gplot data=Results;
      by SmoothingParameter;
      plot DepVar*Year=1 Pred*Year/ &opts name='fit';
   run; quit;

   goptions display;
   proc greplay nofs tc=sashelp.templt template=l2r2;
       igout gseg;
       treplay 1:fit 2:fit2 3:fit1 4:fit3;
   run; quit;

lwsg1c.gif (5372 bytes)

Figure 38.7: Loess Fits with a Range of Smoothing Parameters

If you examine the plots in Figure 38.7, you see that a good fit is obtained with smoothing parameter 0.2. You can gain further insight in how to choose the smoothing parameter by examining scatter plots of the fit residuals versus the year. To aid the interpretation of these scatter plots, you can again use PROC LOESS to smooth the response Residual as a function of Year.

   proc loess data=Results;
      by SmoothingParameter;
      ods output OutputStatistics=residout;
      model Residual=Year/smooth=0.3;                
   run; 

   axis1 label = (angle=90 rotate=0)
         order = (-0.8 to 0.8 by 0.4); 
   goptions nodisplay;
   proc gplot data=residout;
      by SmoothingParameter; 
      plot  DepVar*Year Pred*Year / &opts vref=0 lv=2 vm=1 
                                    name='resids';
   run; quit;

   goptions display;
   proc greplay nofs tc=sashelp.templt template=l2r2;
       igout gseg;
       treplay 1:resids 2:resids2 3:resids1 4:resids3;
   run; quit;

lwsg1d.gif (5463 bytes)

Figure 38.8: Scatter Plots of Residuals versus Year

Looking at the scatter plots in Figure 38.8 confirms that the choice of smoothing parameter 0.2 is reasonable. With smoothing parameter 0.1, there is gross overfitting in the sense that the original data are exactly interpolated. The loess fits on the Residual versus Year scatter plots for smoothing parameters 0.3 and 0.4 reveal that there is a periodic trend in the residuals that is much weaker when the smoothing parameter is 0.2. This suggests that when the smoothing parameter is above 0.3, an overly smooth fit is obtained that misses essential features in the original data.

Having now decided on a loess fit, you may want to obtain confidence limits for your model predictions. This is done by adding the CLM option in the MODEL statement. By default 95% limits are produced, but this can be changed by using the ALPHA= option in the MODEL statement. The following statements add 90% confidence limits to the Results data set and display the results graphically:

 
   proc loess data=Melanoma;
      model Incidences=Year/smooth=0.2 residual clm 
                            alpha=0.1;
      ods output OutputStatistics=Results;
   run;

   symbol3 color=green interpol=join value=none;
   symbol4 color=green interpol=join value=none;
   axis1 label = (angle=90 rotate=0)
         order = (0 to 6); 
   title1 'Age-adjusted Melanoma Incidences for 37 Years'; 




   proc gplot data=Results; 
     plot  DepVar*Year Pred*Year LowerCl*Year UpperCL*Year 
                        /  &opts;
   run;

lwsg1e.gif (4074 bytes)

Figure 38.9: Loess fit of Melanoma Data with 90% Confidence Bands

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.