| Robust Regression Examples |
Example 9.2: LMS and LTS: Stackloss Data
This example presents the results for Brownlee's (1965) stackloss
data, which is also used for documenting the L1 regression module.
The three explanatory variables correspond to measurements for
a plant oxidizing ammonia to nitric acid on 21 consecutive days.
- x1 air flow to the plant
- x2 cooling water inlet temperature
- x3 acid concentration
The response variable yi gives the
permillage of ammonia lost (stackloss).
These data are also given in Rousseeuw and
Leroy (1987, p.76) and Osborne (1985, p.267):
print "Stackloss Data";
aa = { 1 80 27 89 42,
1 80 27 88 37,
1 75 25 90 37,
1 62 24 87 28,
1 62 22 87 18,
1 62 23 87 18,
1 62 24 93 19,
1 62 24 93 20,
1 58 23 87 15,
1 58 18 80 14,
1 58 18 89 14,
1 58 17 88 13,
1 58 18 82 11,
1 58 19 93 12,
1 50 18 89 8,
1 50 18 86 7,
1 50 19 72 8,
1 50 19 79 8,
1 50 20 80 9,
1 56 20 82 15,
1 70 20 91 15 };
Rousseeuw and Leroy (1987, p.76) cite a large number
of papers in which this data set was analyzed before.
They state that most researchers "concluded that
observations 1, 3, 4, and 21 were outliers" and that
some people also reported observation 2 as an outlier.
Consider 2000 Random Subsets
For N=21 and n=4 (three explanatory variables
including intercept), you obtain a total of 5985
different subsets of 4 observations out of 21.
If you do not specify optn[6], the LMS and LTS
algorithms draw Nrep=2000 random sample subsets.
Since there is a large number of subsets with singular
linear systems that you do not want to print, you can
choose optn[2]=2 for reduced printed output.
title2 "***Use 2000 Random Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lms(sc,coef,wgt,optn,b,a);
Output 9.2.1: Some Simple Statistics
| Median
and Mean |
| |
Median |
Mean |
| VAR1 |
58 |
60.428571429 |
| VAR2 |
20 |
21.095238095 |
| VAR3 |
87 |
86.285714286 |
| Intercep |
1 |
1 |
| Response |
15 |
17.523809524 |
| Dispersion
and Standard Deviation |
| |
Dispersion |
StdDev |
| VAR1 |
5.930408874 |
9.1682682584 |
| VAR2 |
2.965204437 |
3.160771455 |
| VAR3 |
4.4478066555 |
5.3585712381 |
| Intercep |
0 |
0 |
| Response |
5.930408874 |
10.171622524 |
|
The following are the results of LS regression.
Output 9.2.2: Table of Unweighted LS Regression
| Unweighted
Least-Squares Estimation |
| LS
Parameter Estimates |
| Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
| VAR1 |
0.7156402 |
0.13485819 |
5.31 |
<.0001 |
0.45132301 |
0.97995739 |
| VAR2 |
1.29528612 |
0.36802427 |
3.52 |
0.0026 |
0.57397182 |
2.01660043 |
| VAR3 |
-0.1521225 |
0.15629404 |
-0.97 |
0.3440 |
-0.4584532 |
0.15420818 |
| Intercep |
-39.919674 |
11.8959969 |
-3.36 |
0.0038 |
-63.2354 |
-16.603949 |
| Sum
of Squares = 178.8299616 |
| LS
Scale Estimate = 3.2433639182 |
| Cov
Matrix of Parameter Estimates |
| |
VAR1 |
VAR2 |
VAR3 |
Intercep |
| VAR1 |
0.0181867302 |
-0.036510675 |
-0.007143521 |
0.2875871057 |
| VAR2 |
-0.036510675 |
0.1354418598 |
0.0000104768 |
-0.651794369 |
| VAR3 |
-0.007143521 |
0.0000104768 |
0.024427828 |
-1.676320797 |
| Intercep |
0.2875871057 |
-0.651794369 |
-1.676320797 |
141.51474107 |
| F(3,17)
Statistic = 59.9022259 |
| Probability
= 3.0163272E-9 |
|
The following are the LMS results for the 2000 random subsets.
Output 9.2.3: Iteration History and Optimization Results
| Random Subsampling for LMS |
| Subset |
Singular |
Best Criterion |
Percent |
| 500 |
23 |
0.163262 |
25 |
| 1000 |
55 |
0.140519 |
50 |
| 1500 |
79 |
0.140519 |
75 |
| 2000 |
103 |
0.126467 |
100 |
| Minimum Criterion= 0.1264668282 |
| Least Median of Squares (LMS) Method |
| Minimizing 13th Ordered Squared Residual. |
| Highest Possible Breakdown Value = 42.86 % |
| Random Selection of 2103 Subsets |
| Among 2103 subsets 103 are singular. |
| Observations of Best Subset |
| 15 |
11 |
19 |
10 |
| Estimated Coefficients |
| VAR1 |
VAR2 |
VAR3 |
Intercep |
| 0.75 |
0.5 |
0 |
-39.25 |
| LMS Objective Function = 0.75 |
| Preliminary LMS Scale = 1.0478510755 |
| Robust R Squared = 0.96484375 |
| Final LMS Scale = 1.2076147288 |
|
For LMS, observations 1, 3, 4, and 21 have
scaled residuals larger than 2.5 (output not
shown), and they are considered outliers.
The following are the corresponding WLS results.
Output 9.2.4: Table of Weighted LS Regression
| Weighted
Least-Squares Estimation |
| RLS
Parameter Estimates Based on LMS |
| Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
| VAR1 |
0.79768556 |
0.06743906 |
11.83 |
<.0001 |
0.66550742 |
0.9298637 |
| VAR2 |
0.57734046 |
0.16596894 |
3.48 |
0.0041 |
0.25204731 |
0.9026336 |
| VAR3 |
-0.0670602 |
0.06160314 |
-1.09 |
0.2961 |
-0.1878001 |
0.05367975 |
| Intercep |
-37.652459 |
4.73205086 |
-7.96 |
<.0001 |
-46.927108 |
-28.37781 |
| Weighted
Sum of Squares = 20.400800254 |
| RLS
Scale Estimate = 1.2527139846 |
| Cov
Matrix of Parameter Estimates |
| |
VAR1 |
VAR2 |
VAR3 |
Intercep |
| VAR1 |
0.0045480273 |
-0.007921409 |
-0.001198689 |
0.0015681747 |
| VAR2 |
-0.007921409 |
0.0275456893 |
-0.00046339 |
-0.065017508 |
| VAR3 |
-0.001198689 |
-0.00046339 |
0.0037949466 |
-0.246102248 |
| Intercep |
0.0015681747 |
-0.065017508 |
-0.246102248 |
22.392305355 |
| Weighted
R-squared = 0.9750062263 |
| F(3,13)
Statistic = 169.04317954 |
| Probability
= 1.158521E-10 |
| There
are 17 points with nonzero weight. |
| Average
Weight = 0.8095238095 |
|
The subroutine, prilmts(), which is in robustmc.sas file
that is contained in the sample library,
can be called to print the output summary:
call prilmts(3,sc,coef,wgt);
Output 9.2.5: First Part of Output Generated by prilmts()
|
| Results of the Least Median Squares Estimation |
| Quantile: 13 |
| Number of Subsets: 2103 |
| Number of Singular Subsets: 103 |
| Number of Nonzero Weights: 17 |
| Objective Function: 0.75 |
| Preliminary Scale Estimate: 1.0478511 |
| Final Scale Estimate: 1.2076147 |
| Robust R-Squared: 0.9648438 |
| Asymptotic Consistency Factor: 1.1413664 |
| RLS Scale Estimate: 1.252714 |
| Weighted Sum of Squares: 20.4008 |
| Weighted R-Squared: 0.9750062 |
| F Statistic: 169.04318 |
|
Output 9.2.6: Second Part of Output Generated by prilmts()
Estimated
LMS Coefficients
0.75
0.5 0 -39.25
|
Indices
of Best Sample
15
11 19 10
|
Estimated
WLS Coefficients
0.7976856
0.5773405 -0.06706 -37.65246
|
Standard
Errors
0.0674391
0.1659689 0.0616031 4.7320509
|
T
Values
11.828242
3.4786054 -1.088584 -7.956901
|
Probabilities
2.4838E-8
0.004078 0.2961071 2.3723E-6
|
Lower
Wald CI
0.6655074
0.2520473 -0.1878 -46.92711
|
Upper
Wald CI
0.9298637
0.9026336 0.0536798 -28.37781
|
|
|
|
|
|
|
Output 9.2.7: Third Part of Output Generated by prilmts()
| LMS
Residuals |
|
6.4176097 2.2772163
6.21059 7.2456884
-0.20702 -0.621059 |
|
-0.20702
0.621059 -0.621059
0.621059
0.621059 0.2070197 |
|
-1.863177 -1.449138
0.621059 -0.20702
0.2070197 0.2070197 |
|
0.621059 1.863177
-6.831649 |
| Diagnostics |
|
10.448052 7.9317507
10 11.666667
2.7297297 3.4864865 |
|
4.7297297 4.2432432
3.6486486 3.7598351
4.6057675 4.9251688 |
|
3.8888889 4.5864209
5.2970297 4.009901
6.679576 4.3053404 |
|
4.0199755
3
11 |
| WLS
Residuals |
|
4.9634454 0.9185794
5.1312163 6.5250478 -0.535877
-0.996749 |
|
-0.338162 0.4601047 -0.844485
0.286883 0.7686702
0.377432 |
|
-2.000854 -1.074607
1.0731966 0.1143341
-0.297718 0.0770058 |
|
0.4679328 1.544008 -6.888934 |
|
You now want to report the results
of LTS for the 2000 random subsets:
title2 "***Use 2000 Random Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lts(sc,coef,wgt,optn,b,a);
Output 9.2.8: Iteration History and Optimization Results
| Random
Subsampling for LTS |
| Subset |
Singular |
Best
Criterion |
Percent |
| 500 |
23 |
0.099507 |
25 |
| 1000 |
55 |
0.087814 |
50 |
| 1500 |
79 |
0.084061 |
75 |
| 2000 |
103 |
0.084061 |
100 |
| Minimum
Criterion= 0.0840614072 |
| Least
Trimmed Squares (LTS) Method |
| Minimizing
Sum of 13 Smallest Squared Residuals. |
| Highest
Possible Breakdown Value = 42.86 % |
| Random
Selection of 2103 Subsets |
| Among
2103 subsets 103 are singular. |
| Observations
of Best Subset |
| 10 |
11 |
7 |
15 |
| Estimated
Coefficients |
| VAR1 |
VAR2 |
VAR3 |
Intercep |
| 0.75 |
0.3333333333 |
0 |
-35.70512821 |
| LTS
Objective Function = 0.4985185153 |
| Preliminary
LTS Scale = 1.0379336739 |
| Robust
R Squared = 0.9719626168 |
| Final
LTS Scale = 1.0407755737 |
|
In addition to observations 1, 3, 4, and 21, which
were considered outliers in LMS, observation 2 for
LTS has a scaled residual considerably larger than
2.5 (output not shown) and is considered an outlier.
Therefore, the WLS results based on LTS
are different from those based on LMS.
Output 9.2.9: Table of Weighted LS Regression
| Weighted Least-Squares Estimation |
| RLS Paramet
er Estimates Based on LTS |
| Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
| VAR1 |
0.75694055 |
0.07860766 |
9.63 |
<.0001 |
0.60287236 |
0.91100874 |
| VAR2 |
0.45353029 |
0.13605033 |
3.33 |
0.0067 |
0.18687654 |
0.72018405 |
| VAR3 |
-0.05211 |
0.05463722 |
-0.95 |
0.3607 |
-0.159197 |
0.054977 |
| Intercep |
-34.05751 |
3.82881873 |
-8.90 |
<.0001 |
-41.561857 |
-26.553163 |
| Weighted Sum of Squares = 10.273044977 |
| RLS Scale Estimate = 0.9663918355 |
| Cov Matrix of Parameter Estimates |
| |
VAR1 |
VAR2 |
VAR3 |
Intercep |
| VAR1 |
0.0061791648 |
-0.005776855 |
-0.002300587 |
-0.034290068 |
| VAR2 |
-0.005776855 |
0.0185096933 |
0.0002582502 |
-0.069740883 |
| VAR3 |
-0.002300587 |
0.0002582502 |
0.0029852254 |
-0.131487406 |
| Intercep |
-0.034290068 |
-0.069740883 |
-0.131487406 |
14.659852903 |
| Weighted R-squared = 0.9622869127 |
| F(3,11) Statistic = 93.558645037 |
| Probability = 4.1136826E-8 |
| There are 15 points with nonzero weight. |
| Average Weight = 0.7142857143 |
|
Consider All 5985 Subsets
You now report the results of LMS for all different subsets:
title2 "*** Use All 5985 Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2; /* ipri */
optn[3]= 3; /* ilsq */
optn[6]= -1; /* nrep: all 5985 subsets */
optn[8]= 3; /* icov */
call lms(sc,coef,wgt,optn,b,a);
Output 9.2.10: Iteration History and Optimization Results for LMS
| Complete Enumeration for LMS |
| Subset |
Singular |
Best Criterion |
Percent |
| 1497 |
36 |
0.185899 |
25 |
| 2993 |
87 |
0.158268 |
50 |
| 4489 |
149 |
0.140519 |
75 |
| 5985 |
266 |
0.126467 |
100 |
| Minimum Criterion= 0.1264668282 |
| Least Median of Squares (LMS) Method |
| Minimizing 13th Ordered Squared Residual. |
| Highest Possible Breakdown Value = 42.86 % |
| Selection of All 5985 Subsets of 4 Cases Out of 21 |
| Among 5985 subsets 266 are singular. |
| Observations of Best Subset |
| 8 |
10 |
15 |
19 |
| Estimated Coefficients |
| VAR1 |
VAR2 |
VAR3 |
Intercep |
| 0.75 |
0.5 |
1.29526E-16 |
-39.25 |
| LMS Objective Function = 0.75 |
| Preliminary LMS Scale = 1.0478510755 |
| Robust R Squared = 0.96484375 |
| Final LMS Scale = 1.2076147288 |
|
Next, report the results of LTS for all different subsets:
title2 "*** Use All 5985 Subsets***";
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 2; /* ipri */
optn[3]= 3; /* ilsq */
optn[6]= -1; /* nrep: all 5985 subsets */
optn[8]= 3; /* icov */
call lts(sc,coef,wgt,optn,b,a);
Output 9.2.11: Iteration History and Optimization Results for LTS
| Complete Enumeration for LTS |
| Subset |
Singular |
Best Criterion |
Percent |
| 1497 |
36 |
0.135449 |
25 |
| 2993 |
87 |
0.107084 |
50 |
| 4489 |
149 |
0.081536 |
75 |
| 5985 |
266 |
0.081536 |
100 |
| Minimum Criterion= 0.0815355299 |
| Least Trimmed Squares (LTS) Method |
| Minimizing Sum of 13 Smallest Squared Residuals. |
| Highest Possible Breakdown Value = 42.86 % |
| Selection of All 5985 Subsets of 4 Cases Out of 21 |
| Among 5985 subsets 266 are singular. |
| Observations of Best Subset |
| 5 |
12 |
17 |
18 |
| Estimated Coefficients |
| VAR1 |
VAR2 |
VAR3 |
Intercep |
| 0.7291666667 |
0.4166666667 |
2.220446E-16 |
-36.22115385 |
| LTS Objective Function = 0.4835390299 |
| Preliminary LTS Scale = 1.0067458407 |
| Robust R Squared = 0.9736222371 |
| Final LTS Scale = 1.009470149 |
|
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.