| Robust Regression Examples |
Example 9.1: LMS and LTS with
Substantial Leverage Points: Hertzsprung-Russell Star Data
The following data are reported in Rousseeuw
and Leroy (1987, p. 27) and are based on
Humphrey (1978) and Vansina and De Greve (1982).
The 47 observations correspond to the 47 stars of
the CYG OB1 cluster in the direction of Cygnus.
The regressor variable (column 2) x is the logarithm
of the effective temperature at the surface of the
star (Te), and the response variable (column 3) y
is the logarithm of its light intensity (L / L0).
The results for LS and LMS on page 28 of Rousseeuw
and Leroy (1987) are based on a more precise
(five decimal places) version of the data set.
This data set is remarkable in that it contains four
substantial leverage points (giant stars) corresponding
to observations 11, 20, 30, and 34 that greatly affect
the results of L2 and even L1 regression.
ab = { 1 4.37 5.23, 2 4.56 5.74, 3 4.26 4.93,
4 4.56 5.74, 5 4.30 5.19, 6 4.46 5.46,
7 3.84 4.65, 8 4.57 5.27, 9 4.26 5.57,
10 4.37 5.12, 11 3.49 5.73, 12 4.43 5.45,
13 4.48 5.42, 14 4.01 4.05, 15 4.29 4.26,
16 4.42 4.58, 17 4.23 3.94, 18 4.42 4.18,
19 4.23 4.18, 20 3.49 5.89, 21 4.29 4.38,
22 4.29 4.22, 23 4.42 4.42, 24 4.49 4.85,
25 4.38 5.02, 26 4.42 4.66, 27 4.29 4.66,
28 4.38 4.90, 29 4.22 4.39, 30 3.48 6.05,
31 4.38 4.42, 32 4.56 5.10, 33 4.45 5.22,
34 3.49 6.29, 35 4.23 4.34, 36 4.62 5.62,
37 4.53 5.10, 38 4.45 5.22, 39 4.53 5.18,
40 4.43 5.57, 41 4.38 4.62, 42 4.45 5.06,
43 4.50 5.34, 44 4.45 5.34, 45 4.55 5.54,
46 4.45 4.98, 47 4.42 4.50 } ;
a = ab[,2]; b = ab[,3];
The following code specifies that most of the output be printed.
print "*** Hertzsprung-Russell Star Data: Do LMS ***";
optn = j(8,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lms(sc,coef,wgt,optn,b,a);
Output 9.1.1: Some Simple Statistics
| Median
and Mean |
| |
Median |
Mean |
| VAR1 |
4.42 |
4.31 |
| Intercep |
1 |
1 |
| Response |
5.1 |
5.0121276596 |
| Dispersion
and Standard Deviation |
| |
Dispersion |
StdDev |
| VAR1 |
0.163086244 |
0.2908234187 |
| Intercep |
0 |
0 |
| Response |
0.6671709983 |
0.5712493409 |
|
Partial output for LS regression is shown in Output 9.1.2.
Output 9.1.2: Table of Unweighted LS Regression
| Unweighted
Least-Squares Estimation |
| LS
Parameter Estimates |
| Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
| VAR1 |
-0.4133039 |
0.28625748 |
-1.44 |
0.1557 |
-0.9743582 |
0.14775048 |
| Intercep |
6.7934673 |
1.23651563 |
5.49 |
<.0001 |
4.3699412 |
9.21699339 |
| Sum
of Squares = 14.346394626 |
| LS
Scale Estimate = 0.5646315343 |
| Cov
Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.0819433428 |
-0.353175807 |
| Intercep |
-0.353175807 |
1.5289708954 |
| F(1,45)
Statistic = 2.0846120667 |
| Probability
= 0.1557164396 |
|
Looking at the column Best Crit in the
iteration history table, you see that, with complete
enumeration, the optimal solution is found very early.
Output 9.1.3: History of Iteration Process
| Complete
Enumeration for LMS |
| Subset |
Singular |
Best
Criterion |
Percent |
| 271 |
5 |
0.392791 |
25 |
| 541 |
8 |
0.392791 |
50 |
| 811 |
27 |
0.392791 |
75 |
| 1081 |
45 |
0.392791 |
100 |
| Minimum
Criterion= 0.3927910898 |
| Least
Median of Squares (LMS) Method |
| Minimizing
25th Ordered Squared Residual. |
| Highest
Possible Breakdown Value = 48.94 % |
| Selection
of All 1081 Subsets of 2 Cases Out of 47 |
| Among
1081 subsets 45 are singular. |
|
Output 9.1.4: Results of Optimization
| Observations
of Best Subset |
| 2 |
29 |
| Estimated
Coefficients |
| VAR1 |
Intercep |
| 3.9705882353 |
-12.62794118 |
| LMS
Objective Function = 0.2620588235 |
| Preliminary
LMS Scale = 0.3987301586 |
| Robust
R Squared = 0.5813148789 |
| Final
LMS Scale = 0.3645644492 |
|
The output for WLS regression follows.
Due to the size of the scaled residuals, six
observations (with numbers 7, 9, 11, 20, 30, 34) were
assigned zero weights in the following WLS analysis.
Output 9.1.5: Table of Weighted LS Regression
| Weighted
Least-Squares Estimation |
| RLS
Parameter Estimates Based on LTS |
| Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
| VAR1 |
3.04615694 |
0.43733923 |
6.97 |
<.0001 |
2.18898779 |
3.90332608 |
| Intercep |
-8.5000549 |
1.92630783 |
-4.41 |
<.0001 |
-12.275549 |
-4.7245609 |
| Weighted
Sum of Squares = 4.52819451 |
| RLS
Scale Estimate = 0.3407455818 |
| Cov
Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.1912656038 |
-0.842128459 |
| Intercep |
-0.842128459 |
3.7106618752 |
| Weighted
R-squared = 0.5543573521 |
| F(1,39)
Statistic = 48.514065776 |
| Probability
= 2.3923178E-8 |
| There
are 41 points with nonzero weight. |
| Average
Weight = 0.8723404255 |
|
The LTS regression leads to similar results:
print "*** Hertzsprung-Russell Star Data: Do LTS ***";
optn = j(8,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lts(sc,coef,wgt,optn,b,a);
Output 9.1.6: History of Iteration Process
| Complete
Enumeration for LTS |
| Subset |
Singular |
Best
Criterion |
Percent |
| 271 |
5 |
0.274268 |
25 |
| 541 |
8 |
0.274268 |
50 |
| 811 |
27 |
0.274268 |
75 |
| 1081 |
45 |
0.274268 |
100 |
Minimum
Criterion= 0.274268243
|
| Least
Trimmed Squares (LTS) Method |
| Minimizing
Sum of 25 Smallest Squared Residuals. |
| Highest
Possible Breakdown Value = 48.94 % |
| Selection
of All 1081 Subsets of 2 Cases Out of 47 |
| Among
1081 subsets 45 are singular. |
|
Output 9.1.7: Results of Optimization
| Observations
of Best Subset |
| 22 |
36 |
| Estimated
Coefficients |
| VAR1 |
Intercep |
| 4.2424242424 |
-13.72633939 |
| LTS
Objective Function = 0.1829838175 |
| Preliminary
LTS Scale = 0.4525412929 |
| Robust
R Squared = 0.4038660847 |
| Final
LTS Scale = 0.3743418666 |
|
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.