MCD and MVE Calls

Language Reference

MCD and MVE Calls

finds the minimum covariance determinant estimator and the minimum volume ellipsoid estimator

CALL MCD( sc, coef, dist, opt, x<, s>);

The MDC call is the robust (resistent) estimation of multivariate location and scatter, defined by minimizing the determinant of the covariance matrix computed from h points.

CALL MVE( sc, coef, dist, opt, x<, s>);

The MVE call is the robust (resistent) estimation of multivariate location and scatter, defined by minimizing the volume of an ellipsoid containing h points.

The MVE and MCD subroutines compute the minimum volume ellipsoid estimator and the minimum covariance determinant estimator. These robust locations and covariance matrices can be used to detect multivariate outliers and leverage points. For this purpose, the MVE and MCD subroutines provide a table of robust distances.

The inputs to the MCD and MVE subroutine are as follows:

opt

refers to an options vector with the following components (missing values are treated as default values):

opt[1]

specifies the amount of printed output. Higher option values request additional output and include the output of lower values.

opt[1]=0: prints no output except error messages.
opt[1]=1: prints most of the output.
opt[1]=2: additionally prints case numbers of the observations in the best subset and some basic history of the optimization process.
opt[1]=3: additionally prints how many subsets result in singular linear systems.

The default is opt[1]=0.

opt[2]

specifies whether the classical, initial, and final robust covariance matrices are printed. The default is opt[2]=0. Note that the final robust covariance matrix is always returned in coef.

opt[3]

specifies whether the classical, initial, and final robust correlation matrices are printed or returned:

opt[3]=0: does not return or print.
opt[3]=1: prints the robust correlation matrix.
opt[3]=2: returns the final robust correlation matrix in coef.
opt[3]=3: prints and returns the final robust correlation matrix.

opt[4]

specifies the quantile h used in the objective function. The default is opt[5]= h = [(N+n+1)/2]. If the value of h is specified outside the range $\frac{N}2+1 \leq h \leq \frac{3N}4 + \frac{n+1}4$ , it is reset to the closest boundary of this region.

opt[5]

specifies the number N_Rep of subset generations. This option is the same as described previously for the LMS and LTS subroutines. Due to computer time restrictions, not all subset combinations can be inspected for larger values of N and n. If opt[6] is zero or missing, the default number of subsets is taken from the following table.

n	1	2	3	4	5	6	7	8	9	10
N_lower	500	50	22	17	15	14	0	0	0	0
N_upper	1000000	1414	182	71	43	32	27	24	23	22
N_Rep	500	1000	1500	2000	2500	3000	3000	3000	3000	3000

n	11	12	13	14	15
N_lower	0	0	0	0	0
N_upper	22	22	22	23	23
N_Rep	3000	3000	3000	3000	3000

: If the number of cases (observations) N is smaller than N_lower, then all possible subsets are used; otherwise, N_Rep subsets are drawn randomly. This means that an exhaustive search is performed for opt[6]=-1. If N is larger than N_upper, a note is printed in the log file indicating how many subsets exist.

x

refers to an N ×n matrix X of regressors.

s

refers to an n vector containing n observation numbers of a subset for which the objective function should be evaluated. This subset can be the start for a pairwise exchange algorithm if opt[4] is specified.

Missing values are not permitted in x. Missing values in opt cause the default value to be used.

The MCD and MVE subroutines return the following values:

sc

is a column vector containing the following scalar information:

sc[1]

the quantile h used in the objective function

sc[2]

number of subsets generated

sc[3]

of subsets with singular linear systems

sc[4]

number of nonzero weights w_i

sc[5]

lowest value of the objective function F_MVE attained (volume of smallest ellipsoid found)

sc[6]

Mahalanobis-like distance used in the computation of the lowest value of the objective function F_MVE

sc[7]

the cutoff value used for the outlier decision

sc[8]

the correction factor c(N,n) used in scaling the initial MVE scatter matrix

sc[9]

scaling factor for the initial MVE scatter matrix

f = [(c²(N,n))/ CINV(0.5,n)]

coef

is a matrix with n columns containing the following results in its rows:

coef[1]: location of ellipsoid center
coef[2]: eigenvalues of final robust scatter matrix
coef[3:2+n]: the final robust scatter matrix for opt[2]=1 or opt[2]=3

dist

is a matrix with N columns containing the following results in its rows:

dist[1]: Mahalanobis distances
dist[2]: robust distances based on the final estimates
dist[3]: weights (=1 for small, =0 for large robust distances)

Example

Consider results for Brownlee's (1965) stackloss data. The three explanatory variables correspond to measurements for a plant oxidizing ammonia to nitric acid on 21 consecutive days.

x₁ air flow to the plant
x₂ cooling water inlet temperature
x₃ acid concentration

The response variable y_i gives the permillage of ammonia lost (stackloss). These data are also given by Rousseeuw & Leroy (1987, p.76).

 print "Stackloss Data";
 aa = { 1  80  27  89  42,
        1  80  27  88  37,
        1  75  25  90  37,
        1  62  24  87  28,
        1  62  22  87  18,
        1  62  23  87  18,
        1  62  24  93  19,
        1  62  24  93  20,
        1  58  23  87  15,
        1  58  18  80  14,
        1  58  18  89  14,
        1  58  17  88  13,
        1  58  18  82  11,
        1  58  19  93  12,
        1  50  18  89   8,
        1  50  18  86   7,
        1  50  19  72   8,
        1  50  19  79   8,
        1  50  20  80   9,
        1  56  20  82  15,
        1  70  20  91  15 };

Rousseeuw & Leroy (1987, p.76) cite a large number of papers where this data set was analyzed before and state that most researchers "concluded that observations 1, 3, 4, and 21 were outliers"; some people also reported observation 2 as an outlier.

By default, subroutine MVE tries only 2000 randomly selected subsets in its search. There are in total 5985 subsets of 4 cases out of 21 cases.

  a = aa[,2:4];
  optn = j(8,1,.);
  optn[1]= 2;              /* ipri */
  optn[2]= 1;              /* pcov: print COV */
  optn[3]= 1;              /* pcor: print CORR */
  optn[5]= -1;             /* nrep: use all subsets */

  CALL MVE(sc,xmve,dist,optn,a);

The first part of the output shows the classical scatter and correlation matrix:

         Minimum Volume Ellipsoid (MVE) Estimation

          Consider Ellipsoids Containing 12 Cases.


                Classical Covariance Matrix

                   VAR1              VAR2              VAR3

   VAR1    84.057142857     22.657142857    24.571428571
   VAR2    22.657142857     9.9904761905    6.6214285714
   VAR3    24.571428571     6.6214285714    28.714285714

                Classical Correlation Matrix

                   VAR1            VAR2            VAR3

   VAR1               1     0.781852333    0.5001428749
   VAR2     0.781852333               1    0.3909395378
   VAR3    0.5001428749    0.3909395378               1

                       Classical Mean

                     VAR1    60.428571429
                     VAR2    21.095238095
                     VAR3    86.285714286

       There are 5985 subsets of 4 cases out of 21 cases.
           All 5985 subsets will be considered.

The second part of the output shows the results of the optimization (complete subset sampling):

                Complete Enumeration for MVE



                                        Best
          Subset    Singular       Criterion     Percent

            1497          22      253.312431          25
            2993          46      224.084073          50
            4489          77      165.830053          75
            5985         156      165.634363         100

                Minimum Criterion= 165.63436284

              Among 5985 subsets 156 are singular.


                   Observations of Best Subset

          7                10                14                20

                     Initial MVE Location
                           Estimates

                     VAR1              58.5
                     VAR2             20.25
                     VAR3                87

                   Initial MVE Scatter Matrix

                   VAR1            VAR2            VAR3

   VAR1    34.829014749    28.413143611     62.32560534
   VAR2    28.413143611    38.036950318    58.659393261
   VAR3     62.32560534    58.659393261    267.63348175

The third part of the output shows the optimization results after local improvement:

         Final MVE Estimates (Using Local Improvement)

            Number of Points with Nonzero Weight=17


                      Robust MVE Location
                           Estimates

                    VAR1      56.705882353
                    VAR2      20.235294118
                    VAR3      85.529411765

                   Robust MVE Scatter Matrix

                     VAR1              VAR2              VAR3

   VAR1      23.470588235      7.5735294118      16.102941176
   VAR2      7.5735294118      6.3161764706      5.3676470588
   VAR3      16.102941176      5.3676470588      32.389705882

                     Eigenvalues of Robust
                         Scatter Matrix

                     VAR1      46.597431018
                     VAR2      12.155938483
                     VAR3       3.423101087

                   Robust Correlation Matrix

                     VAR1              VAR2              VAR3

   VAR1                 1      0.6220269501      0.5840361335
   VAR2      0.6220269501                 1       0.375278187
   VAR3      0.5840361335       0.375278187                 1

The final output presents a table containing the classical Mahalanobis distances, the robust distances, and the weights identifying the outlying observations (that is leverage points when explaining y with these three regressor variables):

         Classical Distances and Robust (Rousseeuw) Distances
                  Unsquared Mahalanobis Distance and
           Unsquared Rousseeuw Distance of Each Observation
                 Mahalanobis          Robust
           N       Distances       Distances          Weight

           1        2.253603        5.528395               0
           2        2.324745        5.637357               0
           3        1.593712        4.197235               0
           4        1.271898        1.588734        1.000000
           5        0.303357        1.189335        1.000000
           6        0.772895        1.308038        1.000000
           7        1.852661        1.715924        1.000000
           8        1.852661        1.715924        1.000000
           9        1.360622        1.226680        1.000000
          10        1.745997        1.936256        1.000000
          11        1.465702        1.493509        1.000000
          12        1.841504        1.913079        1.000000
          13        1.482649        1.659943        1.000000
          14        1.778785        1.689210        1.000000
          15        1.690241        2.230109        1.000000
          16        1.291934        1.767582        1.000000
          17        2.700016        2.431021        1.000000
          18        1.503155        1.523316        1.000000
          19        1.593221        1.710165        1.000000
          20        0.807054        0.675124        1.000000
          21        2.176761        3.657281               0

                   Distribution of Robust Distances

                 MinRes           1st Qu.            Median

           0.6751244996      1.5084120761      1.7159242054

                   Mean           3rd Qu.            MaxRes

           2.2282960174      2.0831826658      5.6373573538

                     Cutoff Value = 3.0575159206

       The cutoff value is the square root of 
         the 0.975 quantile of the chi square
        distribution with 3 degrees of freedom.

   There are 4 points with large robust distances receiving 
   zero weights. These may include boundary cases. Only points 
   whose robust distances are substantially larger than the cutoff 
   value should be considered outliers.

References

Brownlee, K.A. (1965), Statistical Theory and Methodology in Science and Engineering, New York: John Wiley & Sons, Inc.
Davies, L. (1992), ``The Asymptotics of Rousseeuw's Minimum Volume Ellipsoid Estimator,'' The Annals of Statistics, 20, 1828 -1843.
Rousseeuw, P.J. (1984), "Least Median of Squares Regression," Journal of the American Statistical Association, 79, 871 -880.
Rousseeuw, P.J. (1985), "Multivariate Estimation with High Breakdown Point," in Mathematical Statistics and Applications, ed. by W. Grossmann, G. Pflug, I. Vincze, and W. Wertz, Dordrecht: Reidel Publishing Company, 283 -297.
Rousseeuw, P.J. and Croux, C. (1993), "Alternatives to the Median Absolute Deviation," Journal of the American Statistical Association, 88, 1273 -1283.
Rousseeuw, P.J. and Hubert, M. (1997), "Recent developments in PROGRESS," in L₁-Statistical Procedures and Related Topics, ed. by Y. Dodge, IMS Lecture Notes - Monograph Series, No. 31, 201 -214.
Rousseeuw, P.J. and Leroy, A.M. (1987), Robust Regression and Outlier Detection, New York: John Wiley & Sons, Inc.
Rousseeuw, P.J. and Van Driessen, K. (1997), ``A fast Algorithm for the Minimum Covariance Determinant Estimator,'' submitted for publication.
Rousseeuw, P.J. and Van Zomeren, B.C. (1990), "Unmasking Multivariate Outliers and Leverage Points," Journal of the American Statistical Association, 85, 633 -639.

Chapter Contents
Previous
Next
Top