Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/50416849
CITATIONS READS
166 149
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daniel Peña on 27 May 2014.
In this article, we present a simple multivariate outlier-detection procedure and a robust estimator for
the covariance matrix, based on the use of information obtained from projections onto the directions that
maximize and minimize the kurtosis coef cient of the projected data. The properties of this estimator
(computationa l cost, bias) are analyzed and compared with those of other robust estimators described in
the literature through simulation studies. The performance of the outlier-detection procedure is analyzed
by applying it to a set of well-known examples.
The detection of outliers in multivariate data is recognized proposed using the Mahalanobis distance computed using
to be an important and dif cult problem in the physical, chem- M estimators for the mean and covariance matrix. Stahel
ical, and engineering sciences. Whenever multiple measure- (1981) and Donoho (1982) proposed to solve the dimensional-
ments are obtained, there is always the possibility that changes ity problem by computing the weights for the robust estimators
in the measurement process will generate clusters of outliers. from the projections of the data onto some directions. These
Most standard multivariate analysis techniques rely on the directions were chosen to maximize distances based on robust
assumption of normality and require the use of estimates for univariate location and scale estimators, and the optimal
both the location and scale parameters of the distribution. The values for the distances could also be used to weigh each point
presence of outliers may distort arbitrarily the values of these in the computation of a robust covariance matrix. To ensure a
estimators and render meaningless the results of the applica- high breakdown point, one global optimization problem with
tion of these techniques. According to Rocke and Woodruff discontinuous derivatives had to be solved for each data point,
(1996), the problem of the joint estimation of location and and the associated computational cost became prohibitive for
shape is one of the most dif cult in robust statistics. large high-dimensional datasets. This computational cost can
Wilks (1963) proposed identifying sets of outliers of size j be reduced if the directions are generated by a resampling
in normal multivariate data by checking the minimum values procedure of the original data, but the number of directions
of the ratios —A4I 5 —=—A—, where —A4I 5 — is the internal scatter of a to consider still grows exponentially with the dimension of
modi ed sample in which the set of observations I of size j the problem.
has been deleted and —A— is the internal scatter of the complete A different procedure was proposed by Rousseeuw (1985)
sample. The internal scatter is proportional to the determinant based on the computation of the ellipsoid with the smallest
of the covariance matrix and the ratios are computed for all volume or with the smallest covariance determinant that
possible sets of size j. Wilks computed the distribution of would encompass at least half of the data points. This
the statistic for j equal to 1 and 2. It is well known that procedure has been analyzed and extended in a large number
this procedure is a likelihood ratio test and that for j D 1 of articles; see, for example, Hampel, Ronchetti, Rousseeuw,
the method is equivalent to selecting the observation with the and Stahel (1986), Rousseeuw and Leroy (1987), Davies
largest Mahalanobis distance from the center of the data. (1987), Rousseeuw and van Zomeren (1990), Tyler (1991),
Because a direct extension of this idea to sets of out- Cook, Hawkins, and Weisberg (1993), Rocke and Woodruff
liers larger than 2 or 3 is not practical, Gnanadesikan (1993, 1996), Maronna and Yohai (1995), Agulló (1996),
and Kettenring (1972) proposed to reduce the multivariate Hawkins and Olive (1999), Becker and Gather (1999), and
detection problem to a set of univariate problems by looking Rousseeuw and Van Driessen (1999). Public-domain codes
at projections of the data onto some direction. They chose the implementing these procedures can be found in STATL IB—
direction of maximum variability of the data and, therefore, for example, the code FSAMVE from Hawkins (1994) and
they proposed to obtain the principal components of the data MULTOUT from Rocke and Woodruff (1993, 1996). FAST-
and search for outliers in these directions. Although this MCD from Rousseeuw and Van Driessen is implemented as
method provides the correct solution when the outliers are the “mcd.cov” function of S-PLUS.
located close to the directions of the principal components, it
may fail to identify outliers in the general case.
An alternative approach is to use robust location and © 2001 American Statistical Association and
scale estimators. Maronna (1976) studied af nely equivariant the American Society for Quality
M estimators for covariance matrices, and Campbell (1980) TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
286
MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 287
Because these procedures are based on the minimization of very small, close to its minimum possible value. This result
certain nonconvex and nondifferentiable criteria, these estima- suggests searching for outliers also using directions obtained
tors are computed by resampling. For example, Rousseeuw by minimizing the kurtosis of the projections. Therefore, a
(1993) proposed selecting p observations from the original procedure that would search for outliers by projecting the data
sample and computing the direction orthogonal to the hyper- onto the directions that maximize or minimize the kurtosis of
plane de ned by these observations. The maximum over this the projected points would seem promising.
nite set of directions is used as an approximation to the In univariate normal data, outliers have often been associ-
exact solution. Unfortunately, the number of candidate solu- ated with large kurtosis values, and some well-known tests
tions grows exponentially with the size of the problem and, as of normality are based on the asymmetry and kurtosis coef -
a consequence, the corresponding procedures become compu- cients. These ideas have also been used to test for multivariate
tationally expensive for even moderately sized problems. Hadi normality (see Malkovich and A 1973). Additionally, some
(1992, 1994), Atkinson (1994), Hawkins and Olive (1999), projection indices that have been applied in projection pur-
and Rousseeuw and van Driessen (1999) presented methods suit algorithms are related to the third and fourth moments
to compute approximations for these estimates requiring rea- (Jones and Sibson 1987; Posse 1995). Hampel (1985) derived
sonable computation times. the relationship between the critical value and the breakdown
In this article, we present an alternative procedure, based point of the kurtosis coef cient in univariate samples. He also
on the analysis of the projections of the sample points onto a showed that two-point distributions are the least favorable for
certain set of 2p directions, where p is the dimension of the detecting univariate outliers using the kurtosis coef cient.
sample space. These directions are obtained by maximizing To understand the effect of different types of outliers on
and minimizing the kurtosis coef cient of the projections. The the kurtosis coef cient, suppose that we have a sample of uni-
proposed procedure can be seen as an empirically successful variate data from a random variable that has a distribution F
and faster way of implementing the Stahel–Donoho (SD) algo- with nite moments (the uncontaminated R sample). We assume
rithm. The justi cation for using these directions is presented without loss of generality thatR ŒF D xdF 4x5 D 0, and we
in Section 1. Section 2 describes the proposed procedure and will use the notation mF 4j5 D xj dF 4x50 The sample is con-
illustrates its behavior on an example. Section 3 compares it taminated by a fraction < 1=2 of outliers generated
R from
to other procedures by a simulation study. It is shown that the some contaminating distribution G1 with ŒG D xdG4x51 and
proposed procedure works well in practice, is simple to imple- we will denote
R the centered moments of this distribution by
ment, and requires reasonable computation times, even for mG 4j5 D 4x ƒ ŒG 5j dG4x50 Therefore, the resulting observed
large problems. Finally, Section 4 presents some conclusions. random variable X follows a mixture of two distributions,
41 ƒ 5F C G0 The signal-to-noise ratio will be given by
1. KURTOSIS AND OUTLIERS r 2 D Œ2G =mF 4251 and the ratio of variances of the two distribu-
tions will be v2 D mG 425=mF 425. The third- and fourth-orde r
The idea of using projections to identify outliers is the basis
moment coef cients for the mixture and the original and con-
for several outlier-detection procedures. These procedures rely
taminating distributions will be denoted by ai D mi 435=m3=2i 425
on the fact that in multivariate contaminated samples each
and ƒi D mi 445=m2i 425 for i D X1 F 1 G, respectively. The ratio
outlier must be an extreme point along the direction from the
of the kurtosis coef cients of G and F will be ˆ D ƒG =ƒF .
mean of the uncontaminated data to the outlier. Unfortunately,
Some conditions must be introduced on F and G to ensure
high-breakdown-poin t methods developed to date along these
that this is a reasonable model for outliers. The rst con-
lines, such as the SD algorithm, require projecting the data
dition is that, for any values of the distribution parameters,
onto randomly generated directions and need very large num-
the standard distribution F has a bounded kurtosis coef cient,
bers of directions to be successful. The ef ciency of these
ƒF 0 This bound avoids the situation in which the tails of the
methods could be signi cantly improved, at least from a com-
standard distribution are so heavy that extreme observations,
putational point of view, if a limited number of appropriate
which cannot be distinguished from outliers, will appear with
directions would suf ce to identify the outliers. Our proposal
signi cant probability. Note that the most often used distri-
is to choose these directions based on the values of the kurtosis
butions (normal, Student’s t, gamma, beta, : : : ) satisfy this
coef cients of the projected observations.
condition. The second condition is that the contaminating dis-
In this section, we study the impact of the presence of
tribution G is such that
outliers on the kurtosis values and the use of this moment
coef cient to identify them. We start by considering the uni- ŒG mG 435 ¶ 03 (1)
variate case in which different types of outliers produce dif- that is, if the distribution is not symmetric, the relevant tail of
ferent effects on the kurtosis coef cient. Outliers generated by G for the generation of outliers and the mean of G both lie
the usual symmetric contaminated model increase the kurtosis on the same side with respect to the mean of F 0 This second
coef cient of the observed data. A small proportion of outliers assumption avoids situations in which most of the observations
generated by an asymmetric contaminated model also increase generated from G might not be outliers.
the kurtosis coef cient of the observed data. These two results To analyze the effect of outliers on the kurtosis coef cient,
suggest that for multivariate data outliers may be revealed on we write its value for the contaminated population as (see the
univariate projections onto directions obtained by maximizing appendix for the derivation)
the kurtosis coef cient of the projected data. However, a large
proportion of outliers generated by an asymmetric contami- ƒF C 41 ƒ 54c4 C 4rc3 C 6r 2 c2 C r 4 c0 5
ƒX D 1 (2)
nation model can make the kurtosis coef cient of the data h0 C h1 r 2 C h2 r 4
where c4 D ƒF 4ˆv4 ƒ 15=41 ƒ 51 c3 D aG v3 ƒ aF 1 c2 D The study of the univariate kurtosis coef cient indicates
C 41 ƒ 5v 2 1 c0 D 3 C 41 ƒ 53 1 h0 D 41 C 4v 2 ƒ 1552 , that the presence of outliers in the projected data will imply
h1 D 241 ƒ 5h1=2 2 2
0 , and h 2 D 41 ƒ 5 0 We consider the particularly large (or small) values for the kurtosis coef cient.
two following cases: As a consequence, it would be reasonable to use as projec-
tion directions those that maximize or minimize the kurtosis
1. The centered case, in which we suppose that both F and
coef cient of the projected data. For a standard multivariate
G have the same mean, and as a consequence ŒG D 0 and
contamination model, we will show that these directions are
r D 0. From (2), we obtain for this case
able to identify a set of outliers.
ƒF 41 C 4ˆv4 ƒ 155 Consider a p-dimensional random variable X following
ƒX D 0
41 C 4v 2 ƒ 1552 a (contaminated normal) distribution given as a mixture of
Note that the kurtosis coef cient increases due to the pres- normals of the form 41 ƒ 5N401 I5 C N4„e1 1 ‹I 5, where
ence of outliers; that is, ƒX ¶ ƒF whenever ˆv 4 ƒ 2v2 C 1 ¶ e1 denotes the rst unit vector. This contamination model
4v2 ƒ 152 . This holds if ˆ ¶ 1, or equivalently if ƒG ¶ ƒF 1 is particularly dif cult to analyze for many outlier-detection
and under this condition the kurtosis will increase for any procedures, (e.g., see Maronna and Yohai 1995). Moreover,
value of . Thus in the usual situation in which the outlier the analysis in the preceding section indicates that this model,
model is built by using a contaminating distribution of the for noncentered contaminating distributions, may correspond
same family as the original distribution [as in the often-used to an unfavorable situation from the point of view of the
normal scale-contaminated model; see, for instance, Box and kurtosis coef cient.
Tiao (1968)] or with heavier tails, the kurtosis coef cient of Since the kurtosis coef cient is invariant to af ne transfor-
the observed data is expected to be larger than that of the mations, we will center and scale the variable to ensure that it
original distribution. has mean 0 and covariance matrix equal to the identity. This
2. Consider now the noncentered case, in which both distri- transformed variable, Y , will follow a distribution of the form
butions are arbitrary and we assume that the means of G and F 41 ƒ 5N4m1 1 S5 C N4m2 1 ‹S5, where
are different (ŒG 6D 0). A reasonable condition to ensure that m1 D ƒ„S 1=2 e1 1 m2 D 41 ƒ 5„S 1=2 e1
G will generate outliers for F is that the signal-to-noise ratio
r is large enough. If we let r ! ˆ in (2) (and we assume that „2 41 ƒ 5
1 D 1 ƒ 41 ƒ ‹51 2 D
the moment coef cients in the expression remain bounded), 1 C „2 41 ƒ 5
we obtain 1
3 C 41 ƒ 53 S D 4I ƒ 2 e1 e10 5 (3)
ƒX ! 0 1
41 ƒ 5
and 1 and 2 denote auxiliary parameters, introduced to sim-
This result agrees with the one obtained by Hampel (1985). plify the expressions (see the appendix for a derivation of these
Note that if D 05 the kurtosis coef cient of the observed values).
data will be equal to 1, the minimum possible value. On the We wish to study the behavior of the univariate projec-
other hand, if ! 0 the kurtosis coef cient increases without tions for this variable and their kurtosis coef cient values.
bound and will become larger than ƒF , which is bounded. Consider an arbitrary projection direction u. Using the af ne
Therefore, in the asymmetric case, if the contamination is very invariance of the kurtosis coef cient, we will assume ˜u˜ D 1.
large the kurtosis coef cient will be very small, whereas if the The projected univariate random variable Z D u0 Y will follow
contamination is small the kurtosis coef cient will be large. a distribution 41 ƒ 5N4m01 u1 u0 Su5 C N4m02 u1 ‹u0 Su5, with
The preceding results agree with the dual interpretation E4Z5 D 01 and E4Z2 5 D 10 The kurtosis coef cient of Z will
of the standard fourth-moment coef cient of kurtosis (see be given by
Ruppert 1987; Balanda and MacGillivray 1988) as measuring ƒZ 4—5 D a41 „1 ‹5 C b41 „1 ‹5—2 C c41 „1 ‹5—4 1 (4)
tail heaviness and lack of bimodality. A small number
of outliers will produce heavy tails and a larger kurtosis where the coef cients a, b, and c correspond to
coef cient. But, if we increase the amount of outliers, we can 3
start introducing bimodality and the kurtosis coef cient may a41 „1 ‹5 D 41 ƒ C ‹2 5
12
decrease.
62
Kurtosis and Projections b41 „1 ‹5 D 41 ƒ ‹54 2 ‹ ƒ 41 ƒ 52 5
12
The preceding discussion centered on the behavior of the 1 ƒ C ‹2 C ‹41 ƒ 5
c41 „1 ‹5 D 22 3 ƒ6
kurtosis coef cient in the univariate case as an indicator for 12 1
the presence of outliers. A multivariate method to take advan- 3
C 41 ƒ 5 3
tage of these properties would proceed through two stages— C (5)
41 ƒ 5
determining a set of projection directions to obtain univariate
samples and then conducting an analysis of these samples to and — ² u1 D e10 u (see the appendix for details on this
determine if any outlier may be present in the original sample. derivation).
As indicated in the introduction, this is the approach developed We wish to study the relationship between the direction
by Stahel and Donoho. In this section we will show how to to the outliers, e1 in our model and the directions u that
nd interesting directions to detect outliers. correspond to extremes for the projected kurtosis coef cient.
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 289
The optimization problem de ning these extreme directions Table 1. Extreme Directions for the Concentrated
would be either Contamination Model
max ƒZ 4—5 Small contamination
— (6)
s.t. ƒ1 µ — µ 1
Direction ‹<1 ‹>1 Large cont.
or the equivalent minimization problem.
—D 1 Global max. Global max. Global min.
From the rst-order optimality conditions for (6), the —D0p Local max. Global min. Global max.
extremes may correspond to either a point in the interval —D ƒb=2c Global min. — —
4ƒ11 15 such that ƒZ0 4—5 D 0 or to the extreme points of the
interval, — D 1. Since ƒZ0 D 4c—3 C2b—, the points
pthat make
this derivative equal to 0 are — D 0 and — D ƒb=42c5.
is smaller than 1. Additionally, since in this case ƒZ00 D ƒ4b
We now analyze in detail the nature of each of these three
for ‹ < 1, it holds that b < 0 whenever ‹ < 1, implying ƒZ00 >
possible extreme points.
0. Consequently, for small concentrated contaminations this
1. — D 1—that is, the direction of the outliers. This direc- additional extreme point exists and is a minimizer. For large
tion corresponds to a local maximizer whenever 4c C 2b > contamination levels, it holds that
0 (the derivative at — D 1) and to a minimizer whenever b ‹ƒ1 2
2 C 2‹ C „2
4c C 2b < 0 (the case in which 4c C 2b D 0 will be treated lim ƒ D3 0
!1=2 2c „ 1 ƒ 10‹ C ‹2
when considering the third candidate to an extreme point). The
expression for 4c C 2b from (5) is quite complex to analyze For ‹ ¶ 0 and any „, this expression is either negative or larger
in the general case, but for the case of small contamination than 1. As a consequence, no extreme intermediate direction
levels ( ! 0) it holds that 1 ! 1, 2 = ! „2 , and exists if the contamination is large.
4c C 2b Table 1 provides a brief summary of the preceding results.
lim D 4„4 C 12„2 4‹ ƒ 151
!0 Entries “Global max.” and “Global min.” indicate if a given
implying that, since this value is positive except for very direction is the global maximizer or the global minimizer of
small values of „ and ‹4‹ µ 1 ƒ „2 =3), for small values of the kurtosis coef cient, respectively. “Local max.” indicates
the direction of the outliers will be a maximizer for the the case in which the direction orthogonal to the outliers is a
kurtosis coef cient. Moreover, for large contamination levels local maximizer for the kurtosis coef cient.
( ! 1=2), after some manipulation of the expressions in (5) To detect the outliers, the procedure should be able to
it holds that compute the direction to the outliers (— D 1) from Prob-
lem (6). However, from Table 1, to obtain this direction
34‹ ƒ 152 C „2 4‹ C 15
lim 44c C 2b5 D ƒ8„2 < 01 we would need both the global minimizer and the global
!1=2 4‹ C 1542 C 2‹ C „2 52 maximizer of (6). In practice this cannot be done ef ciently;
and, as a consequence, if is large we always have a mini- as an alternative solution, we compute one local minimizer
mizer along the direction of the outliers. and one local maximizer. These computations can be done
2. — D 0, a direction orthogonal to the outliers. Along with reasonable computational effort, as we describe later on.
this direction, as ƒZ00 405 D 2b1 the kurtosis coef cient has a This would not ensure obtaining the direction to the outliers
maximizer whenever b < 0. For small contaminations, from because in some cases these two directions might correspond
lim!0 b= D 6„2 4‹ ƒ 15, the kurtosis coef cient has a max- to a direction orthogonal to the outliers and the intermediate
imizer for ‹ < 1 and a minimizer for ‹ > 1. Comparing the direction, for example, but we are assured of obtaining either
kurtosis coef cient values when the direction to the outliers the direction to the outliers or a direction orthogonal to it. The
and a direction orthogonal to it are both local maximizers computation procedure is then continued by projecting the
(when ‹ < 1), from (4) and data onto a subspace orthogonal to the computed directions,
ƒZ 4 15 ƒ ƒZ 405 and additional directions are obtained by solving the resulting
lim D „2 4„2 ƒ 641 ƒ ‹551 optimization problems. In summary, we will compute one
!0
minimizer and one maximizer for the projected kurtosis
it follows that — D 1 corresponds to the global maximizer,
coef cient, project the data onto an orthogonal subspace, and
except for very small values of „. For large contaminations,
repeat this procedure until 2p directions have been computed.
from
‹ƒ1 2 1 For the case analyzed previously, this procedure should ensure
lim b D ƒ6„2 1 that the direction to the outliers is one of these 2p directions.
!1=2 ‹ C 1 „ C 2‹ C 2
2
Note that, although we have considered a special contam-
the kurtosis p
coef cient has a maximizer at — D 0. ination pattern, this suggested procedure also seems reason-
3. — D ƒb=2c1 an intermediate direction, if this value able in those cases in which the contamination patterns are
lies in the interval 6ƒ11 17—that is, if 0 < ƒb=2c < 1. For different—for example, when more than one cluster of con-
small contamination levels ( ! 0), it holds that taminating observations is present.
b 1ƒ‹
lim ƒ D3 2 1 2. DESCRIPTION OF THE ALGORITHM
!0 2c „
and this local extreme point exists whenever 1 ƒ „2 =3 < ‹ < We assume that we are given a sample 4x1 1 : : : 1 xn 5 of
1—that is, basically when the dispersion of the contamination a p-dimensional vector random variable X. The proposed
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
290 DANIEL PEÑA AND FRANCISCO J. PRIETO
procedure is based on projecting each observation onto a set Table 2. Cutoff Values for Univariate Projections
of 2p directions and then analyzing the univariate projections
onto these directions in a similar manner to the SD algorithm. Sample space dimension p 5 10 20
Cutoff value ‚p 401 609 1008
These directions are obtained as the solutions of 2p simple
smooth optimization problems, as follows:
1. The original data are rescaled and centered. Let xN denote 6. If the condition in Step 5 were satis ed for some i, a
the mean and S the covariance matrix of the original data; new sample composed of all observations i such that ri µ ‚p
then the points are transformed using is formed, and the procedure is applied again to the reduced
sample. This is repeated until either no additional observations
yi D S ƒ1=2 4xi ƒ x51
N i D 11 : : : 1 n0 (7)
satisfy ri > ‚p or the number of remaining observations would
2. Compute p orthogonal directions and projections maxi- be less than 4n C p C 15=2.
mizing the kurtosis coef cient. 7. Finally, a Mahalanobis distance is computed for all
415 observations labeled as outliers in the preceding steps, using
a. Set yi D yi and the iteration index j D 1. the data (mean and covariance matrix) from the remaining
b. The direction that maximizes the coef cient of kur- observations. Let U denote the set of all observations not
tosis is obtained as the solution of the problem labeled as outliers. The algorithm computes
1X n
4j5
4
1 X
dj D arg max d 0 yi QD
m x1
d n iD1 0 (8) —U — i2U i
0
s.t. ddD1 1 X
eD
S 4x ƒ m54x
Q Q 0 51
i ƒm
c. The sample points are projected onto a lower- —U — ƒ 1 i2U i
dimension subspace, orthogonal to the direction dj . De ne
and
( v v0 eƒ1 4xi ƒ m5
I ƒ v0j dj if vj0 dj 6D 0 Q 0S
vi D 4xi ƒ m5 Q 8 i 62 U 0
v j D d j ƒ e 1 1 Qj D j j
2
I otherwise, Those observations i 62 U such that vi < p1 099 are consid-
ered not to be outliers and are included in U . The process
where e1 denotes the rst unit vector. The resulting matrix is repeated until no more such observations are found (or U
Qj is orthogonal, and we compute the new values becomes the set of all observations).
4j5
4j5 zi 4j5 The values of ‚p in Step 5 of the algorithm have been
ui ² D Qj yi 1 i D 11 : : : 1 n1
4jC15
yi obtained from simulation experiments to ensure that, in the
4j 5 4j5
absence of outliers, the percentage of correct observations
where zi is the rst component of ui , which satis es mislabeled as outliers is approximately equal to 5%. Table 2
4j 5 4j 5 4j C15
zi D dj0 yi (the univariate projection values) and yi cor- shows the values used for several sample-space dimensions.
4j5
responds to the remaining p ƒ j components of ui . We set The values for other dimensions could be obtained by inter-
j D j C 11 and, if j < p, we go back to step 2b. Otherwise, polating log ‚p linearly in log p.
4p5 4p5
we let zi D yi .
2.1 Computation of the Projection Directions
3. We compute another set of p orthogonal directions and
projections minimizing the kurtosis coef cient. The main computational effort in the application of the pre-
4pC15
ceding algorithm is associated with the determination of local
a. Reset yi D yi and j D p C 1. solutions for either (8) or (9). This computation can be con-
b. The preceding steps 2b and 2c are repeated, but now ducted in several ways:
instead of (8) we solve the minimization problem
1. Applying a modi ed version of Newton’s method.
1X n
4j5
4
2. Obtaining the solution directly from the rst-order opti-
dj D arg min d 0 yi
d n iD1 (9) mality conditions. The optimality conditions for both prob-
s.t. ddD10 lems are
Xn
4j5 4j5
to compute the projection directions. 4 4d0 yi 53 yi ƒ 2‹d D 0
4j5 iD1
4. To determine if zi is an outlier in any one of the 2p d 0 d D 10
directions, we compute a univariate “measure of outlyingness”
Multiplying the rst equation by d and replacing the con-
for each observation as
4j5
straint, we obtain the value of ‹. The resulting condition is
—zi ƒ median4z4j5 5—
ri D max 0 (10) X
n
4j 5 4j 5 4j 50 X
n
4j5
1µjµ2p MAD 4z4j5 5 4d0 yi 52 yi yi d D 4d0 yi 54 d0 (11)
iD1 iD1
5. These measures ri are used to test if a given observation
is considered to be an outlier. If ri > ‚p , then observation i is This equation indicates that the optimal d will be a unit eigen-
suspected of being an outlier and labeled as such. The cutoff vector of the matrix
values ‚p are chosen to ensure a reasonable level of Type I Xn
4j5 4j5 4j50
M 4d5 ² 4d0 yi 52 yi yi 1
errors and depend on the sample space dimension p. iD1
Table 3. Cutoff Values for Univariate Projections kurtosis) or the smallest (when minimizing) principal compo-
4j 5 4j5
nents of the normalized observations yi =˜yi ˜. These direc-
Sample space dimension p 5 10 20 tions have the property that, once the observations have been
Scaling factor k p 098 095 092
standardized, they are af ne equivariant. They would also cor-
respond to directions along which the observations projected
onto the unit hypersphere seem to present some relevant struc-
that is, of a weighted covariance matrix for the sample, with ture; this would provide a reasonable starting point when the
positive weights (depending on d). Moreover since the eigen- outliers are concentrated, for example. A more detailed dis-
value at the solution is the value of the fourth moment, we are cussion on the motivation for this choice of initial directions
interested in computing the eigenvector corresponding to the was given by Juan and Prieto (1997).
largest or smallest eigenvalue.
In summary, an iterative procedure to compute the direction 2.2 Robust Covariance Matrix Estimation
d proceeds through the following steps: Once the observations have been labeled either as outliers or
1. Select an initial direction dN0 such that ˜dN0 ˜2 D 10 as part of the uncontaminated sample, following the procedure
2. In iteration l C 1, compute dNlC1 , as the unit eigenvector described previously, it is possible to generate robust estimates
associated with the largest (smallest) eigenvalue of M 4dNl 5. for the mean and covariance of the data as the mean and
3. Terminate whenever ˜dNlC1 ƒ dNl ˜ < …, and set dj D dNlC1 . covariance of the uncontaminated observations. This approach,
as opposed to the use of weight functions, seems reasonable
Another relevant issue is the de nition of the initial direc- given the reduced Type I errors associated with the procedure.
tion dN0 . Our choice has been to start with the direction corre- Nevertheless, note that it is necessary to correct the covariance
sponding to the largest (when computing maximizers for the estimator to account for the bias associated with these errors.
(a)
–5
0 10
(b) (c)
15
5
10
0
5
–5
0
10 0 10 0 10
Figure 1. Scatterplots for a Dataset With D 1. The axes correspond to projections onto (a) the direction to the outliers (x axis) and orthogonal
direction (y axis), (b) the direction maximizing the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis), (c) the direction minimizing
the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis).
The proposed estimators become 5N401 I5 C N410e1 I 5 in dimension 2, where e D 41 150 , for
1 X different values of .
QD
m x Consider Figure 1, corresponding to the preceding model
—U — i2U i
with D 01. This gure shows the scatterplots of the data,
and where the axes have been chosen as (1) in Figure 1(a), the
1 X
ec D
S 4x ƒ m54x Q 01
Q i ƒ m5
4—U — ƒ 15kd i2U i direction to the outliers (e) and a direction orthogonal to it;
(2) in Figure 1(b), the direction giving a maximizer for the
where U is the set of all observations not labeled as outliers, kurtosis coef cient (the x axis) and a direction orthogonal to
—U — denotes the number of observations in this set, and kd is
it; and (3) in Figure 1(c), the direction corresponding to a
a constant that has been estimated to ensure that the trace of
minimizer for the kurtosis coef cient (also for the x axis) and
the estimated matrix is unbiased.
a direction orthogonal to it. In this case, the direction maxi-
The values of kd have been obtained through a simulation
experiment for several sample space dimensions and are given mizing the kurtosis coef cient allows the correct identi cation
in Table 3. The values for other dimensions could be obtained of the outliers, in agreement with the results in Table 1 for the
by interpolating log kp linearly in log p. case with small .
Figure 2 shows another dataset, this time corresponding to
2.3 Examples D 03, in the same format as Figure 1. As the analysis in the
To illustrate the procedure and the relevance of choosing preceding section showed and the gure illustrates, here the
projection directions in the manner described previously, we relevant direction to identify the outliers is the one minimizing
show the results from the computation of the projection direc- the kurtosis coef cient, given the large contamination present.
tions for a few simple cases. The rst ones are based on Finally, Figure 3 presents a dataset generated from the
generating 100 observations from a model of the form 41 ƒ model using D 02 in the same format as the preceding
(a)
–5
0 10
(b) (c)
0
5
–5
0
–10
–5
–15
10 0 10 0 10
Figure 2. Scatterplots for a Dataset With D 3. The axes correspond to projections onto (a) the direction to the outliers (x axis) and an
orthogonal direction (y axis), (b) the direction maximizing the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis), (c) the direction
minimizing the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis).
(a)
–5
0 10
(b) (c)
10
0
–5
5
–10
0
–15
10 0 10 10 5 0
Figure 3. Scatterplots for a Dataset With D 2. The axes correspond to projections onto (a) the direction to the outliers (x axis) and an
orthogonal direction (y axis), (b) the direction maximizing the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis), (c) the direction
minimizing the kurtosis coef’ cient (x axis) and an orthogonal direction (y axis).
gures. It is remarkable in this case that both the direction tions onto four of the directions obtained from the application
maximizing the kurtosis coef cient and the direction mini- of the proposed procedure. Each plot gives the value of the
mizing it are not the best ones for identifying the outliers; projection onto one of the directions for each observation in
instead, the direction orthogonal to that maximizing the the sample. The outliers are the last 40 observations and have
kurtosis corresponds now to the direction to the outliers. The been plotted using the symbols “C,” “o,” “#1” and “ ” for
optimization procedure has computed a direction orthogonal each of the clusters, while the uncontaminated observations
to the outliers as the maximizer and an intermediate direction are the rst 60 in the set and have been plotted using the
as the minimizer. As a consequence, the direction to the symbol “ü .”
outliers is obtained once Problem (6) is solved for the obser- Figure 4(a) shows the projections onto the kurtosis maxi-
vations projected onto the direction maximizing the kurtosis. mization direction. This direction is able to isolate the observa-
This result justi es that in some cases (for intermediate tions corresponding to one of the clusters of outliers in the data
contamination levels) it is important to compute directions (the one labeled as “#”) but not the remaining outliers. The
orthogonal to those corresponding to extremes in the kurtosis next direction, which maximizes the kurtosis on a subspace
coef cient, and this effect becomes even more signi cant as orthogonal to the preceding direction, reveals the outliers indi-
the sample-space dimension increases. cated as “C,” as shown in Figure 4(b). This process is repeated
Consider a nal example in higher dimension. A sample of until eight additional directions, maximizing the kurtosis onto
100 observations has been obtained by generating 60 obser- the corresponding orthogonal subspaces, are generated. The
vations from an N401 I5 distribution in dimension 10, and next direction obtained in this way (the third one maximizing
10 observations each from N410di 1 I5 distributions for i D the kurtosis) is not able to reveal any outliers, but the fourth,
11 : : : 1 4, where di were distributed uniformly on the unit shown in Figure 4(c), allows the identi cation of the outliers
hypersphere. Figure 4 shows the projections of these observa- shown as “o.” The remaining kurtosis maximization directions
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
294 DANIEL PEÑA AND FRANCISCO J. PRIETO
(a) (b)
* ****** ****** *** ***+++++o oo
1 x 1
x
*
** **** ** ********************+++ +ooooooo * **************************** ************* oo x xx
oooooooo## ##xxxxxxx
x
xxx xxx ****** ** *** ** *** * *
0 ** * * * ** * + x 0 #
####
#
* ***
–1 –1
# #
–2 –2
#
# +
### +++ +++
–3 –3 ++
## # +
–4 –4
0 50 100 0 50 100
(c) (d)
4 2
++
oo oo o # #x x
oo ++ + ooooo ##### x
++ o# x
oo o 1 + +o o # x xx
o + x
2 # xx
0 * ** * o
* **
* * * ** ** ++++ +
+ #
* * ** *
0 ** *
** *** * ***** ******** *** +++ ####### xxxx
xxx *** **** **** ********* ****** **
** ** *********** * **
* ** * # # xx
***** ******** ** * ** ***** *
+ x
–1
* *
–2 –2
0 50 100 0 50 100
Figure 4. Univariate Projections Onto Directions Generated by the Algorithm for a Dataset in Dimension 10. The x axis represents the obser-
vation number, while the y axis corresponds to the projections of each observation onto (a) the ’ rst maximization direction, (b) the second
maximization direction, (c) the fourth maximization direction, (d) the third minimization direction.
(not shown in the gure) are not able to reveal any additional this property, and the values of the projections are af ne
groups of outliers. invariant. Note also that the initial point for the optimization
To detect the outliers labeled as “ ,” the kurtosis minimiza- algorithm is not affected by af ne transformations.
tion directions must be used. The rst two of these are again As a consequence of the analysis in Section 1.1, we con-
unable to reveal the presence of any outliers. On the other clude that the algorithm is expected to work properly if the
hand, the third minimization direction, shown in Figure 4(d), directions computed are those to the outliers or orthogonal to
allows the identi cation of (nearly) all the outliers at once (it them since additional orthogonal directions will be computed
is a direction on the subspace generated by the four directions in later iterations. It might fail if one of the computed direc-
to the centers of the outlier clusters). The remaining directions tions is the one corresponding to the intermediate extreme
are not very useful. direction (whenever it exists). This intermediate direction will
This example illustrates the importance of using both mini- correspond either to a maximizer or to a minimizer, depending
mization and maximization directions and in each case relying on the values of 1 „, and ‹. Because the projection step does
not just on the rst optimizer but on computing a full set of not affect the values of or ‹, if we assume that „ ! ˆ,
orthogonal directions.
this intermediate direction would be found either as part of
the set of p directions maximizing the kurtosis coef cient or
3. PROPERTIES OF THE ESTIMATOR as part of the p minimizers, but it cannot appear on both sets.
The computation of directions maximizing the kurtosis As a consequence, if this intermediate direction appears as a
coef cient is af ne equivariant. Note that the standardization maximizer (minimizer), the set of minimizing (maximizing)
of the data in Step 1 of the algorithm ensures that the resulting directions will include only the directions corresponding to
data are invariant to af ne transformations, except for a rota- — D 0 or — D 1 and, therefore, the true direction to the
tion. The computation of the projection directions preserves outliers will always be a member of one of these two sets.
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 295
Table 4. Results Obtained by the Proposed Algorithm, Using Both Maximization and Minimization Directions, on Some Small Datasets
Table 5. Results Obtained by the Proposed Algorithm, Using Only Maximization Directions, on Some Small Datasets
Table 6. Success Rates for the Detection of Outliers Forming One Cluster
„ D 10 „ D 100
p
p ‹ FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2
Table 6 gives the number of samples in which all the outliers suggested, in those cases the minimization directions are
have been correctly identi ed for each set of parameter values important.
and both the proposed algorithms (kurtosis1 and kurtosis2) The case analyzed in Table 6 covers a particular contamina-
and the FAST-MCD and SD algorithms. To limit the size of tion model, the one analyzed in Section 1. It is interesting to
the table, we have shown only those cases in which one of study the behavior of the algorithm on other possible contam-
the algorithms scored less than 95 successes. ination models, for example when the outliers form several
The proposed method (kurtosis1) seems to perform much clusters. We have simulated cases with two and four clus-
better than FAST-MCD for concentrated contaminations, ters of outliers, constructed to contain the same number of
while its behavior is worse for those cases in which the shape observations (100=k), with centers that lie at a distance
p
of the contamination is similar to that of the original data „ D 10 p from the origin (the center of the uncontaminated
(‹ D 1). From the results in Section 1, this case tends to observations) along random uniformly distributed directions.
be one of the most dif cult ones for the kurtosis algorithm The variability inside each cluster is the same ‹ for all of
because the objective function is nearly constant for all them. Table 7 gives the results of these simulations, in the
same format as Table 6.
directions, and for nite samples it tends to present many
The results are similar to those in Table 6. The proposed
local minimizers, particularly along directions that are nearly
method works much better than FAST-MCD for small values
orthogonal to the outliers. Nevertheless, this behavior, closely
of ‹ and worse for values of ‹ close to 1. Regarding the SD
associated with the value ‹ D 1, disappears as ‹ moves away
algorithm, the random choice of directions works better as the
from 1. For example, for p D 10 and D 03 the number of
number of clusters increases. Nevertheless, note that, as the
successes
p in 100 trials goes up from 23 for ‹ D 1 to 61 for
sample space dimension and the contamination level increase,
‹ D 08 and 64 for ‹ D 1025. In any case, we have included the preceding results seem to indicate that the SD algorithm
the values for ‹ D 1 to show the worst-case behavior of the may start to become less ef cient.
algorithm. The results in Tables 6 and 7 show that the 2p directions
Regarding the SD algorithm, the proposed method behaves obtained as extremes of the kurtosis coef cient of the pro-
better for large space dimensions and large contamination lev- jections can be computed in a few seconds and perform in
els, showing that it is advantageous to study the data on a most cases better than the thousands of directions randomly
small set of reasonably chosen projection directions, particu- generated by the SD estimator, requiring a much larger com-
larly in those situations in which a random choice of directions putational time. Moreover, the minimization directions play a
would appear to be inef cient. signi cant role for large concentrated contaminations. These
The variant of the algorithm that uses only maximization results suggest that the SD estimator can be easily improved
directions (kurtosis2) presents much worse results than while preserving its good theoretical properties by including
kurtosis1 when the contamination level is high and the these 2p directions in addition to the other randomly selected
contamination is concentrated. As the analysis in Section 1 directions. From the same results, we also see, that the FAST-
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 297
Table 7. Success Rate for the Detection of Outliers Forming Two and Four Clusters
2 clusters 4 clusters
p
p ‹ FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2
MCD code performs very well in situations in which the outliers properly. An interesting result is that the kurtosis pro-
kurtosis procedure fails and vice versa. Again, a combination cedure does a very good job regarding this measure of per-
of these two procedures can be very fruitful. formance in the estimation of the covariance matrix, at least
The preceding tables have presented information related to whenever it is able to identify the outliers properly. Note in
the behavior of the procedures with respect to Type II errors. particular how well it compares to FAST-MCD, a procedure
To complement this information, Type I errors have also been that should perform very well, particularly for small contam-
studied. Table 8 shows the average number of observations ination levels or large dimensions. Its performance is even
that are labeled as outliers by both procedures when 100 obser- better when compared to SD, showing again the advantages
vations are generated from an N401 I 5 distribution. Each value of a nonrandom choice of projection directions.
is based on 100 repetitions. Regarding computational costs, comparisons are not simple
The kurtosis algorithm is able to limit the size of these to carry out because the FAST-MCD code is a FORTRAN
errors through a proper choice of the constants ‚p in Step 5 code, while the kurtosis procedure has been written in Matlab.
of the algorithm. The SD algorithm could also be adjusted Tables 4 and 5 include running times for some small datasets.
in this way, although, in the implementation used, the cutoff Table 11 presents some running times for larger datasets, con-
q
2 structed in the same manner as those included inpTables 6–10.
for the observations has been chosen as p1 095 , following the
All cases correspond to D 02, „ D 10, and ‹ D 01. The
suggestion of Maronna and Yohai (1995). SD code used 151 000 replications for p D 30 and 301 000
A second important application of these procedures is the for p D 40. All other values have been xed as indicated for
robust estimation of the covariance matrix. The same simula- Table 6. The times correspond to the analysis of a single
tion experiments described previously have been repeated but dataset and are based on the average of the running times for
now measuring the bias in the estimation of the covariance 10 random datasets. They have been obtained on a 450 MHz
matrix. The chosen measure has been the average of the log- Pentium PC under Windows 98.
arithms of the condition numbers for the robust covariance These times compare quite well with those of SD and
matrix estimators obtained using the three methods—FAST- FAST-MCD. Since the version of FAST-MCD we have used is
MCD, SD, and kurtosis. Given the sample generation process, a FORTRAN code, this should imply additional advantages if
a value close to 0 would indicate a small bias in this condition a FORTRAN implementation of the proposed procedure were
number. Tables 9 and 10 show the average values for these developed. A Matlab implementation of the proposed proce-
estimates for the settings in Tables 6 and 7, respectively. To dures is available at http://halweb.uc3m.es/fjp/download.html .
limit the size of the tables, only two values for the contami-
nation level ( D 011 03) have been considered. 4. CONCLUSIONS
The abnormally large entries in these tables correspond to
situations in which the algorithm is not able to identify the A method to identify outliers in multivariate samples, based
on the analysis of univariate projections onto directions that
correspond to extremes for the kurtosis coef cient, has been
Table 8. Percentage of Normal Observations Mislabeled motivated and developed. In particular, a detailed analysis has
as Outliers been conducted on the properties of the kurtosis coef cient
Dimension FASTMCD SD Kurtosis1 Kurtosis2
in contaminated univariate samples and on the relationship
between directions to outliers and extremes for the kurtosis in
5 909 804 609 609 the multivariate case.
10 2209 02 909 1102 The method is af ne equivariant, and it shows a very sat-
20 3602 00 706 702
isfactory practical performance, especially for large sample
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
298 DANIEL PEÑA AND FRANCISCO J. PRIETO
Table 9. Average Logarithm of the Condition Numbers for Covariance Matrix Estimates, Outliers Forming One Cluster
„ D 10 „ D 100
p
p ‹ FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2
space dimensions and concentrated contaminations. In this There are also practical problems in which the af ne equi-
sense, it complements the practical properties of MCD-based variance property may not be very relevant. For example, in
methods such as the FAST-MCD procedure. The method also many engineering problems arbitrary linear combinations of
produces good robust estimates for the covariance matrix, with the design variables have no particular meaning. For these
low bias. cases, and especially in the presence of concentrated con-
The associate editor of this article suggested a generaliza- taminations, we have found that adding those directions that
tion of this method based on using the measure of multivari- maximize the fourth central moment of the data results in a
ate kurtosis introduced by Arnold (1964) and discussed by more powerful procedure.
Mardia (1970) and selecting h µ p directions at a time to The results presented in this article emphasize the advan-
maximize (or minimize) the h-variate kurtosis. A second set tages of combining random and speci c directions. It can be
of h directions orthogonal to the rst can then be obtained and expected that, if we have a large set of random uniformly
the procedure can be repeated as in the proposed algorithm. distributed outliers in high dimension, a method that computes
This idea seems very promising for further research on this a very large set of random directions will be more powerful
problem. than another one that computes a small number of speci c
Table 10. Average Logarithm of the Condition Numbers for Covariance Matrix Estimates, Outliers Forming Two and Four Clusters
2 clusters 4 clusters
p
p ‹ FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2
Table 11. Running Times (in s.) on Large Synthetic Datasets Combining these results and rearranging terms, we have
and where zNi and ‘ i denote the mean and standard deviation of Zi .
2
mX 425 D mF 42541 C 4v ƒ 15 C 41 ƒ 5r 50 2 Letting — D e01 u, from (A.2) and (3) it follows that
„—
For the fourth moment, zN1 D m01 u D ƒ p
Z 1 C „2 41 ƒ 5
4
mX 445 D 41 ƒ 5 4x ƒ ŒG 5 dF 4x5 41 ƒ 5„—
zN 2 D m02 u D p 1
Z 1 C „2 41 ƒ 5
C 4x ƒ ŒG 54 dG4x51
and from (A.1)
where 1 ƒ 2 —2
Z ‘ 12 D ‘ 22 =‹ D u0 Su D 0
1
4x ƒ ŒG 54 dF 4x5
Replacing these values in (A.3), we have
D mF 445 ƒ 4ŒG mF 435 C 6 2 Œ2G mF 425 C 4 Œ4G 1 41 ƒ 2 —2 52
ƒZ D 3 41 ƒ C ‹2 5
and 12
Z 1 ƒ 2 —2 „2 41 ƒ 5—2
4x ƒ ŒG 54 dG4x5 C6 4 C ‹41 ƒ 55
1 1 C „2 41 ƒ 5
D mG 445 C 441 ƒ 5ŒG mG 435 C 641 ƒ 52 Œ2G mG 425 „4 2 41 ƒ 52 —4 3 C 41 ƒ 53
C 0
41 C „ 41 ƒ 55
2 2 41 ƒ 5
C 41 ƒ 54 Œ4G 0
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
300 DANIEL PEÑA AND FRANCISCO J. PRIETO
Grouping all the terms that correspond to the same pow- Hawkins, D. M., Bradu, D., and Kass, G. V. (1984), “Location of Several
ers of — and using 1 4 C ‹41 ƒ 55 ƒ 41 ƒ C ‹2 5 D 41 ƒ Outliers in Multiple Regression Data Using Elemental Sets,” Technometrics,
26, 197–208.
‹54 2 ‹ ƒ 41 ƒ 52 51 the result in (4) is obtained. Hawkins, D. M., and Olive, D. J. (1999), “ Improved Feasible Solution Algo-
rithms for High Breakdown Estimation,” Computationa l Statistics and Data
[Received March 1999. Revised June 2000.] Analysis, 30, 1–11.
Jones, M. C., and Sibson, R. (1987), “What Is Projection Pursuit?” Journal
REFERENCES of the Royal Statistical Society, Ser. A, 150, 29–30.
Juan, J., and Prieto, F. J. (1997), “ Identi cation of Point-Mass Contaminations
Agulló, J. (1996), “Exact Iterative Computation of the Multivariate Mini- in Multivariate Samples,” Working Paper 97–13, Statistics and Economet-
mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm,” in rics Series, Universidad Carlos III de Madrid.
Proceedings in Computationa l Statistics, ed. A. Prat, Heidelberg: Physica- Malkovich, J. F., and A , A. A. (1973), “On Tests for Multivariate Normal-
Verlag, pp. 175–180. ity,” Journal of the American Statistical Association, 68, 176–179.
Arnold, H. J. (1964), “Permutation Support for Multivariate Techniques,” Mardia, K. V. (1970), “Measures of Multivariate Skewness and Kurtosis with
Biometrika, 51, 65–70. Applications,” Biometrika, 57, 519–530.
Atkinson, A. C. (1994), “Fast Very Robust Methods for the Detection of Maronna, R. A. (1976), “Robust M-Estimators of Multivariate Location and
Multiple Outliers,” Journal of the American Statistical Association, 89, Scatter,” The Annals of Statistics, 4, 51–67.
1329–1339. Maronna, R. A., and Yohai, V. J. (1995), “The Behavior of the Stahel–Donoho
Balanda, K. P., and MacGillivray, H. L. (1988), “Kurtosis: A Critical Review,” Robust Multivariate Estimator,” Journal of the American Statistical Asso-
The American Statistician, 42, 111–119. ciation, 90, 330–341.
Becker, C., and Gather, U. (1999), “The Masking Breakdown Point of Mul- Posse, C. (1995), “Tools for Two-Dimensiona l Exploratory Projec-
tivariate Outlier Identi cation Rules,” Journal of the American Statistical tion Pursuit,” Journal of Computationa l and Graphical Statistics, 4,
Association, 94, 947–955. 83–100.
Box, G. E. P., and Tiao, G. C. (1968), “A Bayesian Approach to Some Outlier Rocke, D. M., and Woodruff, D. L. (1993), “Computation of Robust Esti-
Problems,” Biometrika, 55, 119–129. mates of Multivariate Location and Shape,” Statistica Neerlandica, 47,
Campbell, N. A. (1980), “Robust Procedures in Multivariate Analysis I: 27–42.
Robust Covariance Estimation,” Applied Statistics, 29, 231–237. (1996), “ Identi cation of Outliers in Multivariate Data,” Journal of
(1989), “Bush re Mapping Using NOAA AVHRR Data,” technical the American Statistical Association, 91, 1047–1061.
report, CS IRO, North Ryde, Australia. Rousseeuw, P. J. (1985), “Multivariate Estimators With High Breakdown
Cook, R. D., Hawkins, D. M., and Weisberg, S. (1993), “Exact Iterative Com- Point,” in Mathematical Statistics and its Applications (vol. B), eds.
putation of the Robust Multivariate Minimum Volume Ellipsoid Estimator,” W. Grossmann, G. P ug, I. Vincze, and W. Wertz, Dordrecht : Reidel,
Statistics and Probability Letters, 16, 213–218. pp. 283–297.
Davies, P. L. (1987), “Asymptotic Behavior of S-Estimators of Multivariate (1993), “A Resampling Design for Computing High-Breakdow n Point
Location Parameters and Dispersion Matrices,” The Annals of Statistics, 15, Regression,” Statistics and Probability Letters, 18, 125–128.
1269–1292. Rousseeuw, P. J., and Leroy, A. M. (1987), Robust Regression and Outlier
Donoho, D. L. (1982), “Breakdown Properties of Multivariate Location Esti- Detection, New York: Wiley.
mators,” unpublishe d Ph.D. qualifying paper, Harvard University, Dept. of Rousseeuw, P. J., and van Driessen, K. (1999), “A Fast Algorithm for
Statistics. the Minimum Covariance Determinant Estimator,” Technometrics, 41,
Gnanadesikan , R., and Kettenring, J. R. (1972), “Robust Estimates, Residuals, 212–223.
and Outliers Detection with Multiresponse Data,” Biometrics, 28, 81–124. Rousseeuw, P. J., and van Zomeren, B. C. (1990), “Unmasking Multivariate
Hadi, A. S. (1992), “ Identifying Multiple Outliers in Multivariate Data,” Jour- Outliers and Leverage Points,” Journal of the American Statistical Associ-
nal of the Royal Statistical Society, Ser. B, 54, 761–771. ation, 85, 633–639.
(1994), “A Modi cation of a Method for the Detection of Outliers in Ruppert, D. (1987), “What Is Kurtosis,” The American Statistician, 41,
Multivariate Samples,” Journal of the Royal Statistical Society, Ser. B, 56, 1–5.
393–396. Stahel, W. A. (1981), “Robuste Schätzungen: In nitesimale Optimalität
Hampel, F. R. (1985), “The Breakdown Point of the Mean Combined With und Schätzungen von Kovarianzmatrizen,” unpublished Ph.D. thesis,
Some Rejection Rules,” Technometrics, 27, 95–107. Eidgenössische Technische Hochschule , Zurich.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986), Tyler, D. E. (1991), “Some Issues in the Robust Estimation of Multivariate
Robust Statistics: The Approach Based on In uence Functions, New York: Location and Scatter,” in Directions in Robust Statistics and Diagnos-
Wiley. tics, Part II, eds. W. Stahel and S. Weisberg, New York: Springer-Verlag,
Hawkins, D. M. (1994), “The Feasible Solution Algorithm for the Minimum pp. 327–336.
Covariance Determinant Estimator in Multivariate Data,” Computational Wilks, S. S. (1963), “Multivariate Statistical Outliers,” Sankhya, Ser. A, 25,
Statistics and Data Analysis, 17, 197–210. 407–426.
Discussion
David M. Rocke David L. Woodruff
Peña and Prieto present a new method for robust multi- are more dif cult when the fraction of outliers is large. More-
variate estimation of location and shape and identi cation of over, widely scattered outliers are relatively easy, whereas
multivariate outliers. These problems are intimately connected
in that identifying the outliers correctly automatically allows
excellent robust estimation results and vice versa. © 2001 American Statistical Association and
Some types of outliers are easy to nd and some are dif - the American Society for Quality
cult. In general, the previous literature concludes that problems TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
concentrated outliers can be very dif cult (Rocke and true of FAST-MCD and MULTOUT, which are designed to
Woodruff1996) . This article is aimed squarely at the case in handle large amounts of data.
which the outliers form a single cluster, separated from the 2. When n D 100 and p D 20, the maximum breakdown of
main data and of the same shape (but possibly different size) any equivariant estimator is .4. Of course, this does not mean
as the main data, an especially dif cult case for many outlier- that the estimator must break down whatever the con guration
detection methods (Rocke and Woodruff 1996). We believe of the outliers, but it does show that at D 04 and p D 20 and
for every dataset generated there is an outlier con guration
that it is entirely appropriate that special methods like those such that no equivariant estimator will work.
in the present article be developed to handle this case, which 3. The case ‹ D 5 is one in which the outliers are actually
is often too dif cult for general-purpose outlier-detection and scattered rather than concentrated. This is especially true when
robust-estimation methods. „ D 10. With standard normal data in 20 dimensions, almost
In this discussion, we make some comparisons of the esti- all observations lie at a distance less than the .001 point of
mators kurtosis1, FAST-MCD, SD, and also certain M and S a 20 distribution, which is 6.73. The outliers are placed at
estimators; then we point out the connection to cluster anal- a distance of 10, and the .999 sphere for these has a radius
ysis (Rocke and Woodruff 2001). The MCD, MVE, SD, and of 455460735 D 33065. Thus, the outliers actually surround the
S estimators with hard redescending in uence functions are good data, rather than lying on one side. This explains why
known to have maximum breakdown. Kurtosis1 probably also FAST-MCD gets all of the ‹ D 5 cases regardless of other
has maximum breakdown, although this article has no formal parameter settings.
proof. M estimators are sometimes thought to be of breakdown 4. When n D 100 and p D 20, the preceding computation
1=4p C 15, but this is actually incorrect. Work by Maronna shows that standard normal data will almost all lie in a sphere
(1976), Donoho (1982), and Stahel (1981) showed that when- of radius 6.73 around the center. If the outliers are displaced
ever the amount of contamination exceeds 1=4p C 15 a root of by 10 with the same covariance (‹ D 1), these spheres overlap
the estimating equations that can be carried over all bounds considerably, and in some instances no method can identify
exists. In fact, a root may exist that remains bounded. S esti- all “outliers” because some of them lie within the good data.
mators are a subclass of M estimators, as shown by Lopuhaä Shrunken outliers (‹ D 01) do not have this problem since the
(1989), and have maximal breakdown when hard redescending radius of the outliers is only .67, so a displacement of 10
– functions are used and the parameters are correctly chosen. prevents overlap. Expanded outliers (‹ D 5) are relatively easy
This provides an example of a class of M estimators that have because the outliers surround the main data, and the chance
maximal breakdown. of one falling in the relatively small sphere of the good data
M estimators can be highly statistically ef cient and are is small.
easy to compute by iteratively reweighted least squares but
need a high-breakdown initial estimator to avoid converging The results are better for two and four clusters (Table 8)
on the “bad” root (this also applies to S estimators, which than for one and would be even better for individually scat-
are computed by solving the related constrained M estimation tered outliers. The important insight is that many methods
problem). MULTOUT (Rocke and Woodruff 1996) combines work well except in one hard case—when the outliers lie
a robust initial estimator with an M estimator to yield a high- in one or two clusters. Methods such as the FAST-MCD
breakdown, statistically ef cient methodology. Note also that and MULTOUT must be supplemented by methods speci -
identifying outliers by distance and reweighting points with cally designed to nd clustered or concentrated outliers. These
weight 0 if the point is declared an outlier and weight 1 other- clustered outlier methods are not replacements for the other
wise is a type of M estimator with a – function that is constant methods because they may perform poorly when there are
until the outlier rejection cutoff point. This is used by many signi cant nonclustered outliers.
methods (e.g., FAST-MCD, kurtosis1) as a nal step, which The remaining question concerns appropriate methods to
improves statistical ef ciency. supplement the more general estimation techniques and allow
Thus, the key step in outlier identi cation and robust esti- detection of clustered/concentrated outliers. The methods of
mation is to use an initial procedure that gives a suf ciently Peña and Prieto are aimed at that task, by trying to nd the
good starting place. The gap can be quite large between the direction in which the clustered outliers lie. The performance
theoretical breakdown of an estimator and its “practical break- of these methods is documented in the article under discus-
down,” the amount and type of contamination such that suc- sion. Another attempt in this direction was given by Juan
cess is unlikely. Consider, for example, the case in Table 6 and Prieto (2001), who tried to nd outliers by looking at
in which n D 100, p D 20, D 03, ‹ D 1, and „ D 100. The the angles subtended by clusters of points projected on an
amount of contamination is well below the breakdown point ellipsoid. They showed reasonable performance of this method
(40%), and the contamination is at a great distance from the but provided computations only for very concentrated outliers
main data, but none of the methods used in this study are very (‹ D 001).
successful. Rocke and Woodruff (2001) presented another approach. If
The study reported in Table 6 has some dif culties: the most problematic case for methods like FAST-MCD and
1. The number of data points is held xed at 100 as the MULTOUT is when the outliers form clusters, why not apply
dimension rises, which leads to a very sparse data problem in methods of cluster analysis to identify them? There are many
dimension 20. Many of these methods could perhaps do much important aspects to this problem that cannot be treated in
better with an adequate amount of data. This is particularly the limited space here, but we can look at a simple version
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
302 DAVID M. ROCKE AND DAVID L. WOODRUFF
of our procedure. We search for clusters using a model-based were fanny 81% and mclust 37%, and the least successful
clustering framework and heuristic search optimization meth- were agnes 1% and k-means 2% (compared to MCD 41%,
ods, then apply an M estimator to the largest identi ed cluster SD 70%, kurtosis1 88%, and our clustering method 100%).
(for details, see Coleman and Woodruff 2000; Coleman et al. For expanded outliers (‹ D 5), the most successful were
1999; Reiners 1998; Rocke and Woodruff 2001). agnes 97%, diana 96%, mclust 73%, and fanny 39%, and the
To illustrate our point, we ran a slightly altered version of least successful was pam 0% (compared to MCD 100%, SD
the simulation study in Table 6 from Peña and Prieto. First, we 88%, kurtosis1 100%, and our clustering method 100%). This
placed the outliers on a diagonal rather than on a coordinate case is quite easy for robust multivariate estimation methods;
axis. If u D 411 11 : : : 1 15, then the mean of the outliers was FAST-MCD and MULTOUT each get 100% of these. Again,
taken to be pƒ1=2 „u, which lies at a distance of „ from 0. After the most dif cult case is shift outliers (Rocke and Woodruff
generating the data matrix X otherwise in the same way as did 1996; Hawkins 1980, p. 104), in which ‹ D 1. The best per-
Peña and Prieto, we centered the data to form a new matrix formance among these clustering methods is diana 34%, with
X ü and then generated “sphered” data in the following way: the next best being k-means at 16% (compared to MCD 82%,
Let X ü D UDV > , the singular value decomposition (SVD) in SD 73%, kurtosis1 62%, and our clustering method 67%).
which D is the diagonal matrix of singular values. Then the The poor performance of these clustering methods is prob-
matrix Xe D X ü V Dƒ1 V > D UV > is an af ne transformed ver- ably due to two factors. First, use of the Euclidean metric is
sion of the data with observed mean vector 0 and observed devastating when the shape of the clusters is very nonspher-
covariance matrix I. The purpose of these manipulations is to ical. Since this can certainly occur in practice, such methods
ensure that methods that may use nonaf ne-equivariant meth- should at least not be the only method of attack. Second,
ods do not have an undue advantage. Pushing the outliers some of these problems require more extensive computation
on the diagonal avoids giving an advantage to component- than these methods allow, at least in the implementation in
wise methods. The SVD method of sphering, unlike the usual S-PLUS. Performance of many descents of the algorithm from
Cholesky factor approach, preserves the direction of displace- a wide variety of starting points, including random ones, can
ment of the outliers. obtain better solutions than a single descent from a single
We ran all the cases from the simulation of Peña and Prieto. plausible starting point. This is particularly true when the data
Time constraints prevented us from running the 100 repeti- are extremely sparse, as they are in these simulations. Many
tions, but out of the ve repetitions we did run, we succeeded of these methods could, of course, be modi ed to use mul-
in 100% of the cases in identifying all of the outliers, except tiple starting points and might show enormously improved
for the cases in which p D 20 and ‹ D 1; these required more performance.
data for our methods to work. We could identify all of the We con rmed this by using two additional model-based
outliers in 100% of the cases when n D 500, for example, clustering methods that permit control of the amount of com-
instead of n D 100. The successful trials included all of the putation. The rst of these was EMM IX (McLachlan and
other cases in which no other method reported in Table 6 was Basford 1988; McLachlan and Peel 2000). The second was
very successful. EMSamp (Rocke and Dai 2001). For both programs, the per-
This insight transforms part of the outlier-identi cation formance was similar to that of our clustering method; success
problem into a cluster-analysis problem. However, the lat- was usual if enough random starting points were used except
ter is not necessarily easy (Rocke and Woodruff 2001). For for shift outliers in dimension 20 in the present n D 100 case.
example, we tried the same set of simulation trials using the Since different methods are better at different types of out-
standard clustering methods available in S-PLUS. In this case, liers and since once the outliers have been identi ed by any
we ran 10 trials of each method. The methods used were method they can be said to stay identi ed, an excellent strat-
two hierarchical agglomeration methods, mclust (Ban eld and egy when using real data instead of simulated data (where the
Raftery 1993) and agnes (Kaufman and Rousseeuw 1990; structure is known) is to use more than one method. Among
Struyf, Hubert, and Rousseeuw 1997); diana, a divisive hier- the methods tested by Peña and Prieto, the best combina-
archical clustering method; fanny, a fuzzy clustering method; tion is FAST-MCD plus kurtosis1. Together these can improve
pam, clustering around medoids (Kaufman and Rousseeuw the overall rate of identi cation from 75% and 83%, respec-
1990; Struyf et al. 1997); and k-means (Hartigan 1975). First, tively, to 90%, since FAST-MCD is better for shift outliers
it should be noted that all of these methods succeed almost and kurtosis1 is better for shrunken outliers. An even better
all of the time for separated clusters (‹ D 01 or ‹ D 1) if the combination is our clustering method plus FAST-MCD, which
data are not sphered. All of these methods except mclust use gets an estimated 95% of the cases.
Euclidean distance and make no attempt at af ne equivariance. We can summarize the most important points we have tried
Mclust uses an af ne equivariant objective function based on to make here as follows:
a mixture model, but the initialization steps that are crucial to
its performance are not equivariant. 1. General-purpose robust-estimation and outlier-detection
Over the 72 cases considered (with sphered data), none of methods work well in the presence of scattered outliers or
these methods achieved as much as 50% success. The overall multiple clusters of outliers.
success rates were agnes 36%, diana 48%, fanny 41%, k- 2. To deal with one or two clusters of outliers in dif -
means 10%, mclust 37%, and pam 12% (compared to MCD cult cases, methods speci cally meant for this purpose such
75%, SD 77%, kurtosis1 83%, and our clustering method as those of Peña and Prieto and Juan and Prieto (2001) are
89%). For shrunken outliers (‹ D 01), the most successful needed to supplement the more general methods.
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
DISCUSSION 303
3. An alternative approach for this case is to use clustering Coleman, D. A., and Woodruff , D. L. (2000), “Cluster Analysis for Large
methods to supplement the general-purpose robust methods. Datasets: Ef cient Algorithms for Maximizing the Mixture Likelihood,”
Journal of Computational and Graphical Statistics, 9, 672–688.
4. Choice of clustering method and computational imple- Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.
mentation are important determinants of success. We have Hawkins, D. M. (1980), The Identi cation of Outliers, London: Chapman and
found that model-based clustering together with heuristic Hall.
Juan, J., and Prieto, F. J. (2001), “Using Angles to Identify Concentrated
search technology can provide high-quality methods (Cole- Multivariate Outliers,” Technometrics, 43, 311–322.
man and Woodruff 2000; Coleman et al. 1999; Reiners 1998; Kaufman, L., and Rousseeuw, P. J. (1990), Finding Groups in Data,
New York: Wiley.
Rocke and Dai 2001; Rocke and Woodruff 2001). Lopuhaä, H. P. (1989), “On the Relation Between S-Estimators and
5. The greatest chance of success comes from use of multi- M -Estimators of Multivariate Location and Covariance,” The Annals of
ple methods, at least one of which is a general-purpose method Statistics, 17, 1662–1683.
McLachlan, G. J., and Basford, K. E. (1988), Mixture Models: Inference and
such as FAST-MCD and MULTOUT, and at least one of which Application to Clustering, New York: Marcel Dekker.
is meant for clustered outliers, such as kurtosis1, the angle McLachlan, G. J., and Peel, D. (2000), Finite Mixture Models, New York:
method of Juan and Prieto (2001), or our clustering method Wiley.
Reiners, T. (1998), “Maximum Likelihood Clustering of Data Sets Using
(Rocke and Woodruff 2001). a Multilevel, Parallel Heuristic,” unpublishe d Master’s thesis, Technische
Universität Braunschweig , Germany, Dept. of Economics, Business Com-
puter Science and Information Management.
Rocke, D. M., and Dai, J. (2001), “Sampling and Subsampling for Cluster
REFERENCES Analysis in Data Mining With Application to Sky Survey Data,” unpub-
lished manuscript .
Ban eld, J. D., and Raftery, A. E. (1993), “Model-Based Gaussian and Non- Rocke, D. M., and Woodruff, D. L. (2001), “A Synthesis of Outlier Detection
Gaussian Clustering,” Biometrics, 49, 803–821. and Cluster Identi cation,” unpublished manuscript .
Coleman, D. A., Dong, X., Hardin, J., Rocke, D. M., and Woodruff, D. L. Struyf, A., Hubert, M., and Rousseeuw, P. J. (1997), “ Integrating Robust Clus-
(1999), “Some Computational Issues in Cluster Analysis With no a Priori tering Techniques in S-Plus,” Computationa l Statistics and Data Analysis,
Metric,” Computational Statistics and Data Analysis, 31, 1–11. 26, 17–37.
Discussion
Mia Hubert
Peña and Prieto propose a new algorithm to detect multi- and Van Driessen 1999), which is currently incorporated in
variate outliers. As a byproduct, the population scatter matrix S-PLUS and SAS. Therefore, the MCD can be used to robus-
is estimated by the classical empirical covariance matrix of tify many multivariate techniques such as discriminant anal-
the remaining data points. ysis (Hawkins and McLachlan 1997), principal-component
The interest in outlier-detection procedures is growing fast analysis (PCA) (Croux and Haesbroeck 2000), and factor anal-
since data mining has become a standard analysis tool in both ysis (Pison, Rousseeuw, Filzmoser, and Croux 2000).
industry and research. Once information (“data”) is gathered, Whereas the primary goal of the MCD is to robustly esti-
marketing people or researchers are not only interested in the mate the multivariate location and scatter matrix, it also can
behavior of the regular clients or measurements (the “good be used to detect outliers by looking at the (squared) robust
data points”) but they also want to learn about the anomalous distances RD 4xi 5 D 4xi ƒ TMCD 5t SMCD
ƒ1
4xi ƒ TMCD 5. Here, TMCD
observations (the “outliers”). and SMCD stand for the MCD location and scatter estimates.
As pointed out by the authors, many different procedures One can compare these robust distances with the quantiles
have already been proposed over the last decades. However, of the p2 distribution. Since this rejection rule often leads
none of them has a superior performance at all kinds of to an in ated Type II error, Hardin and Rocke (2000) devel-
contamination patterns. The high-breakdown MCD covari- oped more precise cutoff values to improve the MCD outlier-
ance estimator of Rousseeuw (1984) is probably the most detection method.
well known and most respected procedure. There are sev- If one is mainly interested in nding the outliers, it is less
eral reasons for this. First, the MCD has good statistical important to estimate the shape of the good points with great
properties since it is af ne equivariant and asymptotically accuracy. As the authors explain, it seems natural to try to nd
normally distributed. It is also a highly robust estimator,
achieving a breakdown value of 50% and a bounded in u-
ence function at any elliptical distribution (Croux and Haes- © 2001 American Statistical Association and
broeck 1999). Another advantage is the availability of a the American Society for Quality
fast and ef cient algorithm, called FAST-MCD (Rousseeuw TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
304 MIA HUBERT
(for each data point) the direction along which the relative Moreover, Section 1.1 deals with only concentrated outliers.
distance to the uncontaminated observations is largest. This What if the outliers are not concentrated in one place? As an
was the idea behind the Stahel–Donoho estimator. Since the example, I generated symmetric contamination, according to
center of the good data is unknown, this estimator projects the model
the dataset on all possible directions and then considers a
robust univariate distance [Formula (10)] for each data point. 41 ƒ 5N401 15 C N4„1 ‹5 C N4ƒ„1 ‹50
2 2
The maximal distance of the point over all directions is then
de ned as its outlyingness. Figure 1(d) shows the kurtosis as a function of „ for D 20%
and ‹ D 02. Here again it can be seen that the kurtosis does
This projection pursuit (PP) strategy has the advantage of
not become extremely large or small, except for „ close to 2.
being applicable even when p > n. This situation often occurs
in chemometrics, genetics, and engineering. It is therefore not 3. Taking 2p orthogonal directions in Steps 2 and 3. In
surprising that many multivariate methods in high dimensions Section 1.1, only a very particular kind of contamination is
have been robusti ed using the PP principle (e.g., see Li studied. For this pattern, it indeed seems reasonable to con-
sider orthogonal directions, but why should this also work at
and Chen 1985; Croux and Ruiz-Gazen 1996, 2000; Hubert, other contamination patterns?
Rousseeuw, and Verboven 2000). The computation time of Moreover, I do not understand why rst only p direc-
these estimators would be huge if all possible directions were tions that maximize the kurtosis are considered and then p
to be scanned. To reduce the search space, Stahel (1981) pro- directions that minimize the kurtosis. Why is this better than
posed taking the directions orthogonal to random p subsets. In for example, alternating between a direction that maximizes
PCA, where only orthogonal equivariance is required, Croux and one that minimizes the kurtosis? The actual implemen-
and Ruiz-Gazen (1996) and Hubert et al. (2000) looked at tation gives me the impression that, although inspired by
the n directions from the spatial median of the data to each the behavior of the kurtosis, the chosen directions are still
data point. However, it is clear that the precision of these rather arbitrary.
algorithms would improve if one were able to nd the most To verify this, I conducted the following experiment. Using
interesting directions. This is exactly what the authors try to do the author’s Matlab code, I changed their orthogonal directions
in this article, which I have therefore read with great interest. obtained in Steps 2 and 3 into two random sets of p orthogo-
However, several key aspects of their proposal can be nal directions. (A random set of p orthogonal directions was
criticized. obtained by taking the principal components of 100 randomly
generated p-variate standard normal observations.) Let us call
1. The standardization in Step 1. First a standardization this new estimator rand2p. Then I repeated some of the simu-
is performed [Formula (7)] using the classical mean and lations that are described in Section 3.1 and that are reported in
covariance matrix. It is well known that these estimators are Table 6. My Table 1 lists the results obtained with the authors’
extremely sensitive to outliers, which often leads to masking proposed method kurtosis1, as well as with rand2p and FAST-
large Type II errors by labeling outliers as good data points MCD. For each set of parameters, the rst line indicates the
and swamping large Type I errors by agging good points as success rate of the algorithms, which (following the authors’
outliers. Rousseeuw and Leroy (1987, pp. 271–273) gave an convention) is de ned as the number of samples in which all
example of how classical prestandardization can destroy the the outliers have been correctly identi ed. ( In my opinion, it
robustness of the nal results. would be better to count how many points were agged as
2. Maximizing and minimizing the kurtosis in outliers.) The second line shows the results obtained by the
Steps 2 and 3. The theoretical discussion in Section 1.1 indi- authors.
cates that the kurtosis is maximal (respectively, minimal) The third line lists the average number of iterations needed
in the direction of the outliers when the contamination is in Step 6 of the kurtosis1 and the rand2p algorithm. Although
concentrated and small (respectively, large). However, this is the authors claim that they consider projections in only 2p
not always true for an intermediate level of contamination. directions, this is somewhat exaggerated, since in Step 6 of
To exemplify this, I have generated 100 univariate data points the algorithm it is said that whenever some outliers are found
according to the asymmetric contamination model of Sec- the procedure is applied again to the reduced sample, so typi-
tion 1.1: 41 ƒ 5N401 15 C N4„1 ‹5. Figure 1, (a) and (b), cally several sets of 2p directions are considered. Table 1 thus
con rms the authors’ conclusions. For D 5%, the kurtosis indicates how many times 2p directions were constructed on
increases rapidly, whereas at D 40% the kurtosis decreases average.
to 1.17. Figure 1(c) shows the kurtosis for D 20%1 ‹ D 02, Table 1 shows that the rand2p algorithm based on random
and „ ranging from 1 to 20. The kurtosis rst attains a directions unrelated to the data has the same or even better
minimum of 2.45 at „ D 3 and then increases. But it stays performance than the authors’ kurtosis1. The number of itera-
close to 3, the kurtosis of the standard normal distribution. tions that occur in rand2p is larger than in kurtosis1, but since
To construct a con dence interval for the kurtosis at a normal the former algorithm does not need any optimization step, its
sample of size 100, I generated 10,000 samples of size 100 computation time is still lower than that of kurtosis1. This
and computed the .025 and .975 quantiles of their kurtosis, shows that selecting directions based on the kurtosis does not
resulting in the 95% con dence interval [2.27, 4.05]. We see add value. On the other hand, note that, in spite of the iter-
that none of the contaminated samples had a kurtosis outside ations, the total number of directions considered is still very
this con dence interval. low compared with the large number of random directions
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
DISCUSSION 305
(a) (b)
18 4
16
3.5
14
3
12
kurtosis
kurtosis
10 2.5
8
2
1.5
4
2 1
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
delta delta
(c) (d)
3.4 3
3.3 2.9
3.2
2.8
3.1
2.7
3
2.6
kurtosis
kurtosis
2.9
2.5
2.8
2.4
2.7
2.3
2.6
2.5 2.2
2.4 2.1
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
delta delta
Figure 1. The Behavior of the Kurtosis at Different Contamination Patterns: (a) 5% Asymmetric Contamination, (b) 40% Asymmetric Contami-
nation, (c) 20% Asymmetric Contamination, (d) 20% Symmetric Contamination.
used by the original Stahel–Donoho algorithm (1,000, 2,000, algorithm. If so, I am very surprised that the estimated Type
and 5,000 for p D 5, 10, and 20). I errors listed in Table 7 are quite far from 5%.
In the next column, note a large discrepancy between my 5. Sequential determination of outliers in Step 7. The
results obtained for FAST-MCD and the author’s results. I authors’ version of outlyingness is only used as a rst step in
do not know how to explain this difference. Did the authors the outlier detection. In Step 7 the mean and covariance matrix
not use the default algorithm with 50% breakdown value? of the good data points are computed and used to decide which
For p D 20, D 02, and ‹ D 01, we note that FAST-MCD outliers can still be reclassi ed as good observations. This pro-
has much better performance when n is 500 rather than 100. cedure is repeated until no more outliers can be reallocated.
2
My rst question is why the cutoff value pƒ11 099 is used for
This is due to the curse of dimensionality that occurs when
p-dimensional data. Second, it seems very dangerous to repeat
n=p is too small. Therefore I also considered n D 500 for the
this procedure in case the outliers are forming a chain. This is
other simulations in p D 20 dimensions. Note that the author’s
illustrated in Figure 2. This is a rather arti cial example, but
results printed in parentheses are still based on samples of size clearly shows what happens when the outliers are determined
n D 100. sequentially. The smallest ellipse is the 97.5% tolerance region
4. Choice of the cutoff values ‚p in Step 5. The authors did based on the mean and covariance matrix of the good data
not explain how they obtained the cutoff values ‚p in Table 2. points obtained after Step 6. Applying the whole kurtosis1
Since these values are not distribution-free, I suppose they algorithm to this dataset yields the larger tolerance ellipse,
were determined by generating many samples from the normal which has been attracted by the chain of outliers. I therefore
distribution and applying the rst four steps from the kurtosis1 suggest that Step 7 be applied only once.
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
306 DANIEL PEÑA AND FRANCISCO J. PRIETO
Response
Daniel Peña and Francisco J. Prieto
We rst want to thank the editor for organizing this computed numerically the maximum bias of this estimate and
discussion and the two discussants for their stimulating com- showed that, when the dimension increases, the maximum bias
ments and valuable insights. We are grateful to them for giving of the MCD grows almost exponentially. This is in agreement
us an opportunity to think harder about the problem and for with the result in our article in which we found that the code
clarifying several interesting issues. FAST-MCD performs badly for concentrated contamination.
Hubert indicates that, in her opinion, the MCD covariance
estimator of Rousseeuw (1984) is the most respected pro-
cedure for multivariate outlier detection. We agree with her © 2001 American Statistical Association and
that this estimator has some excellent properties, but it also the American Society for Quality
has some problems. For instance, Adrover and Yohai (2001) TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
Hubert raises several speci c criticisms about our method, We do not see the relevance to our procedure of the uni-
andwe thank her for this opportunity to clarify some of these variate con dence interval for the kurtosis coef cient at this
issues as follows. point of the discussion. A multivariate kurtosis test could be
built by using the results of Machado (1983), who derived
1. Hubert worries that the standardization using nonrobust
the distribution of the maxima and minima of the kurtosis of
estimators may lead to dif culties with the resulting proce-
the projected data along u, where u is a unit norm vector, for
dure, as illustrated, for example, by Rousseeuw and Leroy
multivariate normal data, and Baringhaus and Henze (1991),
(1987). This problem cannot appear in our method. First, note
who derived the asymptotic distribution of the maximum of
that the algorithm we propose is af ne equivariant, indepen-
the projected kurtosis for elliptically symmetric distributions.
dently of the initial standardization. The kurtosis coef cient is
invariant to translations and scaling of the data, and a rotation 3. The 2p maximization and minimization directions gen-
will not affect the maximizers or minimizers. Moreover, we erated by the algorithm are considered simultaneously. In this
have tried to be careful when de ning the operations to gener- sense, Hubert’s proposal to alternate between directions would
ate the successive directions, as well as in the choice of an ini- be of interest if we used a procedure that stops once a signif-
tial direction for the optimization problems. As a consequence, icant direction is computed. The algorithm we describe does
any (invertible) linear transformation of the data, including not make use of this feature, and in this sense it is a sim-
the one used in the proposed algorithm, should not affect its pler one to describe and understand, although it may be more
results. Second, the problem illustrated by Rousseeuw and expensive to implement.
Leroy (1987) cannot happen to the class of estimators based on Another question raised by Hubert is the choice of orthog-
weighted versions of the sample mean and covariance matrix onal directions. Our motivation to use these orthogonal direc-
in which the suspected outliers are assigned a zero weight.The tions is twofold. On the one hand, we wish the algorithm to
one used in the article belongs to this class. If xi denotes the be able to identify contamination patterns that have more than
original data and yi D Axi C b denotes the data after a (any) one cluster of outliers. For example, this would be the case
linear transformation, we would de ne location and scale esti- if we have individually scattered outliers. If these clusters are
mators for the modi ed data as apart in space, a single projection direction would be able to
X X nd only one of the clusters, at most. Clearly, it would be of
my D wi yi 1 Sy D wi 4yi ƒ my 54yi ƒ my 50 1 interest to try to identify several of these clusters in a single
i i
P pass, by using more than one direction. To ensure that these
where i wi D 1. Undoing the transformations, we have directions do not provide overlapping information, we have
X chosen them to be orthogonal. The second motivation arises
mx D Aƒ1 4my ƒ b5 D w i xi
i from a property of the kurtosis, described in Section 1.1 of the
and article, that implies that in some cases the directions of inter-
X est are those orthogonal to the maximization or minimization
Sx D Aƒ1 Sy 4Aƒ1 50 D wi 4xi ƒ mx 54xi ƒ mx 50 0 directions.
i
Hubert carried out an experiment to determine the relevance
Thus, if the outliers are properly identi ed in the transformed of different choices for the projection directions, based on the
data, the use of the sample mean and covariance to standardize use of random orthogonal directions. Unfortunately, the gen-
it would never cause a breakdown of the estimators, even if eration of random directions, as she suggested, is not an af ne
these sample moments (the coef cients in the transformation) equivariant procedure. Rather than discussing the importance
become arbitrarily large due to the presence of outliers. of outlier-identi cation procedures being af ne equivariant,
2. As we describe in the article, and in accordance with we prefer to illustrate the impact of using a procedure that
Hubert’s comments, the behavior of the kurtosis coef cient does not satisfy this property. We carried out the same exper-
is particularly useful to reveal the presence of outliers in the iment described by Hubert, but we standardized the data in
cases of small and large contaminations, and this agrees with advance. Since our procedure is af ne equivariant, the results
the standard interpretation of the kurtosis coef cient as mea- we obtain are the same (apart from statistical variability)
suring both the presence of outliers and the bimodality of the in both cases, whereas the results of the procedure used by
distribution.
What is remarkable is that in intermediate cases with D 03
the procedure does not break down completely and its perfor-
Table 1. Success Rates for Outliers in One Cluster:
mance improves with the sample size, as we show in Table 3. Random Directions
Although we cannot claim to understand perfectly well the
behavior of the algorithm in these cases, we think that some p
p ‹ „ n Kurtosis Rand2p
explanations for the observed results can be found in the iter-
ative application of the technique. If a few of the outliers are 5 03 01 10 100 100 2
detected in some iterations (by chance or because they form a 10 03 01 100 100 97 0
small cluster), the remaining outliers will constitute a smaller 1 100 100 35 25
percentage of the remaining sample, and they will be easier to 20 02 01 100 100 94 0
detect by the procedure. This seems to indicate that a divide- 500 100 0
and-conquer strategy (that we have not implemented) could be 1 10 500 30 41
of help to improve the practical behavior of the algorithm. 04 01 10 500 100 0
(a) (b)
4
5 2
0 0
–5 –2
–4
0 5 10 15 2 0 2 4
Figure 1. Cones of Random Directions for the Identi’ cation of Outliers: (a) Data Used in the Simulations; (b) Standardized Data.
Hubert change completely. Table 1 shows that the use of ran- of ‹, which basically correspond to those indicated by Hubert
dom directions is a very poor choice for this data. and show that, in agreement with previous results, FAST-MCD
The reason the random directions are so ef cient for the works better for less concentrated contamination.
data used in the simulation made by Hubert is that the out-
liers were located quite far out from the uncontaminated data. 4. The procedure used to compute the values in Table 3
Figure 1 attempts to illustrate this point. In Figure 1(a), we of the article is in fact the one mentioned by Hubert in her
represent data generated as in the simulation experiment and comments. The results are unfortunately not totally satisfac-
the cones of directions that would be able to separate the out- tory; the reason is the large variability in these values in the
liers in the projections. The useful directions are those outside simulations. This variability has two main effects—it is dif -
the narrow cones. Figure 1(b) presents the same data after a cult to nd correct values (huge numbers of replications would
linear transformation to standardize it. It is readily apparent be required) and for any set of 100 replications there is a
that after the transformation the set of useful directions is high probability that the resulting values will be far from the
much smaller than before. expected one. Nevertheless, we agree that these values could
It could be shown that the probability of the projection of be estimated with greater detail, although they do not seem to
the center of the outliers being beyond the z1ƒ quantile of be very signi cant for the behavior of the algorithm, except
the standard normal for the data used in the simulation exper- for contaminations very close to the original sample.
iment, will be given by 5. We again thank Hubert for detecting a misprint in the
³ p ´ article that we have corrected in this printed version. In the
z1ƒ p ƒ 1 version she discusses, it was wrongly written that the cut-
P —Y — ¶ p 1
„2 p ƒ z21ƒ off value was pƒ12
instead of p2 . Actually, in the code that
where Y follows a Student’s tpƒ1 distribution. For D 001, we have made public the percentiles are obtained from the
„ D 10, and p D 5, this probability is approximately equal to Hotelling-T 2 approximation instead of the p2 , which behaves
083, and for „ D 10 and p D 20, the corresponding value is 080. better for small sample sizes.
These values change signi cantly after a linear transformation
is applied to the data. We prefer to use an af ne equivariant Table 2. Success Rates for Outliers in One Cluster: FAST-MCD
algorithm based on the maximization and minimization of the p
p ‹ „ n Fast-MCD
kurtosis coef cient, even if in some cases it may behave worse
than other alternatives. 5 3 1 10 100 0
The different results obtained with FAST-MCD are due to 32 10 100 55
the different values used for the variance of the contaminating 10 3 1 100 100 0
distribution. We must apologize to Hubert for introducing a 32 100 100 94
source of confusion in the previous version of the article, in 1 100 100 100
which in Section 1.1 ‹ was the variance of the contaminating 20 2 1 100 100 0
distribution, whereas in the simulations ‹ was the standard 32 100 100 31
deviation. This has been corrected in the published version 1 100 500 0
of the article. Thanks to her careful checking of our results, 32 100 500 57
1 10 500 100
which we appreciate very much, a possible source of confu-
4 1 10 500 0
sion for the readers of the article has been removed. Table 2 32 10 500 0
gives the values we obtain for FAST-MCD for several values
TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3
RESPONSE 309
Table 3. Success Rates for One Cluster of Outliers in p D 20 44072 from the center of the uncontaminated data. As a con-
Dimensions: Different Sample Sizes sequence, the outliers in this case would lie on one side of the
p original data, and would not surround it with high probability.
‹ „ n Fast-MCD Kurtosis The same analysis would also imply that for ‹ D 1 the spheres
.2 01 100 200 0 99 around the original data and the contaminating cluster would
500 0 99 not overlap.
11000 0 98
21000 0 98
We agree with Rocke and Woodruff that methods like
FAST-MCD or MULTOUT must be supplemented by other
.3 1 100 200 29 1
methods for concentrated contaminations. The proposal of
500 40 52
11000 52 99 Juan and Prieto (1997) would be very useful in this regard
21000 50 100 because it is designed just for this speci c objective. However,
.4 1 100 200 2 59 we do not think that kurtosis1 is useful only for concentrated
500 1 100 outliers. As we have shown p in the simulations, it works as
11000 2 100 well as FAST-MCD when ‹ D 5, whereas it leads to a much
21000 0 100 smaller proportion of good observations mislabeled as outliers
when the dimension grows; see Table 6 in the article. We have
Regarding the suggestion of applying Step 7 only once, conducted another limited simulation experiment for
we have found that this requires a very careful calibration the worst-case amount of contamination, D 03, and differ-
of the cutoff values, and as a consequence it becomes very ent concentrations of the contamination presented in Table 4.
sensitive to deviations from normality in the data. We are From Table 4 we see that the reduction in performanceptends
a little surprised by her general criticism of procedures that to concentrate closely around ‹ D 1, and for values of ‹ on
determine the outliers sequentially. The statistical literature the order of 05–07 and 105, the performance seems to be back
is full of examples of very successful sequential procedures to reasonable levels. However, this reduction of performance
and, to indicate just one, Peña and Yohai (1999) presented a disappears for large sample size, as was shown in Table 3.
sequential procedure for outlier detection in large regression However, since kurtosis1 does not work well for values
problems that performs much better than other nonsequential very close to ‹ D 1 and small sample sizes, it would be
procedures. interesting to nd an optimal combination of this code and
6. Regarding breakdown properties, we agree with Hubert some of the other available codes. Since we have not included
that it would be of great interest to prove a high-breakdown - comparisons with MULTOUT in the article, we will use only
point property of the proposed algorithm to con rm the good the results from Tables 6 and 8 for FAST-MCD and SD.
behavior that it shows in the simulations. The best option from these tables seems to be a combination
of SD and kurtosis1. Thus the directions obtained by the
Rocke and Woodruff mention some dif culties arising from
kurtosis algorithm can be used to compute a fast version of the
the speci c contamination patterns used in the simulation
experiment: Donoho–Stahel estimate. It is also possible to combine these
We agree that in dimension 20 a sample of size 100 is small. directions with others obtained by resampling, searching for
To check the effect of increasing the sample size on FAST- speci c as well as random directions. Peña and Yohai (1999)
MCD and on our proposed algorithm, we have performed a
small simulation experiment, reported in Table 3. As expected,
the success rate improves with the sample size. It is interest- Table 4. Success Rates for Outliers in One Cluster:
ing to stress that in the worst case for the kurtosis algorithm, Different Concentrations
D 03 and ‹ D 11 the performance improves very fast with the
sample size. This suggests that our algorithm can be a fast and p
p ‹ „ n Kurtosis
simple alternative for detecting outliers and computing robust
estimates in large datasets in high dimensions. 10 .3 03 100 100 99
We regret that Rocke and Woodruff may have been misled 05 100 100 95
07 100 100 54
by the notation we used in Section 1.1 and in the description 09 100 100 26
of the simulation experiment. In this experiment, the outliers 1 100 100 32
have been generated to have mean „e, where e denotes a 102 100 100 69
vector of ones—that is, the same vector that these authors 105 100 100 98
denote as u in their comments. As a consequence, the dis- 2 100 100 100
p
tance between the means of both distributions is „ p (rather 20 .3 03 100 500 99
than „). 05 100 500 100
This situation modi es some of the conclusions from the 07 100 500 100
09 100 500 88
Rocke–Woodruff analysis. For example, the rst case has a 1 100 500 75
standard deviation of the outliers equal to 51 „ D 10, and 102 100 500 86
p D 20, and the 0999 sphere with radius 33065 containing p the 105 100 500 100
outliers would be centered on a point at a distance 10 20 D 2 100 500 100
suggested that this type of combination could be very suc- coef cient belong to the admissible linear classi cation rules
cessful in robust regression, and we think that the empirical as de ned by Anderson and Bahadur (1962). We conduct a
evidence available suggests it would also be very promising search for groups along each of these directions using spac-
for detecting multivariate outliers. ings. The performance of the resulting cluster procedure is
We agree completely with Rocke and Woodruff in the inter- very encouraging.
est of using clustering algorithms when a concentrated con- We would like to nish this response by thanking both
tamination is suspected to be present (a case very closely discussants for their thought-provokin g comments that have
related to the standard cluster identi cation problem) or as a helped us to understand the procedure better. We agree with
complement to other robust outlier-identi cation procedures. Rocke and Woodruff that a combination of different proce-
The results that they mention seem very promising, and we dures and approaches seems to be the most promising strategy
are looking forward to reading the two manuscripts referenced to nd multivariate outliers. This suggests exploring ideas
in their report with their new results. Note also that their com- from cluster analysis, robust statistics, and outlier-detection
ments can be interpreted to work in the opposite direction— methods using both classical and Bayesian inference. The mix-
that is, that an algorithm that is ef cient for the detection of ture of different cultures and approaches has been the main
concentrated contaminations may be a reasonable basis for a source of change and improvement in our world, and we
clustering algorithm. Along these lines we have carried out believe that this is also true for statistics.
some work (Peña and Prieto 2001) to improve on the per-
formance of the kurtosis-based method in a cluster-analysis REFERENCES
setting. Basically, we have combined the idea of projections
onto interesting directions with an analysis of the spacings of Adrover, J., and Yohai, V. (2001), “Projection Estimates of Multivariate Loca-
tion,” unpublishe d manuscript.
the projections to identify groups in the data. Both the the- Anderson, T. W., and Bahadur, R. R. (1962), “Classi cation Into Two Mul-
oretical analysis and the simulation results seem to indicate tivariate Normal Distributions With Different Covariance Matrices,” The
that this approach may be of interest, at least for linearly sep- Annals of Mathematical Statistics, 33, 420–431.
Baringhaus, L., and Henze, N. (1991), “Limit Distributions for Measures
arable clusters. One theoretical motivation for the procedure of Multivariate Skewness and Kurtosis Based on Projections,” Journal of
(Peña and Prieto 2000) is that if we have two clusters from Multivariate Analysis, 38, 51–69.
two normal populations with different means but the same Machado, S. G. (1983), “Two Tests for Testing Multivariate Normality,”
Biometrika, 3, 713–718.
covariance matrix, the direction that minimizes the kurtosis Peña, D., and Prieto, F. J. (2000), “The Kurtosis Coef cient and the
of the projection is the Fisher linear discriminant function. Linear Discriminant Function,” Statistics and Probability Letters, 49,
Thus we can obtain the optimal discriminant direction without 257–261.
(2001), “Cluster Identi cation Using Projections,” working paper,
knowing the means or the common covariance matrix, which Universidad Carlos III de Madrid, Dept. of Statistics and Econometrics.
allows its application to cluster analysis. We have also proved Peña, D., and Yohai, V. (1999), “A Fast Procedure for Outlier Diagnostics in
(Peña and Prieto 2001) that if we have a sample generated by Large Regression Problems,” Journal of the American Statistical Associa-
tion, 94, 434–445.
two elliptical distributions with different means and covariance Rosseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal of the
matrices, the directions that maximize or minimize the kurtosis American Statistical Association, 79, 871–880.