You are on page 1of 15

Journal of Applied Statistics

ISSN: 0266-4763 (Print) 1360-0532 (Online) Journal homepage: http://www.tandfonline.com/loi/cjas20

Outlier detection with Mahalanobis square


distance: incorporating small sample correction
factor

Meltem Ekiz & O.Ufuk Ekiz

To cite this article: Meltem Ekiz & O.Ufuk Ekiz (2016): Outlier detection with Mahalanobis
square distance: incorporating small sample correction factor, Journal of Applied Statistics,
DOI: 10.1080/02664763.2016.1255313

To link to this article: http://dx.doi.org/10.1080/02664763.2016.1255313

Published online: 14 Nov 2016.

Submit your article to this journal

Article views: 19

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=cjas20

Download by: [The UC San Diego Library] Date: 27 January 2017, At: 10:44
JOURNAL OF APPLIED STATISTICS, 2016
http://dx.doi.org/10.1080/02664763.2016.1255313

Outlier detection with Mahalanobis square distance:


incorporating small sample correction factor
Meltem Ekiz and O.Ufuk Ekiz
Department of Statistics, University of Gazi, Faculty of Science Teknikokullar, Ankara, Turkey

ABSTRACT ARTICLE HISTORY


Mahalanobis square distances (MSDs) based on robust estimators Received 25 March 2016
improves outlier detection performance in multivariate data. How- Accepted 19 October 2016
ever, the unbiasedness of robust estimators are not guaranteed KEYWORDS
when the sample size is small and this reduces their performance Outlier; MCD estimators;
in outlier detection. In this study, we propose a framework that uses S-estimators; bi-weight;
MSDs with incorporated small sample correction factor (c) and show t-biweight; Mahalanobis
its impact on performance when the sample size is small. This is square distance
achieved by using two prototypes, minimum covariance determi-
AMS SUBJECT
nant estimator and S-estimators with bi-weight and t-biweight func-
CLASSIFICATION
tions. The results from simulations show that distribution of MSDs for 62F35; 62H10
non-extreme observations are more likely to fit to chi-square with p
degrees of freedom and MSDs of the extreme observations fit to F
distribution, when c is incorporated into the model. However, with-
out c, the distributions deviate significantly from chi-square and F
observed for the case with incorporated c. These results are even
more prominent for S-estimators. We present seven distinct compar-
ison methods with robust estimators and various cut-off values and
test their outlier detection performance with simulated data. We also
present an application of some of these methods to the real data.

1. Introduction
There is a growing interest in literature on detecting the outliers in multivariate data. This
problem is first addressed by using the classical statistical methods [1,17,23]. However,
these classical methods fail in detection, since sample mean and covariance are highly
sensitive to the outliers and existence of multiple outliers in the data causes masking
and swamping problems. These issues are then addressed by using Mahalanobis square
distances (MSDs) based on robust estimators of location and scale parameters.
In the literature, there is a vast amount of work on the details and the performance of
the proposed robust estimators, [4,5,9,10,14,22,25,27,28,33]. Particularly in their studies,
Becker and Gather [2], Cerioli [7], Hadi [13], Hardin and Rocke [16], Herwindiati et al.
[18], Rocke [26], and Rousseeuw and van Zomeren [29] have most commonly used MSDs
based on minimum covariance determinant (MCD) and S-estimators with bi-weight and

CONTACT Meltem Ekiz ozmeltem@gazi.edu.tr Department of Statistics, University of Gazi, Faculty of Science
Teknikokullar, 06500 Ankara, Turkey

© 2016 Informa UK Limited, trading as Taylor & Francis Group


2 M. EKIZ AND O.U. EKIZ

t-biweight functions. Moreover, they investigate the distributions of MSDs based on these
proposed robust estimators.
The formal definition of MSD is given by

ˆ = (Xi − μ̂) 
Di (Xi , μ̂, ) ˆ −1 (Xi − μ̂),

where  represents the transpose, i = 1, 2, . . . , n and n is the sample size. In the formula
above, Xi is drawn from a multivariate normal distribution Xi ∼ Np (μ, ), and μ̂ and  ˆ
are the estimations of mean μ and covariance . The MSD computed by using robust
estimators of location and scale parameters is known to be distributed asymptotically χp2 ,
[16] where p is the number of variables. However, the approximation remains weak in cases
where the sample size is small [7,16,30].
In order to overcome this problem, for the multivariate normal distribution, the con-
sistency factor(k) is used to ensure the consistency of the robust estimators of the scale
parameters, [7,24]. However, when the sample size is small, using k does not necessar-
ily guarantee the unbiasedness of the robust estimators [7,24]. If the random variable in
Equation 3 is Wishart distributed then mMCD E(k ˆ MCD ) = mMCD  [21]. On the other
hand, in the case of small sample size, E(k ˆ MCD ) =  and there exists a c that satisfies
E(ck ˆ MCD ) = .
Hence, in [7,11,24] the authors suggested to use small sample correction factor (c) that
is determined by simulations for various sample sizes and number of variables.
The aim of this study is to show the effect of small sample correction factor (c) on the
distribution of MSDs based on MCD and S-estimators when the sample size is small. To
achieve this, we computed c values from the simulated data with various n and p and com-
pared the number of outliers detected by MSDs (with incorporated c) based on various
robust estimators. In the analysis, we show the sensitivity of the results to p/n which is the
ratio of the number of variables to the sample size.
In Section 2, we introduce MCD and S-estimators with bi-weight and t-biweight func-
tions and presented the obtained c values. Then, the distributions of the MSDs based on
these estimators (with incorporated c values) are examined in Section 3. In Section 4,
we summarize seven different methods to detect outliers from MSDs based on various
robust estimators and cut-off values. We also illustrate the simulation results that compare
their performances in estimating the number of outliers. Finally, we present an example
application of the proposed methods to real data.

2. MCD and S-estimators with bi-weight and t-biweight functions


In this section, we introduce MCD and S-estimators used to detect outliers in data. Before,
presenting the details for the estimators we need to define the outlier, and breakdown point
(BP) of an estimator.
We define outlier in the following way. An alternative hypothesis to H0 ; xi ∈ G, {i =
1 · · · n} is a hypothesis of the form H1 ; xi ∈ F, {i = 1 · · · n} where F = (1 − λ)G + λH
and a proportion 1 − λ of observations is generated by G (e.g. normal distribution), while
a proportion λ (λ < 0.5) is generated by an unknown distribution (e.g. shifted normal
distribution), which declares that outliers reflect the chance that λ observations arise from
a distribution H. This alternative hypothesis is called a mixture alternative [1,22].
JOURNAL OF APPLIED STATISTICS 3

For the definition of BP of an estimator we follow the definition given in [22]. The BP
of an estimate θ̂ at G, denoted by ε ∗ (θ̂ , G), is the largest ε ∗ ∈ (0, 1) such that λ < ε ∗ , θ̂∞
((1 − λ)G + λH) as a function of H remains bounded.

2.1. MCD estimators


For a finite sample of observations (x1 , x2 , . . . , xn ) in p , the MCD is determined by select-
ing the subset {x1 , x2 , . . . , xh } of size h, with 1 ≤ h ≤ n, which minimizes the generalized
variance (which is the determinant of the covariance matrix computed from the subset )
among all possible subsets of size h. The location estimator for MCD is defined as

1
h
μ̂MCD = x
h j=1 j

and the estimator of scatter is given by

1
h
ˆ MCD = k
 (x − μ̂MCD )(xj − μ̂MCD ) ,
h j=1 j

where k is a consistency factor. h is commonly chosen as h = [(n + p + 1)/2] ≈ 0.5n since


it yields the highest BP [8,20]. However, when the sample size is small, k is not sufficient
to guarantee unbiasedness; thus, the small-sample correction factor for MCD (cMCD ) is
proposed in [24] (also, suggested by Cerioli [7]).

2.2. S-estimators with bi-weight and t-biweight


S-estimators behave reasonably well in terms of both robustness and statistical effi-
ciency and are relatively popular in the robustness literature (see, e.g. [6, 8–10,22,26]).
S-estimators of μ and  are defined by Davies [9,10] as follows.
Let κ denote a function with properties

(1) κ(0) = 1
(2) κ : + → [0, 1] is non-increasing, continuous at 0 and continuous on the left
(3) κ(u) > 0, 0 ≤ u < e
κ(u) = 0, u > e for some e > 0

ˆ S ) of (μ, ) are defined to be the solution


Given ε, 0 < ε < 1, the S-estimators (μ̂S , 
(α ∗ , A∗ ) of the following minimization problem. Choose α ∈ p and a positive definite
symmetric matrix Ap×p that minimize |A| (determinant of matrix A) subject to

1
n
κ((X − α) A−1 (X − α)) ≥ 1 − ε (1)
n 1

and it is shown in [9,10] that

1 − ε = E(κ((X − μ)  −1 (X − μ))).


4 M. EKIZ AND O.U. EKIZ

The BP of the S-estimators is ε ∗ = min(ε, 1 − ε). For a given ε, one may ensure the Fisher
consistency of the estimates by choosing a function κ with the properties given above
and has a third continuous derivative. If we replace the κ function, with a non-decreasing
function ρ :  → [0, ∞), we obtain similar results presented in [19].
If the Tukey’s bi-weight function,

(1 − (u/e)2 )2 , 0≤u≤e
κb (u) =
0, u>e

is used to obtain the S-estimators, then ‘e’ can be obtained from



π p/2 e
1 − ε = E(κb ((X − μ)  −1 (X − μ))) = κb (u)f (u)up/2−1 du, (2)
(p/2) 0

where 1 − ε = ((n − p + 1)/2n), [10,19].


As an alternative, t-biweight function is introduced by Rocke [26] and it is given by

⎨ 1, 0 ≤ u < e1 ,
κt (u) = (1 − ((u − e1 )/e2 )2 )2 , e1 ≤ u < e1 + e2 ,

0, u > e 1 + e2 .

This function contains two parameters (e1 , e2 ) that satisfy the given values of the BP and the

asymptotic rejection probability (for large values of p, Rocke [26] showed that e2 = p/e1 ).

The constants e1 and e2 = p/e1 are determined from Equation (2), and they satisfy
 −1
E(κt ((X − μ)  (X − μ))) = 1 − ε. e1 is computed by replacing κb with κt and proper

partition of the integral boundaries. Then e2 is found by the formula e2 = p/e1 . Hence-
forth, we use S-estimators when we refer to S-estimators with b-weight and t-biweight
functions.
In our previous study, we conclude that it is necessary to use small sample correction
factor for S-estimators, [11]. To achieve this, after the computation of the robust estimators
explained above we multiply the estimated covariance with small sample correction factor
cMCD (mentioned in [7]), cb and ct for MCD, S-estimators with bi-weight and t-biweight
functions, respectively. Since the robust estimators are scale invariant, without loss of gen-
erality we assume that the real value of the covariance is  = I where I represents the
identity matrix. In order to choose the proper c values, we run r simulations for each esti-
mator and we compute the value of  ˆ and took the 1/pth power of the determinant of the
average across the total number of runs, [7,11]. This is given by
 1/p
 r 
 1  (j) 
c = 1/  ˆ  .
 
r j 

ˆ ≈ 1.
c takes values on the interval [0, 1] and when the sample size is small it satisfies |c|
However, when the sample size increases c approaches 1. Table 1 illustrates c values for
robust estimators investigated in this study. Table 2 presents e (for bi-weight) and e1 values
(for t-biweight and e2 can be directly computed from e1 ) obtained for various p.
JOURNAL OF APPLIED STATISTICS 5

Table 1. The small-sample correction factors.


p n 50 100 500 1000
4 cMCD 0.9050 0.9350 0.9850 0.9900
cb 0.7700 0.8800 0.9700 0.9900
ct 0.7700 0.8350 0.9500 0.9850
10 cMCD 0.8410 0.8900 0.9700 0.9850
cb 0.7350 0.8700 0.9650 0.9800
ct 0.7150 0.8200 0.9450 0.9700
20 cMCD 0.7720 0.8450 0.9520 0.9900
cb 0.6350 0.8150 0.9600 0.9650
ct 0.6560 0.7630 0.9250 0.9550

Table 2. e and e1 values for various p.


p 2 4 6 8 10 20
e 2.777 6.637 10.482 14.298 18.092 36.860
e1 1.0975 3.004 5.092 7.133 9.158 19.214

3. Distribution of MSDs based on the MCD and S-estimators in small sample


case
In the rest of the study, we will use the fact that, if Xi is multivariate normal and m ˆ
is Wishart (W(m, )) distributed and it is independent of Xi then Di (Xi , μ, ) and ˆ
ˆ are approximately F distributed [21,31].
Di (Xi , X̄, )
Let X1 , X2 , . . . , Xn be an i.i.d. random sample, (Xi ∼ Np (μ, ), i = 1, 2, . . . , n), then the
elements of Xi that are excluded in the computation of MCD are independent of  ˆ MCD and

ˆ MCD ∼ Wp (mMCD , ),


mMCD k (3)

[16]. Here, mMCD is the unknown degrees of freedom, and k is the consistency factor.
Additionally, since μ̂MCD → μ, as n → ∞ the distribution of MSDs for extreme
observations based on the MCD estimators becomes
kpmMCD
ˆ MCD ) ∼
D(Xi , μ̂MCD , k Fp,mMCD −p+1
(mMCD − p + 1)

[16]. The degrees of the freedom of F is then estimated by

2
m̂MCD = , (4)
Ĉ2

where Ĉ is the coefficient of variation of the diagonal elements of  ˆ MCD .


However, it is also shown in [16] that MSDs (based on MCD) of the observations, which
are also used to find MCD estimator, fit to χp2 .
These expressions above are also valid for S-estimators [15]. The authors in [15] vali-
dated the expressions by generating two different samples from the standard multivariate
normal distribution. The first group i.i.d. sample is generated from the standard multi-
variate normal, and the second is generated from three truncated multivariate normal
distributions. Y1 , Y2 , . . . , Yn is denoted as the first, and X1 , X2 , . . . , Xn is denoted as the
6 M. EKIZ AND O.U. EKIZ

second group of the random sample. The second sample is independent from the first and
generated as follows [15,16]. Given 1 − ε < δ < 1,

(1) Let n1 , n2 and n3 , come from the multinomial (n, (1 − ε), δ − (1 − ε), 1 − δ) distri-
bution.
(2) X1 , X2 , . . . , Xn1 , is a sample from a truncated normal distribution Np (μ, ), where the
truncation is on the condition (X − μ)  −1 (X − μ) < χp,(1−ε) 2 .
(3) Xn1 +1 , Xn1 +2 , . . . , Xn1 +n2 , is a sample from a truncated normal distribution Np (μ, ),
where the truncation is based on the conditions (X − μ)  −1 (X − μ) > χp,(1−ε) 2 and
(X − μ)  −1 (X − μ) < χp,δ 2 .

(4) Xn1 +n2 +1 , Xn1 +n2 +2 , . . . , Xn1 +n2 +n3 =n , is a sample from a truncated normal distribu-
tion Np (μ, ), where the truncation is based on the condition (X − μ)  −1 (X −
μ) > χp,δ 2 .

Then, three elliptical regions are defined as,

E1 = {X ∈ p | (X − μ)  −1 (X − μ) < χp,(1−ε)


2
}
E2 = {X ∈ p | χp,(1−ε)
2
< (X − μ)  −1 (X − μ) < χp,δ
2
}
E3 = {X ∈ p | (X − μ)  −1 (X − μ) > χp,δ
2
}.

Here, E3 is the extreme observations of the second sample. Therefore, under the condi-
tions above, the second sample X1 , X2 , . . . , Xn would be an i.i.d. sample from the Np (μ, )
distribution.

Remark 3.1: In this study, the data is also generated by the conditions presented above,
and in what follows we show the impact of incorporating c for generating the distribution
of the MSDs based on the MCD and S-estimators.

With the small sample correction factor (c), as μ̂MCD → μ, the distribution of MSDs
(based on MCD) for extreme observations becomes
cMCD .k.p.mMCD
ˆ MCD ) ∼
D(Xi , μ̂MCD , cMCD .k. .Fp,mMCD −p+1 .
(mMCD − p + 1)
Similarly, when S-estimators with bi-weight and t-biweight functions are used, as μ̂b → μ,
the distribution of MSDs for extreme observations become
cb .p.mb
ˆ b) ∼
D(Xi , μ̂b , cb . .Fp,mb −p+1 ,
(mb − p + 1)
and as μ̂t → μ
ct .p.mt
ˆ t) ∼
D(Xi , μ̂t , ct . .Fp,mt −p+1 ,
(mt − p + 1)
respectively. These equations are similar to those given in [16] but they differ with the
multiplicative c values.
ˆ
When c is incorporated, the probability of an observation (used to calculate μ̂ and )
to fall into the region E2 ∪ E3 and especially E3 will approach zero for increased sample
JOURNAL OF APPLIED STATISTICS 7

Figure 1. The horizontal axis denotes the Monte-Carlo estimates of the expectations of the order statis-
tics of χp2 , and the vertical axis shows the Monte-Carlo estimates of the ordered MSDs based on the robust
estimators (MCD and S-estimators with bi-weight and t-biweight functions). The Monte-Carlo estimates
for ordered MSDs based on robust estimators are plotted against the expected values of the order
statistics for χp2 . The simulations are performed for 5000 repetitions when n = 50, p = 10, ε = 0.50 and
δ = 0.90. The Monte-Carlo estimates of the ordered MSDs for the first group of samples Y1 , Y2 , . . . , Yn ,
based on μ̂ and c, ˆ obtained from X1 , X2 , . . . , Xn are plotted versus the expected values of the order
statistics for χp2 and are represented by green plus signs ( ). The Monte-Carlo estimates of the ordered
MSDs for the second group of samples X1 , X2 , . . . , Xn , based on μ̂ and c, ˆ obtained from X1 , X2 , . . . , Xn
are plotted versus the expected values of the order statistics for χp and are represented by red diamonds
2
( ). The expected values of the order statistics for F are plotted versus the expected values of the order
statistics for χp2 are represented by blue circles ( ). The expected values of the order statistics for χp2
are plotted versus the expected values of the order statistics for χp2 and are represented by solid black
line (–).
8 M. EKIZ AND O.U. EKIZ

Table 3. m̂MCD , m̂b and, m̂t values when c is used.


n p 2 4 6 8 10 20
50 m̂MCD 11.1937 15.1005 17.7910 20.0573 21.6551 32.7940
m̂b 6.9763 11.4209 18.0501 21.3938 24.7369 32.3936
m̂t 7.3579 12.7987 14.0492 18.2142 21.6777 30.3938
100 m̂MCD 15.2131 23.3081 28.4823 32.3689 36.5351 48.6523
m̂b 17.5839 26.4124 41.0005 44.2126 48.3432 61.7918
m̂t 8.0839 17.5355 25.5680 28.7255 38.1135 49.8677
500 m̂MCD 50.6167 82.7396 104.6648 124.9625 134.8462 176.7611
m̂b 107.0562 141.5303 222.8444 246.1443 304.7836 375.1362
m̂t 47.6882 65.1780 91.1678 109.5918 127.7914 167.6284
1000 m̂MCD 91.9343 145.7057 191.8972 243.1629 241.2236 337.4993
m̂b 176.7452 326.2297 427.7858 500.0111 585.1014 745.2888
m̂t 77.7145 120.3194 168.8665 194.2897 233.3980 337.5019

size. This is along with the lines given in [16]. Hence, in our analysis below, we will use
MCD and S-estimators from the second sample which are by definition independent of
both the extreme values (of the second sample) and the first sample, that is, μ̂ and c ˆ are
independent of Xn2 +1 · · · Xn and Y1 , . . . Yn .
In order to illustrate the impact of using small sample correction factor (c), we present
the simulation results in Figure 1. In the simulations, we compute m̂ from Equation (3). As
described in [16], since the diagonal elements of  ˆ are identically distributed and uncorre-
lated, we simulate r independent copies of  ˆ matrix from n data points of each independent
sample, and m is estimated by the coefficient of variation of rp diagonal elements. In Table 3,
we presented m̂ values that are obtained by r = 5000 simulations for various n and p values
where ε = 0.50 and δ = 0.90.
For the observations in region E1 , the distribution of MSDs based on μ̂ and c ˆ fits
exactly χp with p degrees of freedom (Figure 1, (a,1), (b,1), (c,1)). However, in the absence
2

of c, this is not necessarily the case (Figure 1, (a,2), (b,2), (c,2)) in that the sample size is
small. The results are even more prominent for S-estimators. Moreover, the distribution
based on the extreme observations (region E3 ) deviates more significantly from χp2 when
the sample size is small.
We also generate the distributions of MSDs for the observations from Y1 · · · Yn based
on μ̂ and c ˆ and observe that they fit to F distribution (see Figure 1, (a,1),(b,1),(c,1)).
The distribution of MSDs of the extreme observations (in E3 ) fit to the tail of the same
F. Nevertheless, the degrees of freedom of F found in our simulations with c are different
from the F distributions observed in [16] without c (see Figure 1, (a,2), (b,2),(c,2)).

4. Testing MSD performances in outlier detection with small sample


correction factor (c)
In this part of the paper, we summarize our results in order to validate the proposed frame-
work presented in the preceding section. We will consider seven different comparison
methods by using MSDs based on various estimators and distinct cut-off values. Then,
we will show the results from the simulations and discuss their performances in outlier
detection when the sample size is small. Finally, in the second part of the section, we present
JOURNAL OF APPLIED STATISTICS 9

an example application to the real data and compare our findings with the previous results
from the literature.

4.1. Simulations to test MSD performances


We perform simulation studies by incorporating c for seven different comparisons
below,

(1) MSDs based on maximum likelihood estimators and using a cut off value of χp,1−α
2 ,
(2) MSDs based on MCD estimators and using a cut off value of Fp,m̂MCD −p+1,1−α ,
(3) MSDs based on MCD estimators and using a cut off value of χp,1−α
2 ,
(4) MSDs based on S-estimators with bi-weight function and using a cut off value of
Fp,m̂b −p+1,1−α ,
(5) MSDs based on S-estimators with bi-weight function and using a cut off value of
χp,1−α
2 ,
(6) MSDs based on S-estimators with t-biweight function and using a cut off value of
Fp,m̂t −p+1,1−α ,
(7) MSDs based on S-estimators with t-biweight function and using a cut off value of
χp,1−α
2 .

These are compared in terms of their performance in detecting the number of outlying
points when the sample size is small.
As the number of variables increases, the presence of outliers in the data causes swamp-
ing and masking problems. These problems are investigated by generating data with small
sample size and multiple outliers. To achieve this, we construct a simulation study to gen-
erate data from multivariate normal distributions, and the steps of the simulation are listed
below;

(1) ν1 is a randomly selected integer from the uniform distribution U(1, ν), where ν
denotes the number of outliers and ν2 is determined so that ν = ν1 + ν2 is satisfied,
(2) ν1 number of outliers are then generated from Np (μ1 , I), and the remaining ν2 = ν −
ν1 outliers are generated from Np (μ2 , I). Here the elements of μ1 and μ2 are randomly
generated from U(−10, 10) such that 25 ≤ μ1 μ1 and 25 ≤ μ2 μ2 ,
(3) Finally, (n − v) samples are generated from the standard multivariate normal.

Hence, the data is used to test performances of seven methods presented earlier in
detecting outliers. The subset of methods used to detect v outliers are listed below
(Figure 2):

(1) we applied methods 2 and 3, and the results are given in plots (a,1), (a,2) and (a,3),
respectively, for p/n > 0.20, p/n = 0.20, and p/n < 0.20,
(2) we applied methods 1, 4 and 5, and the results are given in plots (b,1), (b,2) and (b,3),
respectively, for p/n > 0.20, p/n = 0.20, and p/n < 0.20,
(3) we applied methods 1, 6 and 7, and the results are given in plots (c,1), (c,2) and (c,3),
respectively, for p/n > 0.20, p/n = 0.20, and p/n < 0.20.
10 M. EKIZ AND O.U. EKIZ

Figure 2. For 5000 repetitions, the horizontal axis denotes the number of outliers, ν, the vertical axis
shows the means for the detected outlier ratios (v̂/n) determined using the procedures given in 1–7.
In all of the subplots black solid line is line l with slope 1/n. Moreover, -.-.-.-. in subplots (b1)–(c3) is the
line that represents the results gathered from Method 1. In subplots (a1)–(a3), blue and red represent
the results for Methods 2 and 3, respectively. In subplots (b1)–(b3), brown and green are used for Meth-
ods 4 and 5, respectively. Finally, In subplots (c1)–(c3), purple and cyan are used for Methods 6 and 7,
respectively.

The horizontal axis in these plots denotes the number of outliers ν, and the vertical axis
denotes the Monte-Carlo means for the detected number of outlier rates (v̂/n) obtained by
repeating the simulations r times for each method. Here, the solid line (l) in plots represents
the line with the slope 1/n such that each point on the vertical axis is equal to v/n with the
real v values. The performance of a given method is concluded as ‘successful’ if estimated
v̂/n fits l.
JOURNAL OF APPLIED STATISTICS 11

Table 4. Monte-Carlo estimates of the number of observations used in the calculations of the S-
estimators(h).
n p 2 4 6 8 10 20
50 hb 34.3350 36.3600 37.0700 37.5300 37.6350 40.1300
ht 26.6800 26.9300 27.6200 28.2400 29.6000 35.2100
100 hb 70.7700 77.2200 81.9300 84.3600 85.1700 86.9500
ht 53.0300 52.5700 52.5300 53.8700 54.3500 59.8100

Please note that, v̂/n is a Monte-Carlo mean for the ratio of observations that the meth-
ods can detect as an outlier. It is not taken into account if the observation is correctly
classified as an outlier or not.
The results gathered from the analysis were similar for various p values. Hence, in order
to avoid repetition, we plot only p = 20 in Figure 2. We focus on three cases p/n > 0.20,
p/n = 0.20, and p/n < 0.20. This is because, in all of the repeated simulations with various
p/n, we observe the most dramatic change in behavior for the values below and above
p/n = 0.20. For instance, one may conclude from the plots (a,1) for p/n > 0.20 and (a,3)
for p/n < 0.20 in Figure 2 that the results differ significantly and methods 2, 3 perform
better in approaching line l in the latter case.
Furthermore, plots (b,i) and (c,i), i = 1,2,3, show that the BPs of the maximum likelihood
estimators (method 1) are very low. From the plots, one may observe that as the number
of outliers increase, the estimations become further away from line l.
For a general conclusion, if we consider p/n < 0.20, and compare all seven methods (see
(a,3), (b,3), and (c,3) in Figure 2), it is clear that method 2 performs the best in determining
the number of outliers correctly. We also conclude that the behavior of each pair of methods
(2, 3), (4, 5) and (6, 7) become similar to each other when the sample size increases. This
similarity occurs at a rapid speed for methods 2 and 3 and at a slower speed for methods 6
and 7. These conclusions are parallel to the well known fact that the distribution of extreme
values of MSDs fit chi-square with p degrees of freedom when n increases [30].
When p/n > 0.20, it is obvious from plots (a,1), (b,1) and (c,1) in Figure 2 that the
results gathered from method 4 are very satisfactory and we conclude that it performs the
best for p/n > 0.20, by detecting %10 of the outliers in the data with great accuracy.
hb and ht represent the Monte-Carlo estimates of the number of observations that are
used to calculate S-estimators where subscripts b and t represent bi-weight and t-weight
functions, respectively. These estimations are given in Table 4. One may observe from the
table that hb values are always larger than ht for every n and p pair. This shows that S-
estimators with bi-weight functions are more efficient than the S-estimators with t-biweight

functions when e2 = p/e1 .
In what follows, we illustrate an example application for the proposed framework.

4.2. An example application to the real data


Here, we will investigate Coleman data [28] as an example application. In the data, the num-
ber of variables is 5 and the number observations is 20. The observations are enumerated
from 1–20.
In literature, the data is examined by de Boer and Feltkamp [3], Filzmoser [12], and
Rousseeuw and Leroy [28]. In [3], the data is investigated by using two methods, named
12 M. EKIZ AND O.U. EKIZ

Figure 3. Dashed and solid lines are cut-off values of chi-square and F, respectively. Blue circles are the
MSDs, based on bi-weight S-estimators which are improved by cb and the green diamonds are MSDs,
based on S-estimators, and red crosses are the MSDs, based on maximum likelihood estimators.

as Projection and Kosinsky. The Projection and Kosinsky methods detected (2, 3, 6, 10, 11,
12, 15, 18) and (2, 6, 9, 10, 11, 15, 17) as outliers, respectively (Figure 3).
In the second study, performed by Filzmoser [12], the numbered observations (1, 6, 9,
10, 11, 15, 18) were determined to be the outlying points. In fact, when the MSDs based
on the classical maximum likelihood estimators are used there were no outliers detected
in the data, (see Figure 3 or [28]).
In this study, we apply two methods (with c) to the same data (Method 4 and Method 5,
refer to Section 4.1) that use MSDs based on S-estimators with bi-weight function. Meth-
ods 4 and 5 have cut-off values as Fp,m̂b −p+1,1−α and χp,1−α
2 , respectively. As illustrated in
Figure 3, Method 4 and 5 detect (6, 9, 10, 11) and (1, 2, 6, 9, 10, 11, 13, 15, 18) as outliers,
respectively. From Figure 3, we conclude that methods (with c) based on χp,1−α 2 detect too
many outliers and F should be preferred. This validates our simulation studies in the pre-
ceding section in which we show that when we incorporate small sample correction factor,
the MSDs of extreme observations fit to F. However, this F differs significantly from the F
without using c.

5. Conclusions
MSDs based on robust estimators perform well in outlier detection for large sample size.
However, when the sample size is small, robust estimators are biased. To address this prob-
lem, we introduce small sample correction factor for the distributions of MSDs based
on robust estimators. Two widely studied estimators are chosen as prototypes, MCD and
S-estimators, for testing the outlier detection performance with incorporated c, when the
sample size is small. We show details of formally incorporating c to the existing model.
JOURNAL OF APPLIED STATISTICS 13

First, we use simulated data to investigate the impact of incorporating c. The results from
simulations show that distribution of MSDs for non-extreme observations are more likely
to fit to chi-square with p degrees of freedom and MSDs of the extreme observations fit to F
distribution, when c is incorporated into the model. However, without c, the distributions
deviate significantly from chi-square and F observed for the case with incorporated c.
Second, we introduce seven different methods to analyze their outlier detection per-
formance. The methods use MSDs based on MCD and S-estimators with various cut-off
values. The simulations reveal that the performance of methods are highly effected by the
ratio of the number of variables to the sample size. In our simulation set up, most distinct
change in performance occurs below and above 0.2 threshold. We conclude that below
the threshold, the comparison between MSDs based on MCD and a cut off value F gives
the best performance. However, above the threshold the comparison of MSDs based on S-
estimators with bi-weight function with F distribution gives more satisfactory results when
the sample size is small.
Finally, these results are validated by applying some of these comparison methods to real
data. We investigate the well-known Coleman data which has 5 variables and 20 obser-
vations in total. We observe that the method with χp,1−α 2 cut-off value detects too many
outliers and cut-off value F gives more reasonable results.
This study particularly focuses on the effect of c to the MSDs based on MCD and S
estimators. However, there exist studies in literature that use MSDs based on different esti-
mators (e.g. [32]), and the comparison of these methods with the MSDs mentioned in this
study is the subject of the future work.

Disclosure statement
No potential conflict of interest was reported by the authors.

References
[1] V. Barnett and T. Lewis, Outliers in Statistical Data, ISBN 0-471-93094-6, John Wiley & Sons,
Chichester, 1994.
[2] C. Becker and U. Gather, The largest nonidentifiable outlier: A comparison of multivariate
simultaneous outlier identification rules, Comput. Statist. Data Anal. 36 (2001), pp. 119–127.
[3] P. de Boer and V. Feltkamp, Robust multivariate outlier detection, Statistics Netherlands Project
number 80820 BPA number 324-00-RMS/INTERN, 2000.
[4] R.W. Butler, P.L. Davies, and M. Jhun, Asymptotics for the minimum covariance determinant
estimator, Ann. Statist. 21 (1993), pp. 1385–1400.
[5] N.A. Campbell, Robust procedures in multivariate analysis I: Robust covariance estimation, Appl.
Stat. 29 (1980), pp. 231–237.
[6] N.A. Campbell, H.P. Lopuha, and P.J. Rousseeuw, On the calculation of a robust S-estimator of
a covariance matrix, Stat. Med. 17 (1998), pp. 2685–2695.
[7] A. Cerioli, Multivariate outlier detection with high-breakdown estimators, J. Amer. Statist. Assoc.
105 (2010), pp. 147–156.
[8] C. Croux and G. Haesbroeck, Influence function and efficiency of the minimum covariance
determinant scatter matrix estimator, J. Multivariate Anal. 71 (1999), pp. 161–190.
[9] P.L. Davies, Asymtotics behaviour of S-estimates of multivariate location parameters and disper-
sion matrices, Ann. Statist. 15 (1987), pp. 1269–1292.
[10] L. Davies, The asymptotics of S-estimators in the linear regression model, Ann. Statist. 18 (1990),
pp. 1651–1675.
14 M. EKIZ AND O.U. EKIZ

[11] O.U. Ekiz and M. Ekiz, A small-sample correction factor for S-estimators, J. Stat. Comput. Simul.
85 (2013), pp. 794–801.
[12] P. Filzmoser, Identification of multivariate outliers: A performance study, Aust. J. Stat. 34 (2005),
pp. 127–138.
[13] A.S. Hadi, Identifying multiple outliers in multivariate data, J. R. Stat. Soc. B 54 (1992),
pp. 761–771.
[14] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel, Robust Statistics, The Approach
Based on Influence Functions, John Wiley & Sons, New York, 1986.
[15] J.S. Hardin, Multivariate outlier detection and robust clustering with minimum covariance
determinant estimation and S-estimation, Ph.D. thesis, University of California, 2000.
[16] J. Hardin and D.M. Rocke, The distribution of robust distances, J. Comput. Graph. Statist. 14
(2005), pp. 928–946.
[17] D.M. Hawkins, Identification of Outliers, Vol. 11, Chapman and Hall, London, 1980.
[18] D.E. Herwindiati, M.A. Djauhari, and M. Mashuri, Robust multivariate outlier labeling, Comm.
Statist. Simulation Comput. 36 (2007), pp. 1287–1294.
[19] H.P. Lopuha, On the relation between S-estimators and M-estimators of multivariate location
and covariance, Ann. Statist. 17 (1989), pp. 1662–1683.
[20] H.P. Lopuha and P.J. Rousseeuw, Breakdown points of affine equivariant estimators of multivari-
ate location and covariance matrices, Ann. Statist. 19 (1991), pp. 229–248.
[21] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis, Academic Press, New York, 1979.
[22] R.A. Maronna, R.D. Martin, and V.J. Yohai, Robust Statistics: Theory and Methods, John Wiley
& Sons, New York, 2006.
[23] K.I. Penny and I.T. Jolliffe, A comparison of multivariate outlier detection methods for clinical
laboratory safety data, The Statistician 50 (2001), pp. 295–307.
[24] G. Pison, S. Van Aelst, and G. Willems, Small sample corrections for LTS and MCD, Metrika 55
(2002), pp. 111–123.
[25] M. Riani, A.C. Atkinson, and A. Cerioli, Finding an unknown number of multivariate outliers,
J. R. Stat. Soc. B 71 (2009), pp. 447–466.
[26] D.M. Rocke, Robustness properties of S-estimators of multivariate location and shape in high
dimension, Ann. Statist. 24 (1996), pp. 1327–1345.
[27] P.J. Rousseeuw, Multivariate estimation with high breakdown point, in Mathematical Statistics
and Applications, W. Grossman, G. Pflug, I. Vinceze, and W. Wertz, eds., Reidel Publishing
Company, Dordrecht, 1985, pp. 283–297.
[28] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, Wiley, New York,
1987.
[29] P.J. Rousseeuw and B.C. van Zomeren, Unmasking multivariate outliers and leverage points
(with discussion), J. Amer. Statist. Assoc. 85 (1990), pp. 633–639.
[30] P.J. Rousseeuw and B.C. van Zomeren, Robust distances: Simulations and cutoff values, in Direc-
tions in Robust Statistics and Diagnostics, W. Stahel and S. Weisberg, eds., Springer, New York,
1991, pp. 195–203.
[31] R.J. Serfling, Approximation Theorems of Mathematical Statistics, Wiley, New York, 1980.
[32] R. Todeschini, D. Ballabio, V. Consonni, F. Sahigara, and P. Filzmoser, Locally centred Maha-
lanobis distance: A new distance measure with salient features towards outlier detection, Anal.
Chim. Acta 787 (2013), pp. 1–9.
[33] R.R. Wilcox, Introduction to Robust Estimation and Hypothesis Testing, 2nd ed., Elsevier
Academic Press, London, 2005.

You might also like