Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Grubbs Test for outlier detection

Grubbs Test for outlier detection

Ratings: (0)|Views: 174|Likes:
Published by Prakash Chowdary

More info:

Published by: Prakash Chowdary on Jun 21, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/11/2012

pdf

text

original

 
Short Communication
A recursive version of Grubbs' test for detecting multiple outliers in environmentaland chemical data
Ram B. Jain
Centers for Disease Control and Prevention, 4770 Buford Highway, Chamblee, GA 30341, USA
a b s t r a c ta r t i c l e i n f o
 Article history:
Received 16 January 2010Received in revised form 18 March 2010Accepted 27 April 2010Available online 21 May 2010
Keywords:
OutliersExtreme Studentized Deviate StatisticSimulationGrubbs
Objective:
To compare the performance of Grubbs' outlier detection procedure with recursive ExtremeStudentized Deviate (ESD) outlier detection procedure.
Design and methods:
Using simulated data, the powers of Grubbs
and ESD procedures were evaluated.
Results:
Except when the sample contained exactly one outlier, the power of ESD procedure was higherthan that of Grubbs' procedure.
Conclusion:
The ESD recursive procedure is the procedure of choice to detect multiple outliers inenvironmental and chemical data.© 2010 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.
Introduction
In any laboratory where data quality is important, certain qualitycontrolschemesarepracticed.Theseschemesincludediscardingcertainobservations, called outliers, which are beyond certain standarddeviations (SD) from their respective means[1]. These outliers mustbe detected and treated so that they will not have a disproportionatein
uence on data analysis. Outliers are de
ned as the observationsthat do not
t into the pattern of the remaining observations calledinliers[2].Many procedures to detect outliers have been proposed[3]. In aconsecutive procedure like the one by Grubbs[4], only oneobservation at a time can be tested as an outlier. In the presence of multiple outliers, these procedures must be used repeatedly until nofurther outliers can be detected. In this case, the total Type I errormay be much higher than the intended
α 
of 1% or 5%. This is of concern since chemical data do have multiple outliers[5]. Grubbs[4] procedure (GRBP) is widely accepted. Among several variations of Grubbs
statistics, the statistics
τ 
(
)
=(
 X 
(
)
 X 
    ̅
)
/
and
τ 
(1)
=(
 X 
    ̅
 X 
(1)
)/
(where
 X 
    ̅
=mean of the sample,
=of the sample,
(
)
=the largest observation in the sample, and
(1)
=the smallestobservation in the sample) are in use today and the tables of criticalvalues are available[4]. Recently, critical values for sample sizes up to30,000 have become available[6].Recursive outlier detection procedures[7]were designed to eli-minate some of the shortcomings associated with consecutive proce-dures. These procedures, even if used to detect the presence of 
S
suspected outliers, can detect the presence of any number of outliers,from zero to
S
and can control Type I error to the intended level.Even thoughcritical values for these procedures are available[3], thereis a set of tables of critical values for each combination of 
and
S
.ESDprocedure[7](ESDP)istherecursiveversionofGRBP.If 
(
i
)
isanordered sample containing observations
(1Si)
,
,
(NSi)
such that
 X 
(1Si)
 X 
(2Si)
. . .
(
-1Si)
 X 
(NSi)
in sample
(i)
, then to detect
S
suspected outliers in the sample, statistics ESD
i
=max((
 X 
(NSi)
M
Si
),(
Si
 X 
(1Si)
))/SD
Si
, (
 X 
(NSi)
=largest observation in sample
(i)
,
(1Si)
=smallest observation in sample
(
i
)
, M
Si
=mean of the sample S
(i)
, andSD
Si
=SDofsample
(
i
)
,
i
=1,
,
S)
.Or,ESD
i
istheextremestudentizeddeviate for sample
(
i
)
. The null hypothesis of no outliers is rejected if ESD
i
isgreaterthanitscriticalvalue.ThecriticalvaluesofESDfor
=20to100,for
=2to5areavailable[3,8].However,ESDPdoesrequireanestimate of 
S
.The powers of consecutive as well as and recursive proceduresare negatively affected by the masking and swamping effects whenthe actual number of outliers,
A
is different than
S
. The maskingeffect, de
ned as the inability to detect an outlier in the presenceof another outlier, is present when
S
b
A.
The swamping effect,de
ned to detect inliers as outliers in addition to outliers, is presentwhen
S
N
A
. Details about masking and swamping effects arepresented as Supplemental Information (SI) S1.In this paper, I explain the algorithm used to compute samplestatistics in the ESDP, develop a model to compute the critical valuesfor ESDP, and compare the performance of ESDP with GRBP byconducting a simulation study.
Clinical Biochemistry 43 (2010) 1030
1033
The
ndings and conclusions in this report are those of the author[s] and do notnecessarily represent the views of the Centers for Disease Control and Prevention.
Fax: +1 770 488 0181.
E-mail address:
rij0@cdc.gov.0009-9120/$
see front matter © 2010 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.doi:10.1016/j.clinbiochem.2010.04.071
Contents lists available atScienceDirect
Clinical Biochemistry
 journal homepage: www.elsevier.com/locate/clinbiochem
 
Materials and methods
Critical values of the ESD recursive outlier detection procedure
I
tted non-linear regression models for ESD
i
as a function of 
and
S,
where
is the sample size of the original sample for
α 
=0.01and 0.05. Models for ESD
i
were: ESD
i
=
α 
+
 β 
1
+
 β 
2
N
2
+
 β 
3
N
3
+
 β 
4
S
where
is the sample size. The estimated values of regressioncoef 
cients are given inTable 1.These models can be used to obtain critical values for
from 20 to 100 and 2
S
5. An exampleillustrating the computation is given in SI S2. The difference betweenthe tabulated critical values provided elsewhere[3]and those com-puted here are given in SI S3.
Simulation study
In the simulation study for ESD,
S
for each sample was arbitrarilyspeci
ed as 2
S
5. I generated 500 random samples each of size20,30,40,50,and100fromanormaldistribution
 x
(0,1).Upto
veoutliers were then introduced in each sample randomly with thelowest and/or the largest observations in the sample being increasedor reduced by randomly determined value that varied between 1 SDand 3 SD. The number of outliers introduced in each sample wasrandomly determined with the restriction that no sample had morethan 10% outliers. The total number of outliers on the lower and theupper ends of the sample were also randomly determined. The step-by-step details of the simulation algorithm are given in SI S4.
Results
Simulation study
When
α 
=0.05 and
S
=
A
, the power or the percent proba-bility of detecting the exact number of outliers varied from 83.6% to99.8% for the ESDP and from 54% to 81% for GRBP (Fig. 1, Panel A).
 Table 1
Intercepts and model coef 
cient to compute critical value for ESD statistics.Type IErrorESDStatisticsEstimated regression coef 
cients
α β 
1
β 
2
β 
3
β 
4
0.05 ESD
1
2.36272 0.02164
0.00011202 0.000000E+00 0.05400ESD
2
2.08321 0.02360
0.00024696 9.940952E-07 0.03929ESD
3
2.12080 0.01562
0.00014331 5.330823E-07 0.02643ESD
4
2.08897 0.01364
0.00013213 5.284038E-07 0.02143ESD
5
2.26316 0.00626
0.00002207 0.000000E+00 0.000000.01 ESD
1
2.21006 0.05135
0.00062151 2.680000E-06 0.04271ESD
2
2.31375 0.02378
0.00024152 9.337330E-07 0.03900ESD
3
2.36188 0.01382
0.00010112 2.566791E-07 0.02143ESD
4
2.27294 0.01538
0.00017094 7.268129E-07 0.02000ESD
5
2.40764 0.01020
0.00012227 6.338063E-07 0.00000
Fig. 1.
(A) Probability of detecting exact number of outliers for Grubbs' and ESD procedures when
S
=
A
. (B) Probability of detecting less than the exact number of outliers forGrubbs' and ESD procedures when
S
=
A
. (C) Probability of detecting exact number of outliers for Grubbs' and ESD procedures when
S
b
A
. (D) Probability of detecting less thanthe exact number of outliers for Grubbs' and ESD procedures when
S
b
A
. The probabilities for Grubbs' procedure are displayed in dotted lines. The probabilities for ESD procedureare displayed by solid line.1031
R.B. Jain / Clinical Biochemistry 43 (2010) 1030
1033
 
The power of ESDP was as much as 38.6% higher than the power of GRBP. The probability of detecting less than the actual number of outliers was as high as 47.2% for GRBP (Fig. 1, Panel B). Thus, whenthere is exact knowledge of how many outliers are present in thesample, ESD is the procedure of choice.When
S
b
A
, the ESDP performed much better than GRBP. Thepower of 
ESDP
was higher by as much as 42.2% than GRBP for
=50,
A
=5, and
S
=2 (Fig. 1, Panel C). The probability of detectingmore outliers than
A
(
high
) for GRBP was several-fold higher thanfor ESDP. For example for
=40,
A
=4, and
S
=3,
high
for GRBPwas 44.6% and 4% for ESDP (Fig. 1, Panel D). Thus, ESDP was theprocedure of choice. However, the closer the values of 
S
and
A
were, the better was the power of ESDP.When
S
N
A
and when
A
=1, (Fig. 2, Panel A), the power of GRBP was higher by as much as 35% than ESDP. However, as thedifference between
S
and
A
decreased and sample size increased,ESDP performed better and better and, actually it performed betterthan GRBP. For example, when N=40,
A
=4, and
S
=5, the powerof ESDP was 80% and the power of GRBP was 50.8% (Fig. 2, Panel B).While GRBP performed better than ESDP in quite a few cases, itwas still quite sensitive to the masking effect (Fig. 2, Panels C and D)as the difference between
S
and
A
decreased and sample sizeincreased. For example, for N=30,
A
=3, and
S
=5,
high
was 41%for RBP and 6.4% for ESDP (Fig. 2, Panel C). The choice of a betterprocedure when
S
N
A
was di
cult and depends upon the dif-ference between
S
and
A
.In addition to Tukey's exploratory procedure, other proceduresto estimate
A
are available[9]but require use of expected valuesof normal order statistics which are easily available[10]. A statisticRDM that can be used to estimate the number of outliers[9]is brie
ydiscussed as SI S5.
Discussion
Overall, ESDP works better than GRBP in all situations when
S
A
. However, when
A
=1, GRBP may be better. While theESDP performs satisfactorily when
S
N
A
, its performance may bedegraded when the difference between
S
and
A
is large. Theimpact of this can be minimized by having a
good
estimate of 
S
.We found that Tukey's inner fences do not always work satisfacto-rily (data not shown). In my opinion, procedures for estimating (ascompared to detecting)
S
such as those given elsewhere[9]canprovide better results.The issue of what to do with outliers once they have been detectedis complicated. It probably will depend upon the source of outliers. Agood discussion is given by Barnett and Lewis[3]in Chapter 2. Finally,anexamplethatdemonstratescomputationsofESDispresentedasSIS6.
Fig. 2.
(A) Probability of detecting exact number of outliers for Grubbs
and ESD procedures when 2
S
5 and
A
=1. (B) Probability of detecting the exact number of outliers forGrubbs
and ESD procedures when 4
S
5 and 2
A
4. (C) Probability of detecting less than the exact number of outliers for Grubbs' and ESD procedures when 4
S
5 and2
A
4. (D) Probability of detecting less than the exact number of outliers for Grubbs
and ESD procedures when 2
S
5 and
A
=1. The probabilities for Grubbs' procedure aredisplayed in dotted lines. The probabilities for ESD procedure are displayed by solid line.1032
R.B. Jain / Clinical Biochemistry 43 (2010) 1030
1033

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->