You are on page 1of 5

FULL PAPER WWW.C-CHEM.

ORG

An Enhanced Monte Carlo Outlier Detection Method


Liangxiao Zhang,*[a,b,c,d] Peiwu Li,*[a,c,d,e] Jin Mao,[a,b,c] Fei Ma,[a,d] Xiaoxia Ding,[a,c]
and Qi Zhang[a,e]

Outlier detection is crucial in building a highly predictive removed, the value of validation by Kovats retention indices
model. In this study, we proposed an enhanced Monte Carlo and the root mean square error of prediction decreased from
outlier detection method by establishing cross-prediction 3.195 to 1.655, and the average cross-validation prediction
models based on determinate normal samples and analyzing error decreased from 2.0341 to 1.2780. This method helps
the distribution of prediction errors individually for dubious establish a good model by eliminating outliers. V
C 2015 Wiley

samples. One simulated and three real datasets were used to Periodicals, Inc.
illustrate and validate the performance of our method, and the
results indicated that this method outperformed Monte Carlo DOI: 10.1002/jcc.24026
outlier detection in outlier diagnosis. After these outliers were

Introduction as a feasible way to detect different kinds of outliers by estab-


lishing many cross-prediction models.[2,14,15] The core idea of
Outlier detection is a primary step in data modeling and is
an MC outlier detector is that the predictive results for the X
important in identifying and subsequently eliminating atypical
outlier far from the center of the sample space are consider-
observations from a given set of data during the establish-
ably variable by Monte Carlo sampling subset predictive mod-
ment of a high-performance model.[1,2] Both univariate meth-
els while predicting the y outlier is usually difficult. Thus, the
ods and multivariate methods can be used for outlier
distribution of predictive errors could be used for samples in
detection.[3] Most early univariate methods for outlier detec-
multiple outlier detection. However, due to the masking effect,
tion were designed on the assumption of underlying identical
the boundary between normal and abnormal samples is
and independent data distribution, such as the Chauvenet’s
unclear in the plot of variance of residuals versus mean of
criterion, Peirce’s criterion, Grubbs’ test,[4] Tietjen–Moore test,[5]
residuals.
and generalized extreme Studentized deviate test.[6] These
In this study, we proposed a new strategy to detect outliers
methods are unsuitable for high-dimensional datasets and
using MCOD to identify normal samples from the plot of var-
arbitrary datasets without prior knowledge of underlying data
iance versus mean of residuals and then individually checking
distribution.[7]
the dubious samples. As there is no more than one outlier in
Multivariate outlier detection methods include statistical and
the dataset, the masking effect does not exist, and therefore,
data mining methods. Statistical methods aim at identifying
it is easier to detect outliers. In addition, if the dubious
the observations relatively far from the center of data distribu-
tion.[8] In this method, the Mahalanobis distance is used as a
well-known criterion depending on the estimated multivariate [a] L. Zhang, P. Li, J. Mao, F. Ma, X. Ding, Q. Zhang
normal distribution parameters.[9] The detection result is satis- Oil Crops Research Institute, Chinese Academy of Agricultural Sciences,
Wuhan 430062, China
factory for datasets with only one calibration outlier but not E-mail: liangxiao_zhang@hotmail.com (or) peiwuli@oilcrops.cn
for those with multiple outliers because outliers could distort [b] L. Zhang, J. Mao
the mean value (MV) or covariance matrix to generate a mask- Key Laboratory of Biology and Genetic Improvement of Oil Crops,
ing effect, which makes the hat matrix leverage unattainable. Ministry of Agriculture, Wuhan 430062, China
[c] L. Zhang, P. Li, J. Mao, X. Ding,
To mitigate the masking effect, many methods have been Laboratory of Risk Assessment for Oilseeds Products (Wuhan),
proposed to detect outliers, including minimum volume ellip- Ministry of Agriculture, Wuhan 430062, China
soid,[10] ellipsoidal multivariate trimming,[11] minimum covari- [d] P. Li, F. Ma, L. Zhang,
ance determinant,[12] resampling by half-means and smallest Quality Inspection and Test Center for Oilseeds Products,
Ministry of Agriculture, Wuhan 430062, China
half volume.[13] The key to these methods is to find out the
[e] P. Li, Q. Zhang
main body of an observation matrix and identify the outliers Key Laboratory of Detection for Mycotoxins, Ministry of Agriculture,
significantly different from the majority of the dataset. Wuhan 430062, China
Data mining methods are designed to manage large data- Contract grant sponsor: National Key Technologies R&D Program;
Contract grant number: 2012BAK08B03; Contract grant sponsor:
bases from high-dimensional spaces, including distance-based
National Nature Foundation Committee of P.R. China; Contract grant
methods, clustering methods, and spatial methods. To effec- number: 21205118; Contract grant sponsor: Earmarked Fund for China
tively use dependent variables and detect more outliers, the Agriculture Research System; Contract grant number: CARS-13
Monte Carlo outlier detection (MCOD) method was developed C 2015 Wiley Periodicals, Inc.
V

1902 Journal of Computational Chemistry 2015, 36, 1902–1906 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG FULL PAPER

MC outlier detection

Traditional outlier detection methods usually analyze the distri-


bution of samples in the sample space. The purpose of outlier
detection is actually to build the best prediction model, and
the model-based method is, therefore, more effective. Recently,
Monte Carlo cross-validation was developed to detect outliers
by studying the distribution of prediction errors of each sample
obtained from the original dataset.[2] The hypothesis of this
outlier detection method is that the normal samples at the
center of the sample space are less influenced by fluctuant
model parameters. As the detailed algorithm was described
elsewhere,[2] this study only gives the outline. Figure 1 (the left)
presents a flow chart for the complete algorithm. The number
of principle components was firstly determined by cross-
validation in partial least squares or principal component
regression modeling. Then, the whole dataset was randomly
divided into training and validation sets. The prediction model
was built using the training set and used to predict the valida-
Figure 1. Flow chart for enhanced Monte Carlo outlier detection. tion set so as to obtain the prediction error for each validation
sample. After N cycles, the prediction error distribution was
obtained for each sample. Then, the MV and standard deviation
samples are normal, the prediction errors decrease to an (STD) of the prediction error distribution for the samples were
acceptable range with no influence from the outliers. used to detect outliers. The distribution of the predictive errors
generated by many models contains more sample information
Theories and Methods about whether a sample is an outlier or not. The error distribu-
tion of a normal sample is less likely varied when normal sam-
Dataset
ples are the main body of the whole dataset. The predictive
Dataset 1, a simulated dataset, was designed (X (100 3 10) residuals of a y outlier, however, have a large expectation value,
and y (100 3 1) samples with normally distributed noise) to while an X outlier (good leverage point) far from the main
illustrate our method. In these examples, matrix X contains body of all the samples possesses a small expectation value of
independent columns (X) meaning molecular descriptors and a predictive residuals but a large STD. According to the hypothe-
dependent column (y) being related to X by the equation: sis of this outlier detection method, normal samples have small
y 5 f(X). Then, two types of outliers (y and X) are added into MVs and STDs for their prediction errors and, therefore, lie on
this dataset as follows: (1) 20 additional y outliers with three- the lower left of the MV/STD plot; the upper left lies the sam-
fold noise are added to this dataset. The independent varia- ple outliers that have small MVs but large STDs; the lower right
bles of these y outliers are derived from the main body of the displays the y outliers or model outliers that have large MVs
100 normal samples and (2) 20 additional X outliers with a but small STDs. This MV/STD plot could obviously provide vis-
large Mahalanobis distance (twice larger than the average ual diagnosis for direct outlier detection. Being validated by
leverage value from the normal samples) are added to this several datasets, this method was proven effective for outlier
dataset. The X outliers have the same functional relationship detection.[2]
as the 100 normal samples.
Dataset 2, stack loss plant dataset for oxidation of ammonia Enhanced Monte Carlo outlier detection
to nitric acid, provides operational data of a plant, which As the outliers in a calibration dataset increase, the probability
includes 21 observations on three independent variables (cool- of selecting at least one outlier observation is relatively small.
ing air flow, cooling water inlet temperature, and acid concen- In this case, the masking effect existing in the interaction of
tration) and a dependent variable of stack loss.[16,17] Among all outlier observation makes it quite unclear to divide outliers
the samples, the outliers are No. 1, 3, 4, and 21, and No. 2 is a from normal samples. A visual diagnostic for the distribution
good leverage point. of prediction errors is, therefore, insufficient and complex. To
Dataset 3, Hawkins–Bradu–Kass Data, is another classic data- overcome the masking effect, an enhanced Monte Carlo out-
set for outlier detection and robust regression. The first 14 lier detection (EMCOD) method was developed to obtain bet-
observations out of 75 are outliers of this dataset.[18] ter outlier detection results. This method is designed based
Dataset 4, a dataset of Kovats retention indices, was chosen on the fact that normal samples with the smallest MV and
according to.[19–21] A total of 177 methylalkanes comprised a STD of prediction errors are easily determined. As shown in
range of different carbon chain lengths. Only the last two dig- Figure 1, the EMCOD procedure is similar with MCOD, and the
its of the KI were recorded in Ref. 21, which were obtained by procedure contains the following steps: (1) use MCOD to
subtracting the number of carbons from the main chain 100. obtain the prediction error distribution for each sample; (2)

Journal of Computational Chemistry 2015, 36, 1902–1906 1903


FULL PAPER WWW.C-CHEM.ORG

Figure 2. Mean/STD plot of prediction errors for Dataset 1: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right).

select 40–60% of the samples with the smallest MV and STD Results and Discussion
of prediction errors, and determine the remaining samples as
Enhanced Monte Carlo outlier detection method
dubious samples; (3) randomly divide the selected data (Ns)
into training and validation sets; (4) after the number of prin- Outlier detection is an important step in building a highly pre-
ciple components is determined by cross-validation, build the dictive model. MOCD was recently developed to provide a fea-
prediction model with the training set and use it to predict sible means of detecting different kinds of outliers by
the samples in the dubious samples in the validation set to establishing many predictive models and a MV/STD plot of
obtain the prediction error; (5) after N cycles, obtain the pre- prediction errors for all samples. This outlier detection method
diction error distribution for the dubious samples; and (6) use depends on the graphic MV/STD plot, so the key is to deter-
the MV and STD of the error distribution on the dubious sam- mine the visualized boundary between normal and abnormal
ples to test whether the dubious samples are outliers. Accord- samples.
ing to the hypothesis of this outlier detection method, the To illustrate our method, a simulated dataset was designed,
MVs and STDs of their prediction errors decrease, while those which contains 100 normal samples, 20 X outliers, and 20 y
of the outliers increase to some extent. As the masking effect outliers. MCOD was initially conducted to detect the outliers.
could be eliminated by EMCOD, it could provide better results As shown in Figure 2, two kinds of outliers have a clear tend-
than MCOD. ency to separate from the normal samples. The y outliers have
larger prediction errors than normal samples while X outliers
(good leverage point) have large STD values than normal sam-
ples. However, the boundary between the outliers and the
Data processing and analysis
normal samples is indistinct, making it difficult to determine
All programs used were coded in MATLAB 2011a for Win- whether a sample far from the original point is an outlier or
dows and all calculations were performed on a personal com- not. In this case, EMCOD was performed to detect the outliers
puter. The MATLAB implementation of EMCOD is available in this simulated dataset. As an enhancement of MCOD, the
from http://www.mathworks.com/matlabcentral/fileexchange/ MVs and STDs of prediction errors acquired from MCOD were
52023-emcod. used to select out 60 normal samples with the smallest MVs

Figure 3. Mean/STD plot of prediction errors for Dataset 2: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right).

1904 Journal of Computational Chemistry 2015, 36, 1902–1906 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG FULL PAPER

Figure 4. Mean/STD plot of prediction errors for Dataset 3: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right). [Color
figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

and STDs. When the number (N) of Monte Carlo models and 17 were normal samples, which had the smallest mean and
sampling ratio are, respectively, set to 10,000 and 0.8, the MVs STD values. We established MC prediction models using these
and STDs of prediction errors could be used to determine 11 samples and used these models to observe other samples.
whether the dubious samples are outliers. As shown in Figure The number (N) of Monte Carlo models and sampling ratio are
2, the samples in this simulated dataset were noticeably classi- also set to 10,000 and 0.8, respectively. According to the
fied into four groups. The distances between outliers and nor- hypothesis that the models built with merely normal samples
mal samples significantly increase, and 20 X outliers and 20 y provide lower prediction errors for normal samples but higher
outliers could be easily identified from the MV/STD plot of pre- prediction errors for outliers, the distances between normal
diction errors. It is absolutely obvious that EMCOD could pres- samples and outliers should be longer. The result is shown in
ent a better result in correctly detecting outliers. Figure 3 (left), which illustrates that EMCOD has a better result
as the outliers have correctly been detected.
With the help of an MC outlier detector, normal samples
Method validation
with the smallest MVs and STDs of prediction errors could be
Dataset 2 is the stack loss dataset of a plant. In MCOD, the easily detected, even though it was hard to determine the
number (N) of Monte Carlo models and sampling ratio are set boundary between normal samples and outliers. We selected
to 10,000 and 0.8, respectively. The MV/STD plot of the predic- some normal samples with the smallest MVs and STDs of pre-
tion errors for 21 samples was shown on the right of Figure 3. diction errors and then determined whether other samples
Lacking the information about this commonly used dataset, it were outlier one after another.
is hard to determine the boundary for outlier detection. To Dataset 3 represents the Hawkins–Bradu–Kass data. As
obtain a clearer result, enhanced MOCD was proposed and shown on the right of Figure 4, the M/SD plot indicates that
used to detect outliers in this dataset. As shown in Figure 3, 14 samples (No. 1–14) are outliers. 52 samples with the lowest
the samples including 20, 5, 16, 18, 19, 13, 14, 8, 15, 10, and STDs of prediction errors (<0.5) were selected as normal

Figure 5. Mean/STD plot of prediction errors for Dataset 4. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Journal of Computational Chemistry 2015, 36, 1902–1906 1905


FULL PAPER WWW.C-CHEM.ORG

samples. Other 23 samples were then detected and tested one


How to cite this article: L. Zhang, P. Li, J. Mao, F. Ma, X. Ding,
by one by the MC prediction models of the dataset estab- Q. Zhang, J. Comput. Chem. 2015, 36, 1902–1906. DOI:
lished with the 52 samples. As shown on the left hand of Fig- 10.1002/jcc.24026
ure 4, the prediction errors of 9 normal samples decrease and
the prediction errors of 14 outliers greatly increase. The distan-
ces between normal samples and outliers significantly increase. [1] R. Todeschini, D. Ballabio, V. Consonni, F. Sahigara, Anal. Chim. Acta
Moreover, this method is insensitive to the number of normal 2013, 787, 1.
[2] D. S. Cao, Y. Z. Liang, Q. S. Xu, H. D. Li, X. Chen, J. Comput. Chem.
samples used to build the prediction models (data not 2010, 31, 592.
shown). [3] G. J. Williams, R. A. Baxter, H. X. He, S. Hawkins, L. Gu, In IEEE Interna-
Dataset 4 is the dataset of the Kovats retention indices. The tional Conference on Data-Mining (ICDM’02), CSIRO Technical Report
CMIS-02/102; Maebashi City, Japan, CSIRO Technical Report CMIS-02/
detailed information was provided in the previous stud-
102, 2002.
ies.[19–22] EMCOD was conducted on Dataset 4. As shown in [4] F. Grubbs, Technometrics 1969, 11, 1.
Figure 5, the M/SD plot indicates that 18 samples are far from [5] W. Stefansky, Technometrics 1976, 14, 469.
[6] B. Rosner, Technometrics 1983, 25, 165.
the original point and can be regarded as outliers. The models
[7] C. Zhu, H. Kitagawa, S. Papadimitriou, C. Faloutsos, J. Intell. Inf. Syst.
built by all 177 samples were compared with those built by 2011, 36, 217.
159 samples, and the result showed that the root mean square [8] E. M. Knorr, R. T. Ng, In Proceedings of the VLDB Conference; New
error of prediction decreased from 3.195 to 1.655. After these York, 1998; pp. 392–403.
[9] I. Ben-Gal, Outlier detection, In Data Mining and Knowledge Discovery
outliers were removed, the accuracy of the model significantly Handbook: A Complete Guide for Practitioners and Researchers; O.
improved. The average cross-validation prediction error also Maimon and L. Rockach, Eds.; Kluwer Academic Publishers, Dordrecht,
dropped from 2.0341 to 1.2780, which was obviously better the Netherlands, 2005, ISBN 0-387-24435-2.[WorldCat]
[10] R. Gnanadesikan, J. R. Kettenring, Biometrics 1972, 28, 81.
than the previous study (4.6 and 4.3, respectively).[21]
[11] D. M. Rocke, D. L. Woodruff, J. Am. Stat. Assoc. 1996, 91, 1047.
[12] P. J. Rousseeuw, V. D. Katrien, Technometrics 1999, 41, 212.
[13] W. J. Egan, S. L. Morgan, Anal. Chem. 1998, 79, 2372.
[14] D. S. Cao, Y. Z. Liang, Q. S. Xu, Y. F. Yun, H. D. Li, J. Comput. Aid Mol.
Des. 2011, 25, 67.
[15] H. D. Li, Y. Z. Liang, Q. S. Xu, D. S. Cao, J. Chemometr. 2009, 24, 418.
Conclusion [16] K. A. Brownlee, Statistical Theory and Methodology in Science and
In this study, we proposed EMCOD by establishing cross- Engineering; Academic: New York, 1965; pp. 491–500.
[17] R. A. Becker, J. M. Chambers, A. R. Wilks, The New S Language; Wads-
predictive models using determinate normal samples and indi- worth & Brooks/Cole, Pacific Grove, California, 1988.
vidually analyzing the distribution of prediction errors for dubi- [18] D. M. Hawkins, D. Bradu, G. V. Kass, Technometrics 1984, 26, 197.
ous samples. Four datasets were used to illustrate and validate [19] D. A. Carlson, U. R. Bernier, B. D. Sutton, J. Chem. Ecol. 1998, 24, 1845.
[20] Y. V. Kissin, G. P. Feulmer, J. Chromatogr. Sci. 1986, 24, 53.
our method. The results indicated that EMCOD could increase
[21] A. R. Katritzky, K. Chen, U. Maran, D. A. Carlson, Anal. Chem. 2000, 72, 101.
the distances between outliers and normal samples, making it [22] Y. Z. Liang, D. L. Yuan, Q. S. Xu, O. M. Kvalheim, J. Chemometr. 2008, 22, 23.
easier to detect outliers.

Keywords: outlier detection  enhanced Monte Carlo outlier Received: 18 June 2015
Accepted: 1 July 2015
detection  validation Published online on 31 July 2015

1906 Journal of Computational Chemistry 2015, 36, 1902–1906 WWW.CHEMISTRYVIEWS.COM

You might also like