Monitoring High-Dimensional Data For Failure Detection and Localization in Large-Scale Computing Systems

Monitoring High-Dimensional Data for Failure
Detection and Localization in Large-Scale

Computing Systems
Haifeng Chen, Guofei Jiang, and Kenji Yoshihira
AbstractIt is a major challenge to process high-dimensional measurements for failure detection and localization in large-scale
computing systems. However, it is observed that in information systems, those measurements are usually located in a low-dimensional
structure that is embedded in the high-dimensional space. From this perspective, a novel approach is proposed to model the geometry
of underlying data generation and detect anomalies based on that model. We consider both linear and nonlinear data generation
models. Two statistics, that is, the Hotelling T
2
and the squared prediction error (SPE), are used to reflect data variations within and
outside the model. We track the probabilistic density of extracted statistics to monitor the systems health. After a failure has been
detected, a localization process is also proposed to find the most suspicious attributes related to the failure. Experimental results on
both synthetic data and a real e-commerce application demonstrate the effectiveness of our approach in detecting and localizing
failures in computing systems.
Index TermsFailure detection, manifold learning, statistics, data mining, information system, Internet applications.
1 INTRODUCTION
D
ETECTING failures promptly in large-scale Internet
services becomes more critical. A single hour of
downtime of those services such as Google, MSN, and
Yahoo! could often result in millions of dollars of lost
revenue, bad publicity, and click over to competitors.
One significant problem in building such detection tools is
the high dimensionality of measurements collected from
the large-scale computing infrastructures. For example,
commercial frameworks such as HPs OpenView [10] and
IBMs Tivoli [19] aggregate attributes from a variety of
sources, including hardware, networking, operating sys-
tems, and application servers. It is hard to extract
meaningful information from those data to distinguish
anomalous situations from normal ones.
In many circumstances, however, the system measure-
ments are not truly high dimensional. Rather, they can
efficiently be summarized in a space with much lower
dimension, because many of the attributes are correlated.
For instance, a multitier e-commerce system may have a
large number of user requests everyday, and many internal
attributes accordingly react to the volume of user requests
when the requests flow through the system [22]. Such
internal correlations among system attributes motivate us to
develop a new approach for monitoring the high-dimen-
sional data for system failure detection. We discover the
underlying low-dimensional structure of monitoring data
and extract two statistics, that is, the Hotelling T
2
score and
the squared prediction error o11, from each measure-
ment to express its variations within and outside the
discovered model. Failure detection is then carried out by
tracking the probabilistic density of these statistics along
time. Each time a new measurement comes in, we calculate
its related statistics and then update their probabilistic
density based on the newly computed values. A large
deviation of the density distribution before and after
updating is regarded as the indication of system failure.
We start with the situation where the monitoring data is
generated from a low-dimensional linear structure. Singular
value decomposition (SVD) is employed to discover the
linear subspace that contains the majority of data. The
Hotelling T
2
and o11 are then derived from the geometry
features of each measurement with respect to that subspace.
After that, we extend our work to the case where the
underlying data structure is nonlinear, which is often
encountered in information systems due to the nonlinear
mechanisms such as caching, queuing, and resource pooling
in the system. Unlike the linear model, however, there are
many new challenges in the extraction of geometric features
from nonlinear data. For example, there are no parametric
equations for globally describing the nonlinear model. Our
approach is based on the assumption that the data lies on a
nonlinear (Riemannian) manifold. That is, even though the
measurements are globally nonlinear, they are often smooth
and approximately linear in a local region. In the last few
years, many manifold reconstruction algorithms have been
proposed: locally linear embedding (LLE) [32], isometric
feature mapping (ISOMAP) [35], and so on. However, these
algorithms all focus on the problem of dimension reduction,
which determines the low-dimensional embedding vector
yy
i
2 1
i
of the original measurement rr
i
2 1
j
. In this paper,
in order to derive the nonlinear version of Hotelling T
2
and
o11, we need more geometric information about the data
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 13
. The authors are with NEC Laboratories America, Inc., 4 Independence
Way, Suite 200, Princeton, NJ 08540.
E-mail: {haifeng, gfj, kenji}@nec-labs.com.
Manuscript received 27 Nov. 2006; revised 4 July 2007; accepted 5 Sept. 2007;
published online 12 Sept. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0537-1106.
Digital Object Identifier no. 10.1109/TKDE.2007.190674.
1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
distribution aside from their low-dimensional representa-
tion yy
i
s. For instance, the projection of each measurement rr
i
on the underlying manifold in the original space ^ rr
i
is
required to compute the o11 value of rr
i
. Furthermore, it
has been noted that current manifold reconstruction algo-
rithms are sensitive to noise and outliers [4], [7]. To solve
these issues, we propose a novel approach to discovering
the underlying geometry of nonlinear manifold for the
purpose of failure detection. We first use the linear error-in-
variables (EIV) model [14] in each local region to estimate
the locally smoothed values of each point in that region.
Since the local regions are overlapped, each measurement
may have several locally smoothed values. We then propose
a fusion process to combine those locally smoothed values
and obtain a globally smoothed value for each measure-
ment. Those globally smoothed values are regarded as the
projections of original measurements on the underlying
manifold and then fed to the current manifold learning
algorithms such as LLE to obtain their low-dimensional
representations. Note that instead of directly using the
original data for manifold reconstruction, we propose the
EIV model and a fusion process to preprocess the data and
hence achieve more robust and accurate reconstruction of
the underlying manifold. As a by-product, we also obtain
the projections of original measurements on the underlying
manifold in the original space that are necessary for
computing the o11 value of each measurement.
We also present a statistical test algorithm to decide
whether the linear or nonlinear model is suitable, given the
measurement data. After modeling, we compute the values
of two statistics of each measurement and estimate their
probabilistic density. The failure detection is based on the
deviation of newly calculated statistics of each coming
measurement with respect to the learned density. Once a
failure is detected, a localization procedure is proposed to
reveal the most suspicious attributes based on the values of
violated statistics. We use both synthetic data and measure-
ments from a real e-commerce application to test the
effectiveness of our proposed approach. The purpose of
using synthetic data is to demonstrate the advantages of the
EIV model and fusion in reconstructing the manifold. We
compare the results of the LLE algorithm on the original
measurements and the data that has been preprocessed by
the EIV model and fusion. The results show that our two
proposed procedures are both necessary to achieve an
accurate reconstruction of the nonlinear manifold. Then, we
test our failure detection and localization methods in a
J2EE-based Web application. We collect measurements
during normal system operations and apply both linear
and nonlinear algorithms to learn the underlying data
structure. We then inject a variety of failures into the system
and compare the performances of failure detectors based on
linear and nonlinear models. It shows that both linear and
nonlinear models can detect many failure incidents. How-
ever, the nonlinear model produces more accurate results
compared with the linear model.
2 BACKGROUND AND RELATED WORK
The purpose of this paper is to model the normal behavior
of a system and highlight any significant divergence from
normality to indicate the onset of unknown failures. In data
mining, this is called anomaly detection or novelty
detection, and there have been a large number of literatures
in this aspect. Markou and Singh provided a detailed
review of those techniques [27], [28], in which both
statistical and neural networks-based approaches have been
presented. Although the statistical approaches [26], [11]
model the data based on their statistical properties and use
such information to test whether new samples come from
the same distribution, the neural networks-based methods
[25], [8] train a network to implicitly reveal the unknown
data distribution for detecting novelties.
However, much of the early work makes implicit
assumptions of the low dimensionality of data and does
not work well for high-dimensional measurements. For
example, it is quite difficult for statistical methods to model
the density of data with hundreds or thousands of attributes.
The computational complexity of neural networks is also an
important consideration for high-dimensional data. To
address these issues, Aggarwal and Yu [1] proposed a
projection-based method to find the best subsets of attributes
that can reveal data anomalies. Bodik et al. [6] used a Naive
Bayes approach, assuming the independent distribution of
attributes, to model the probabilistic density of high-
dimensional data. Tax and Duin [34] proposed a support
vector-based approach to identifying a minimal hyper-
sphere that surrounds the normal data. The samples located
outside the hypersphere are considered as faulty measure-
ments. In this paper, our solution is based on the observation
that in information systems, the high-dimensional data are
usually located on a low-dimensional underlying structure
that is embedded in high-dimensional space. Unlike the
Naive Bayes approach, we discover the correlations among
data attributes and extract the low-dimensional structure
that generates the data. Furthermore, our approach is carried
out in the original data space, without any data mappings
into their kernel feature space, as in the support vector-based
methods. As a consequence, we can directly analyze the
suspicious attributes once a failure has been detected.
Detecting and localizing failures promptly is crucial to
mission critical information systems. However, some
specific features of those systems introduce challenges for
the detection task. For instance, a large percentage of actual
failures in computing systems are partial failures, which
only break down part of service functions and do not affect
the operational statistics such as response time. Such partial
failures cannot easily be detected by traditional tools such
as pings and heartbeats [2]. To solve this issue, statistical
learning approaches have recently received a lot of attention
due to their capabilities in mining large quantities of
measurements for interesting patterns that can directly be
related to high-level system behavior. For instance, Ide and
Kashima [20] treated the Web-based system as a weighted
graph and applied graph mining techniques to monitor the
graph sequences for failure detection. In the Magpie project
[5], Barham et al. used the stochastic context-free grammar
to model the requests control flow across multiple
machines for detecting component failures and localizing
performance bottlenecks. The Pinpoint project [9], a close
relative to Magpie, proposed two features for system failure
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
detection: request path shapes and component interactions.
For the former, the set of seen traces was modeled with a
probabilistic context-free grammar (PCFG). The latter
feature was used in building a profile for each components
interaction and using the
2
test for comparing the current
distribution with the profile. In the same context of request
shape analysis as in [9], Jiang et al. [21] put forward a
multiresolution abnormal trace detection algorithm using
variable-length N-grams and automata. Bodik et al. [6] made
use of the user access behavior as an evidence of system
health and applied several statistical approaches such as the
Naive Bayes to mine such information for detecting
application-level failures.
We notice that in most circumstances, the attributes
collected from information systems are highly correlated.
For instance, the interaction profile of one component, as
described in [9], is usually correlated with those of other
components due to the implicit business logic or other
system constraints. Similarly, there exist high correlations
between different user Web access behaviors [6]. From this
perspective, we believe that the high-dimensional measure-
ments can be summarized in a space with much lower
dimension. We explore such underlying low-dimensional
structure to detect system failures. Based on the learned
structure of data, the values of Hotelling T
2
and SPE are
calculated for each sample in the online monitoring. The
failure is then detected based on the deviation of those
statistics with respect to their distributions. Note that the
Hotelling T
2
and SPE have already been used in chemo-
metrics to understand chemical systems and processes [30],
[33], [24]. For example, the Soft Independent Modeling of
Class Analogy (SIMCA), a widely used approach in
chemometrics, employed the Hotelling T
2
and SPE to
identify the classes of system measurements with similar
hidden phenomena [33]. Kourti and MacGregor [24] pre-
sented these statistics to monitor the chemical processes and
diagnose performance anomalies. However, those methods
all assume that the data are located in a hyperplane in the
original space and relied on linear methods such as the
principal component analysis (PCA) [23] or partial least
squares (PLS) [18] to compute the values of Hotelling T
2
and
SPE. In this paper, we provide a general framework for
calculating these statistics, which considers both linear and
nonlinear data generation models. For the nonlinear model,
we propose a novel algorithm to extract the Hotelling T
2
and
SPE from the underlying manifold learned from training
data. Furthermore, we have applied the Hotelling T
2
and
SPE statistics to the failure detection in distributed comput-
ing systems. Experimental results show that these two
statistics are effective in detecting a variety of injected
failures in our Web-based test system.
3 MODELING THE NORMAL DATA
When system measurements contain hundreds of attributes,
our solution for modeling their normal behavior is based on
the fact that actual measurements are usually generated
from a structure with much lower dimension. Fig. 1 uses
2D examples to illustrate such situations. In Fig. 1a, the
normal data (marked as x) are generated from a 1D line
with certain noise in the 2D space. Similarly, Fig. 1b shows
the data generated from an underlying 1D nonlinear
manifold. Two abnormal measurements are also plotted in
each figure (marked as
). As we can see, the abnormal

sample o
1
is deviated from the underlying structure.
Although the abnormal sample o
2
may be located in the
structure, its position in the structure is too far from those of
normal measurements.
It is observed in Fig. 1 that the implicit structure captures
important information about data distribution. Such prop-
erty can be exploited to distinguish abnormal and normal
samples. That is, we discover the underlying geometric
model and reveal data variations within and outside the
model as features to detect failures. Two statistics, the
Hotelling T
2
and SPE, are utilized to represent such data
variations. We calculate these two statistics for each
measurement and then build their probabilistic distribution
based on the computed values. In the monitoring process,
we compute the statistics of new measurements and check
their values with respect to the learned distribution to detect
failures. Fig. 2 provides the workflow of our normal data
modeling. We consider both linear and nonlinear data
generation models, which are described in Sections 3.1 and
3.2, respectively. Section 3.3 then presents a criterion to
determine whether the linear or nonlinear model is suitable
for the available data.
3.1 Linear Model
If the measurements rr
i
2 1
j
, with i 1. . i, are gener-
ated from a low-dimensional hyperplane in 1
j
, we apply
the SVD of data matrix X rr
1
rr
i
:
X l\
>
l
:
:
\
>
:
l
i
i
\
>
i
. 1
where
dioq`
1
. . `
i
. `
i1
. . `
i
2 1
ji
.
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 15
Fig. 1. Two-dimensional data examples. (a) Linear case. (b) Nonlinear
case.
Fig. 2. The workflow of normal data modeling.
and `
1
! `
i
) `
i1
! `
i
, with i iiifj. ig. The two
orthogonal matrices l and \ are called the left and right
eigenmatrices of A. Based on the magnitude of singular
values `
i
, the space of Xis decomposed into signal and noise
subspaces. The left i columns of l, l
:
u
1
. u
2
. u
i
, form
the bases of signal space, and
:
dioq`
1
. . `
i
. Any
vector rr 2 1
j
can be represented by the summation of two
projection vectors from two subspaces:
rr ^ rr ~ rr. 2
where ^ rr l
:
l
>
:
rr is the projection of rr on the hyperplane
expressed in the original space, which represents the signal
part of rr, and ~ rr contains the noise. Meanwhile, we can also
obtain the low-dimensional representation of rr in the signal
subspace:
yy l
>
:
^ rr. 3
The vector yy 2 1
i
is called the principal component vector of rr,
which represents the i-dimensional coordinates of rr in the
signal subspace. The ith element in yy is called the
ith principal component of rr. Since the principal compo-
nents are uncorrelated and the variance of the ith component
y
i
is `
2
i
[16], the covariance of yy is C
y
dioq`
2
1
. . `
2
i
. Note
that for ease of explanation, we assume that the data rr
i
s are
centered. In real situations, we need to center the data before
the above calculations.
Two statistics are defined for each sample rr to
represent its variations within and outside the signal
subspace. One is the Hotelling T
2
score [23], which is
expressed as the Mahalanobis distance from rrs principal
components yy to the mean of principal component vectors
from the training data:
T
2
yy " yy
>
C
1
y
yy " yy. 4
where C
1
y
is the inverse matrix of C
y
. Since " yy 0, we can
simplify (4) as T
2
yy
>
C
1
y
yy. Another statistic, the SPE [23],
indicates how well the sample rr conforms to the hyper-
plane, measured by the euclidean distance between rr and
its projection ^ rr on the hyperplane expressed in the original
space:
o11 k~ rrk
2
krr ^ rrk
2
. 5
The intuition of using these two statistics for failure
detection is illustrated in Fig. 3, in which 2D normal
samples, marked as x, are generated from a line
(1D subspace) with certain noises. Through subspace
decomposition, we obtain the direction of the line (the
signal subspace) and then project each sample rr onto that
line to get ^ rr. In this case, the Hotelling T
2
score represents
the Mahalanobis distance from ^ rr to the origin (0, 0). The
value of SPE is the squared distance between rr and ^ rr. Two
abnormal samples, marked as
, are also shown in Fig. 3a.

Since the sample o
1
is far from the line, its SPE is much
larger than those of other points. Although the sample o
2
has a reasonable SPE value, its Hotelling T
2
score is very
large, since its projection on the line o
2:
is far from the
cluster of projected normal samples. We plot the histograms
of the Hotelling T
2
score and SPE value for all the samples,
as shown in Figs. 3b and 3c, respectively. Based on these
histograms, we conclude that by defining suitable bound-
aries for normal samples in the extracted statistics, we can
find abnormal measurements and hence detect failures.
3.2 Nonlinear Model
When the measurements rr
i
s are generated from a nonlinear
structure, we still desire to have statistics that serve the
same purpose of Hotelling T
2
and SPE in the linear model.
To derive the nonlinear version of these two statistics, our
approach is based on the assumption that the data lies on a
nonlinear (Riemannian) manifold. We discover the under-
lying manifold of high-dimensional measurements and
then define the corresponding statistics based on the
geometric features of each sample with respect to the
discovered manifold.
According to the original definition of Hotelling T
2
in (4)
and SPE in (5), we need the following information to get
their nonlinear estimates:
. the low-dimensional embedding vector yy of the
original measurement rr, where yy represents the low-
dimensional coordinates of rr in the underlying
manifold instead of the linear hyper plane, and
. the projection ^ rr of measurement rr on the manifold
in the original space.
If we have the values of these variables, the nonlinear
version of Hotelling T
2
and o11 for each sample rr can be
defined in the same form as that in the linear situation. For
instance, the nonlinear T
2
is expressed as in (4), except that
the yy is computed from the underlying manifold, " yy is the
mean of yy for all the training data, and C
y
denotes the
sample covariance matrix:
C
y

1
i 1
i
,1
yy
,
" yyyy
,
" yy
>
. 6
Fig. 3. The role of extracted statistics for failure detection. (a) The
2D normal data together with two outliers. (b) The histogram of Hotelling
T
2
of all the samples. (c) The histogram of o11 of all the samples.
Similarly, the nonlinear SPE for sample rr is also defined in
(5), with ^ rr representing the projection of rr on the manifold.
In the last few years, many manifold reconstruction
algorithms have been proposed, such as the LLE [32] and
ISOMAP [35]. However, these algorithms all focus on the
problem of dimension reduction, which only outputs the
low-dimensional embedding vector yy of the original
measurement rr and does not directly compute the projec-
tion ^ rr of rr on the manifold. Furthermore, in practical
situations, the measurements are always noisy, and it is
noted that the LLE and ISOMAP algorithms are sensitive to
noise [4], [7]. As a consequence, we use two steps to obtain
the necessary estimates:
rr
i
!
1
^ rr
i
!
2
yy
i
i 1. . i. 7
We start by calculating the projection ^ rr
i
of each sample rr
i
on
the underlying manifold, followed by the estimation of yy
i
s
based on the projected variables. Note that here, we use the
estimate ^ rr
i
as an intermediary for obtaining yy
i
. The accuracy
of the estimate ^ rr
i
is therefore important to the computation
of two statistics. To get the reliable ^ rr
i
s, we first apply the
linear EIV model, as described in Section 3.2.1, in the
neighborhood region of each sample rr
i
to compute the
locally smoothed value of every sample in that neighbor-
hood. Since the local regions are overlapped, each sample
usually has several locally smoothed values. We then
present a fusion process in Section 3.2.2 to combine all the
locally smoothed values of rr
i
and hence obtain its
projection ^ rr
i
on the manifold in the original space. Once
the projections ^ rr
i
s are available, they are input to current
manifold learning algorithms such as LLE to estimate the
low-dimensional embedding vector yy
i
s, which is described
in Section 3.2.3.
3.2.1 Local Error-in-Variables (EIV) Model
We start with the /-nearest neighbor search for each
sample rr
i
to define its local region. Given a set of
j-dimensional vectors rr
i
1
. . rr
i
/
located in the neighbor-
hood of rr
i
, a local smoothing of those vectors is performed
based on the geometry of the local region that comprises the
points. For simplicity, we use rr
1
. . rr
/
to denote those
neighborhood points of rr
i
. In current manifold reconstruc-
tion algorithms such as LLE, the local geometry is
represented by a weight vector nn n
i1
n
i/
>
that
best reconstructs rr
i
from its neighbors. By minimizing the
following reconstruction error
c rr
i

/
,1
n
i,
rr
,
_
_
_
_
_
_
_
_
_
_
2
j
|1
r
i
|

/
,1
n
i,
r
,
|
_ _
2
8
subject to
,
n
i,
1, where r
i|
or r
,|
represents the
|th element of vector rr
i
rr
,
, the LLE obtains the least
squares solution of nn for the local region surrounding rr
i
.
Equation (8) assumes that all the neighbors of rr
i
are free of
noise, and only the measurement rr
i
is noisy. This is
frequently unrealistic, since usually, all the samples are
corrupted by noise. As a result, the solution of (8) is biased.
To remedy this problem, we use the following EIV model
[14] by minimizing
c
0
rr
i

/
,1
n
i,
^ rr
,
_
_
_
_
_
_
_
_
_
_
2
/
,1
krr
,
^ rr
,
k
2
9
subject to
,
n
i,
1, where ^ rr
,
is the local noise-free
estimate of sample rr
,
in the region surrounding rr
i
.
Equation (9) can also be represented as
c
0
krr
i
^ rr
i
k
2
/
,1
krr
,
^ rr
,
k
2
10
by taking into account that ^ rr
i

,
n
i,
^ rr
,
. Define the
matrices
1 rr
i
rr
1
rr
2
rr
/
2 1
j/1
and 1 ^ rr
i
^ rr
1
^ rr
2
^ rr
/
2 1
j/1
to contain the locally
smoothed estimates of those points, with j ) /, for
common cases where the nonlinear manifold is embedded
in a high-dimensional space. The problem (10) can be
reformulated as
min k11k
2
11
subject to
100 0. 12
where 00 1 n
i1
n
i2
n
i/
>
. From (12), the rank of 1
is /. Therefore, the estimate 1 is the rank / approximation
of matrix 1. If the SVD of 1 is 1
/1
,1
`
,
u
,
v
>
,
, with
`
1
! `
2
! `
/1
, we obtain the noise-free sample matrix 1
from the Eckart-Young-Mirsky Theorem [12], [29] as
1
/
,1
`
,
u
,
v
>
,
. 13
The weight vector nn can also be estimated from the SVD
of matrix 1. Since our purpose here is to find the local
noise-free estimate 1, we do not plan to discuss it in detail.
For an in-depth treatment of the EIV model and its
solutions, see [14], [36]. According to [36], we can also
obtain the first-order approximation of the covariance
^
C
n
of
the parameter nn, which is proportional to the estimation of
the variance of noise ^ o
2
c
`
2
/1
,j /.
3.2.2 Fusion
We apply the EIV model in the neighborhood region of
every sample rr
i
to obtain the locally smoothed value of
every point in that region. Since each sample rr
i
is usually
included in the neighborhoods of several other points and
in its own local region, it has more than one local noise-free
estimate. Given those different estimates f^ rr
1
i
. ^ rr
2
i
. . ^ rr
/
i
g
with / ! 1 our goal is to find a global noise-free estimate ^ rr
i
of rr
i
from its many local values. Such global noise-free
estimate ^ rr
i
can be regarded as the projection of rr
i
onto the
manifold in the original space.
Due to the variation of curvatures of the underlying
manifold, the linear model presented in Section 3.2.1 may
not always succeed in discovering local structures. For
instance, in the local region with a large curvature, the noise-
free estimate ^ rr
i
is not reliable. Suppose we have the
covariance matrix C
,
of ^ rr
,
i
to characterize the uncertainty of
linear fitting, through which ^ rr
,
i
is obtained. Then, the global
estimate of a noise-free value ^ rr
i
can be found by minimizing
the sum of the following Mahalanobis distances:
^ rr
i
argmin
^ ri
/
,1
^ rr
i
^ rr
,
i

>
C
1
,
^ rr
i
^ rr
,
i
. 14
The solution of (14) is
^ rr
i

/
,1
C
1
,
_ _
1
/
,1
C
1
,
^ rr
,
i
. 15
That is, the global estimate ^ rr
i
is characterized by the
covariance weighted average of ^ rr
,
i
s. The more uncertain a
local estimate ^ rr
,
i
is (the inverse of its covariance has a
smaller norm), the less the instance that it contributes to the
result of the global value ^ rr
i
.
The covariance C
,
can be estimated from the error
propagation of the covariance of nn during the calculation
of ^ rr
,
i
. However, since the dimension of ^ rr
,
i
is high, it is
not feasible to directly use C
,
. Instead, we use the
determinant of C
,
to approximate (15):
^ rr
i

/
,1
jC
,
j
1
_ _
1
/
,1
jC
,
j
1
^ rr
,
i
. 16
where jC
,
j is approximated by jC
,
j % ^ o
2
c
, and is a constant
that does not contribute to the calculation of (16).
3.2.3 Manifold Reconstruction
Once we have the estimate ^ rr
i
s, we feed them to the current
manifold reconstruction algorithms such as LLE and ISO-
MAP to obtain the low-dimensional embedding vector yy
i
s. In
our approach, the EIV modeling and fusion process actually
serve as preprocessing procedures for those manifold
learning algorithms in order to achieve a more robust and
accurate reconstruction of the nonlinear manifold. One by-
product of this is that we can also obtain the projection of
each measurement on the manifold in the original space. We
do not plan to describe the LLE or ISOMAP algorithm here.
The readers may refer to [32], [35] for details.
When the projection ^ rr
i
s and the low-dimensional
embedding vector yy
i
s are available, the nonlinear Hotelling
T
2
and o11 of rr
i
s are calculated in the same way as in (4)
and (5). We then use the Gaussian mixture model to
estimate the density distribution of two statistics. Section 4
will show that by monitoring these statistics along time, we
can detect abnormal points that deviated from the under-
lying manifold.
3.3 Linear versus Nonlinear
The linear model is easy to understand and simple to
implement. On the other hand, nonlinear models are more
accurate when the nonlinearities of the underlying structure
are too pronounced to be approximated by linear models. To
make the correct decision of choosing the model, given a set
of measurements, we first estimate the intrinsic dimension of
data and then apply a statistical test to decide whether the
linear or nonlinear model has to be used.
We use the method proposed in [17] to estimate the
intrinsic dimension. It is based on the observation that for
an i-dimensional data set embedded in the j-dimensional
space, rr
i
2 1
j
, with i 1. . i, the number of pairs of
points closer to each other than i is proportional to i
i
. We
define
H
i
i
2
ii 1
i
i<,
1
fkrr
i
rr
,
k<ig
. 17
where 1
is the indicator function of the event . The

intrinsic dimension i is defined as
i lim
i!0
lim
i!1
log H
i
i
log i
. 18
In practice, we compute H
i
i for different i
i
s and then fit a
line through log i
i
. log H
i
i
i
to obtain i.
The estimated dimension i is then used to test whether
the linear model is sufficient for discovering the geometry
of data samples. We perform SVD of the data matrix, as
described in Section 3.1, and check whether the linear
subspace with dimension i covers enough variances of the
original space. To do this, we define

`
2
1
`
2
2
`
2
i
`
2
1
`
2
2
`
2
i
_ _
1,2
. 19
where `
i
s are the singular values from SVD, and
i iiifj. ig. If the value is larger than a predefined
threshold, we use the linear model to characterize the
normal data. Otherwise, the nonlinear model is applied. In
this paper, the threshold of is 0.98, which is determined
from the evaluation of various values of the linear model
contaminated with different levels of noise.
4 ONLINE DETECTION
Section 3 demonstrates that the Hotelling T
2
and o11 can
serve as metrics to distinguish abnormal and normal
samples. We use the Gaussian mixture model to exploit
their probabilistic density from the training data. That is, we
denote the Hotelling T
2
and o11 as a vector :: and represent
the probability density of :: by a mixture of / Gaussians:
j::j00
/
i1
c
i
j::jjj
i
.
i
. 20
where c
i
! 0,
/
i1
c
i
1, and each j::jii
i
.
i
is a two-
dimensional Gaussian distribution, with the density speci-
fied by the mean ii
i
and covariance matrix
i
. The number
of Gaussians / is chosen between 2 and 5 in usual cases. In
the online monitoring, every time a new measurement rr
t
comes in, we calculate its related statistics ::
t
. If the system
behavior is relatively static, the failure can be detected by
directly checking the newly computed statistics ::
t
with their
distribution (20) learned from the training data, for example,
comparing the probability j::
t
j00 with a certain threshold to
generate alarms. However, we observe that in Web-based
computing systems, the system behavior is not always fixed
due to the user workload variations, Web content updating,
and so on. As a consequence, the distribution of two
statistics may shift during the monitoring period as a result
of variations in system behavior. To deal with such systems
with changing behavior, Section 4.2 provides another
solution to detect failures. We employ an online updating
algorithm to sequentially update the distribution of two
statistics for every sample in the monitoring process. The
failure is then detected based on the statistical deviation of
density distribution before and after the updating.
For the linear model, it is easy to compute the
Hotelling T
2
and o11 for each new sample rr
t
based on
(4) and (5). However, calculating those statistics online is
not straightforward in the nonlinear case, since there are no
global formula or parameters to describe the nonlinear
manifold. Therefore, Section 4.1 focuses on the online
statistics computing for the nonlinear model. Once the
newly computed statistics are available, Section 4.2 then
presents the way of updating their density and detecting
anomaly based on the updated density.
4.1 Online Computing of Two Statistics
Fig. 4 presents the algorithm of the online computing of
Hotelling T
2
and o11 for the nonlinear model. Given a
new measurement rr
t
, the first step is to find the nearest
patch of rr
t
on the discovered manifold from the training
data. We start by locating the nearest point rr
of rr
t
in the
training data together with its nearest neighbors (including
rr
itself) rr
1
. rr
2
. . rr
/
, followed by retrieving their projec-
tions on the manifold ^ rr
1
. . ^ rr
/
. The plane spanned by ^ rr
/
s
is then regarded as the nearest patch of rr
t
on the manifold.
Note that the nearest neighbors of rr
and their projections

have already been calculated during the training phase, and
no extra calculations are needed. The only issue is finding
the nearest point rr
of rr
t
in the training data. For high-
dimensional rr
t
, the complexity of nearest neighbor query is
practically linear with respect to the training data size. To
speed up computation, the locally sensitive hashing (LSH)
[15] data structure is constructed for the training data to
approximate the nearest neighbor search. As a consequence,
the nearest neighbor query time has sublinear dependences
on the data size.
Step 2 in Fig. 4 then projects rr
t
onto its nearest patch.
We assume that the patch spanned by ^ rr
i
s is linear and
builds a matrix A
^ rr
1
. . ^ rr
/
. By solving the equation
A
nn rr
t
by least squares, we get the estimation nn
A
>
A
1
A
>
rr
t
and the projection of rr
t
on the manifold:
^ rr
t
A
nn A
A
>
A
1
A
>
rr
t
. 22
Once we obtain the weight nn, the low-dimensional
embedding vector yy
t
of rr
t
is calculated by (21) based on
the observation that the local geometry in the original space
should equally be valid for local patches in the manifold.
Accordingly, the Hotelling T
2
and o11 values of the new
sample rr
t
are calculated.
4.2 Density-Based Detection
Once the newly calculated Hotelling T
2
and o11 are
available, we use the sequentially dynamic expectation-
maximization (SDEM) algorithm [26], as described in
Fig. 5, to update the density (20). Based on the original
expectation-maximization (EM) algorithm for Gaussian
mixture models [31], the SDEM utilizes an exponentially
weighted moving average (EWMA) filter to adapt to frequent
system changes. For instance, given a set of observations
fr
1
. r
2
. . r
i
. g, an online EWMA filter of the mean j is
expressed as
j
i1
1 ,j
i
,r
i1
. 23
where the forgetting parameter , dictates the degree of
discounting previous examples. Intuitively, the larger , is,
the faster the algorithm can age out past examples. Note
that there is another parameter c in Fig. 5, which is set
between [1.0, 2.0], in the estimation of
i
in order to improve
the stability of the solution.
The anomaly is then determined based on the statistical
deviation of density distribution before and after the new
statistics ::
t
is obtained. If we denote the two distributions as
j
t1
:
and j
t
:
, respectively, our metric called the Hellinger
score is defined by
:
H
::
t

_

j
t
:
_

j
t1
:
_
_ _
2
d:. 24
Intuitively, this score measures how much the probability
density function j
t
:
has moved from j
t1
:
after learning ::
t
.
A higher score indicates that ::
t
is an outlier with high
probability. For the efficient computation of the Hellinger
score, see [26].
Fig. 4. Online computing of Hotelling T
2
and o11 for the nonlinear
model.
Fig. 5. The SDEM algorithm for updating the mixing probability c
t
i
and
the mean jj
t
i
and covariance
t
i
of / Gaussian functions in (20), given
the new statistics ::
t
.
5 FAILURE LOCALIZATION
This section discusses finding out a set of the most
suspicious attributes, after a failure has been detected,
based on their relation to the failure. Although these
returned attributes may not tell the exact root cause of the
failure, they still contain useful clues and can thereby
greatly help system operators in narrowing down the
debugging scope to some highly probable components.
We collect and analyze the failure measurement rr
)
to
determine which of its statistics gets wrong by comparing
each statistic with its marginal distribution derived from the
joint density (20). If the statistic o11 deviates from its
distribution, we use the ranking method in Section 5.1 to
return the most suspicious variables. Similarly, if the
Hotelling T
2
goes wrong, we use the method in Section 5.2
to rank the variables. If both statistics get wrong, the union
of two sets of the most suspicious variables is returned.
5.1 Variable Ranking from o11
Given the failure measurement rr
)
, according to (5), its o11
represents the squared normof residual vector ~ rr
)
rr
)
^ rr
)
.
Our method for finding the most suspicious attributes from
the deviated SPE is based on the absolute value of each
element in the residual ~ rr
)
. If the ith element in ~ rr
)
has large
absolute values, then the ith attribute is one of the most
suspicious attributes.
However, the direct comparison of each element in ~ rr
)
is
not reliable, since the attributes have different scales. In
order to prevent outweighing one attribute with other
attributes, we record the absolute values of each element in
the residual vectors of training samples and calculate their
mean ~ m and the standard deviation ~ oo. The ith element in ~ rr
)
is then transformed to a new variable:
:
i

j ~ r
)i
j ~ i
i
~ o
i
. 25
where ~ i
i
and ~ o
i
represent the ith element of ~ m and ~ oo,
respectively. A large :
i
value indicates the high importance
of the ith attribute to the detected failure.
5.2 Variable Ranking from Hotelling T
2
For the linear model, the Hotelling T
2
in (4) can be
simplified as T
2
yy
>
C
1
y
yy
i
i1
y
2
i
`
2
i
. We define
v
t

y
2
1
`
2
1
. .
y
2
i
`
2
i
_ _
>
. 26
in which the ith element represents the importance of the
ith principal component y
i
to the T
2
statistic. However, the
principal components usually do not have any physical
meaning. To reveal failure evidence in the attribute level,
we compute the contribution of the original variable r
i
to
each principal component in terms of extracted variances
and denote it as v
i
. It is expected that the variable whose v
i
has the same element distribution as v
t
is the most
suspicious. Therefore, we compute
t
i
v
>
i
v
t
27
for the ith attribute and reveal the suspicious attributes
based on t
i
s. If the value of t
i
is large, then the ith attribute
needs more attention in the debugging.
In order to calculate v
i
for each variable r
i
, we consider
the principal component loading 1 2 1
ii
[3], in which
each element |
i,
represents the correlation between the
ith variable r
i
and the , principal component y
,
:
|
i,
1r
i
y
,
. 28
The matrix 1 can be computed from the SVD of data matrix
A [3], and the square of its element |
2
i.,
tells the proportion of
variance in the original variable r
i
explained by the
principal component y
,
. Therefore, we define a matrix `
with element `
i,
|
2
i,
varr
i
to represent the actual
variance in r
i
that is explained by y
,
. The summation of
the ,th column in `, )
,

i
`
i,
, represents the total
variance extracted by the ,th principal component y
,
. We
divide each element `
i,
by )
,
to obtain a new matrix
~
`
with each element
~
`
i,
`
i,
,)
,
. The ith row of matrix
~
`
then represents the contribution of the original variable r
i
to
each principal component in terms of extracted variances:
v
i

~
`
i1

~
`
ii
>
. 29
For the nonlinear model, however, the low-dimensional
embedding vector yy cannot be expressed as the linear
combination of original attributes. In order to apply the
above variable ranking method to nonlinear situations, we
perform SVD on the nearest patch of rr
)
, A
^ rr
1
. . ^ rr
/
,
as discovered in Section 4.1. By doing so, we obtain rr
)
s local
principal components yy
)
. The loading matrix 1 and, hence,
the v
i
s are then calculated from the data matrix A
and yy
)
,
and v
t
is computed from yy
)
. Accordingly, the t
i
values in
(27) can be calculated to rank the variables.
6 EXPERIMENTAL RESULTS
In this section, we first use some synthetic data to
demonstrate the advantages of the local EIV modeling
and fusion process, as described in Sections 3.2.1 and 3.2.2,
in achieving a more accurate reconstruction of the nonlinear
manifold. We also evaluate both the linear and nonlinear
methods in detecting a set of generated outliers. Then, we
apply our proposed high-dimensional data monitoring
approach to a real J2EE-based application to detect a
variety of injected system failures.
6.1 Synthetic Data
In addition to providing an original framework of monitor-
ing high-dimensional measurements in information sys-
tems, there are also novel algorithmic contributions to the
nonlinear manifold recovery in this paper. We have
proposed the local EIV model and a fusion process to
reduce the noise in the original measurements and hence
achieve a more accurate reconstruction of the underlying
manifold. In Section 6.1.1, we use synthetic data to
demonstrate the effectiveness of our proposed algorithms.
In addition, we generate some outliers to evaluate the
detection performance of both linear and nonlinear models
in Section 6.1.2.
6.1.1 Manifold Reconstruction
For ease of visualization, a 1D manifold (curve) is used in
this example. The 400 data points are generated by qt
t cost. t sint
>
added with a certain amount of Gaussian
noise, where t is uniformly sampled in the interval 0. 4.
Fig. 6a shows such curve under the noise with standard
deviation j 0.3. Since in this example, the data dimension
j 2 is smaller than the size of the neighborhood / (=12
for this data set), we use a regularized solution of the EIV
model [13] to calculate the matrix 1 in (13).
The reconstruction accuracy for 1D manifold is mea-
sured by the relationship between the recovered manifold
coordinate ~ t and centered arc length tt defined as
tt
_
t
t0
kJ
q
tkdt. 30
where J
q
t is the Jacobian of qt:
J
q
t cost t sint. sint t cost
>
. 31
The more accurate the manifold reconstruction is, the more
linear the relationship between ~ t and t becomes. Figs. 1b,
1c, and 1d show their relationship curves generated by the
LLE algorithm, LLE with EIV modeling, and LLE with both
EIV modeling and fusion. It is obvious that the LLE with
both EIV modeling and fusion outperforms the other two
algorithms. Note that in the LLE algorithm with only EIV
modeling, the estimate of the projection ^ rr is taken as the
locally smoothed value from the region that surrounds rr.
Further comparison of the performances of the three
algorithms are carried out by some random simulations on
manifolds in the 100-dimensional space. We generate the
same 2D data qt
i
s as in the first example, followed by
transforming the data into 100-D vectors by orthogonal
transformation rr
i
Qqt
i
, where Q 2 1
1002
is a random
orthonormal matrix. We add different levels of noise on the
data, with the standard deviation j ranging from 0.1 to 1. At
each noise level, 100 trials are run. We use the correlation
coefficient between the recovered manifold coordinate ~ t
and the centered arc length t
,
co.~ t. t
.oi~ t.oit
_ 32
to measure the strength of their linear relationship. Fig. 7
shows the mean and standard deviation of correlation
coefficient ,, as obtained by LLE, LLE with EIV modeling,
and LLE with EIV modeling and fusion, for 100 trials under
different noise levels, with standard deviation j ranging
from 0.1 to 1. Note that the vertical bars in this figure mark
one standard deviation from the mean. It illustrates that
both the EIV modeling and fusion are beneficial to reducing
the noise of data samples.
6.1.2 Outlier Detection
We use the data from a highly nonlinear structure to
illustrate the effectiveness of manifold reconstruction and
the two statistics in detecting outliers. We generate 1,000 3D
data z
i
from a 2D Swiss roll, which is shown in Fig. 8a. Those
3Ddata are thentransformedinto 100-dimensional vectors rr
i
by an orthogonal transformation rr
i
Qz
i
, where Q 2 1
1003
Fig. 6. (a) Samples from a noisy 1D manifold. (b) The centered arc
length t versus manifold coordinate ~ t recovered by LLE. (c) The
centered arc length t versus manifold coordinate ~ t recovered by LLE
with the EIV modeling. (d) The centered arc length t versus manifold
coordinate ~ t recovered by LLE with the EIV modeling and fusion.
Fig. 7. Performance comparison of LLE, LLE with EIV, and LLE with EIV
and fusion in noise reduction. The vertical bars mark one standard
deviation from the mean.
Fig. 8. The comparison of linear and nonlinear models in outlier
detection. (a) The 3D view of training data. (b) The 3D view of test
samples (marked as
) and the training data (marked as .).

(c) The scatterplot of the o11 and T
2
statistics of test samples
computed by the linear method. The normal test samples are marked
as . and the outliers are marked as x. (d) The scatterplot of the
o11 and T
2
statistics of test samples computed by the nonlinear
method. The normal test samples are marked as . and the outliers
are marked as x.
is a random orthonormal matrix. Some Gaussian noises with
standard deviation 0.5 are also added to the 100D vectors.
Meanwhile, we also generate 200 test samples, in which half
of them are normal data generated in the same way as
training data generation, and the others are randomly
generated outliers. Fig. 8b presents a 3D view of the test
data (marked as
) and the training data.

We apply both linear PCA and the proposed manifold
reconstruction algorithms to reconstruct the underlying
structure from the training data. Based on the learned
structure, the T
2
and o11 statistics are computed for the
training data and test samples. Figs. 8c and 8d present the
scatterplot of two statistics of test data computed from the
linear and nonlinear methods, respectively, in which the
normal data are plotted as . and the outliers are marked
as x. In Fig. 8c, we see that the normal samples and
outliers are highly overlapped. The linear method shows
poor performance in detecting the outliers because of the
high nonlinearity of data structure. On the other hand, the
results from the nonlinear method present a clear separa-
tion between normal samples and outliers, which is shown
in Fig. 8d. In this figure, we notice that the o11 statistic
plays an important role in distinguishing outliers from the
normal data. This is reasonable, because the outliers are
usually randomly generated, and it is less likely for them to
be located in the same manifold structure with the inliers.
However, the T
2
statistic, which measures the in-model
distances among samples, is also useful in the outlier
identification. As shown in Fig. 8d, two outliers (8.7, 86.9)
and (8.4, 108.1) exhibit small o11 values but large
T
2
scores. We check those two points in the original
3D data and find that they are actually located in the
manifold of inliers but far from the cluster of the inlier data.
In this case, the T
2
statistic provides good evidence about
the existence of outliers.
6.2 Real Test Bed
Our algorithms have been tested on a real e-commerce
application, which is based on the J2EE multitiered
architecture. J2EE is a widely adopted platform standard
for constructing enterprise applications based on deployable
Java components, called Enterprise Java Beans (EJBs). The
architecture of our testbed system is shown in Fig. 9. We use
Apache as a Web server. The application server consists of
the Web container (Tomcat) and the EJB container (JBoss).
The MySQL is running at the back end to provide persistent
storage of data. PetStore 1.3.2 is deployed as our testbed
application. Its functionality consists of storefront, shopping
cart, purchase tracking, and so on. There are 47 components
in PetStore, including EJBs, Servlets, and JSPs. We build a
client emulator to generate a workload similar to that created
by typical user behavior. The emulator produces a varying
number of concurrent client connections, with each client
simulating a session based on some common scenarios,
which consists of a series of requests such as creating new
accounts, searching by keywords, browsing for item details,
updating user profiles, placing orders, and checking out.
The monitored data are collected from three servers
(Web sever, application server, and database sever) in our
testbed system. Each server generates measurements from a
variety of sources such as CPU, disk, network, and
operating systems. Fig. 10 lists all these attributes, which
are divided into eight categories. The three right columns in
this figure give the number of attributes in each category
generated in three servers, respectively. In total, there are
111 attributes contained in each measurement. We manu-
ally check these attributes and observe that many of them
are correlated. Fig. 11 presents an example of four highly
correlated attributes. It suggests that our proposed ap-
proach is feasible to this type of data.
We collect the measurements every 5 seconds under
system normal operations, with the magnitude of workload
randomly generated between 0 and 100 user requests per
second. In total, 5,000 data samples are gathered as the
training data. To determine whether the linear or nonlinear
model best characterizes that data set, we calculate H
i
i
i
for different i
i
s, as described in Section 3.3, fit a line
between their log values log H
i
i
i
5.83 log i
i
15.14, and
Fig. 9. The architecture of the testbed system.
Fig. 10. The list of attributes from the testbed.
Fig. 11. Example of four correlated attributes.
get the intrinsic dimension i 5.83 6. Since the calcu-
lated 0.92 from (19) is smaller than the threshold 0.98, it
is suggested that the nonlinear model be applied for the
data. In the following, we will confirm this conclusion by
comparing the performances of the linear and nonlinear
models in detecting a variety of injected failures in the
system.
We modify the codes in some EJB components of the
PetStore application to simulate a number of real system
failures. Five types of faults are injected into various
components, with different intensities, to demonstrate the
robustness of our approach.
Memory Leaking. We simulate three memory leaking
failures by repeatedly allocating three different sizes
(1 Kbyte, 10 Kbytes, and 100 Kbytes) of heap memory into
the ShoppingCartEJB of the PetStore application. Since the
reference of that EJB object is always pointed from other
objects, the Java garbage collector does not notice this
leakage of memory. Hence, the PetStore application will
gradually exhaust the supply of virtual memory pages,
which leads to severe performance issues and makes the
accomplishment of client requests much slower.
File Missing. In the packaging process of Java Web
applications, it might happen that a file is improperly
dropped from the required composition, which will result
in failures of invoking a correct system response, and may
eventually cause service malfunction, which makes the user
come across strange Web pages. Here, we simulate five such
failures by dropping different JSP files from the PetStore
application to mimic the operators mistakes during system
maintenance.
Busy Loop. The actual causes of request slowdown can
be quite a few such as the spinlock fault among synchro-
nized threads. We simulate the phenomenon of slowdown
by adding a busy-loop procedure in the code. Depending
on the number of loops in the instrumentation, the
significance of simulation is different. In this section, we
simulate five different busy-loop failures by allocating 30,
65, 100, 150, and 300 loops in the ShoppingCartLocalEJB of
the PetStore application, respectively.
Expected Exception. The expected exception [9] happens
when a method declaring exceptions (which appears in the
methods signature) is invoked. In this situation, an
exception is thrown, without the methods code being
executed. As a consequence, the user may encounter
strange Web pages. We inject such fault into two different
EJBs of PetStore, that is, CatalogEJB and AddressEJB, to
generate two expected exception failures.
Null Call. The null call fault [9] causes all methods in the
affected component to return a null value without executing
the methods code. It is usually caused by the errors in
allocating system resources, failed lookups, and so on.
Similar as the expected exception, the null-call failure
results in strange Web pages. We inject such fault into two
different EJBs of PetStore, that is, CatalogEJB and Addres-
sEJB, to generate two failure cases.
As a result, altogether, 17 failures cases are simulated
from the five different types. Note that the system is
restarted before every failure injection in order to remove
the impact of previous injected failures. In addition, the
workloads are dynamically generated with much random-
ness so that we never get a similar workload twice in the
experiments. We randomly collect a certain number of
measurements from each failure case and in total obtain
425 abnormal measurements. We also collect 575 normal
samples to make the test data set contain 1,000 samples.
The linear and nonlinear models are compared in
representing the training data and detecting failure samples
from the test data. Figs. 12a and 12b present the scatterplots
of Hotelling T
2
and o11 of the test data produced by the
linear and nonlinear models, respectively. The values of
normal test data are marked as . in the figures, and those
of failure data are marked as . For the linear model
shown in Fig. 12a, there is an overlap between the normal
data distribution and that of the failure samples. In
addition, there are four normal points with very large SPE
values (at around 120). We check those points in Fig. 12b
provided by the nonlinear method and find that among
those four points, three are located in the cluster of normal
samples, and only one point, (39.5, 2.1), is hard to separate
from outliers. Compared with the linear model, the non-
linear model produces more clear separation between
normal and abnormal samples in the generated statistics.
We also notice that for the nonlinear model, the SPE statistic
plays a dominant role in detecting the outliers. Based on the
similar observation that we obtained in Fig. 8d from the
synthetic data, we can explain this by two factors: 1) the
nonlinear method correctly identifies the underlying data
structure and 2) in the experiment, most failure points are
located outside the discovered manifold. In spite of the
importance of SPE, the T
2
statistic is also useful in failure
detection, especially when the SPE values are in the
ambiguity region, for example, between 27 and 37 for
the nonlinear model. For the linear method, such ambi-
guity region of SPE is wider, which is from 15 to 35, as
shown in Fig. 12a, because of its linear assumption of
underlying data structure. In order to qualitatively compare
the performances of two detectors, we use the method
described in Section 4.2 to build the joint density of T
2
and
o11 based on values computed from the training data and
calculate the Hellinger score (24) for every test sample.
Based on these scores, the ROC curves of the two models
are plotted in Fig. 13. It shows that both the linear and
nonlinear methods obtain acceptable results in detecting the
failure samples due to the moderate nonlinearity of data
generated in the experiment. However, the nonlinear model
produces more accurate results than the linear model.
A further investigation of the nonlinear model reveals
its more advantages over the linear method. We find that
the values of T
2
and o11 calculated by the nonlinear
model can also provide useful clues about the significance
Fig. 12. The scatterplot of SPE and T
2
of the test data produced by the
(a) linear model and (b) the nonlinear model. The normal test data are
marked as . and the failure data are marked as .
of injected failures. Fig. 14 uses the values of o11 on the
busy-loop failure to demonstrate this fact. Fig. 14b shows
the histogram of o11 values of normal test data generated
by the nonlinear model. Figs. 14d, 14f, and 14h present the
o11 of test samples from three busy loop failure cases with
different impacts, in which 30, 65, and 100 busy loops are
injected into an EJB component, respectively. Results show
that the o11 values for these failures are separated, and
the failure with stronger significance produces larger
o11 values. The o11s computed by the linear model
are also shown in Figs. 14a, 14c, 14e, and 14g. Compared
with those from the nonlinear model, the o11 values from
the linear model are overlapped and lack strong evidence
about the significance of injected failures.
Our failure localization procedure also produces satis-
factory results. Here, we use variable ranking from o11 to
demonstrate this. We randomly select 200 samples from the
failure measurements whose o11 values are affected.
Among the 200 selected data, each type of failure occupies
40 samples, which are continuously indexed. We apply the
attribute ranking method described in Section 5 and output
a vector
i
, which contains the indices of the five most
suspicious attributes. In total, we generate 200 such vectors
i
, i 1. . 200. In order to see whether these attribute
ranking results really tell any evidence about the injected
failure, we perform hierarchical clustering on
i
s, in which
the Jaccard coefficient Ji. ,
ji\,j
ji[,j
is used for calculating
the similarity between
i
and
,
. We start by assigning each
vector
i
to its own cluster and, then, we merge clusters
with the largest average similarity until five clusters are
obtained. Results show that the vector
i
s from the same
type of failure provides consistent variable ranking results.
Table 1 plots the cluster indices associated with each failure
measurement. We see that all the vectors belonging to the
memory leaking (with indices 1-40) and file missing (41-80)
failures form separate clusters. Most of the vectors from the
busy-loop failure (81-120) form one cluster, except five noisy
points. However, it is hard to separate the null call (121-160)
and expected exception (161200) failures. Actually, this is
reasonable, since these two types of failures are generated
by similar mechanisms. Therefore, our proposed failure
localization method provides consistent and useful evi-
dence about the failure. It is our future work to look further
into the clustered vectors and reveal the signatures for each
type of failure based on its suspicious attributes. By doing
so, we can quickly identify and solve recurrent system
failures by retrieving the similar signatures from historic
failures.
7 CONCLUSIONS
This paper has presented a method for monitoring high-
dimensional data in information systems based on the
observation that the high-dimensional measurements are
usually located in a low-dimensional structure embedded in
the original space. We have developed both linear and
nonlinear algorithms to discover the underlying low-
dimensional structure of data. Two statistics, the Hotelling
T
2
and SPE, have been used to represent the data variations
within and outside the revealed structure. Based on the
probabilistic density of these statistics, we have successfully
detected a variety of simulated failures in a J2EE-based Web
application. In addition, we have discovered a list of
suspicious attributes for each detected failure, which are
helpful in finding the failure root cause.
Fig. 14. (a) and (b) The histograms of o11 of normal test samples
produced by the linear and nonlinear models. (c) and (d) The histograms
of o11 by the linear and nonlinear models for samples from the busy-
loop failures, with 30 loops injected. (e) and (f) The histograms of o11
by the linear and nonlinear models for samples from the busy-loop
failures, with 65 loops injected. (g) and (h) The histograms of o11 by
the linear and nonlinear models for samples from the busy-loop failures,
with 100 loops injected.
TABLE 1
The Clustering Results of Failure Measurements Based
on the Outcomes of Attribute Ranking
Fig. 13. The ROC curves for failure detectors based on the linear and
nonlinear models.
REFERENCES
[1] C.C. Aggarwal and P.S. Yu, Outlier Detection for High-
Dimensional Data, Proc. ACM SIGMOD 01, pp. 37-46, 2001.
[2] M.K. Aguilera, W. Chen, and S. Toueg, Using the Heartbeat
Failure Detector for Quiescent Reliable Communication and
Consensus in Partitionable Networks, Theoretical Computer
Science, special issue on distributed algorithms, vol. 220, pp. 3-
30, 1999.
[3] T.W. Anderson, An Introduction to Multivariate Statistical Analysis,
second ed. Wiley, 1984.
[4] M. Balasubramanian and E.L. Schwartz, The Isomap Algorithm
and Topological Stability, Science, vol. 295, no. 7, 2005.
[5] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, Magpie:
Real-Time Modeling and Performance-Aware Systems, Proc.
Ninth Workshop Hot Topics in Operating Systems (HotOS 03), May
2003.
[6] P. Bodik et al., Combining Visualization and Statistical Analysis
to Improve Operator Confidence and Efficiency for Failure
Detection and Localization, Proc. Second Intl Conf. Autonomic
Computing (ICAC 05), pp. 89-100, June 2005.
[7] M. Brand, Charting a Manifold, Advances in Neural Information
Processing Systems 15, MIT Press, 2003.
[8] T. Brotherton and T. Johnson, Anomaly Detection for Advanced
Military Aircraft Using Neural Networks, Proc. IEEE Aerospace
Conf., pp. 3113-3123, 2001.
[9] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, Pinpoint:
Problem Determination in Large Dynamic Systems, Proc. Intl
Performance and Dependability Symp. (IPDS 02), June 2002.
[10] HP OpenView, HP Corp., http://www.openview.hp.com/, 2007.
[11] M.J. Desforges, P.J. Jacob, and J.E. Cooper, Applications of
Probability Density Estimation to the Detection of Abnormal
Conditions in Engineering, Proc. Inst. of Mechanical Eng.Part C:
J. Mechanical Eng. Science, vol. 212, pp. 687-703, 1998.
[12] G. Eckart and G. Young, The Approximation of One Matrix by
Another of Low Rank, Psychometrica, vol. 1, pp. 211-218, 1936.
[13] R.D. Fierro, G.H. Golub, P.C. Hansen, and D.P. OLeary,
Regularization by Truncated Total Least Squares, SIAM J.
Scientific Computing, vol. 18, pp. 1223-1241, 1997.
[14] W. Fuller, Measurement Error Models. John Wiley & Sons, 1987.
[15] A. Gionis, P. Indyk, and R. Motwani, Similarity Search in High
Dimensions via Hashing, Proc. 25th Intl Conf. Very Large Data
Bases (VLDB 99), pp. 518-529, 1999.
[16] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed.
Johns Hopkins Univ. Press, 1996.
[17] P. Grassberger and I. Procaccia, Measuring the Strangeness of
Strange Attractors, Physica D, vol. 9, pp. 189-208, 1983.
[18] A. Ho skuldsson, PLS Regression Methods, J. Chemometrics,
vol. 2, no. 3, pp. 211-228, 1988.
[19] Tivoli Business System Manager, IBM, http://www.tivoli.com/,
2007.
[20] T. Ide and H. Kashima, Eigenspace-Based Anomaly Detection in
Computer Systems, Proc. ACM SIGKDD 04, pp. 440-449, Aug.
2004.
[21] G. Jiang, H. Chen, C. Ungureanu, and K. Yoshihira, Multi-
Resolution Abnormal Trace Detection Using Varied-Length
i-Grams and Automata, Proc. Second Intl Conf. Autonomic
Computing (ICAC 05), pp. 111-122, June 2005.
[22] G. Jiang, H. Chen, and K. Yoshihira, Discovering Likely
Invariants of Distributed Transaction Systems for Autonomic
System Management, Proc. Third Intl Conf. Autonomic Computing
(ICAC 06), pp. 199-208, June 2006.
[23] I.T. Jolliffe, Principal Component Analysis. Springer Verlag, 1986.
[24] T. Kourti and J.F. MacGregor, Recent Developments in Multi-
variate SPC Methods for Monitoring and Diagnosing Process and
Product Performance, J. Quality Technology, vol. 28, no. 4, pp. 409-
428, 1996.
[25] R. Kozma, M. Kitamura, M. Sakuma, and Y. Yokoyama,
Anomaly Detection by Neural Network Models and Statistical
Time Series Analysis, Proc. IEEE World Congress on Computational
Intelligence 94, pp. 3207-3210, 1994.
[26] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, On-Line
Unsupervised Outlier Detection Using Finite Mixtures with
Discounting Learning Algorithms, Proc. Sixth ACM SIGKDD
00, pp. 320-344, 2000.
[27] M. Markou and S. Singh, Novelty Detection: A ReviewPart 1:
Statistical Approaches, Signal Processing, vol. 83, pp. 2481-2497,
2003.
[28] M. Markou and S. Singh, Novelty Detection: A ReviewPart 2:
Neural Network Based Approaches, Signal Processing, vol. 83,
pp. 2499-2521, 2003.
[29] L. Mirsky, Symmetric Gauge Functions and Unitarily Invariant
Norms, Quarterly J. Math. Oxford, vol. 11, pp. 50-59, 1960.
[30] M.J. Piovoso, K.A. Kosanovich, and J.P. Yuk, Process Data
Chemometrics, IEEE Trans. Instrumentation and Measurement,
vol. 41, no. 2, pp. 262-268, 1992.
[31] R.A. Redner and H.F. Walker, Mixture Densities, Maximum
Likelihood and the EM Algorithm, SIAM Rev., vol. 26, pp. 195-
239, 1984.
[32] S. Roweis and L. Saul, Nonlinear Dimensionality Reduction by
Locally Linear Embedding, Science, vol. 290, pp. 2323-2326, 2000.
[33] N.K. Shah and P.J. Gempcrlinc, Combination of the Mahalanobis
Distance and Residual Variance Pattern Recognition Techniques
for Classification of Near-Infrared Reflectance Spectra, J. Am.
Chemical Soc., vol. 62, no. 5, pp. 465-470, 1990.
[34] D.M.J. Tax and R.P.W. Duin, Support Vector Domain Descrip-
tion, Pattern Recognition Letters, vol. 20, pp. 1191-1199, 1999.
[35] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, vol. 290, pp. 2319-2323, 2000.
[36] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem.
Computational Aspects and Analysis. Soc. for Industrial and Applied
Math., 1991.
Haifeng Chen received the BEng and MEng
degrees in automation from the Southeast
University, China, in 1994 and 1997, respec-
tively, and the PhD degree in computer engi-
neering from Rutgers University, New Jersey, in
2004. He was a researcher at the Chinese
National Research Institute of Power Automa-
tion. He is currently a research staff member at
the NEC Laboratories America, Princeton, New
Jersey. His research interests include data
mining, autonomic computing, pattern recognition, and robust statistics.
Guofei Jiang received the BS and PhD degrees
in electrical and computer engineering from
Beijing Institute of Technology, Beijing, in 1993
and 1998, respectively. From 1998 to 2000, he
was a postdoctoral fellow in computer engineer-
ing at Dartmouth College, New Hampshire. He is
currently a senior research staff member with
the Robust and Secure Systems Group, NEC
Laboratories America, Princeton, New Jersey.
His current research interests include distributed
systems, dependable and secure computing, and system and informa-
tion theory. He has published nearly 50 technical papers in these areas.
He is an associate editor for IEEE Security and Privacy and has served
in the program committees of many prestigious conferences.
Kenji Yoshihira received the BE degree in
electrical engineering from the University of
Tokyo in 1996 and the MS degree in computer
science from New York University in 2004. For
five years, he was with Hitachi, where he
designed processor chips for enterprise compu-
ters. Until 2002, he was a chief technical officer
(CTO) at Investoria Inc., Japan, where he
developed an Internet service system for finan-
cial information distribution. He is currently a
research staff member with the Robust and Secure Systems Group,
NEC Laboratories America, Inc., New Jersey. His current research
interests include distributed systems and autonomic computing.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Monitoring High-Dimensional Data For Failure Detection and Localization in Large-Scale Computing Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monitoring High-Dimensional Data For Failure Detection and Localization in Large-Scale Computing Systems

Uploaded by

Copyright:

Available Formats

Monitoring High-Dimensional Data for Failure

Detection and Localization in Large-Scale

). As we can see, the abnormal

, are also shown in Fig. 3a.

is the indicator function of the event . The

and their projections

) and the training data (marked as .).

) and the training data.

You might also like