DATA MINING IN
TIME SERIES DATABASES
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE*
Editors: H. Bunke (Univ. Bern, Switzerland)
P. S. P. Wang (Northeastern Univ., USA)
Vol. 43: Agent Engineering
(Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang)
Vol. 44: Multispectral Image Processing and Pattern Recognition
(Eds. J. Shen, P. S. P. Wang and T. Zhang)
Vol. 45: Hidden Markov Models: Applications in Computer Vision
(Eds. H. Bunke and T. Caelli )
Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration
(K. Y. Huang)
Vol. 47: Hybrid Methods in Pattern Recognition
(Eds. H. Bunke and A. Kandel )
Vol. 48: Multimodal Interface for HumanMachine Communications
(Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang)
Vol. 49: Neural Networks and Systolic Array Design
(Eds. D. Zhang and S. K. Pal )
Vol. 50: Empirical Evaluation Methods in Computer Vision
(Eds. H. I. Christensen and P. J. Phillips)
Vol. 51: Automatic Diatom Identification
(Eds. H. du Buf and M. M. Bayer)
Vol. 52: Advances in Image Processing and Understanding
A Festschrift for Thomas S. Huwang
(Eds. A. C. Bovik, C. W. Chen and D. Goldgof)
Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing
(Eds. A. Ghosh and S. K. Pal)
Vol. 54: Fundamentals of Robotics — Linking Perception to Action
(M. Xie)
Vol. 55: Web Document Analysis: Challenges and Opportunities
(Eds. A. Antonacopoulos and J. Hu)
Vol. 56: Artificial Intelligence Methods in Software Testing
(Eds. M. Last, A. Kandel and H. Bunke)
Vol. 57: Data Mining in Time Series Databases
(Eds. M. Last, A. Kandel and H. Bunke)
Vol. 58: Computational Web Intelligence: Intelligent Technology for
Web Applications
(Eds. Y. Zhang, A. Kandel, T. Y. Lin and Y. Yao)
Vol. 59: Fuzzy Neural Network Theory and Application
(P. Liu and H. Li)
*For the complete list of titles in this series, please write to the Publisher.
Series in Machine Perception and Artificial Intelligence  Vol, 57
DATA MINING IN
TIME SERIES DATABASES
Editors
Mark Last
BenGurion LIniversity of the Negeu, Israel
Abraham Kandel
ZlAuiv University, Israel
University of South Florida, Tampa, LISA
Horst Bunke
University of Bern, Switzerland
vp World Scientific
N E W J E R S E Y * L O N D O N * S I N G A P O R E B E l J l N G S H A N G H A I H O N G K O N G T AI P E I C H E N N A I
British Library CataloguinginPublication Data
A catalogue record for this book is available from the British Library.
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN 9812382909
Typeset by Stallion Press
Email: enquiries@stallionpress.com
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.
Copyright © 2004 by World Scientific Publishing Co. Pte. Ltd.
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Printed in Singapore by World Scientific Printers (S) Pte Ltd
DATA MINING IN TIME SERIES DATABASES
Series in Machine Perception and Artificial Intelligence (Vol. 57)
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume fm
Dedicated to
The Honorable Congressman C. W. Bill Young
House of Representatives
For his vision and continuous support in creating the National Institute
for Systems Test and Productivity at the Computer Science and
Engineering Department, University of South Florida
This page intentionally left blank
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume fm
Preface
Traditional data mining methods are designed to deal with “static”
databases, i.e. databases where the ordering of records (or other database
objects) has nothing to do with the patterns of interest. Though the assump
tion of order irrelevance may be suﬃciently accurate in some applications,
there are certainly many other cases, where sequential information, such as
a timestamp associated with every record, can signiﬁcantly enhance our
knowledge about the mined data. One example is a series of stock values:
a speciﬁc closing price recorded yesterday has a completely diﬀerent mean
ing than the same value a year ago. Since most today’s databases already
include temporal data in the form of “date created”, “date modiﬁed”, and
other timerelated ﬁelds, the only problem is how to exploit this valuable
information to our beneﬁt. In other words, the question we are currently
facing is: How to mine time series data?
The purpose of this volume is to present some recent advances in pre
processing, mining, and interpretation of temporal data that is stored by
modern information systems. Adding the time dimension to a database
produces a Time Series Database (TSDB) and introduces new aspects and
challenges to the tasks of data mining and knowledge discovery. These new
challenges include: ﬁnding the most eﬃcient representation of time series
data, measuring similarity of time series, detecting change points in time
series, and time series classiﬁcation and clustering. Some of these problems
have been treated in the past by experts in time series analysis. However,
statistical methods of time series analysis are focused on sequences of values
representing a single numeric variable (e.g., price of a speciﬁc stock). In a
realworld database, a timestamped record may include several numerical
and nominal attributes, which may depend not only on the time dimension
but also on each other. To make the data mining task even more com
plicated, the objects in a time series may represent some complex graph
structures rather than vectors of featurevalues.
vii
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume fm
viii Preface
Our book covers the stateoftheart research in several areas of time
series data mining. Speciﬁc problems challenged by the authors of this
volume are as follows.
Representation of Time Series. Eﬃcient and eﬀective representation
of time series is a key to successful discovery of timerelated patterns.
The most frequently used representation of singlevariable time series is
piecewise linear approximation, where the original points are reduced to
a set of straight lines (“segments”). Chapter 1 by Eamonn Keogh, Selina
Chu, David Hart, and Michael Pazzani provides an extensive and compar
ative overview of existing techniques for time series segmentation. In the
view of shortcomings of existing approaches, the same chapter introduces
an improved segmentation algorithm called SWAB (Sliding Window and
Bottomup).
Indexing and Retrieval of Time Series. Since each time series is char
acterized by a large, potentially unlimited number of points, ﬁnding two
identical time series for any phenomenon is hopeless. Thus, researchers have
been looking for sets of similar data sequences that diﬀer only slightly from
each other. The problem of retrieving similar series arises in many areas such
as marketing and stock data analysis, meteorological studies, and medical
diagnosis. An overview of current methods for eﬃcient retrieval of time
series is presented in Chapter 2 by Magnus Lie Hetland. Chapter 3 (by
Eugene Fink and Kevin B. Pratt) presents a new method for fast compres
sion and indexing of time series. A robust similarity measure for retrieval of
noisy time series is described and evaluated by Michail Vlachos, Dimitrios
Gunopulos, and Gautam Das in Chapter 4.
Change Detection in Time Series. The problem of change point detec
tion in a sequence of values has been studied in the past, especially in the
context of time series segmentation (see above). However, the nature of
realworld time series may be much more complex, involving multivariate
and even graph data. Chapter 5 (by Gil Zeira, Oded Maimon, Mark Last,
and Lior Rokach) covers the problem of change detection in a classiﬁcation
model induced by a data mining algorithm from time series data. A change
detection procedure for detecting abnormal events in time series of graphs
is presented by Horst Bunke and Miro Kraetzl in Chapter 6. The procedure
is applied to abnormal event detection in a computer network.
Classiﬁcation of Time Series. Rather than partitioning a time series
into segments, one can see each time series, or any other sequence of data
points, as a single object. Classiﬁcation and clustering of such complex
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume fm
Preface ix
“objects” may be particularly beneﬁcial for the areas of process con
trol, intrusion detection, and character recognition. In Chapter 7, Carlos
J. Alonso Gonz´ alez and Juan J. Rodr´ıguez Diez present a new method for
early classiﬁcation of multivariate time series. Their method is capable of
learning from series of variable length and able of providing a classiﬁcation
when only part of the series is presented to the classiﬁer. A novel concept of
representing time series by median strings (see Chapter 8, by Xiaoyi Jiang,
Horst Bunke, and Janos Csirik) opens new opportunities for applying clas
siﬁcation and clustering methods of data mining to sequential data.
As indicated above, the area of mining time series databases still
includes many unexplored and insuﬃciently explored issues. Speciﬁc sug
gestions for future research can be found in individual chapters. In general,
we believe that interesting and useful results can be obtained by applying
the methods described in this book to realworld sets of sequential data.
Acknowledgments
The preparation of this volume was partially supported by the National
Institute for Systems Test and Productivity at the University of South
Florida under U.S. Space and Naval Warfare Systems Command grant num
ber N000390112248.
We also would like to acknowledge the generous support and cooperation
of: BenGurion University of the Negev, Department of Information Sys
tems Engineering, University of South Florida, Department of Computer
Science and Engineering, TelAviv University, College of Engineering, The
Fulbright Foundation, The USIsrael Educational Foundation.
January 2004 Mark Last
Abraham Kandel
Horst Bunke
This page intentionally left blank
June 14, 2005 9:13 WSPC/Trim Size: 9in x 6in for Review Volume fm
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1 Segmenting Time Series: A Survey
and Novel Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
E. Keogh, S. Chu, D. Hart and M. Pazzani
Chapter 2 A Survey of Recent Methods for Eﬃcient
Retrieval of Similar Time Sequences. . . . . . . . . . . . . . . .
23
M. L. Hetland
Chapter 3 Indexing of Compressed Time Series . . . . . . . . . . . . . . . 43
E. Fink and K. B. Pratt
Chapter 4 Indexing TimeSeries under Conditions of Noise. . . . 67
M. Vlachos, D. Gunopulos and G. Das
Chapter 5 Change Detection in Classiﬁcation Models
Induced from Time Series Data . . . . . . . . . . . . . . . . . . . .
101
G. Zeira, O. Maimon, M. Last and L. Rokach
Chapter 6 Classiﬁcation and Detection of
Abnormal Events in Time Series of Graphs. . . . . . . . .
127
H. Bunke and M. Kraetzl
Chapter 7 Boosting IntervalBased Literals:
Variable Length and Early Classiﬁcation . . . . . . . . . . .
149
C. J. Alonso Gonz´ alez and J. J. Rodr´ıguez Diez
Chapter 8 Median Strings: A Review . . . . . . . . . . . . . . . . . . . . . . . . . 173
X. Jiang, H. Bunke and J. Csirik
xi
This page intentionally left blank
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
CHAPTER 1
SEGMENTING TIME SERIES: A SURVEY AND
NOVEL APPROACH
Eamonn Keogh
Computer Science & Engineering Department, University of California —
Riverside, Riverside, California 92521, USA
Email: eamonn@cs.ucr.edu
Selina Chu, David Hart, and Michael Pazzani
Department of Information and Computer Science, University of California,
Irvine, California 92697, USA
Email: {selina, dhart, pazzani}@ics.uci.edu
In recent years, there has been an explosion of interest in mining time
series databases. As with most computer science problems, representa
tion of the data is the key to eﬃcient and eﬀective solutions. One of the
most commonly used representations is piecewise linear approximation.
This representation has been used by various researchers to support clus
tering, classiﬁcation, indexing and association rule mining of time series
data. A variety of algorithms have been proposed to obtain this represen
tation, with several algorithms having been independently rediscovered
several times. In this chapter, we undertake the ﬁrst extensive review
and empirical comparison of all proposed techniques. We show that all
these algorithms have fatal ﬂaws from a data mining perspective. We
introduce a novel algorithm that we empirically show to be superior to
all others in the literature.
Keywords: Time series; data mining; piecewise linear approximation;
segmentation; regression.
1. Introduction
In recent years, there has been an explosion of interest in mining time
series databases. As with most computer science problems, representation
of the data is the key to eﬃcient and eﬀective solutions. Several high level
1
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
2 E. Keogh, S. Chu, D. Hart and M. Pazzani
(a) (b)
Fig. 1. Two time series and their piecewise linear representation. (a) Space Shuttle
Telemetry. (b) Electrocardiogram (ECG).
representations of time series have been proposed, including Fourier Trans
forms [Agrawal et al. (1993), Keogh et al. (2000)], Wavelets [Chan and Fu
(1999)], Symbolic Mappings [Agrawal et al. (1995), Das et al. (1998), Perng
et al. (2000)] and Piecewise Linear Representation (PLR). In this work,
we conﬁne our attention to PLR, perhaps the most frequently used repre
sentation [Ge and Smyth (2001), Last et al. (2001), Hunter and McIntosh
(1999), Koski et al. (1995), Keogh and Pazzani (1998), Keogh and Pazzani
(1999), Keogh and Smyth (1997), Lavrenko et al. (2000), Li et al. (1998),
Osaki et al. (1999), Park et al. (2001), Park et al. (1999), Qu et al. (1998),
Shatkay (1995), Shatkay and Zdonik (1996), Vullings et al. (1997), Wang
and Wang (2000)].
Intuitively, Piecewise Linear Representation refers to the approximation
of a time series T, of length n, with K straight lines (hereafter known as
segments). Figure 1 contains two examples. Because K is typically much
smaller that n, this representation makes the storage, transmission and
computation of the data more eﬃcient. Speciﬁcally, in the context of data
mining, the piecewise linear representation has been used to:
• Support fast exact similarly search [Keogh et al. (2000)].
• Support novel distance measures for time series, including “fuzzy queries”
[Shatkay (1995), Shatkay and Zdonik (1996)], weighted queries [Keogh
and Pazzani (1998)], multiresolution queries [Wang and Wang (2000),
Li et al. (1998)], dynamic time warping [Park et al. (1999)] and relevance
feedback [Keogh and Pazzani (1999)].
• Support concurrent mining of text and time series [Lavrenko et al.
(2000)].
• Support novel clustering and classiﬁcation algorithms [Keogh and
Pazzani (1998)].
• Support change point detection [Sugiura and Ogden (1994), Ge and
Smyth (2001)].
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 3
Surprisingly, in spite of the ubiquity of this representation, with the
exception of [Shatkay (1995)], there has been little attempt to understand
and compare the algorithms that produce it. Indeed, there does not even
appear to be a consensus on what to call such an algorithm. For clarity, we
will refer to these types of algorithm, which input a time series and return
a piecewise linear representation, as segmentation algorithms.
The segmentation problem can be framed in several ways.
• Given a time series T, produce the best representation using only K
segments.
• Given a time series T, produce the best representation such that the maxi
mum error for any segment does not exceed some userspeciﬁed threshold,
max error.
• Given a time series T, produce the best representation such that the
combined error of all segments is less than some userspeciﬁed threshold,
total max error.
As we shall see in later sections, not all algorithms can support all these
speciﬁcations.
Segmentation algorithms can also be classiﬁed as batch or online. This is
an important distinction because many data mining problems are inherently
dynamic [Vullings et al. (1997), Koski et al. (1995)].
Data mining researchers, who needed to produce a piecewise linear
approximation, have typically either independently rediscovered an algo
rithm or used an approach suggested in related literature. For example,
from the ﬁelds of cartography or computer graphics [Douglas and Peucker
(1973), Heckbert and Garland (1997), Ramer (1972)].
In this chapter, we review the three major segmentation approaches
in the literature and provide an extensive empirical evaluation on a very
heterogeneous collection of datasets from ﬁnance, medicine, manufacturing
and science. The major result of these experiments is that only online algo
rithm in the literature produces very poor approximations of the data, and
that the only algorithm that consistently produces high quality results and
scales linearly in the size of the data is a batch algorithm. These results
motivated us to introduce a new online algorithm that scales linearly in the
size of the data set, is online, and produces high quality approximations.
The rest of the chapter is organized as follows. In Section 2, we provide
an extensive review of the algorithms in the literature. We explain the basic
approaches, and the various modiﬁcations and extensions by data miners. In
Section 3, we provide a detailed empirical comparison of all the algorithms.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
4 E. Keogh, S. Chu, D. Hart and M. Pazzani
We will show that the most popular algorithms used by data miners can in
fact produce very poor approximations of the data. The results will be used
to motivate the need for a new algorithm that we will introduce and validate
in Section 4. Section 5 oﬀers conclusions and directions for future work.
2. Background and Related Work
In this section, we describe the three major approaches to time series seg
mentation in detail. Almost all the algorithms have 2 and 3 dimensional
analogues, which ironically seem to be better understood. A discussion of
the higher dimensional cases is beyond the scope of this chapter. We refer
the interested reader to [Heckbert and Garland (1997)], which contains an
excellent survey.
Although appearing under diﬀerent names and with slightly diﬀerent
implementation details, most time series segmentation algorithms can be
grouped into one of the following three categories:
• Sliding Windows: A segment is grown until it exceeds some error bound.
The process repeats with the next data point not included in the newly
approximated segment.
• TopDown: The time series is recursively partitioned until some stopping
criteria is met.
• BottomUp: Starting from the ﬁnest possible approximation, segments
are merged until some stopping criteria is met.
Table 1 contains the notation used in this chapter.
Table 1. Notation.
T A time series in the form t
1
, t
2
, . . . , t
n
T[a : b] The subsection of T from a to b, t
a
, t
a+1
, . . . , t
b
Seg TS A piecewise linear approximation of a time series of length n
with K segments. Individual segments can be addressed with
Seg TS(i).
create segment(T) A function that takes in a time series and returns a linear segment
approximation of it.
calculate error(T) A function that takes in a time series and returns the
approximation error of the linear segment approximation of it.
Given that we are going to approximate a time series with straight lines,
there are at least two ways we can ﬁnd the approximating line.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 5
• Linear Interpolation: Here the approximating line for the subsequence
T[a : b] is simply the line connecting t
a
and t
b
. This can be obtained in
constant time.
• Linear Regression: Here the approximating line for the subsequence
T[a : b] is taken to be the best ﬁtting line in the least squares sense
[Shatkay (1995)]. This can be obtained in time linear in the length of
segment.
The two techniques are illustrated in Figure 2. Linear interpolation
tends to closely align the endpoint of consecutive segments, giving the piece
wise approximation a “smooth” look. In contrast, piecewise linear regression
can produce a very disjointed look on some datasets. The aesthetic superi
ority of linear interpolation, together with its low computational complex
ity has made it the technique of choice in computer graphic applications
[Heckbert and Garland (1997)]. However, the quality of the approximating
line, in terms of Euclidean distance, is generally inferior to the regression
approach.
In this chapter, we deliberately keep our descriptions of algorithms at a
high level, so that either technique can be imagined as the approximation
technique. In particular, the pseudocode function create segment(T) can
be imagined as using interpolation, regression or any other technique.
All segmentation algorithms also need some method to evaluate the
quality of ﬁt for a potential segment. A measure commonly used in conjunc
tion with linear regression is the sum of squares, or the residual error. This is
calculated by taking all the vertical diﬀerences between the bestﬁt line and
the actual data points, squaring them and then summing them together.
Another commonly used measure of goodness of ﬁt is the distance between
the best ﬁt line and the data point furthest away in the vertical direction
Linear
Interpolation
Linear
Regression
Fig. 2. Two 10segment approximations of electrocardiogram data. The approxima
tion created using linear interpolation has a smooth aesthetically appealing appearance
because all the endpoints of the segments are aligned. Linear regression, in contrast, pro
duces a slightly disjointed appearance but a tighter approximation in terms of residual
error.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
6 E. Keogh, S. Chu, D. Hart and M. Pazzani
(i.e. the L∞ norm between the line and the data). As before, we have
kept our descriptions of the algorithms general enough to encompass any
error measure. In particular, the pseudocode function calculate error(T)
can be imagined as using any sum of squares, furthest point, or any other
measure.
2.1. The Sliding Window Algorithm
The Sliding Window algorithm works by anchoring the left point of a poten
tial segment at the ﬁrst data point of a time series, then attempting to
approximate the data to the right with increasing longer segments. At some
point i, the error for the potential segment is greater than the userspeciﬁed
threshold, so the subsequence from the anchor to i −1 is transformed into
a segment. The anchor is moved to location i, and the process repeats until
the entire time series has been transformed into a piecewise linear approx
imation. The pseudocode for the algorithm is shown in Table 2.
The Sliding Window algorithm is attractive because of its great sim
plicity, intuitiveness and particularly the fact that it is an online algorithm.
Several variations and optimizations of the basic algorithm have been pro
posed. Koski et al. noted that on ECG data it is possible to speed up the
algorithm by incrementing the variable i by “leaps of length k” instead of
1. For k = 15 (at 400 Hz), the algorithm is 15 times faster with little eﬀect
on the output accuracy [Koski et al. (1995)].
Depending on the error measure used, there may be other optimizations
possible. Vullings et al. noted that since the residual error is monotonically
nondecreasing with the addition of more data points, one does not have
to test every value of i from 2 to the ﬁnal chosen value [Vullings et al.
(1997)]. They suggest initially setting i to s, where s is the mean length
of the previous segments. If the guess was pessimistic (the measured error
Table 2. The generic Sliding Window algorithm.
Algorithm Algorithm Algorithm Seg TS = Sliding Window(T, max error)
anchor = 1;
while not finished segmenting time series while not finished segmenting time series while not finished segmenting time series
i = 2;
while while while calculate error(T[anchor: anchor + i ]) < max error
i = i + 1;
end; end; end;
Seg TS = concat(Seg TS, create segment(T[anchor: anchor
+ (i  1)]);anchor = anchor + i;
end; end; end;
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 7
is still less than max error) then the algorithm continues to increment i
as in the classic algorithm. Otherwise they begin to decrement i until the
measured error is less than max error. This optimization can greatly speed
up the algorithm if the mean length of segments is large in relation to
the standard deviation of their length. The monotonically nondecreasing
property of residual error also allows binary search for the length of the
segment. Surprisingly, no one we are aware of has suggested this.
The Sliding Window algorithm can give pathologically poor results
under some circumstances, particularly if the time series in question con
tains abrupt level changes. Most researchers have not reported this [Qu
et al. (1998), Wang and Wang (2000)], perhaps because they tested the
algorithm on stock market data, and its relative performance is best on
noisy data. Shatkay (1995), in contrast, does notice the problem and gives
elegant examples and explanations [Shatkay (1995)]. They consider three
variants of the basic algorithm, each designed to be robust to a certain
case, but they underline the diﬃculty of producing a single variant of the
algorithm that is robust to arbitrary data sources.
Park et al. (2001) suggested modifying the algorithm to create “mono
tonically changing” segments [Park et al. (2001)]. That is, all segments con
sist of data points of the form of t
1
≤ t
2
≤ · · · ≤ t
n
or t
1
≥ t
2
≥ · · · ≥ t
n
.
This modiﬁcation worked well on the smooth synthetic dataset it was
demonstrated on. But on real world datasets with any amount of noise,
the approximation is greatly overfragmented.
Variations on the Sliding Window algorithm are particularly popular
with the medical community (where it is known as FAN or SAPA), since
patient monitoring is inherently an online task [Ishijima et al. (1983), Koski
et al. (1995), McKee et al. (1994), Vullings et al. (1997)].
2.2. The TopDown Algorithm
The TopDown algorithm works by considering every possible partitioning
of the times series and splitting it at the best location. Both subsections
are then tested to see if their approximation error is below some user
speciﬁed threshold. If not, the algorithm recursively continues to split the
subsequences until all the segments have approximation errors below the
threshold. The pseudocode for the algorithm is shown in Table 3.
Variations on the TopDown algorithm (including the 2dimensional
case) were independently introduced in several ﬁelds in the early 1970’s.
In cartography, it is known as the DouglasPeucker algorithm [Douglas and
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
8 E. Keogh, S. Chu, D. Hart and M. Pazzani
Table 3. The generic TopDown algorithm.
Algorithm Algorithm Algorithm Seg TS = Top Down(T, max error)
best so far = inf;
for for for i = 2 to to to length(T)  2 // Find the best splitting point.
improvement in approximation = improvement splitting here(T, i);
if if if improvement in approximation < best so far
breakpoint = i;
best so far = improvement in approximation;
end; end; end;
end; end; end;
// Recursively split the left segment if necessary.
if if if calculate error(T[1:breakpoint]) > max error
Seg TS = Top Down(T[1:breakpoint]);
end; end; end;
// Recursively split the right segment if necessary.
if if if calculate error(T[breakpoint + 1:length(T)]) > max error
Seg TS = Top Down(T[breakpoint + 1:length(T)]);
end; end; end;
Peucker (1973)]; in image processing, it is known as Ramer’s algorithm
[Ramer (1972)]. Most researchers in the machine learning/data mining com
munity are introduced to the algorithm in the classic textbook by Duda and
Harts, which calls it “Iterative EndPoints Fits” [Duda and Hart (1973)].
In the data mining community, the algorithm has been used by [Li et al.
(1998)] to support a framework for mining sequence databases at multiple
abstraction levels. Shatkay and Zdonik use it (after considering alternatives
such as Sliding Windows) to support approximate queries in time series
databases [Shatkay and Zdonik (1996)].
Park et al. introduced a modiﬁcation where they ﬁrst perform a scan
over the entire dataset marking every peak and valley [Park et al. (1999)].
These extreme points used to create an initial segmentation, and the Top
Down algorithm is applied to each of the segments (in case the error on an
individual segment was still too high). They then use the segmentation to
support a special case of dynamic time warping. This modiﬁcation worked
well on the smooth synthetic dataset it was demonstrated on. But on real
world data sets with any amount of noise, the approximation is greatly
overfragmented.
Lavrenko et al. uses the TopDown algorithm to support the concurrent
mining of text and time series [Lavrenko et al. (2000)]. They attempt to
discover the inﬂuence of news stories on ﬁnancial markets. Their algorithm
contains some interesting modiﬁcations including a novel stopping criteria
based on the ttest.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 9
Finally Smyth and Ge use the algorithm to produce a representation
that can support a Hidden Markov Model approach to both change point
detection and pattern matching [Ge and Smyth (2001)].
2.3. The BottomUp Algorithm
The BottomUp algorithm is the natural complement to the TopDown
algorithm. The algorithm begins by creating the ﬁnest possible approxima
tion of the time series, so that n/2 segments are used to approximate the n
length time series. Next, the cost of merging each pair of adjacent segments
is calculated, and the algorithm begins to iteratively merge the lowest cost
pair until a stopping criteria is met. When the pair of adjacent segments i
and i + 1 are merged, the algorithm needs to perform some bookkeeping.
First, the cost of merging the new segment with its right neighbor must be
calculated. In addition, the cost of merging the i −1 segments with its new
larger neighbor must be recalculated. The pseudocode for the algorithm is
shown in Table 4.
Two and threedimensional analogues of this algorithm are common in
the ﬁeld of computer graphics where they are called decimation methods
[Heckbert and Garland (1997)]. In data mining, the algorithm has been
used extensively by two of the current authors to support a variety of time
series data mining tasks [Keogh and Pazzani (1999), Keogh and Pazzani
(1998), Keogh and Smyth (1997)]. In medicine, the algorithm was used
by Hunter and McIntosh to provide the high level representation for their
medical pattern matching system [Hunter and McIntosh (1999)].
Table 4. The generic BottomUp algorithm.
Algorithm Algorithm Algorithm Seg TS = Bottom Up(T, max error)
for for for i = 1 : 2 : length(T) // Create initial fine approximation.
Seg TS = concat(Seg TS, create segment(T[i: i + 1 ]));
end; end; end;
for for for i = 1 : length(Seg TS)  1 // Find merging costs.
merge cost(i) = calculate error([merge(Seg TS(i), Seg TS(i + 1))]);
end; end; end;
while while while min(merge cost) < max error // While not finished.
p = min(merge cost); // Find ‘‘cheapest’’ pair to merge.
Seg TS(p) = merge(Seg TS(p), Seg TS(p + 1)); // Merge them.
delete(Seg TS(p + 1)); // Update records.
merge cost(p) = calculate error(merge(Seg TS(p), Seg TS(p + 1)));
merge cost(p  1) = calculate error(merge(Seg TS(p  1), Seg TS(p)));
end; end; end;
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
10 E. Keogh, S. Chu, D. Hart and M. Pazzani
2.4. Feature Comparison of the Major Algorithms
We have deliberately deferred the discussion of the running times of the
algorithms until now, when the reader’s intuition for the various approaches
are more developed. The running time for each approach is data dependent.
For that reason, we discuss both a worstcase time that gives an upper
bound and a bestcase time that gives a lower bound for each approach.
We use the standard notation of Ω(f(n)) for a lower bound, O(f(n)) for
an upper bound, and θ(f(n)) for a function that is both a lower and upper
bound.
Deﬁnitions and Assumptions. The number of data points is n, the
number of segments we plan to create is K, and thus the average segment
length is L = n/K. The actual length of segments created by an algorithm
varies and we will refer to the lengths as L
i
.
All algorithms, except topdown, perform considerably worse if we allow
any of the L
I
to become very large (say n/4), so we assume that the algo
rithms limit the maximum length L to some multiple of the average length.
It is trivial to code the algorithms to enforce this, so the time analysis that
follows is exact when the algorithm includes this limit. Empirical results
show, however, that the segments generated (with no limit on length) are
tightly clustered around the average length, so this limit has little eﬀect in
practice.
We assume that for each set S of points, we compute a best segment
and compute the error in θ(n) time. This reﬂects the way these algorithms
are coded in practice, which is to use a packaged algorithm or function to
do linear regression. We note, however, that we believe one can produce
asymptotically faster algorithms if one custom codes linear regression (or
other best ﬁt algorithms) to reuse computed values so that the computation
is done in less than O(n) time in subsequent steps. We leave that as a topic
for future work. In what follows, all computations of best segment and error
are assumed to be θ(n).
TopDown. The best time for TopDown occurs if each split occurs at
the midpoint of the data. The ﬁrst iteration computes, for each split point
i, the best line for points [1, i] and for points [i +1, n]. This takes θ(n) for
each split point, or θ(n
2
) total for all split points. The next iteration ﬁnds
split points for [1, n/2] and for [n/2 + 1, n]. This gives a recurrence T(n) =
2T(n/2) + θ(n
2
) where we have T(2) = c, and this solves to T(n) = Ω(n
2
).
This is a lower bound because we assumed the data has the best possible
split points.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 11
The worst time occurs if the computed split point is always at one side
(leaving just 2 points on one side), rather than the middle. The recurrence
is T(n) = T(n −2) +θ(n
2
) We must stop after K iterations, giving a time
of O(n
2
K).
Sliding Windows. For this algorithm, we compute best segments for
larger and larger windows, going from 2 up to at most cL (by the assumption
we discussed above). The maximum time to compute a single segment is
cL
i=2
θ(i) = θ(L
2
). The number of segments can be as few as n/cL = K/c
or as many as K. The time is thus θ(L
2
K) or θ(Ln). This is both a best
case and worst case bound.
BottomUp. The ﬁrst iteration computes the segment through each
pair of points and the costs of merging adjacent segments. This is easily
seen to take O(n) time. In the following iterations, we look up the minimum
error pair i and i + 1 to merge; merge the pair into a new segment S
new
;
delete from a heap (keeping track of costs is best done with a heap) the
costs of merging segments i −1 and i and merging segments i +1 and i +2;
compute the costs of merging S
new
with S
i−1
and with S
i−2
; and insert
these costs into our heap of costs. The time to look up the best cost is θ(1)
and the time to add and delete costs from the heap is O(log n). (The time
to construct the heap is O(n).)
In the best case, the merged segments always have about equal length,
and the ﬁnal segments have length L. The time to merge a set of length 2
segments, which will end up being one length L segment, into half as many
segments is θ(L) (for the time to compute the best segment for every pair
of merged segments), not counting heap operations. Each iteration takes
the same time repeating θ(log L) times gives a segment of size L.
The number of times we produce length L segments is K, so the total
time is Ω(K L log L) = Ω(nlog n/K). The heap operations may take as
much as O(nlog n). For a lower bound we have proven just Ω(nlog n/K).
In the worst case, the merges always involve a short and long segment,
and the ﬁnal segments are mostly of length cL. The time to compute the
cost of merging a length 2 segment with a length i segment is θ(i), and the
time to reach a length cL segment is
cL
i=2
θ(i) = θ(L
2
). There are at most
n/cL such segments to compute, so the time is n/cL × θ(L
2
) = O(Ln).
(Time for heap operations is inconsequential.) This complexity study is
summarized in Table 5.
In addition to the time complexity there are other features a practitioner
might consider when choosing an algorithm. First there is the question of
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
12 E. Keogh, S. Chu, D. Hart and M. Pazzani
Table 5. A feature summary for the 3 major algorithms.
Algorithm User can Online Complexity
specify
1
TopDown E, ME, K No O(n
2
K)
BottomUp E, ME, K No O(Ln)
Sliding Window E Yes O(Ln)
1
KEY: E → Maximum error for a given segment, ME →
Maximum error for a given segment for entire time series,
K →Number of segments.
whether the algorithm is online or batch. Secondly, there is the question
of how the user can specify the quality of desired approximation. With
trivial modiﬁcations the BottomUp algorithm allows the user to specify
the desired value of K, the maximum error per segment, or total error
of the approximation. A (nonrecursive) implementation of TopDown can
also be made to support all three options. However Sliding Window only
allows the maximum error per segment to be speciﬁed.
3. Empirical Comparison of the Major
Segmentation Algorithms
In this section, we will provide an extensive empirical comparison of the
three major algorithms. It is possible to create artiﬁcial datasets that allow
one of the algorithms to achieve zero error (by any measure), but forces
the other two approaches to produce arbitrarily poor approximations. In
contrast, testing on purely random data forces the all algorithms to pro
duce essentially the same results. To overcome the potential for biased
results, we tested the algorithms on a very diverse collection of datasets.
These datasets where chosen to represent the extremes along the fol
lowing dimensions, stationary/nonstationary, noisy/smooth, cyclical/non
cyclical, symmetric/asymmetric, etc. In addition, the data sets represent
the diverse areas in which data miners apply their algorithms, includ
ing ﬁnance, medicine, manufacturing and science. Figure 3 illustrates the
10 datasets used in the experiments.
3.1. Experimental Methodology
For simplicity and brevity, we only include the linear regression versions
of the algorithms in our study. Since linear regression minimizes the sum
of squares error, it also minimizes the Euclidean distance (the Euclidean
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 13
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
(ix)
(x)
Fig. 3. The 10 datasets used in the experiments. (i) Radio Waves. (ii) Exchange
Rates. (iii) Tickwise II. (iv) Tickwise I. (v) Water Level. (vi) Manufacturing. (vii) ECG.
(viii) Noisy Sine Cubed. (ix) Sine Cube. (x) Space Shuttle.
distance is just the square root of the sum of squares). Euclidean dis
tance, or some measure derived from it, is by far the most common metric
used in data mining of time series [Agrawal et al. (1993), Agrawal et al.
(1995), Chan and Fu (1999), Das et al. (1998), Keogh et al. (2000), Keogh
and Pazzani (1999), Keogh and Pazzani (1998), Keogh and Smyth (1997),
Qu et al. (1998), Wang and Wang (2000)]. The linear interpolation ver
sions of the algorithms, by deﬁnition, will always have a greater sum of
squares error.
We immediately encounter a problem when attempting to compare the
algorithms. We cannot compare them for a ﬁxed number of segments, since
Sliding Windows does not allow one to specify the number of segments.
Instead we give each of the algorithms a ﬁxed max error and measure the
total error of the entire piecewise approximation.
The performance of the algorithms depends on the value of max error.
As max error goes to zero all the algorithms have the same performance,
since they would produce n/2 segments with no error. At the opposite end,
as max error becomes very large, the algorithms once again will all have
the same performance, since they all simply approximate T with a single
bestﬁt line. Instead, we must test the relative performance for some rea
sonable value of max error, a value that achieves a good trade oﬀ between
compression and ﬁdelity. Because this “reasonable value” is subjective and
dependent on the data mining application and the data itself, we did the fol
lowing. We chose what we considered a “reasonable value” of max error for
each dataset, and then we bracketed it with 6 values separated by powers of
two. The lowest of these values tends to produce an overfragmented approx
imation, and the highest tends to produce a very coarse approximation. So
in general, the performance in the midrange of the 6 values should be
considered most important. Figure 4 illustrates this idea.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
14 E. Keogh, S. Chu, D. Hart and M. Pazzani
Too fine an
approximation
“Correct”
approximation
Too coarse an
approximation
max_error = E × 2
4
max_error = E × 2
5
max_error = E × 2
6
max_error = E × 2
1
max_error = E × 2
2
max_error = E × 2
3
Fig. 4. We are most interested in comparing the segmentation algorithms at the set
ting of the userdeﬁned threshold max error that produces an intuitively correct level
of approximation. Since this setting is subjective we chose a value for E, such that
max error = E ×2
i
(i = 1 to 6), brackets the range of reasonable approximations.
Since we are only interested in the relative performance of the algo
rithms, for each setting of max error on each data set, we normalized the
performance of the 3 algorithms by dividing by the error of the worst per
forming approach.
3.2. Experimental Results
The experimental results are summarized in Figure 5. The most obvious
result is the generally poor quality of the Sliding Windows algorithm. With
a few exceptions, it is the worse performing algorithm, usually by a large
amount.
Comparing the results for Sine cubed and Noisy Sine supports our con
jecture that the noisier a dataset, the less diﬀerence one can expect between
algorithms. This suggests that one should exercise caution in attempting
to generalize the performance of an algorithm that has only been demon
strated on a single noisy dataset [Qu et al. (1998), Wang and Wang (2000)].
TopDown does occasionally beat BottomUp, but only by small amount.
On the other hand BottomUp often signiﬁcantly out performs TopDown,
especially on the ECG, Manufacturing and Water Level data sets.
4. A New Approach
Given the noted shortcomings of the major segmentation algorithms, we
investigated alternative techniques. The main problem with the Sliding
Windows algorithm is its inability to look ahead, lacking the global view
of its oﬄine (batch) counterparts. The BottomUp and the TopDown
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 15
E*2
1
E*2
2
E*2
3
E*2
4
E*2
5
E*2
6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
Space Shuttle Sine Cubed Noisy Sine Cubed
ECG Manufacturing Water Level
Tickwise 1 Tickwise 2 Exchange Rate
Radio Waves
Fig. 5. A comparison of the three major times series segmentation algorithms, on ten
diverse datasets, over a range in parameters. Each experimental result (i.e. a triplet of
histogram bars) is normalized by dividing by the performance of the worst algorithm on
that experiment.
approaches produce better results, but are oﬄine and require the scan
ning of the entire data set. This is impractical or may even be unfeasible in
a datamining context, where the data are in the order of terabytes or arrive
in continuous streams. We therefore introduce a novel approach in which
we capture the online nature of Sliding Windows and yet retain the supe
riority of BottomUp. We call our new algorithm SWAB (Sliding Window
and Bottomup).
4.1. The SWAB Segmentation Algorithm
The SWAB algorithm keeps a buﬀer of size w. The buﬀer size should ini
tially be chosen so that there is enough data to create about 5 or 6 segments.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
16 E. Keogh, S. Chu, D. Hart and M. Pazzani
BottomUp is applied to the data in the buﬀer and the leftmost segment
is reported. The data corresponding to the reported segment is removed
from the buﬀer and more datapoints are read in. The number of datapoints
read in depends on the structure of the incoming data. This process is per
formed by the Best Line function, which is basically just classic Sliding
Windows. These points are incorporated into the buﬀer and BottomUp is
applied again. This process of applying BottomUp to the buﬀer, report
ing the leftmost segment, and reading in the next “best ﬁt” subsequence is
repeated as long as data arrives (potentially forever).
The intuition behind the algorithm is this. The Best Line function
ﬁnds data corresponding to a single segment using the (relatively poor)
Sliding Windows and gives it to the buﬀer. As the data moves through the
buﬀer the (relatively good) BottomUp algorithm is given a chance to reﬁne
the segmentation, because it has a “semiglobal” view of the data. By the
time the data is ejected from the buﬀer, the segmentation breakpoints are
usually the same as the ones the batch version of BottomUp would have
chosen. Table 6 shows the pseudo code for the algorithm.
Table 6. The SWAB (Sliding Window and Bottomup) algorithm.
Algorithm Algorithm Algorithm Seg TS = SWAB(max error, seg num) // seg num is a small integer,
i.e. 5 or 6
read in w number of data points read in w number of data points read in w number of data points // Enough to approximate
lower bound = w / 2; // seg num of segments.
upper bound = 2 * w;
while while while data at input
T = Bottom Up(w, max error) // Call the BottomUp algorithm.
Seg TS = CONCAT(SEG TS, T(1));
w = TAKEOUT(w, w
); // Deletes w
points in T(1) from w.
if if if data at input // Add w
points from BEST LINE() to w.
w = CONCAT(w, BEST LINE(max error));
{check upper and lower bound, adjust if necessary}
else else else // flush approximated segments from buffer.
Seg TS = CONCAT(SEG TS, (T  T(1)))
end; end; end;
end; end; end;
Function Function Function S = BEST LINE(max error) // returns S points to approximate.
while while while error ≤ max error // next potential segment.
read in one additional data point, d, into S
S = CONCAT(S, d);
error = approx segment(S);
end while; end while; end while;
return return return S;
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 17
Using the buﬀer allows us to gain a “semiglobal” view of the data set for
BottomUp. However, it important to impose upper and lower bounds on
the size of the window. A buﬀer that is allowed to grow arbitrarily large will
revert our algorithm to pure BottomUp, but a small buﬀer will deteriorate
it to Sliding Windows, allowing excessive fragmentation may occur. In our
algorithm, we used an upper (and lower) bound of twice (and half) of the
initial buﬀer.
Our algorithm can be seen as operating on a continuum between the
two extremes of Sliding Windows and BottomUp. The surprising result
(demonstrated below) is that by allowing the buﬀer to contain just 5 or
6 times the data normally contained by is a single segment, the algorithm
produces essentially the same results as BottomUp, yet is able process
a neverending stream of data. Our new algorithm requires only a small,
constant amount of memory, and the time complexity is a small constant
factor worse than that of the standard BottomUp algorithm.
4.2. Experimental Validation
We repeated the experiments in Section 3, this time comparing the new
algorithm with pure (batch) BottomUp and classic Sliding Windows. The
result, summarized in Figure 6, is that the new algorithm produces results
that are essentiality identical to BottomUp. The reader may be surprised
that SWAB can sometimes be slightly better than BottomUp. The reason
why this can occur is because SWAB is exploring a slight larger search
space. Every segment in BottomUp must have an even number of data
points, since it was created by merging other segments that also had an even
number of segments. The only possible exception is the rightmost segment,
which can have an even number of segments if the original time series had
an odd length. Since this happens multiple times for SWAB, it is eﬀectively
searching a slight larger search space.
5. Conclusions and Future Directions
We have seen the ﬁrst extensive review and empirical comparison of time
series segmentation algorithms from a data mining perspective. We have
shown the most popular approach, Sliding Windows, generally produces
very poor results, and that while the second most popular approach, Top
Down, can produce reasonable results, it does not scale well. In contrast,
the least well known, BottomUp approach produces excellent results and
scales linearly with the size of the dataset.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
18 E. Keogh, S. Chu, D. Hart and M. Pazzani
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
1 2 3 4 5 6 1 2 3 4 5 6
E*2
1
E*2
2
E*2
3
E*2
4
E*2
5
E*2
6
Sine Cubed Noisy Sine Cubed
ECG Manufacturing Water Level
Tickwise 2 Exchange Rate
Radio Waves
Space Shuttle
Tickwise 1
Fig. 6. A comparison of the SWAB algorithm with pure (batch) BottomUp and classic
Sliding Windows, on ten diverse datasets, over a range in parameters. Each experimental
result (i.e. a triplet of histogram bars) is normalized by dividing by the performance of
the worst algorithm on that experiment.
In addition, we have introduced SWAB, a new online algorithm, which
scales linearly with the size of the dataset, requires only constant space and
produces high quality approximations of the data.
There are several directions in which this work could be expanded.
• The performance of BottomUp is particularly surprising given that it
explores a smaller space of representations. Because the initialization
phase of the algorithm begins with all line segments having length two,
all merged segments will also have even lengths. In contrast the two
other algorithms allow segments to have odd or even lengths. It would be
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 19
interesting to see if removing this limitation of BottomUp can improve
its performance further.
• For simplicity and brevity, we have assumed that the inner loop of the
SWAB algorithm simply invokes the BottomUp algorithm each time.
This clearly results in some computation redundancy. We believe we may
be able to reuse calculations from previous invocations of BottomUp,
thus achieving speedup.
Reproducible Results Statement: In the interests of competitive
scientiﬁc inquiry, all datasets and code used in this work are freely available
at the University of California Riverside, Time Series Data Mining Archive
{www.cs.ucr.edu/∼eamonn/TSDMA/index.html}.
References
1. Agrawal, R., Faloutsos, C., and Swami, A. (1993). Eﬃcient Similarity Search
in Sequence Databases. Proceedings of the 4th Conference on Foundations of
Data Organization and Algorithms, pp. 69–84.
2. Agrawal, R., Lin, K.I., Sawhney, H.S., and Shim, K. (1995). Fast Similarity
Search in the Presence of Noise, Scaling, and Translation in TimesSeries
Databases. Proceedings of 21th International Conference on Very Large Data
Bases, pp. 490–501.
3. Chan, K. and Fu, W. (1999). Eﬃcient Time Series Matching by Wavelets.
Proceedings of the 15th IEEE International Conference on Data Engineering,
pp. 126–133.
4. Das, G., Lin, K. Mannila, H., Renganathan, G., and Smyth, P. (1998). Rule
Discovery from Time Series. Proceedings of the 3rd International Conference
of Knowledge Discovery and Data Mining, pp. 16–22.
5. Douglas, D.H. and Peucker, T.K. (1973). Algorithms for the Reduction of the
Number of Points Required to Represent a Digitized Line or its Caricature.
Canadian Cartographer, 10(2) December, pp. 112–122.
6. Duda, R.O. and Hart, P.E. (1973). Pattern Classiﬁcation and Scene Analysis.
Wiley, New York.
7. Ge, X. and Smyth P. (2001). Segmental SemiMarkov Models for Endpoint
Detection in Plasma Etching. IEEE Transactions on Semiconductor Engi
neering.
8. Heckbert, P.S. and Garland, M. (1997). Survey of Polygonal Surface Simpli
ﬁcation Algorithms, Multiresolution Surface Modeling Course. Proceedings
of the 24th International Conference on Computer Graphics and Interactive
Techniques.
9. Hunter, J. and McIntosh, N. (1999). KnowledgeBased Event Detection in
Complex Time Series Data. Artiﬁcial Intelligence in Medicine, Springer,
pp. 271–280.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
20 E. Keogh, S. Chu, D. Hart and M. Pazzani
10. Ishijima, M.. et al. (1983). ScanAlong Polygonal Approximation for Data
Compression of Electrocardiograms. IEEE Transactions on Biomedical Engi
neering (BME), 30(11), 723–729.
11. Koski, A., Juhola, M., and Meriste, M. (1995). Syntactic Recognition of ECG
Signals By Attributed Finite Automata. Pattern Recognition, 28(12), 1927–
1940.
12. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. (2000). Dimen
sionality Reduction for Fast Similarity Search in Large Time Series
Databases. Journal of Knowledge and Information Systems, 3(3), 263–286.
13. Keogh, E. and Pazzani, M. (1998). An Enhanced Representation of Time
Series which Allows Fast and Accurate Classiﬁcation, Clustering and Rele
vance Feedback. Proceedings of the 4th International Conference of Knowl
edge Discovery and Data Mining, AAAI Press, pp. 239–241.
14. Keogh, E. and Pazzani, M. (1999). Relevance Feedback Retrieval of Time
Series Data. Proceedings of the 22th Annual International ACMSIGIR Con
ference on Research and Development in Information Retrieval, pp. 183–190.
15. Keogh, E. and Smyth, P. (1997). A Probabilistic Approach to Fast Pattern
Matching in Time Series Databases. Proceedings of the 3rd International Con
ference of Knowledge Discovery and Data Mining, pp. 24–20.
16. Last, M., Klein, Y., and Kandel, A. (2001). Knowledge Discovery in Time
Series Databases. IEEE Transactions on Systems, Man, and Cybernetics,
31B(1), 160–169.
17. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., and Allan, J.
(2000). Mining of Concurrent Text and Time Series. Proceedings of the 6th
International Conference on Knowledge Discovery and Data Mining, 37–44.
18. Li, C,. Yu, P., and Castelli, V. (1998). MALM: A Framework for Mining
Sequence Database at Multiple Abstraction Levels. Proceedings of the 9th
International Conference on Information and Knowledge Management, pp.
267–272.
19. McKee, J.J, Evans, N.E, and Owens, F.J (1994). Eﬃcient Implementation of
the Fan/SAPA2 Algorithm Using Fixed Point Arithmetic. Automedica, 16,
109–117.
20. Osaki, R., Shimada, M., and Uehara, K. (1999). Extraction of Primitive
Motion for Human Motion Recognition. Proceedings of the 2nd International
Conference on Discovery Science, pp. 351–352.
21. Park, S., Kim, S.W, and Chu, W.W (2001). SegmentBased Approach for
Subsequence Searches in Sequence Databases. Proceedings of the 16th ACM
Symposium on Applied Computing, pp. 248–252.
22. Park, S., Lee, D., and Chu, W.W (1999). Fast Retrieval of Similar Subse
quences in Long Sequence Databases. Proceedings of the 3rd IEEE Knowledge
and Data Engineering Exchange Workshop.
23. Pavlidis, T. (1976). Waveform Segmentation Through Functional Approxi
mation. IEEE Transactions on Computers, pp. 689–697.
24. Perng, C., Wang, H., Zhang, S., and Parker, S. (2000). Landmarks: A New
Model for SimilarityBased Pattern Querying in Time Series Databases. Pro
ceedings of 16th International Conference on Data Engineering, pp. 33–45.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap01
Segmenting Time Series: A Survey and Novel Approach 21
25. Qu, Y., Wang, C., and Wang, S. (1998). Supporting Fast Search in
Time Series for Movement Patterns in Multiples Scales, Proceedings of the
7th International Conference on Information and Knowledge Management,
pp. 251–258.
26. Ramer, U. (1972). An Iterative Procedure for the Polygonal Approximation
of Planar Curves. Computer Graphics and Image Processing, 1, 244–256.
27. Shatkay, H. (1995). Approximate Queries and Representations for Large
Data Sequences. Technical Report cs9503, Department of Computer Sci
ence, Brown University.
28. Shatkay, H. and Zdonik, S. (1996). Approximate Queries and Representa
tions for Large Data Sequences. Proceedings of the 12th IEEE International
Conference on Data Engineering, pp. 546–553.
29. Sugiura, N. and Ogden, R.T (1994). Testing ChangePoints with Linear
Trend. Communications in Statistics B: Simulation and Computation, 23,
287–322.
30. Vullings, H.J L.M., Verhaegen, M.H.G., and Verbruggen H.B. (1997). ECG
Segmentation Using TimeWarping. Proceedings of the 2nd International
Symposium on Intelligent Data Analysis, pp. 275–286.
31. Wang, C. and Wang, S. (2000). Supporting ContentBased Searches on Time
Series Via Approximation. Proceedings of the 12th International Conference
on Scientiﬁc and Statistical Database Management, pp. 69–81.
This page intentionally left blank
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
CHAPTER 2
A SURVEY OF RECENT METHODS FOR EFFICIENT
RETRIEVAL OF SIMILAR TIME SEQUENCES
Magnus Lie Hetland
Norwegian University of Science and Technology
Sem Sælands vei 7–9
NO7491 Trondheim, Norway
Email: magnus@hetland.org
Time sequences occur in many applications, ranging from science and
technology to business and entertainment. In many of these applica
tions, searching through large, unstructured databases based on sample
sequences is often desirable. Such similaritybased retrieval has attracted
a great deal of attention in recent years. Although several diﬀerent
approaches have appeared, most are based on the common premise of
dimensionality reduction and spatial access methods. This chapter gives
an overview of recent research and shows how the methods ﬁt into a
general context of signature extraction.
Keywords: Information retrieval; sequence databases; similarity search;
spatial indexing; time sequences.
1. Introduction
Time sequences arise in many applications—any applications that involve
storing sensor inputs, or sampling a value that changes over time. A problem
which has received an increasing amount of attention lately is the problem
of similarity retrieval in databases of time sequences, socalled “query by
example.” Some uses of this are [Agrawal et al. (1993)]:
• Identifying companies with similar patterns of growth.
• Determining products with similar selling patterns.
• Discovering stocks with similar movement in stock prices.
23
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
24 M. L. Hetland
• Finding out whether a musical score is similar to one of a set of copy
righted scores.
• Finding portions of seismic waves that are not similar to spot geological
irregularities.
Applications range from medicine, through economy, to scientiﬁc disci
plines such as meteorology and astrophysics [Faloutsos et al. (1994), Yi and
Faloutsos (2000)].
The running times of simple algorithms for comparing time sequences
are generally polynomial in the length of both sequences, typically linear or
quadratic. To ﬁnd the correct oﬀset of a query in a large database, a naive
sequential scan will require a number of such comparisons that is linear in
the length of the database. This means that, given a query of length m and
a database of length n, the search will have a time complexity of O(nm),
or even O(nm
2
) or worse. For large databases this is clearly unacceptable.
Many methods are known for performing this sort of query in the domain
of strings over ﬁnite alphabets, but with time sequences there are a few extra
issues to deal with:
• The range of values is not generally ﬁnite, or even discrete.
• The sampling rate may not be constant.
• The presence of noise in various forms makes it necessary to support very
ﬂexible similarity measures.
This chapter describes some of the recent advances that have been made
in this ﬁeld; methods that allow for indexing of time sequences using ﬂexible
similarity measures that are invariant under a wide range of transformations
and error sources.
The chapter is structured as follows: Section 2 gives a more formal
presentation of the problem of similaritybased retrieval and the socalled
dimensionality curse; Section 3 describes the general approach of signature
based retrieval, or shrink and search, as well as three speciﬁc methods using
this approach; Section 4 shows some other approaches, while Section 5
concludes the chapter. Finally, Appendix gives an overview of some basic
distance measures.
1
1
The term “distance” is used loosely in this paper. A distance measure is simply the
inverse of a similarity measure and is not required to obey the metric axioms.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 25
1.1. Terminology and Notation
A time sequence x = x
1
= (v
1
, t
1
), . . . , x
n
= (v
n
, t
n
) is an ordered col
lection of elements x
i
, each consisting of a value v
i
and a timestamp t
i
.
Abusing the notation slightly, the value of x
i
may be referred to as x
i
.
For some retrieval methods, the values may be taken from a ﬁnite class
of values [Mannila and Ronkainen (1997)], or may have more than one
dimension [Lee et al. (2000)], but it is generally assumed that the values
are real numbers. This assumption is a requirement for most of the methods
described in this chapter.
The only requirement of the timestamps is that they be nondecreasing
(or, in some applications, strictly increasing) with respect to the sequence
indices:
t
i
≤ t
j
⇔ i ≤ j. (1)
In some methods, an additional assumption is that the elements are
equispaced: for every two consecutive elements x
i
and x
i+1
we have
t
i+1
− t
i
= ∆, (2)
where ∆ (the sampling rate of x) is a (positive) constant. If the actual
sampling rate is not important, ∆ may be normalized to 1, and t
1
to 0. It
is also possible to resample the sequence to make the elements equispaced,
when required.
The length of a time sequence x is its cardinality, written as x. The
contiguous subsequence of x containing elements x
i
to x
j
(inclusive) is
written x
i:j
. A signature of a sequence x is some structure that somehow
represents x, yet is simpler than x. In the context of this chapter, such
a signature will always be a vector of ﬁxed size k. (For a more thorough
discussion of signatures, see Section 3.) Such a signature is written x. For
a summary of the notation, see Table 1.
Table 1. Notation.
x A sequence
˜ x A signature of x
x
i
Element number i of x
x
i:j
Elements i to j (inclusive) of x
x The length of x
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
26 M. L. Hetland
2. The Problem
The problem of retrieving similar time sequences may be stated as follows:
Given a sequence q, a set of time sequences X, a (nonnegative) distance
measure d, and a tolerance threshold ε, ﬁnd the set R of sequences closer
to q than ε, or, more precisely:
R = {x ∈ Xd(q, x) ≤ ε}. (3)
Alternatively, one might wish to ﬁnd the k nearest neighbours of q, which
amounts to setting ε so that R = k. The parameter ε is typically supplied
by the user, while the distance function d is domaindependent. Several
distance measures will be described rather informally in this chapter. For
more formal deﬁnitions, see Appendix.
Figure 1 illustrates the problem for Euclidean distance in two dimen
sions. In this example, the vector x will be included in the result set R,
while y will not.
A useful variation of the problem is to ﬁnd a set of subsequences of the
sequences in X. This, in the basic case, requires comparing q not only to
all elements of X, but to all possible subsequences.
2
If a method retrieves a subset S of R, the wrongly dismissed sequences
in R −S are called false dismissals. Conversely, if S is a superset of R, the
sequences in S − R are called false alarms.
Fig. 1. Similarity retrieval.
2
Except in the description of LCS in Appendix, subsequence means contiguous subse
quence, or segment.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 27
2.1. Robust Distance Measures
The choice of distance measure is highly domain dependent, and in some
cases a simple L
p
norm such as Euclidean distance may be suﬃcient. How
ever, in many cases, this may be too brittle [Keogh and Pazzani (1999b)]
since it does not tolerate such transformations as scaling, warping, or trans
lation along either axis. Many of the newer retrieval methods focus on using
more robust distance measures, which are invariant under such transforma
tions as time warping (see Appendix for details) without loss of perfor
mance.
2.2. Good Indexing Methods
Faloutsos et al. (1994) list the following desirable properties for an indexing
method:
(i) It should be faster than a sequential scan.
(ii) It should incur little space overhead.
(iii) It should allow queries of various length.
(iv) It should allow insertions and deletions without rebuilding the index.
(v) It should be correct: No false dismissals must occur.
To achieve high performance, the number of false alarms should also be
low. Keogh et al. (2001b) add the following criteria to the list above:
(vi) It should be possible to build the index in reasonable time.
(vii) The index should preferably be able to handle more than one distance
measure.
2.3. Spatial Indices and the Dimensionality Curse
The general problem of similarity based retrieval is well known in the ﬁeld of
information retrieval, and many indexing methods exist to process queries
eﬃciently [BaezaYates and RibeiroNeto (1999)]. However, certain prop
erties of time sequences make the standard methods unsuitable. The fact
that the value ranges of the sequences usually are continuous, and that
the elements may not be equispaced, makes it diﬃcult to use standard
textindexing techniques such as suﬃxtrees. One of the most promising
techniques is multidimensional indexing (Rtrees [Guttman (1984)], for
instance), in which the objects in question are multidimensional vectors,
and similar objects can be retrieved in sublinear time. One requirement of
such spatial access methods is that the distance measure must be monotonic
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
28 M. L. Hetland
in all dimensions, usually satisﬁed through the somewhat stricter require
ment of the triangle inequality (d(x, z) ≤ d(x, y) + d(y, z)).
One important problem that occurs when trying to index sequences with
spatial access methods is the socalled dimensionality curse: Spatial indices
typically work only when the number of dimensions is low [Chakrabarti
and Mehrotra (1999)]. This makes it unfeasible to code the entire sequence
directly as a vector in an indexed space.
The general solution to this problem is dimensionality reduction: to
condense the original sequences into signatures in a signature space of low
dimensionality, in a manner which, to some extent, preserves the distances
between them. One can then index the signature space.
3. Signature Based Similarity Search
A time sequence x of length n can be considered a vector or point in an
ndimensional space. Techniques exist (spatial access methods, such as the
Rtree and variants [Chakrabarti and Mehrotra (1999), Wang and Perng
(2001), Sellis et al. (1987)] for indexing such data. The problem is that
the performance of such methods degrades considerably even for relatively
low dimensionalities [Chakrabarti and Mehrotra (1999)]; the number of
dimensions that can be handled is usually several orders of magnitude lower
than the number of data points in a typical time sequence.
A general solution described by Faloutsos et al. (1994; 1997) is to extract
a lowdimensional signature from each sequence, and to index the signature
space. This shrink and search approach is illustrated in Figure 2.
Fig. 2. The signature based approach.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 29
An important result given by Faloutsos et al. (1994) is the proof that in
order to guarantee completeness (no false dismissals), the distance function
used in the signature space must underestimate the true distance mea
sure, or:
d
k
(˜ x, ˜ y) ≤ d(x, y). (4)
This requirement is called the bounding lemma. Assuming that (1.4)
holds, an intuitive way of stating the resulting situation is: “if two signa
tures are far apart, we know the corresponding [sequences] must also be far
apart” [Faloutsos et al. (1997)]. This, of course, means that there will be
no false dismissals. To minimise the number of false alarms, we want d
k
to
approximate d as closely as possible. The bounding lemma is illustrated in
Figure 3.
This general method of dimensionality reduction may be summed up as
follows [Keogh et al. (2001b)]:
1. Establish a distance measure d from a domain expert.
2. Design a dimensionality reduction technique to produce signatures of
length k, where k can be eﬃciently handled by a standard spatial access
method.
3. Produce a distance measure d
k
over the kdimensional signature space,
and prove that it obeys the bounding condition (4).
In some applications, the requirement in (4) is relaxed, allowing for a
small number of false dismissals in exchange for increased performance.
Such methods are called approximate.
The dimensionality reduction may in itself be used to speed up the
sequential scan, and some methods (such as the piecewise linear approxi
mation of Keogh et al., which is described in Section 4.2) rely only on this,
without using any index structure.
Fig. 3. An intuitive view of the bounding lemma.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
30 M. L. Hetland
Methods exist for ﬁnding signatures of arbitrary objects, given the dis
tances between them [Faloutsos and Lin (1995), Wang et al. (1999)], but
in the following I will concentrate on methods that exploit the structure of
the time series to achieve good approximations.
3.1. A Simple Example
As an example of the signature based scheme, consider the two sequences
shown in Figure 4.
The sequences, x and y, are compared using the L
1
measure (Manhattan
distance), which is simply the sum of the absolute distances between each
aligning pair of values. A simple signature in this scheme is the preﬁx of
length 2, as indicated by the shaded area in the ﬁgure. As shown in Figure 5,
these signatures may be interpreted as points in a twodimensional plane,
which can be indexed with some standard spatial indexing method. It is
also clear that the signature distance will underestimate the real distance
between the sequences, since the remaining summands of the real distance
must all be positive.
Fig. 4. Comparing two sequences.
Fig. 5. A simple signature distance.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 31
Fig. 6. An example time sequence.
Although correct, this simple signature extraction technique is not par
ticularly precise. The signature extraction methods introduced in the fol
lowing sections take into account more information about the full sequence
shape, and therefore lead to fewer false alarms.
Figure 6 shows a time series containing measurements of atmospheric
pressure. In the following three sections, the methods described will be
applied to this sequence, and the resulting simpliﬁed sequence (recon
structed from the extracted signature) will be shown superimposed on the
original.
3.2. Spectral Signatures
Some of the methods presented in this section are not very recent, but
introduce some of the main concepts used by newer approaches.
Agrawal et al. (1993) introduce a method called the Findex in which a
signature is extracted from the frequency domain of a sequence. Underlying
their approach are two key observations:
• Most realworld time sequences can be faithfully represented by their
strongest Fourier coeﬃcients.
• Euclidean distance is preserved in the frequency domain (Parseval’s
Theorem [Shatkay (1995)]).
Based on this, they suggest performing the Discrete Fourier Transform
on each sequence, and using a vector consisting of the sequence’s k ﬁrst
amplitude coeﬃcients as its signature. Euclidean distance in the signa
ture space will then underestimate the real Euclidean distance between
the sequences, as required.
Figure 7 shows an approximated time sequence, reconstructed from a
signature consisting of the original sequence’s ten ﬁrst Fourier components.
This basic method allows only for wholesequence matching. In 1994,
Faloutsos et al. introduce the STindex, an improvement on the Findex
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
32 M. L. Hetland
Fig. 7. A sequence reconstructed from a spectral signature.
that makes subsequence matching possible. The main steps of the approach
are as follows:
1. For each position in the database, extract a window of length w, and
create a spectral signature (a point) for it.
Each point will be close to the previous, because the contents of the
sliding window change slowly. The points for one sequence will therefore
constitute a trail in signature space.
2. Partition the trails into suitable (multidimensional) Minimal Bounding
Rectangles (MBRs), according to some heuristic.
3. Store the MBRs in a spatial index structure.
To search for subsequences similar to a query q of length w, simply
look up all MBRs that intersect a hypersphere with radius ε around the
signature point ˜ q. This is guaranteed not to produce any false dismissals,
because if a point is within a radius of ε of ˜ q, it cannot possibly be contained
in an MBR that does not intersect the hypersphere.
To search for sequences longer than w, split the query into wlength
segments, search for each of them, and intersect the result sets. Because
a sequence in the result set R cannot be closer to the full query sequence
than it is to any one of the window signatures, it has to be close to all of
them, that is, contained in all the result sets.
These two papers [Agrawal et al. (1993) and Faloutsos et al. (1994)]
are seminal; several newer approaches are based on them. For example,
Raﬁei and Mendelzon (1997) show how the method can be made more
robust by allowing various transformations in the comparison, and Chan
and Fu (1999) show how the Discrete Wavelet Transform (DWT) can be
used instead of the Discrete Fourier Transform (DFT), and that the DWT
method is empirically superior. See Wu et al. (2000) for a comparison
between similarity search based on DFT and DWT.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 33
3.3. Piecewise Constant Approximation
An approach independently introduced by Yi and Faloutsos (2000) and
Keogh et al. (2001b), Keogh and Pazzani (2000) is to divide each sequence
into k segments of equal length, and to use the average value of each seg
ment as a coordinate of a kdimensional signature vector. Keogh et al. call
the method Piecewise Constant Approximation, or PCA. This deceptively
simple dimensionality reduction technique has several advantages [Keogh
et al. (2001b)]: The transform itself is faster than most other transforms,
it is easy to understand and implement, it supports more ﬂexible distance
measures than Euclidean distance, and the index can be built in linear time.
Figure 8 shows an approximated time sequence, reconstructed from a
tendimensional PCA signature.
Yi and Faloutsos (2000) also show that this signature can be used with
arbitrary L
p
norms without changing the index structure, which is some
thing no previous method [such as Agrawal et al. (1993; 1995), Faloutsos
et al. (1994; 1997), Raﬁei and Mendelzon (1997), or Yi et al. (1998)] could
accomplish. This means that the distance measure may be speciﬁed by the
user. Preprocessing to make the index more robust in the face of such trans
formations as oﬀset translation, amplitude scaling, and time scaling can also
be performed.
Keogh et al. demonstrate that the representation can also be used with
the socalled weighted Euclidean distance, where each part of the sequence
has a diﬀerent weight.
Empirically, the PCA methods seem promising: Yi and Faloutsos
demonstrate up to a ten times speedup over methods based on the discrete
wavelet transform. Keogh et al. do not achieve similar speedups, but point
to the fact that the structure allows for more ﬂexible distance measures
than many of the competing methods.
Keogh et al. (2001a) later propose an improved version of the PCA, the
socalled Adaptive Piecewise Constant Approximation, or APCA. This is
Fig. 8. A sequence reconstructed from a PCA signature.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
34 M. L. Hetland
similar to the PCA, except that the segments need not be of equal length.
Thus regions with great ﬂuctuations may be represented with several short
segments, while reasonably featureless regions may be represented with
fewer, long segments. The main contribution of this representation is that
it is a more eﬀective compression than the PCA, while still representing the
original faithfully.
Two distance measures are developed for the APCA, one which is guar
anteed to underestimate Euclidean distance, and one which can be cal
culated more eﬃciently, but which may generate some false dismissals. It
is also shown that this technique, like the PCA, can handle arbitrary L
p
norms. The empirical data suggest that the APCA outperforms both meth
ods based on the discrete Fourier transform, and methods based on the dis
crete wavelet transform with a speedup of one to two orders of magnitude.
In a recent paper, Keogh (2002) develops a distance measure that is
a lower bound for dynamic time warping, and uses the PCA approach to
index it. The distance measure is based on the assumption that the allowed
warping is restricted, which is often the case in real applications. Under this
assumption, Keogh constructs two warped versions of the sequence to be
indexed: An upper and a lower limit. The PCA signatures of these limits are
then extracted, and together with Keogh’s distance measure form an exact
index (one with no false dismissals) with high precision. Keogh performs
extensive empirical experiments, and his method clearly outperforms any
other existing method for indexing time warping.
3.4. Landmark Methods
In 1997, Keogh and Smyth introduce a probabilistic method for sequence
retrieval, where the features extracted are characteristic parts of the
sequence, socalled feature shapes. Keogh (1997) uses a similar landmark
based technique. Both these methods also use the dimensionality reduction
technique of piecewise linear approximation (see Section 4.2) as a prepro
cessing step. The methods are based on ﬁnding similar landmark features
(or shapes) in the target sequences, ignoring shifting and scaling within
given limits. The technique is shown to be signiﬁcantly faster than sequen
tial scanning (about an order of magnitude), which may be accounted for by
the compression of the piecewise linear approximation. One of the contribu
tions of the method is that it is one of the ﬁrst that allows some longitudinal
scaling.
A more recent paper by Perng et al. (2000) introduces a more general
landmark model. In its most general form, the model allows any point of
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 35
Fig. 9. A landmark approximation.
great importance to be identiﬁed as a landmark. The speciﬁc form used
in the paper deﬁnes an nth order landmark of a onedimensional function
to be a point where the function’s nth derivative is zero. Thus, ﬁrstorder
landmarks are extrema, secondorder landmarks are inﬂection points, and so
forth. A smothing technique is also introduced, which lets certain landmarks
be overshadowed by others. For instance, local extrema representing small
ﬂuctuations may not be as important as a global maximum or minimum.
Figure 9 shows an approximated time sequence, reconstructed from a
twelvedimensional landmark signature.
One of the main contributions of Perng et al. (2000) is to show that for
suitable selections of landmark features, the model is invariant with respect
to the following transformations:
• Shifting
• Uniform amplitude scaling
• Uniform time scaling
• Nonuniform time scaling (time warping)
• Nonuniform amplitude scaling
It is also possible to allow for several of these transformations at once,
by using the intersection of the features allowed for each of them. This
makes the method quite ﬂexible and robust, although as the number of
transformations allowed increases, the number of features will decrease;
consequently, the index will be less precise.
A particularly simple landmark based method (which can be seen as a
special case of the general landmark method) is introduced by Kim et al.
(2001). They show that by extracting the minimum, maximum, and the
ﬁrst and last elements of a sequence, one gets a (rather crude) signature
that is invariant to time warping. However, since time warping distance
does not obey the triangle inequality [Yi et al. (1998)], it cannot be used
directly. This problem is solved by developing a new distance measure that
underestimates the time warping distance while simultaneously satisfying
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
36 M. L. Hetland
the triangle inequality. Note that this method does not achieve results com
parable to those of Keogh (2002).
4. Other Approaches
Not all recent methods rely on spatial access methods. This section contains
a sampling of other approaches.
4.1. Using Suﬃx Trees to Avoid Redundant Computation
BaezaYates and Gonnet (1999) and Park et al. (2000) independently intro
duce the idea of using suﬃx trees [Gusﬁeld (1997)] to avoid duplicate cal
culations when using dynamic programming to compare a query sequence
with other sequences in a database. BaezaYates and Gonnet use edit dis
tance (see Appendix for details), while Park et al. use time warping.
The basic idea of the approach is as follows: When comparing two
sequences x and y with dynamic programming, a subtask will be to compare
their preﬁxes x
1:i
and y
1:j
. If two other sequences are compared that have
identical preﬁxes to these (for instance, the query and another sequence
from the database), the same calculations will have to be performed again.
If a sequential search for subsequence matches is performed, the cost may
easily become prohibitive.
To avoid this, all the sequences in the database are indexed with a suﬃx
tree. A suﬃx tree stores all the suﬃxes of a sequence, with identical pre
ﬁxes stored only once. By performing a depthﬁrst traversal of the suﬃx
tree one can access every suﬃx (which is equivalent to each possible subse
quence position) and backtrack to reuse the calculations that have already
been performed for the preﬁx that the current and the next candidate sub
sequence share.
BaezaYates and Gonnet assume that the sequences are strings over
a ﬁnite alphabet; Park et al. avoid this assumption by classifying each
sequence element into one of a ﬁnite set of categories. Both methods achieve
subquadratic running times.
4.2. Data Reduction through Piecewise Linear
Approximation
Keogh et al. have introduced a dimensionality reduction technique using
piecewise linear approximation of the original sequence data [Keogh (1997),
Keogh and Pazzani (1998), Keogh and Pazzani (1999a), Keogh and Pazzani
(1999b), Keogh and Smyth (1997)]. This reduces the number of data
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 37
points by a compression factor typically in the range from 10 to 600 for
real data [Keogh (1997)], outperforming methods based on the Discrete
Fourier Transform by one to three orders of magnitude [Keogh and Pazzani
(1999b)]. This approximation is shown to be valid under several distance
measures, including dynamic time warping distance [Keogh and Pazzani
(1999b)]. An enhanced representation is introduced in [Keogh and Pazzani
(1998)], where every line segment in the approximation is augmented with
a weight representing its relative importance; for instance, a combined
sequence may be constructed representing a class of sequences, and some
line segments may be more representative of the class than others.
4.3. Search Space Pruning through Subsequence Hashing
Keogh and Pazzani (1999a) describe an indexing method based on hashing,
in addition to the piecewise linear approximation. An equispaced template
grid window is moved across the sequence, and for each position a hash key
is generated to decide into which bin the corresponding subsequence is put.
The hash key is simply a binary string, where 1 means that the sequence
is predominantly increasing in the corresponding part of the template grid,
while 0 means that it is decreasing. These bin keys may then be used during
a search, to prune away entire bins without examining their contents. To get
more beneﬁt from the bin pruning, the bins are arranged in a bestﬁrst order.
5. Conclusion
This chapter has sought to give an overview of recent advances in the ﬁeld of
similarity based retrieval in time sequence databases. First, the problem of
similarity search and the desired properties of robust distance measures and
good indexing methods were outlined. Then, the general approach of signa
ture based similarity search was described. Following the general descrip
tion, three speciﬁc signature extraction approaches were discussed: Spectral
signatures, based on Fourier components (or wavelet components); piece
wise constant approximation, and the related method adaptive piecewise
constant approximation; and landmark methods, based on the extraction of
signiﬁcant points in a sequence. Finally, some methods that are not based
on signature extraction were mentioned.
Although the ﬁeld of time sequence indexing has received much atten
tion and is now a relatively mature ﬁeld [Keogh et al. (2002)] there are still
areas where further research might be warranted. Two such areas are (1)
thorough empirical comparisons and (2) applications in data mining.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
38 M. L. Hetland
The published methods have undergone thorough empirical tests that
evaluate their performance (usually by comparing them to sequential scan,
and, in some cases, to the basic spectral signature methods), but comparing
the results is not a trivial task—in most cases it might not even be very
meaningful, since variations in performance may be due to implementation
details, available hardware, and several other factors that may not be inher
ent in the indexing methods themselves. Implementing several of the most
promising methods and testing them on real world problems (under similar
conditions) might lead to new insights, not only about their relative perfor
mances in general, but also about which methods are best suited for which
problems. Although some comparisons have been made [such as in Wu et al.
(2000) and, in the more general context of spatial similarity search, in Weber
et al. (1998)], little research seems to have been published on this topic.
Data mining in time series databases is a relatively new ﬁeld [Keogh
et al. (2002)]. Most current mining methods are based on a full, linear scan
of the sequence data. While this may seem unavoidable, constructing an
index of the data could make it possible to perform this full data traversal
only once, and later perform several data mining passes that only use the
index to perform their work. It has been argued that data mining should
be interactive [Das et al. (1998)], in which case such techniques could prove
useful. Some publications can be found about using time sequence indexing
for data mining purposes [such as Keogh et al. (2002), where a method is
presented for mining patterns using a suﬃx tree index] but there is still a
potential for combining existing sequence mining techniques with existing
methods for similaritybased retrieval.
Appendix Distance Measures
Faloutsos et al. (1997) describe a general framework for sequence distance
measures [a similar framework can be found in Jagadish et al. (1995)].
They show that many common distance measures can be expressed in the
following form:
d(x, y) = min
min
T
1
,T
2
∈T
{c(T
1
) + c(T
2
) + d(T
1
(x), T
2
(y))}
d
0
(x, y).
(5)
T is a set of allowable transformations, c(T
i
) is the cost of performing
the transformation T
i
, T
i
(x) is the sequence resulting from performing the
transformation T
i
on x, and d
0
is a socalled base distance, typically calcu
lated in linear time. For instance, L
p
norms (such as Manhattan distance
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 39
and Euclidean distance) results when T = ∅ and
d
0
(x y) = L
p
=
p
l
i=1
x
i
−y
i

p
(6)
where x = y = l.
Editing distance (or Levenshtein distance) is the weight of the mini
mum sequence of editing operations needed to transform one sequence into
another [Sankoﬀ and Kruskal (1999)]. It is usually deﬁned on strings (or
equispaced time sequences), but Mannila and Ronkainen (1997) show how
to generalise this measure to general (non equispaced) time sequences. In
the framework given above, editing distance may be deﬁned as:
d
ed
(x, y) = min
c(del(x
1
)) + d
ed
(x
2:m
, y)
c(del(y
1
)) + d
ed
(x, y
2:n
)
c(sub(x
1
y
1
)) + d
ed
(x
2:m
, y
2:n
)
(7)
where m = x, n = y, del(x
1
) and del(y
1
) stand for deleting the ﬁrst
elements of x and y, respectively, and sub (x
1
, y
1
) stands for substituting
the ﬁrst element of x with the ﬁrst element of y.
A distance function with time warping allows nonuniform scaling along
the time axis, or, in sequence terms, stuttering. Stuttering occurs when
an element from one of the sequences is repeated several times. A typical
distance measure is:
d
tw
(x, y) = d
0
(x
1
, y
1
) + min
d
tw
(x, y
2:n
) (x – stutter),
d
tw
(x
2:m
, y) (y – stutter),
d
tw
(x
2:m
, y
2:n
) (no stutter).
(8)
Both d
ed
and d
tw
can be computed in quadratic time (O(mn)) using
dynamic programming [Cormen et al. (1993), Sankoﬀ and Kruskal (1999)]:
An m×n table D is ﬁlled iteratively so that D[i, j] = d(x
1:i
, y
1:j
). The ﬁnal
distance d(x, y), is found in D[m, n].
The Longest Common Subsequence (LCS) measure [Cormen et al.
(1993)], d
lcs
(x, y), is the length of the longest sequence s which is a
(possibly noncontiguous) subsequence of both x and y, in other words:
d
lcs
(x, y) = max
s 
s ⊆ x, s ⊆ y
. (9)
In some applications the measure is normalised by dividing by
max(x, y), giving a distance in the range [0, 1]. d
lcs
(x, y) may be calcu
lated using dynamic programming, in a manner quite similar to d
ed
.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
40 M. L. Hetland
References
1. Agrawal, R., Faloutsos, C., and Swami, A.N. (1993). Eﬃcient Similarity
Search in Sequence Databases. Proc. 4th Int. Conf. on Foundations of Data
Organization and Algorithms (FODO), pp. 69–84.
2. Agrawal, R., Lin, K., Sawhney, H.S., and Shim, K. (1995). Fast Similarity
Search in the Presence of Noise, Scaling, and Translation in TimeSeries
Database. Proc. 21st Int. Conf. on Very Large Databases (VLDB), pp. 490–
501.
3. BaezaYates, R. and Gonnet, G.H. (1999). A Fast Algorithm on Average for
AllAgainstAll Sequence Matching. Proc. 6th String Processing and Infor
mation Retrieval Symposium (SPIRE), pp. 16–23.
4. BaezaYates, R. and RibeiroNeto, B. (1999). Modern Information Retrieval.
ACM Press/Addison–Wesley Longman Limited.
5. Chakrabarti, K. and Mehrotra, S. (1999). The Hybrid Tree: An Index Struc
ture for High Dimensional Feature Spaces. Proc. 15th Int. Conf. on Data
Engineering (ICDE), pp. 440–447.
6. Chan, K. and Fu, A.W. (1999). Eﬃcient Time Series Matching by Wavelets.
Proc. 15th Int. Conf. on Data Engineering (ICDE), pp. 126–133.
7. Cormen, T.H., Leiserson, C.E., and Rivest, R.L. (1993). Introduction to Algo
rithms, MIT Press.
8. Das, G., Lin, K., Mannila, H., Renganathan, G., and Smyth, P. (1998). Rule
Discovery from Time Series. Proc. 4th Int. Conf. on Knowledge Discovery
and Data Mining (KDD), pp. 16–22.
9. Faloutsos, C., Jagadish, H.V., Mendelzon, A.O., and Milo, T. (1997). A Sig
nature Technique for SimilarityBased Queries. Proc. Compression and Com
plexity of Sequences (SEQUENCES).
10. Faloutsos, C. and Lin, K.I. (1995). FastMap: A Fast Algorithm for Index
ing, DataMining and Visualization of Traditional and Multimedia Datasets.
Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data,
pp. 163–174.
11. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. (1994). Fast Subse
quence Matching in TimeSeries Databases. Proc. of the 1994 ACM SIGMOD
Int. Conf. on Management of Data, pp. 419–429.
12. Gusﬁeld, D. (1997). Algorithms on Strings, Trees and Sequences, Cambridge
University Press.
13. Guttman, A. (1984). Rtrees: A Dynamic Index Structure for Spatial
Searching. Proc. 1984 ACM SIGMOD Int. Conf. on Management of Data,
pp. 47–57.
14. Jagadish, H.V., Mendelzon, A.O., and Milo, T. (1995). SimilarityBased
Queries. Proc. 14th Symposium on Principles of Database Systems (PODS),
pp. 36–45.
15. Keogh, E.J. (1997). A Fast and Robust Method for Pattern Matching in Time
Series Databases. Proc. 9th Int. Conf. on Tools with Artiﬁcial Intelligence
(ICTAI), pp. 578–584.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences 41
16. Keogh, E.J. (2002). Exact Indexing of Dynamic Time Warping. Proc. 28th
Int. Conf. on Very Large Data Bases (VLDB), pp. 406–417.
17. Keogh, E.J. and Pazzani, M.J. (1998). An Enhanced Representation of Time
Series which Allows Fast and Accurate Classiﬁcation, Clustering and Rel
evance Feedback. Proc. 4th Int. Conf. on Knowledge Discovery and Data
Mining (KDD), pp. 239–243.
18. Keogh, E.J. and Pazzani, M.J. (1999a). An Indexing Scheme for Fast Similar
ity Search in Large Time Series Databases. Proc. 11th Int. Conf. on Scientiﬁc
and Statistical Database Management (SSDBM), pp. 56–67.
19. Keogh, E.J. and Pazzani, M.J. (1999b). Scaling up Dynamic Time Warping
to Massive Datasets. Proc. 3rd European Conf. on Principles of Data Mining
and Knowledge Discovery (PKDD), pp. 1–11.
20. Keogh, E.J. and Pazzani, M.J. (2000). A Simple Dimensionality Reduction
Technique for Fast Similarity Search in Large Time Series Databases. Proc.
of the 4th PaciﬁcAsia Conference on Knowledge Discovery and Data Mining
(PAKDD), pp. 122–133.
21. Keogh, E.J. and Smyth, P. (1997). A Probabilistic Approach to Fast Pat
tern Matching in Time Series Databases. Proc. 3rd Int. Conf. on Knowledge
Discovery and Data Mining (KDD), pp. 24–30.
22. Keogh, E.J., Chakrabarti, K., Mehrotra, S., and Pazzani, M.J. (2001a).
Locally Adaptive Dimensionality Reduction for Indexing Large Time Series
Databases. Proc. 2001 ACM SIGMOD Conf. on Management of Data,
pp. 151–162.
23. Keogh, E.J., Chakrabarti, K., Pazzani, M.J., and Mehrotra, S. (2001b).
Dimensionality Reduction for Fast Similarity Search in Large Time
Series Databases. Journal of Knowledge and Information Systems 3(3),
263–286.
24. Keogh, E.J., Lonardi, S., and Chiu, B. (2002). Finding Surprising Patterns in
a Time Series Database in Linear Time and Space. Proc. 8th ACM SIGKDD
Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 550–556.
25. Kim, S., Park, S., and Chu, W.W. (2001). An IndexBased Approach for
Similarity Search Supporting Time Warping in Large Sequence Databases.
Proc. 17th Int. Conf. on Data Engineering (ICDE), pp. 607–614.
26. Lee, S., Chun, S., Kim, D., Lee, J., and Chung, C. (2000). Similarity Search
for Multidimensional Data Sequences. Proc. 16th Int. Conf. on Data Engi
neering (ICDE), pp. 599–609.
27. Mannila, H. and Ronkainen, P. (1997). Similarity of Event Sequences. Proc.
4th Int. Workshop on Temporal Representation and Reasoning, TIME, pp.
136–139.
28. Park, S., Chu, W.W., Yoon, J., and Hsu, C. (2000). Eﬃcient Search for
Similar Subsequences of Diﬀerent Lengths in Sequence Databases. Proc. 16th
Int. Conf. on Data Engineering (ICDE), pp. 23–32.
29. Perng, C., Wang, H., Zhang, S.R., and Parker, D.S. (2000). Landmarks: A
New Model for SimilarityBased Pattern Querying in Time Series Databases.
Proc. 16th Int. Conf. on Data Engineering (ICDE), pp. 33–42.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap02
42 M. L. Hetland
30. Raﬁei, D. and Mendelzon, A. (1997). On SimilarityBased Queries for Time
Series Data. SIGMOD Record, 26(2), 13–25.
31. Sankoﬀ, D. and Kruskal, J., ed. (1999). Time Warps, String Edits, and
Macromolecules: The Theory and Practice of Sequence Comparison. CSLI
Publications, reissue edition.
32. Sellis, T.K., Roussopoulos, N., and Faloutsos, C. (1987). The R
+
Tree: A
Dynamic Index for MultiDimensional Objects. Proc. 13th Int. Conf. on Very
Large Database (VLDB), pp. 507–518.
33. Shatkay, H. (1995). The fourier transform: A primer. Technical Report CS
9537, Brown University.
34. Wang, H. and Perng, C. (2001). The S
2
Tree. An Index Structure for Subse
quence Matching of Spatial Objects. Proc. of the 5th PaciﬁcAsia Conf. on
Knowledge Discovery and Data Mining (PAKDD), pp. 312–323.
35. Wang, J.T.L., Wang, X., Lin, K.I., Shasha, D., Shapiro, B.A., and Zhang, K.
(1999). Evaluating a Class of DistanceMapping Algorithms for Data Mining
and Clustering. Proc. of the 1999 ACM SIGKDD Int. Conf. on Knowledge
Discovery and Data Mining, pp. 307–311.
36. Weber, R., Schek, H., and Blott, S. (1998). A Quantitative Analysis and Per
formance Study for SimilaritySearch Methods in HighDimensional Spaces.
Proc. 24th Int. Conf. on Very Large Databases (VLDB), pp. 194–205.
37. Wu, Y., Agrawal, D., and Abbadi, A.E. (2000). A Comparison of DFT and
DWT Based Similarity Search in TimeSeries Databases. Proc. 9th Int. Conf.
on Information and Knowledge Management (CIKM), pp. 488–495.
38. Yi, B. and Faloutsos, C. (2000). Fast Time Sequence Indexing for Arbitrary
L
p
Norms. The VLDB Journal, pp. 385–594.
39. Yi, B., Jagadish, H.V., and Faloutsos, C. (1998). Eﬃcient Retrieval of Sim
ilar Time Sequences Under Time Warping. Proc. 14th Int. Conf. on Data
Engineering (ICDE), pp. 201–208.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
CHAPTER 3
INDEXING OF COMPRESSED TIME SERIES
Eugene Fink
Computer Science and Engineering, University of South Florida
Tampa, Florida 33620, USA
Email: eugene@csee.usf.edu
Kevin B. Pratt
Harris Corporation, 1025 NASA Blvd., W3/9708
Melbourne, Florida 32919, USA
Email: kpratt01@harris.com
We describe a procedure for identifying major minima and maxima of
a time series, and present two applications of this procedure. The ﬁrst
application is fast compression of a series, by selecting major extrema
and discarding the other points. The compression algorithm runs in lin
ear time and takes constant memory. The second application is indexing
of compressed series by their major extrema, and retrieval of series sim
ilar to a given pattern. The retrieval procedure searches for the series
whose compressed representation is similar to the compressed pattern. It
allows the user to control the tradeoﬀ between the speed and accuracy
of retrieval. We show the eﬀectiveness of the compression and retrieval
for stock charts, meteorological data, and electroencephalograms.
Keywords: Time series; compression; fast retrieval; similarity measures.
1. Introduction
We view a time series as a sequence of values measured at equal intervals;
for example, the series in Figure 1 includes the values 20, 22, 25, 22, and
so on. We describe a compression procedure based on extraction of certain
important minima and maxima from a series. For example, we can compress
the series in Figure 1 by extracting the circled minima and maxima, and
discarding the other points. We also propose a measure of similarity between
43
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
44 E. Fink and K. B. Pratt
time
v
a
l
u
e
30
25
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fig. 1. Example of a time series.
Table 1. Data sets used in the experiments.
Data set Description Number of Points Total number
series per series of points
Stock prices 98 stocks, 2.3 years 98 610 60,000
Air and sea 68 buoys, 18 years, 136 1,800–6,600 450,000
Temperatures 2 sensors per buoy
Wind speeds 12 stations, 18 years 12 6,570 79,000
EEG 64 electrodes, 1 second 64 256 16,000
series and show that it works well with compressed data. Finally, we present
a technique for indexing and retrieval of compressed series; we have tested
it on four data sets (Table 1), which are publicly available through the
Internet.
Stock prices: We have used stocks from the Standard and Poor’s 100
listing of large companies for the period from January 1998 to April 2000.
We have downloaded daily prices from America Online, discarded newly
listed and delisted stocks, and used ninetyeight stocks in the experiments.
Air and sea temperatures: We have experimented with daily temperature
readings by sixtyeight buoys in the Paciﬁc Ocean, from 1980 to 1998,
downloaded from the Knowledge Discovery archive at the University of
California at Irvine (kdd.ics.uci.edu).
Wind speeds: We have used daily wind speeds from twelve sites in
Ireland, from 1961 to 1978, obtained from an archive at Carnegie Mellon
University (lib.stat.cmu.edu/datasets).
Electroencephalograms: We have utilized EEG obtained by Henri
Begleiter at the Neurodynamics Laboratory of the SUNY Health Center
at Brooklyn. These data are from sixtyfour electrodes at standard points
on the scalp; we have downloaded them from an archive at the University
of California at Irvine (kdd.ics.uci.edu).
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 45
2. Previous Work
We review related work on the comparison and indexing of time series.
Feature sets. Researchers have considered various feature sets for com
pressing time series and measuring similarity between them. They have
extensively studied Fourier transforms, which allow fast compression [Singh
and McAtackney (1998), Sheikholeslami et al. (1998), Stoﬀer (1999), Yi
et al. (2000)], however, this technique has several disadvantages. In par
ticular, it smoothes local extrema, which may lead to a loss of important
information, and it does not work well for erratic series [Ikeda et al. (1999)].
Chan and his colleagues applied Haar wavelet transforms to time series and
showed several advantages of this technique over Fourier transforms [Chan
and Fu 1999, Chanet al. (2003)].
Guralnik and Srivastava (1999) considered the problem of detecting a
change in the trend of a data stream, and developed a technique for ﬁnding
“change points” in a series. Last et al. (2001) proposed a general frame
work for knowledge discovery in time series, which included representa
tion of a series by its key features, such as slope and signaltonoise ratio.
They described a technique for computing these features and identifying
the points of change in the feature values.
Researchers have also studied the use of small alphabets for compression
of time series, and applied string matching to the pattern search [Agrawal
et al. (1995), Huang and Yu (1999), AndrJnsson and Badal (1997), Lam
and Wong (1998), Park et al. (1999), Qu et al. (1998)]. For example,
Guralnik et al. (1997) compressed stock prices using a nineletter alphabet.
Singh and McAtackney (1998) represented stock prices, particle dynam
ics, and stellar light intensity using a threeletter alphabet. Lin and Risch
(1998) used a twoletter alphabet to encode major spikes in a series. Das et
al. (1998) utilized an alphabet of primitive shapes for eﬃcient compression.
These techniques give a high compression rate, but their descriptive power
is limited, which makes them inapplicable in many domains.
Perng et al. (2000) investigated a compression technique based on
extraction of “landmark points,” which included local minima and max
ima. Keogh and Pazzani (1997; 1998) used the endpoints of bestﬁt line
segments to compress a series. Keogh et al. (2001) reviewed the compres
sion techniques based on approximation of a time series by a sequence of
straight segments. We describe an alternative compression technique, based
on selection of important minima and maxima.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
46 E. Fink and K. B. Pratt
Similarity measures. Several researchers have deﬁned similarity as
the distance between points in a feature space. For example, CaracaValente
and LopezChavarrias (2000) used Euclidean distance between feature vec
tors containing angle of knee movement and muscle strength, and Lee et al.
(2000) applied Euclidean distance to compare feature vectors containing
color, texture, and shape of video data. This technique works well when all
features have the same units of scale [Goldin and Kanellakis (1995)], but it
is often ineﬀective for combining disparate features.
An alternative deﬁnition of similarity is based on bounding rectan
gles; two series are similar if their bounding rectangles are similar. It
allows fast pruning of clearly dissimilar series [Perng et al. (2000), Lee
et al. (2000)], but it is less eﬀective for selecting the most similar series.
The envelopecount technique is based on dividing a series into short seg
ments, called envelopes, and deﬁning a yes/no similarity for each envelope.
Two series are similar within an envelope if their pointbypoint diﬀerences
are within a certain threshold. The overall similarity is measured by the
number of envelopes where the series are similar [Agrawal et al. (1996)].
This measure allows fast computation of similarity, and it can be adapted
for noisy and missing data [Das et al. (1997), Bollobas et al. (1997)].
Finally, we can measure a pointbypoint similarity of two series and
then aggregate these measures, which often requires interpolation of missing
points. For example, Keogh and Pazzani (1998) used linear interpolation
with this technique, and Perng et al. (2000) applied cubic approximation.
Keogh and Pazzani (2000) also described a pointbypoint similarity with
modiﬁed Euclidean distance, which does not require interpolation.
Indexing and retrieval. Researchers have studied a variety of tech
niques for indexing of time series. For example, Deng (1998) applied kd
trees to arrange series by their signiﬁcant features, Chan and Fu (1999)
combined wavelet transforms with Rtrees, and Bozkaya and her col
leagues used vantagepoint trees for indexing series by numeric features
[Bozkaya et al. (1997), Bozkaya and
¨
Ozsoyoglu (1999)]. Park et al. (2001)
indexed series by their local extrema and by properties of the segments
between consecutive extrema. Li et al. (1998) proposed a retrieval technique
based on a multilevel abstraction hierarchy of features. Aggarwal and Yu
(2000) considered grid structures, but found that the grid performance is
often no better than exhaustive search. They also showed that exhaustive
search among compressed series is often faster than sophisticated indexing
techniques.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 47
3. Important Points
We compress a time series by selecting some of its minima and maxima,
and dropping the other points (Figure 2). The intuitive idea is to discard
minor ﬂuctuations and keep major minima and maxima. We control the
compression rate with a parameter R, which is always greater than one;
an increase of R leads to selection of fewer points. A point a
m
of a series
a
1
, . . . , a
n
is an important minimum if there are indices i and j, where
i ≤ m ≤ j, such that
• a
m
is the minimum among a
i
, . . . , a
j
, and
• a
i
/a
m
≥ R and a
j
/a
m
≥ R.
Intuitively, a
m
is the minimal value of some segment a
i
, . . . , a
j
, and the end
point values of this segment are much larger than a
m
(Figure 3). Similarly,
a
m
is an important maximum if there are indices i and j, where i ≤ m ≤ j,
such that
• a
m
is the minimum among a
i
, . . . , a
j
, and
• a
m
/a
i
≥ R and a
m
/a
j
≥ R.
In Figure 4, we give a procedure for selecting important points, which
takes linear time and constant memory. It outputs the values and indices
of all important points, as well as the ﬁrst and last point of the series.
This procedure can process new points as they arrive, without storing the
1.0
0.9
1.0
0.9
Fig. 2. Important points for 91% compression (left) and 94% compression (right).
i j
time
i j
time
a
m
a
m
· R
a
m
/R
a
m
Fig. 3. Important minimum (left) and important maximum (right).
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
48 E. Fink and K. B. Pratt
IMPORTANTPOINTS—Toplevel function for ﬁnding important points.
The input is a time series a
1
, . . . , a
n
; the output is the
values and indices of the selected important points.
output (a
1
, 1)
i = FINDFIRST
if i < n and a
i
> a
1
then i = FINDMAXIMUM(i)
while i < n do
i = FINDMINIMUM(i)
i = FINDMAXIMUM(i)
output (a
n
, n)
FINDFIRST—Find the ﬁrst important point.
iMin = 1; iMax = 1; i = 2
while i < n and a
i
/a
iMin
< R and a
iMax
/a
i
< R do
if a
i
< a
iMin
then iMin = i
if a
i
> a
iMax
then iMax = i
i = i + 1
if iMin < iMax
then output (a
iMin,
iMin)
else output (a
iMax,
iMax)
return i
FINDMINIMUM(i)—Find the ﬁrst important maximum after the ith point.
iMin = i
while i < n and a
i
/a
iMin
< R do
if a
i
< a
iMin
then iMin = i
i = i + 1
if i < n or a
iMin
< a
i
then output (a
i
Min, iMin)
return i
FINDMAXIMUM(i)—Find the ﬁrst important maximum after the ith point.
iMax = i
while i < n and a
iMax
/a
i
< R do
if a
i
> a
iMax
then iMax = i
i = i + 1
if i < n or a
iMax
> a
i
then output (a
iMax
, iMax)
return i
Fig. 4. Compression procedure. We process a series a
1
, . . . , a
n
and use a global variable
n to represent its size. The procedure outputs the values and indices of the selected points.
original series; for example, it can compress a live electroencephalogram
without waiting until the end of the data collection. We have implemented
it in Visual Basic 6 and tested on a 300MHz PC; for an npoint series, the
compression time is 14 · n microseconds.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 49
We have applied the compression procedure to the data sets in Table 1,
and compared it with two simple techniques: equally spaced points and
randomly selected points. We have experimented with diﬀerent compression
rates, which are deﬁned as the percentage of points removed from a series.
For example, “eightypercent compression” means that we select 20% of
points and discard the other 80%.
For each compression technique, we have measured the diﬀerence
between the original series and the compressed series. We have considered
three measures of diﬀerence between the original series, a
1
, . . . , a
n
, and the
series interpolated from the compressed version, b
1
, . . . , b
n
.
Mean diﬀerence:
1
n
·
n
i=1
a
i
− b
i
.
Maximum diﬀerence: max
i∈[1,...,n]
a
i
− b
i
.
Root mean square diﬀerence:
1
n
·
n
i=1
(a
i
− b
i
)
2
.
We summarize the results in Table 2, which shows that important points
are signiﬁcantly more accurate than the two simple methods.
4. Similarity Measures
We deﬁne similarity between time series, which underlies the retrieval pro
cedure. We measure similarity on a zerotoone scale; zero means no likeness
and one means perfect likeness. We review three basic measures of similar
ity and then propose a new measure. First, we deﬁne similarity between
two numbers, a and b:
sim(a, b) = 1 −
a − b
a + b
.
The mean similarity between two series, a
1
, . . . , a
n
and b
1
, . . . , b
n
, is the
mean of their pointbypoint similarity:
1
n
·
n
i=1
sim(a
i
, b
i
).
We also deﬁne the root mean square similarity:
1
n
·
n
i=1
sim(a
i
, b
i
)
2
.
In addition, we consider the correlation coeﬃcient, which is a standard
statistical measure of similarity. It ranges from −1 to 1, but we can convert
A
p
r
i
l
2
2
,
2
0
0
4
1
6
:
5
8
W
S
P
C
/
T
r
i
m
S
i
z
e
:
9
i
n
x
6
i
n
f
o
r
R
e
v
i
e
w
V
o
l
u
m
e
C
h
a
p
0
3
5
0
E
.
F
i
n
k
a
n
d
K
.
B
.
P
r
a
t
t
Table 2. Accuracy of three compression techniques. We give the average diﬀerence between an original series and its compressed
version using the three diﬀerence measures; smaller diﬀerences correspond to more accurate compression.
Mean diﬀerence Maximum diﬀerence Root mean square diﬀ.
important ﬁxed random important ﬁxed random important ﬁxed random
points points points points points points points points points
Eighty percent compression
Stock prices 0.02 0.03 0.04 0.70 1.70 1.60 0.05 0.14 0.14
Air temp. 0.01 0.03 0.03 0.33 0.77 0.72 0.03 0.10 0.10
Sea temp. 0.01 0.03 0.03 0.35 0.81 0.75 0.03 0.10 0.10
Wind speeds 0.02 0.03 0.03 0.04 1.09 1.01 0.04 0.05 0.05
EEG 0.03 0.06 0.07 0.68 1.08 1.00 0.10 0.18 0.17
Ninety percent compression
Stock prices 0.03 0.06 0.07 1.10 1.70 1.70 0.08 0.21 0.21
Air temp. 0.02 0.05 0.05 0.64 0.80 0.78 0.08 0.16 0.14
Sea temp. 0.01 0.04 0.05 0.60 0.83 0.82 0.07 0.16 0.14
Wind speeds 0.03 0.04 0.04 0.06 1.09 1.03 0.05 0.06 0.06
EEG 0.08 0.13 0.12 0.82 1.10 1.09 0.17 0.27 0.24
Ninetyﬁve percent compression
Stock prices 0.05 0.10 0.12 1.30 1.80 1.80 0.11 0.32 0.30
Air temp. 0.03 0.09 0.08 0.74 0.83 0.83 0.12 0.23 0.21
Sea temp. 0.03 0.08 0.08 0.78 0.85 0.85 0.12 0.23 0.21
Wind speeds 0.05 0.04 0.04 0.08 1.09 1.10 0.07 0.08 0.08
EEG 0.13 0.17 0.16 0.90 1.10 1.10 0.24 0.31 0.28
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 51
–a b a
psim (a,b)
1
Fig. 5. Peak similarity of numbers a and b, where a ≥ b.
it to the zerotoone scale by adding one and dividing by two. For series
a
1
, . . . , a
n
and b
1
, . . . , b
n
, with mean values m
a
= (a
1
+ · · · + a
n
)/n and
m
b
= (b
1
+· · · + b
n
)/n, the correlation coeﬃcient is
n
i=1
(a
i
−m
a
) · (b
i
−m
b
)
n
i=1
(a
i
−m
a
)
2
·
n
i=1
(b
i
−m
b
)
2
We next deﬁne a new similarity measure, called the peak similarity. For
numbers a and b, their peak similarity is
psim(a, b) = 1 −
a −b
2 · max(a, b)
.
In Figure 5, we show the meaning of this deﬁnition for a ≥ b, based on
the illustrated triangle. We draw the vertical line through b to the intersec
tion with the triangle’s side; the ordinate of the intersection is the similarity
of a and b. The peak similarity of two series is the mean similarity of their
points (psim(a
1
, b
1
) +· · · + psim(a
n
, b
n
))/n.
We next give an empirical comparison of the four similarity measures.
For each series, we have found the ﬁve most similar series, and then deter
mined the mean diﬀerence between the given series and the other ﬁve.
In Table 3, we summarize the results and compare them with the perfect
exhaustivesearch selection and with random selection. The results show
that the peak similarity performs better than the other measures, and that
the correlation coeﬃcient is the least eﬀective.
We have also used the four similarity measures to identify close matches
for each series, and compared the results with groundtruth neighborhoods.
For stocks, we have considered small neighborhoods formed by industry sub
groups, as well as large neighborhoods formed by industry groups, accord
ing to Standard and Poor’s classiﬁcation. For air and sea temperatures, we
have used geographic proximity to deﬁne two groundtruth neighborhoods.
The ﬁrst neighborhood is the 1 × 5 rectangle in the grid of buoys, and
the second is the 3 × 5 rectangle. For wind speeds, we have also used geo
graphic proximity; the ﬁrst neighborhood includes all sites within 70 miles,
A
p
r
i
l
2
2
,
2
0
0
4
1
6
:
5
8
W
S
P
C
/
T
r
i
m
S
i
z
e
:
9
i
n
x
6
i
n
f
o
r
R
e
v
i
e
w
V
o
l
u
m
e
C
h
a
p
0
3
5
2
E
.
F
i
n
k
a
n
d
K
.
B
.
P
r
a
t
t
Table 3. Diﬀerences between selected similar series. For each given series, we have selected the ﬁve most similar series, and then measured
the mean diﬀerence between the given series and the other ﬁve. Smaller diﬀerences correspond to better selection. We also show the running
times of selecting similar series.
Similarity Comp. Stock prices Sea temperatures Air temperatures Wind speeds EEG
measure rate
Mean Max Time Mean Max Time Mean Max Time Mean Max Time Mean Max Time
diﬀ. diﬀ. (sec) diﬀ. diﬀ. (sec) diﬀ. diﬀ. (sec) diﬀ. diﬀ. (sec) diﬀ. diﬀ. (sec)
Perfect selection 0.094 0.437 0.016 0.072 0.024 0.121 0.021 0.136 0.038 0.170
Random selection 0.287 1.453 0.078 0.215 0.070 0.235 0.029 0.185 0.072 0.370
Peak 90% 0.103 0.429 0.024 0.018 0.068 0.021 0.029 0.103 0.022 0.023 0.138 0.016 0.052 0.241 0.015
Similarity 95% 0.110 0.534 0.022 0.019 0.073 0.019 0.030 0.136 0.020 0.023 0.148 0.016 0.063 0.306 0.015
Mean 90% 0.110 0.525 0.026 0.026 0.092 0.022 0.031 0.134 0.022 0.023 0.137 0.017 0.055 0.279 0.016
Similarity 95% 0.126 0.570 0.024 0.033 0.112 0.021 0.037 0.152 0.022 0.025 0.152 0.017 0.066 0.323 0.014
Root mean 90% 0.103 0.497 0.026 0.024 0.090 0.022 0.030 0.133 0.022 0.023 0.134 0.017 0.051 0.261 0.016
Square sim. 95% 0.115 0.588 0.024 0.031 0.106 0.021 0.035 0.147 0.022 0.023 0.153 0.017 0.064 0.317 0.014
Correlation 90% 0.206 1.019 0.048 0.054 0.162 0.044 0.051 0.214 0.046 0.024 0.138 0.042 0.056 0.281 0.030
Coeﬃcient 95% 0.210 1.101 0.045 0.063 0.179 0.042 0.051 0.224 0.043 0.024 0.154 0.033 0.068 0.349 0.028
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 53
and the second includes the sites within 140 miles. For electroencephalo
grams, the ﬁrst neighborhood is the 3×3 rectangle in the grid of electrodes,
and the second is the 5 × 5 rectangle.
For each series, we have found the ﬁve closest matches, and then
determined the average number of matches that belong to the same
neighborhood. In Table 4, we give the results and compare them with the
perfect selection and random selection; larger numbers correspond to better
selections.
The results show that the peak similarity is usually more eﬀective than
the other three similarities. If we use the 90% compression, the peak simi
larity gives better selection than the other similarity measures for the stock
prices, air and sea temperatures, and eeg; however, it gives worse results
for the wind speeds. If we use the 95% compression, the peak similarity out
performs the other measures on the stocks, temperatures, eeg, and large
neighborhood selection for the wind speeds; however, it loses to the mean
similarity and correlation coeﬃcient on the smallneighborhood selection
for the wind speeds.
We have also checked how well the peak similarity of original series
correlates with the peak similarity of their compressed versions (Figure 6).
We have observed a good linear correlation, which gracefully degrades with
an increase of compression rate.
5. Pattern Retrieval
We give an algorithm that inputs a pattern series and retrieves similar series
from a database. It includes three steps: identifying a “prominent feature”
of the pattern, ﬁnding similar features in the database, and comparing the
pattern with each series that has a similar feature.
We begin by deﬁning a leg of a series, which is the segment between
two consecutive important points. For each leg, we store the values listed
in Figure 7, denoted vl, vr, il, ir, ratio, and length; we give an example of
these values in Figure 8. The prominent leg of a pattern is the leg with the
greatest endpoint ratio.
The retrieval procedure inputs a pattern and searches for similar seg
ments in a database (Figure 9). First, it ﬁnds the pattern leg with the
greatest endpoint ratio, denoted ratio
p
, and determines the length of this
leg, length
p
. Next, it identiﬁes all legs in the database that have a similar
endpoint ratio and length. A leg is considered similar to the pattern leg
if its ratio is between ratio
p
/C and ratio
p
· C, and its length is between
A
p
r
i
l
2
2
,
2
0
0
4
1
6
:
5
8
W
S
P
C
/
T
r
i
m
S
i
z
e
:
9
i
n
x
6
i
n
f
o
r
R
e
v
i
e
w
V
o
l
u
m
e
C
h
a
p
0
3
5
4
E
.
F
i
n
k
a
n
d
K
.
B
.
P
r
a
t
t
Table 4. Finding members of the same neighborhood. For each series, we have found the ﬁve closest
matches, and then determined the average number of the series among them that belong to the same
groundtruth neighborhood.
Similarity Comp. Stocks Sea temp. Air temp. Wind speeds EEG
Measure rate
1 2 1 2 1 2 1 2 1 2
Perfect selection 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00
Random selection 0.07 0.29 0.11 0.40 0.10 0.40 0.74 2.27 0.35 1.03
Peak 90% 0.22 0.62 0.54 1.09 0.49 0.89 1.16 2.83 1.81 2.81
Similarity 95% 0.21 0.55 0.65 1.18 0.48 0.82 1.50 2.83 1.59 2.25
Mean 90% 0.18 0.55 0.28 0.85 0.34 0.77 1.33 2.92 1.05 1.98
Similarity 95% 0.12 0.47 0.17 0.75 0.25 0.65 1.58 2.66 0.36 0.90
Root mean 90% 0.14 0.53 0.32 0.88 0.34 0.83 1.50 2.92 1.20 2.19
Square sim. 95% 0.17 0.35 0.20 0.77 0.26 0.71 1.33 2.75 0.36 0.90
Correlation 90% 0.15 0.39 0.25 0.82 0.49 0.74 1.33 2.92 1.16 2.24
Coeﬃcient 95% 0.19 0.50 0.29 0.72 0.34 0.60 1.51 2.75 0.68 1.65
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 55
0 1
0
1
correlation: .98
0 1
0
1
correlation: .93
0 1
0
1
correlation: .95
0 1
0
1
correlation: .90
0 1
0
1
correlation: .92
0 1
0
1
correlation: .96
0 1
0
1
correlation: .76
0 1
0
1
correlation: .86
0
1
0
1
correlation: .74
0 1
0
1
correlation: .68
0 1
0
1
correlation: .93
0 1
0
1
correlation: .57
0 1
0
1
correlation: .76
0 1
0
1
correlation: .70
0 1
0
1
correlation: .58
95%
comp.
90%
comp.
80%
comp.
stock
prices
air
temperatures
sea
temperatures
wind
speeds
EEG
1
Fig. 6. Correlation between the peak similarity of original series and the peak similarity
of their compressed versions. We show the correlation for three compression rates: 80%,
90%, and 95%.
vl value of the left important point of the leg
vr value of the right important point of the leg
il index of the left important point
ir index of the right important point
ratio ratio of the endpoints, deﬁned as vr/vl
length length of the leg, deﬁned as ir – il
Fig. 7. Basic data for a leg.
length
p
/D and length
p
· D, where C and D are parameters for controlling
the search.
We index all legs in the database by their ratio and length using a range
tree, which is a standard structure for indexing points by two coordinates
[Edelsbrunner (1981), Samet (1990)]. If the total number of legs is l, and
the number of retrieved legs with an appropriate ratio and length is k, then
the retrieval time is O(k + lg l).
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
56 E. Fink and K. B. Pratt
Finally, the procedure identiﬁes the segments that contain the selected
legs (Figure 10) and computes their similarity to the pattern. If the simi
larity is above a given threshold T, the procedure outputs the segment as
a match. In Figure 11, we give an example of a stockprice pattern and
similar segments retrieved from the stock database.
0 5 10 15 20 25 30
time
1.6
1.4
1.2
1.0
0.8
0.6
vl = 0.6
vr = 1.4
il = 9
ir =15
ratio = 2.33
length = 6
vl = 1.6
vr = 1.0
il = 21
ir = 27
ratio = 0.62
length = 6
0 5 10 15 20 25 30
time
1.6
1.4
1.2
1.0
0.8
0.6
vl = 0.6
vr = 1.4
il = 9
ir =15
ratio = 2.33
length = 6
vl = 1.6
vr = 1.0
il = 21
ir = 27
ratio = 0.62
length = 6
Fig. 8. Example legs. We show the basic data for the two legs marked by thick lines.
PATTERNRETRIEVAL
The procedure inputs a pattern series and searches a timeseries
database; the output is a list of segments from the database that match
the pattern.
Identify the pattern leg p with the greatest endpoint ratio, denoted ratio
p
. Deter
mine the length of this pattern leg, denoted length
p
.
Find all legs in the database that satisfy the following conditions:
• their endpoint ratios are between ratio
p
/C and ratio
p
· C, and
• their lengths are between length
p
/D and length
p
· D.
For each leg in the set of selected legs:
Identify the segment corresponding to the pattern (Figure 10).
Compute the similarity between the segment and the pattern.
If the similarity is above the threshold T, then output the segment.
Fig. 9. Search for segments similar to a given pattern. We use three parameters to con
trol the search: maximal ratio deviation C, maximal length deviation D, and similarity
threshold T.
(a) Prominent
leg in a pattern.
(c) Align the right
end of these legs.
(d) Identify the respective
segment in the series.
(b) Similar leg
in a series.
Fig. 10. Identifying a segment that may match the pattern.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 57
pattern
Fig. 11. Example of retrieved stock charts.
(b) Dissimilar legs. (c) Extended legs. (a) Prominent
leg of a pattern.
Fig. 12. Example of extended legs. The pattern (a) matches the series (b), but the
pattern’s prominent leg has no equivalent in the series. If we identify the extended legs
(c), the prominent leg matches one of them.
The described procedure can miss a matching segment that does not
have a leg corresponding to the pattern’s prominent leg. We illustrate
this problem in Figure 12, where the prominent leg of the pattern has no
equivalent in the matching series. To avoid this problem, we introduce the
notion of an extended leg, which is a segment that would be a leg under a
higher compression rate (Figure 12c). Formally, points a
i
and a
j
of a series
a
1
, . . . , a
n
form an extended upward leg if
• a
i
is a local minimum, and a
j
is a local maximum, and
• for every m ∈ [i, j], we have a
i
< a
m
< a
j
.
The deﬁnition of an extended downward leg is similar.
We identify all extended legs, and index them in the same way as nor
mal legs. The advantage of this approach is more accurate retrieval, and
the disadvantage is a larger indexing structure. In Figure 13, we give an
algorithm for identifying upward extended legs; the procedure for ﬁnding
downward extended legs is similar. We assume that normal upward legs in
the input series are numbered from 1 to l. First, the procedure processes
important maxima; for each maximum ir
k
, it identiﬁes the next larger max
imum and stores its index in next[k]. Second, it uses this information to
identify extended legs. The running time of the ﬁrst part is linear in the
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
58 E. Fink and K. B. Pratt
EXTENDEDLEGS
The input is the list of legs in a series; the output is a list of all extended
legs.
initialize an empty stack S of leg indices
PUSH(S, 1)
for k = 2 to l do
while S is not empty and ir
TOP(S)
< ir
k
do
next[TOP(S)] = k; POP(S)
PUSH(k)
while S is not empty do
next[TOP(S)] = NIL; POP(S)
initialize an empty list of extended legs
for k = 1 to l − 1 do
m = next[k]
while m is not NIL do
add (il
k
, ir
m
) to the list of extended legs
m = next[m]
Fig. 13. Identifying extended legs. We assume that normal legs are numbered 1 to l.
number of normal legs, and the time of the second part is linear in the
number of extended legs.
To evaluate the retrieval accuracy, we have compared the retrieval
results with the matches identiﬁed by a slow exhaustive search. We have
ranked the matches found by the retrieval algorithm from most to least
similar. In Figures 14 and 15, we plot the ranks of matches found by the
fast algorithm versus the ranks of exhaustivesearch matches. For instance,
if the fast algorithm has found only three among seven closest matches, the
graph includes the point (3, 7). Note that the fast algorithm never returns
“false positives” since it veriﬁes each candidate match.
The retrieval time grows linearly with the pattern length and with the
number of candidate segments identiﬁed at the initial step of the retrieval
algorithm. If we increase C and D, the procedure ﬁnds more candidates and
misses fewer matchers, but the retrieval takes more time. In Table 5, we give
the mean number of candidate segments in the retrieval experiments, for
three diﬀerent values of C and D; note that this number does not depend
on the pattern length.
We have measured the speed of a VisualBasic implementation on a 300
MHz PC. If the pattern has mlegs, and the procedure identiﬁes k candidate
matches, then the retrieval time is 70 · m · k microseconds. For the stock
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 59
0
235
397 397
200 169 113
0
p
e
r
f
e
c
t
r
a
n
k
i
n
g
0 0
0 0
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C= D= 2
time: 0.14 sec
fast ranking
C= D= 5
time: 0.31 sec
113
0
0
fast ranking
C= D= 1.5
time: 0.09 sec
0 0
0
0 0
210
331
398
fast ranking
C = D = 5
time: 0.41 sec
200 151
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
0 0
0 0
331
398
200 151
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C = D = 2
time: 0.14 sec
fast ranking
C = D = 1.5
time: 0.07 sec
(a) Search for fourleg patterns
(b) Search for sixleg patterns
(c) Search for thirtyleg patterns
0 0 0
0 0 0
200 200 167
202
328
398
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
0 0 0
328
398
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C = D = 5
time: 2.50 sec
fast ranking
C = D = 2
time: 1.01 sec
fast ranking
C = D = 1.5
time: 0.68 sec
0 200
Fig. 14. Retrieval of stock charts. The horizontal axes show the ranks of matches
retrieved by the fast algorithm. The vertical axes are the ranks assigned to the same
matches by the exhaustive search. If the fast algorithm has found all matches, the graph
is a fortyﬁve degree line; otherwise, it is steeper.
database with 60,000 points, the retrieval takes from 0.1 to 2.5 seconds. For
the database of air and sea temperatures, which includes 450,000 points,
the time is between 1 and 10 seconds.
6. Concluding Remarks
The main results include a procedure for compressing time series, indexing
of compressed series by their prominent features, and retrieval of series
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
60 E. Fink and K. B. Pratt
0 0 0
0 0 0
200 151 82
202
395 392
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C = D = 1.5
time: 1.09 sec
fast ranking
C = D = 2
time: 2.18 sec2
fast ranking
C = D = 5
time: 9.43 sec
0
0 0 0
0 0 0
200 200 200
200
206
233
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C = D = 5
time: 1.58 sec
fast ranking
C = D = 2
time: 1.06 sec
fast ranking
C = D = 1.5
time: 0.61 sec
(a) Search for twelveleg patterns of air and sea temperatures.
(b) Search for nineleg patterns of wind speeds.
(c) Search for electroencephalogram patterns with twenty legs.
0 0 0 200 89 35
p
e
r
f
e
c
t
r
a
n
k
i
n
g
0 0 0
244
390 382
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
35
390 382
p
e
r
f
e
c
t
r
a
n
k
i
n
g
p
e
r
f
e
c
t
r
a
n
k
i
n
g
fast ranking
C = D = 5
time: 0.31 sec
fast ranking
C = D = 2
time: 0.10 sec
fast ranking
C = D = 1.5
time: 0.05 sec
Fig. 15. Retrieval of weather and electroencephalogram patterns. The horizontal axes
show the similarity ranks assigned by the fast algorithm, and the vertical axes are the
exhaustivesearch ranks.
whose compressed representation is similar to the compressed pattern.
The experiments have shown the eﬀectiveness of this technique for index
ing of stock prices, weather data, and electroencephalograms. We plan to
apply it to other timeseries domains and study the factors that aﬀect its
eﬀectiveness.
We are working on an extended version of the compression procedure,
which will assign diﬀerent importance levels to the extrema of time series,
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 61
Table 5. Average number of the candidate segments that match a pattern’s prominent
leg. The retrieval algorithm identiﬁes these candidates and then compares them with the
pattern. The number of candidates depends on the search parameters C and D.
Search Stock Air and sea Wind speeds EEG
parameters prices temperatures
C = D = 1.5 270 1,300 970 40
C = D = 2 440 2,590 1,680 70
C = D = 5 1,090 11,230 2,510 220
and allow construction of a hierarchical indexing structure [Gandhi (2003)].
We also aim to extend the developed technique for ﬁnding patterns that
are stretched over time, and apply it to identifying periodic patterns, such
as weather cycles.
Acknowledgements
We are grateful to Mark Last, Eamonn Keogh, Dmitry Goldgof, and Rafael
Perez for their valuable comments and suggestions. We also thank Savvas
Nikiforou for his comments and help with software installations.
References
1. Aggarwal, C.C. and Yu, Ph.S. (2000). The IGrid Index: Reversing the Dimen
sionality Curse for Similarity Indexing in HighDimensional Space. Proceed
ings of the Sixth ACM International Conference on Knowledge Discovery and
Data Mining, pp. 119–129.
2. Agrawal, R., Psaila, G., Wimmers, E.L., and Zait, M. (1995). Querying
Shapes of Histories. Proceedings of the TwentyFirst International Confer
ence on Very Large Data Bases, pp. 502–514.
3. Agrawal, R., Mehta, M., Shafer, J.C., Srikant, R., Arning, A., and Bollinger,
T. (1996). The Quest Data Mining System. Proceedings of the Second ACM
International Conference on Knowledge Discovery and Data Mining, pp.
244–249.
4. Andr´eJ¨ onsson, H. and Badal, D.Z. (1997). Using Signature Files for Query
ing TimeSeries Data. Proceedings of the First European Symposium on Prin
ciples of Data Mining and Knowledge Discovery, pp. 211–220.
5. Bollobas, B., Das, G., Gunopulos, D., and Mannila, H. (1977). TimeSeries
Similarity Problems and WellSeparated Geometric Sets. Proceedings of the
Thirteenth Annual Symposium on Computational Geometry, pp. 454–456.
6. Bozkaya, T. and
¨
Ozsoyoglu, Z.M. (1999). Indexing Large Metric Spaces for
Similarity Search Queries. ACM Transactions on Database Systems 24(3),
361–404.
7. Bozkaya, T., Yazdani, N., and
¨
Ozsoyoglu, Z.M. (1997). Matching and Index
ing Sequences of Diﬀerent Lengths. Proceedings of the Sixth International
Conference on Information and Knowledge Management, pp. 128–135.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
62 E. Fink and K. B. Pratt
8. Brockwell, P.J. and Davis, R.A. (1996). Introduction to Time Series and
Forecasting. SpringerVerlag, Berlin, Germany.
9. Burrus, C.S., Gopinath, R.A., and Guo, H. (1997). Introduction to Wavelets
and Wavelet Transforms: A Primer. Prentice Hall, Englewood Cliﬀs, NJ.
10. CaracaValente, J.P. and LopezChavarrias, I. (2000). Discovering Similar
Patterns in Time Series. Proceedings of the Sixth ACM International Con
ference on Knowledge Discovery and Data Mining, pp. 497–505.
11. Chan, K.P. and Fu, A.W.C. (1999). Eﬃcient Time Series Matching by
Wavelets. Proceedings of the Fifteenth International Conference on Data
Engineering, pp. 126–133.
12. Chan, K.P., Fu, A.W.C., and Yu, C. (2003). Haar Wavelets for Eﬃcient
Similarity Search of TimeSeries: With and Without Time Warping. IEEE
Transactions on Knowledge and Data Engineering, Forthcoming.
13. Chi, E.H.H., Barry, Ph., Shoop, E., Carlis, J.V., Retzel, E., and Riedl, J.
(1995). Visualization of Biological Sequence Similarity Search Results. Pro
ceedings of the IEEE Conference on Visualization, pp. 44–51.
14. Cortes, C., Fisher, K., Pregibon, D., and Rogers, A. (2000). Hancock: A Lan
guage for Extracting Signatures from Data Streams. Proceedings of the Sixth
ACM International Conference on Knowledge Discovery and Data Mining,
pp. 9–17.
15. Das, G., Gunopulos, D., and Mannila, H. (1997). Finding Similar Time Series.
Proceedings of the First European Conference on Principles of Data Mining
and Knowledge Discovery, pp. 88–100.
16. Das, G., Lin, KI., Mannila, H., Renganathan, G., and Smyth, P. (1998). Rule
Discovery from Time Series. Proceedings of the Fourth ACM International
Conference on Knowledge Discovery and Data Mining, pp. 16–22.
17. Deng, K. (1998). OMEGA: OnLine MemoryBased General Purpose Sys
tem Classiﬁer. PhD thesis, Robotics Institute, Carnegie Mellon University.
Technical Report CMURITR9833.
18. Domingos, P. and Hulten, G. (2000). Mining HighSpeed Data Streams. Pro
ceedings of the Sixth ACM International Conference on Knowledge Discovery
and Data Mining, pp. 71–80.
19. Edelsbrunner, H. (1981). A Note on Dynamic Range Searching. Bulletin of
the European Association for Theoretical Computer Science, 15, 34–40.
20. Fountain, T., Dietterich, T.G., and Sudyka, B. (2000). Mining IC
Test Data to Optimize VLSI Testing. Proceedings of the Sixth ACM
International Conference on Knowledge Discovery and Data Mining,
pp. 18–25.
21. Franses, P.H. (1998). Time Series Models for Business and Economic Fore
casting. Cambridge University Press, Cambridge, United Kingdom.
22. Galka, A. (2000). Topics in Nonlinear Time Series Analysis with Implications
for EEG Analysis, World Scientiﬁc, Singapore.
23. Gandhi, H.S. (2003). Important Extrema of Time Series: Theory and Appli
cations. Master’s thesis, Computer Science and Engineering, University of
South Florida. Forthcoming.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 63
24. Gavrilov, M., Anguelov, D., Indyk, P., and Motwani, R. (2000). Mining the
Stock Market: Which Measure is Best. Proceedings of the Sixth ACM Inter
national Conference on Knowledge Discovery and Data Mining, pp. 487–496.
25. Geva., A.B. (1999). HierarchicalFuzzy Clustering of Temporal Patterns
and its Application for TimeSeries Prediction. Pattern Recognition Letters,
20(14), 1519–1532.
26. Goldin, D.Q., and Kanellakis, P.C. (1995). On Similarity Queries for Time
Series Data: Constraint Speciﬁcation and Implementation. Proceedings of the
First International Conference on Principles and Practice of Constraint Pro
gramming, pp. 137–153.
27. Graps, A.L. (1995). An Introduction to Wavelets. IEEE Computational Sci
ence and Engineering, 2(2), 50–61.
28. Grenander, U. (1996). Elements of Pattern Theory, Johns Hopkins University
Press, Baltimore, MD.
29. Guralnik, V., and Srivastava, J. (1999). Event Detection from Time Series
Data. Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 33–42.
30. Guralnik, V., Wijesekera, D., and Srivastava, J. (1998). Pattern
Directed Mining of Sequence Data. Proceedings of the Fourth ACM
International Conference on Knowledge Discovery and Data Mining,
pp. 51–57.
31. Gusﬁeld, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology, Cambridge University Press, Cambridge,
United Kingdom.
32. Han, J., Gong, W., and Yin, Y. (1998). Mining SegmentWise Peri
odic Patterns in TimeRelated Databases. Proceedings of the Fourth
ACM International Conference on Knowledge Discovery and Data Mining,
pp. 214–218.
33. Haslett, J., and Raftery, A.E. (1989). SpaceTime Modeling with Long
Memory Dependence: Assessing Ireland’s Wind Power Resource. Applied
Statistics, 38, 1–50.
34. Huang, Y.W. and Yu, P.S. (1999). Adaptive Query Processing for Time
Series Data. Proceedings of the Fifth International Conference on Knowledge
Discovery and Data Mining, pp. 282–286.
35. Ikeda, K., Vaughn, B.V., and Quint, S.R. (1999). Wavelet Decomposition
of Heart Period Data. Proceedings of the IEEE First Joint BMES/EMBS
Conference, pp. 3–11.
36. Keogh, E.J. and Pazzani, M.J. (1998). An Enhanced Representation of Time
Series which Allows Fast and Accurate Classiﬁcation, Clustering and Rele
vance Feedback. Proceedings of the Fourth ACM International Conference on
Knowledge Discovery and Data Mining, pp. 239–243.
37. Keogh, E.J., and Pazzani, M.J. (2000). Scaling up Dynamic Time
Warping for Data Mining Applications. Proceedings of the Sixth ACM
International Conference on Knowledge Discovery and Data Mining,
pp. 285–289.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
64 E. Fink and K. B. Pratt
38. Keogh, E.J., Chu, S., Hart, D., and Pazzani, M.J. (2001). An Online Algo
rithm for Segmenting Time Series. Proceedings of the IEEE International
Conference on Data Mining, pp. 289–296.
39. Keogh, E.J. (1997). Fast Similarity Search in the Presence of Lon
gitudinal Scaling in Time Series Databases. Proceedings of the Ninth
IEEE International Conference on Tools with Artiﬁcial Intelligence,
pp. 578–584.
40. Kim, S.W., Park, S., and Chu, W.W. (2001). An IndexBased Approach for
Similarity Search Supporting Time Warping in Large Sequence Databases.
Proceedings of the Seventeenth International Conference on Data Engineer
ing, pp. 607–614.
41. Lam, S.K. and Wong, M.H. (1998). A Fast Projection Algorithm for Sequence
Data Searching. Data and Knowledge Engineering, 28(3), 321–339.
42. Last, M., Klein, Y., and Kandel, A. (2001). Knowledge Discovery in Time
Series Databases. IEEE Transactions on Systems, Man, and Cybernetics,
Part B, 31(1), 160–169.
43. Lee, S.L., Chun, S.J., Kim, D.H., Lee, J.H., and Chung, C.W.
(2000). Similarity Search for Multidimensional Data Sequences. Pro
ceedings of the Sixteenth International Conference on Data Engineering,
pp. 599–608.
44. Li, C.S., Yu, P.S., and Castelli, V. (1998). MALM: A Framework for Mining
Sequence Database at Multiple Abstraction Levels. Proceedings of the Sev
enth International Conference on Information and Knowledge Management,
pp. 267–272.
45. Lin, L., and Risch, T. (1998). Querying Continuous Time Sequences. Pro
ceedings of the TwentyFourth International Conference on Very Large Data
Bases, pp. 170–181.
46. Lu, H., Han, J., and Feng, L. (1998). Stock Movement Prediction and N
Dimensional InterTransaction Association Rules. Proceedings of the ACM
SIGMOD Workshop on Research Issues in Data Mining and Knowledge Dis
covery, pp. 12:1–7.
47. Maurer, K., and Dierks, T. (1991). Atlas of Brain Mapping: Topographic
Mapping of EEG and Evoked Potentials, SpringerVerlag, Berlin, Germany.
48. Niedermeyer, E., and Lopes DaSilva, F. (1999). Electroencephalography: Basic
Principles, Clinical Applications, and Related Fields. Lippincott, Williams
and Wilkins, Baltimore, MD, fourth edition.
49. Park, S., Lee, D., and Chu, W.W. (1999). Fast Retrieval of Similar Subse
quences in Long Sequence Databases. Proceedings of the Third IEEE Knowl
edge and Data Engineering Exchange Workshop.
50. Park, S., Chu, W.W., Yoon, J., and Hsu, C. (2000). Eﬃcient Searches for Sim
ilar Subsequences of Diﬀerent Lengths in Sequence Databases. Proceedings
of the Sixteenth International Conference on Data Engineering, pp. 23–32.
51. Park, S., Kim, S.W., and Chu, W.W. (2001). SegmentBased Approach for
Subsequence Searches in Sequence Databases. Proceedings of the Sixteenth
ACM Symposium on Applied Computing, pp. 248–252.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap03
Indexing of Compressed Time Series 65
52. Perng, C.S., Wang, H., Zhang, S.R., and Scott Parker, D. (2000). Land
marks: A New Model for SimilarityBased Pattern Querying in Time Series
Databases. Proceedings of the Sixteenth International Conference on Data
Engineering, pp. 33–42.
53. Policker, S., and Geva, A.B. (2000). NonStationary Time Series Analysis by
Temporal Clustering. IEEE Transactions on Systems, Man, and Cybernetics,
Part B, 30(2), 339–343.
54. Pratt, K.B. and Fink, E. (2002). Search for Patterns in Compressed Time
Series. International Journal of Image and Graphics, 2(1), 89–106.
55. Pratt, K.B. (2001). Locating patterns in discrete time series. Master’s thesis,
Computer Science and Engineering, University of South Florida.
56. Qu, Y., Wang, C., and Wang, X.S. (1998). Supporting Fast Search in Time
Series for Movement Patterns in Multiple Scales. Proceedings of the Seventh
International Conference on Information and Knowledge Management, pp.
251–258.
57. Sahoo, P.K., Soltani, S., Wong, A.K.C., and Chen, Y.C. (1988). A Survey of
Thresholding Techniques. Computer Vision, Graphics and Image Processing,
41, 233–260.
58. Samet, H. (1990). The Design and Analysis of Spatial Data Structures.
AddisonWesley, Reading, MA.
59. Sheikholeslami, G., Chatterjee, S., and Zhang, A. (1998). WaveCluster: A
MultiResolution Clustering Approach for Very Large Spatial Databases.
Proceedings of the TwentyFourth International Conference on Very Large
Data Bases, pp. 428–439.
60. Singh, S., and McAtackney, P. (1998). Dynamic TimeSeries Forecasting
Using Local Approximation. Proceedings of the Tenth IEEE International
Conference on Tools with Artiﬁcial Intelligence, pp. 392–399.
61. Stoﬀer, D.S. (1999). Detecting Common Signals in Multiple Time Series
Using the Spectral Envelope. Journal of the American Statistical Associa
tion, 94, 1341–1356.
62. Yaﬀee, R.A. and McGee, M. (2000). Introduction to Time Series Analysis
and Forecasting, Academic Press, San Diego, CA.
63. Yi, B.K, Sidiropoulos, N.D., Johnson, T., Jagadish, H.V., Faloutsos, C.,
and Biliris, A. (2000). Online Data Mining for CoEvolving Time Series.
Proceedings of the Sixteenth International Conference on Data Engineering,
pp. 13–22.
This page intentionally left blank
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
CHAPTER 4
INDEXING TIMESERIES UNDER CONDITIONS OF NOISE
Michail Vlachos and Dimitrios Gunopulos
Department of Computer Science and Engineering
Bourns College of Engineering
University of California, Riverside Riverside, CA 92521, USA
Email: {mvlachos, dg}@cs.ucr.edu
Gautam Das
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
Email: gautamd@microsoft.com
We present techniques for the analysis and retrieval of timeseries under
conditions of noise. This is an important topic because the data obtained
using various sensors (examples include GPS data or video tracking data)
are typically noisy. The performance of previously used measures is gen
erally degraded under noisy conditions. Here we formalize nonmetric
similarity functions based on the Longest Common Subsequence that
are very robust to noise. Furthermore they provide an intuitive notion of
similarity between timeseries by giving more weight to the similar por
tions of the sequences. Stretching of sequences in time is allowed, as well
as global translating of the sequences in space. Eﬃcient approximate
algorithms that compute these similarity measures are also provided.
We compare these new methods to the widely used Euclidean and Time
Warping distance functions (for real and synthetic data) and show the
superiority of our approach, especially under the strong presence of noise.
We prove a weaker version of the triangle inequality and employ it in
an indexing structure to Answer nearest neighbor queries. Finally, we
present experimental results that validate the accuracy and eﬃciency of
our approach.
Keywords: Longest Common Subsequence; timeseries; spatiotemporal;
outliers; time warping.
67
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
68 M. Vlachos, D. Gunopulos and G. Das
1. Introduction
We consider the problem of discovering similar timeseries, especially under
the presence of noise. Timeseries data come up in a variety of domains,
including stock market analysis, environmental data, telecommunication
data, medical and ﬁnancial data. Web data that count the number of clicks
on given sites, or model the usage of diﬀerent pages are also modeled as
time series.
In the last few years, the advances in mobile computing, sensor and GPS
technology have made it possible to collect large amounts of spatiotempo
ral data and there is increasing interest to perform data analysis tasks over
this data [4]. For example, in mobile computing, users equipped with mobile
devices move in space and register their location at diﬀerent time instants
to spatiotemporal databases via wireless links. In environmental informa
tion systems, tracking animals and weather conditions is very common and
large datasets can be created by storing locations of observed objects over
time. Other examples of the data that we are interested in include features
extracted from video clips, sign language recognition, and so on.
Data analysis in such data includes determining and ﬁnding objects that
moved in a similar way or followed a certain motion pattern. Therefore, the
objective is to cluster diﬀerent objects into similar groups, or to classify an
object based on a set of known examples. The problem is hard, because the
similarity model should allow for imprecise matches.
In general the timeseries will be obtained during a tracking procedure,
with the aid of devices such as GPS transmitters, data gloves etc. Here also
athens 1
athens 2
time
outliers
outliers
X
m
o
v
e
m
e
n
t
Fig. 1. Examples of videotracked data representing 2 instances of the word ‘athens’.
Start & ending contain many outliers.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 69
lies the main obstacle of such data; they may contain a signiﬁcant amount
of outliers (Figure 1) or in other words incorrect data measurements.
We argue that the use of nonmetric distance functions is prefer
able, since such functions can eﬀectively ‘ignore’ the noisy sections of the
sequences.
The rest of the chapter is organized as follows. In section 2 we will review
the most prevalent similarity measures used in timeseries databases, we
will demonstrate the need for nonmetric distance functions and we will also
consider related work. In section 3 we formalize the new similarity functions
by extending the Longest Common Subsequence (LCSS) model. Section 4
demonstrates eﬃcient algorithms to compute these functions and section 5
elaborates on the indexing structure. Section 6 provides the experimental
validation of the accuracy and eﬃciency of the proposed approach. Finally,
section 7 concludes the chapter.
2. Background
Indexing of timeseries has concentrated great attention in the research
community, especially in the last decade, and this can partly be attributed
to the explosion of database sizes. Characteristic examples are environmen
tal data collected on a daily basis or satellite image databases of the earth
1
.
If we would like to allow the user to explore these vast databases, we should
organize the data in such a way, so that one can retrieve accurately and
eﬃciently the data of interest.
More speciﬁcally, our objective is the automatic classiﬁcation of time
series using Nearest Neighbor Classiﬁcation (NNC). In NNC the timeseries
query is classiﬁed according to the majority of its nearest neighbors. The
NNC is conceptually a simple technique, but provides very good results
in practical situations. This classiﬁcation technique is particularly suited
for our setting, since we are going to use a nonmetric distance function.
Furthermore, the technique has good theoretical properties; it has been
shown that the one nearest neighbor rule has asymptotic error rate that is
at most twice the Bayes error rate [13].
So, the problem we consider is: “Given a database D of timeseries and
a query Q (not already in the database), ﬁnd the sequence T that is closest
to Q.” We need to deﬁne the following:
1. A realistic distance function that will match the user’s perception of
what is considered similar.
1
http://www.terraserver.com
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
70 M. Vlachos, D. Gunopulos and G. Das
2. An eﬃcient indexing scheme, which will speed up the user queries.
We will brieﬂy discuss some issues associated with these two topics.
2.1. Time Series Similarity Measures
The simplest approach to deﬁne the distance between two sequences is to
map each sequence into a vector and then use a pnorm to calculate their
distance. The pnorm distance between two ndimensional vectors x and y
is deﬁned as:
L
p
(x, y) =
n
i=1
x
i
− y
i

p
1
p
For p = 2 it is the well known Euclidean distance and for p = 1 the
Manhattan distance.
Most of the related work on timeseries has concentrated on the use of
some metric L
p
Norm. The advantage of this simple model is that it allows
eﬃcient indexing by a dimensionality reduction technique [2,15,19,44]. On
the other hand the model cannot deal well with outliers and is very sensi
tive to small distortions in the time axis. There are a number of interesting
extensions to the above model to support various transformations such as
scaling [10,36], shifting [10,21], normalization [21] and moving average [36].
Other recent works on indexing time series data for similarity queries assum
ing the Euclidean model include [25,26]. A domain independent framework
for deﬁning queries in terms of similarity of objects is presented in [24].
In [29], Lee et al. propose methods to index sequences of multidimensional
points. The similarity model is based on the Euclidean distance and they
extend the ideas presented by Faloutsos et al. in [16], by computing the
distances between multidimensional Minimum Bounding Rectangles.
Another, more ﬂexible way to describe similar but outofphase
sequences can be achieved by using the Dynamic Time Warping (DTW)
[38]. Berndt and Cliﬀord [5] were the ﬁrst that introduced this measure in
the datamining community. Recent works on DTW are [27,31,45]. DTW has
been ﬁrst used to match signals in speech recognition [38]. DTW between
two sequences A and B is described by the following equation:
DTW(A, B) = D
base
(a
m
, b
n
) + min{DTW(Head(A), Head(B)),
DTW(Head(A), B), DTW(A, Head(B))}
where D
base
is some L
p
Norm and Head(A) of a sequence A are all the
elements of A except for a
m
, the last one.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 71
Fig. 2. The support of outofphase matching is important. However, the DTW
matches all points (so, the outliers as well), therefore distorting the true distance
between sequences. The LCSS model can support timeshifting and eﬃciently ignore
the noisy parts.
The diﬃculties imposed by DTW include the fact that it is a nonmetric
distance function and also that its performance deteriorates in the presence
of large amount of outliers (Figure 2). Although the ﬂexibility provided by
DTW is very important, nonetheless DTW is not the appropriate distance
function for noisy data, since by matching all the points, it also matches
the outliers distorting the true distance between the sequences.
Another technique to describe the similarity is to ﬁnd the longest com
mon subsequence (LCSS) of two sequences and then deﬁne the distance
using the length of this subsequence [3,7,12,41]. The LCSS shows how
well the two sequences can match one another if we are allowed to stretch
them but we cannot rearrange the sequence of values. Since the values are
real numbers, we typically allow approximate matching, rather than exact
matching. A technique similar to our work is that of Agrawal et al. [3]. In
this chapter the authors propose an algorithm for ﬁnding an approxima
tion of the LCSS between two timeseries, allowing local scaling (that is,
diﬀerent parts of each sequence can be scaled by a diﬀerent factor before
matching). In [7,12] fast probabilistic algorithms to compute the LCSS of
two time series are presented.
Other techniques to deﬁne time series similarity are based on extracting
certain features (Landmarks [32] or signatures [14]) from each timeseries
and then use these features to deﬁne the similarity. An interesting approach
to represent a time series using the direction of the sequence at regular time
intervals is presented in [35]. Ge and Smyth [18] present an alternative
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
72 M. Vlachos, D. Gunopulos and G. Das
approach for sequence similarity that is based on probabilistic matching,
using Hidden Markov Models. However the scalability properties of this
approach have not been investigated yet. A recent work that proposes a
method to cluster trajectory data is due to Gaﬀney and Smyth [17]. They
use a variation of the EM (expectation maximization) algorithm to cluster
small sets of trajectories. However, their method is a model based approach
that usually has scalability problems.
The most related paper to our work is the Bozkaya et al. [8]. They dis
cuss how to deﬁne similarity measures for sequences of multidimensional
points using a restricted version of the edit distance which is equivalent to
the LCCS. Also, they present two eﬃcient methods to index the sequences
for similarity retrieval. However, they focus on sequences of feature vectors
extracted from images and they do not discuss transformations or approx
imate methods to compute the similarity.
Lately, there has been some work on indexing moving objects to answer
spatial proximity queries (range and nearest neighbor queries) [1,28,39].
Also, in [33], Pfoser et al. present index methods to answer topological
and navigational queries in a database that stores trajectories of moving
objects. However these works do not consider a global similarity model
between sequences but they concentrate on ﬁnding objects that are close to
query locations during a time instant, or time period that is also speciﬁed
by the query.
2.2. Indexing Time Series
Indexing refers to the organization of data in such a way, so as not only to
group together similar items, but also to facilitate the pruning of irrelevant
data. The second part is greatly aﬀected from the fact whether our distance
function is metric or not.
Deﬁnition 1. A distance function d(x, y) between two objects x and y is
metric when it complies to the following properties:
1. Positivity, d(x, y) ≥ 0 and d(x, y) = 0 iﬀ x = y.
2. Symmetry, d(x, y) = d(y, x).
3. Triangle Inequality, d(x, y) + d(y, z) ≥ d(x, z).
Example: The Euclidean distance is a metric distance function. The
pruning power of such a function can be shown in the following exam
ple(Figure 3); Assume we have a set of sequences, including sequences Seq1,
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 73
Pairwise Distances
Seq1
Seq1
Seq2
Seq3
0
20
110
20
0
90
110
90
0
Seq2 Seq3
Query
Seq 1
Seq 2
Seq 3
Fig. 3. Pruning power of triangle inequality.
Seq2, and Seq3, and we have already recorded their pairwise distances in a
table. Suppose we pose a query Q, and we want to ﬁnd the closest sequence
to Q. Assume we have already found a sequence, bestmatchsofar, whose
distance from Q is 20. Comparing the query sequence Q to Seq2 yields
D(Q, Seq2) = 150. However, because D(Seq2,Seq1) = 20 and it is true that:
D(Q,Seq1) ≥ D(Q,Seq2) − D(Seq2,Seq1) ⇒ D(Q,Seq1) ≥ 150 − 20 = 130
we can safely prune Seq1, since it is not going to oﬀer a better solution.
Similarly, we can prune Seq3.
We are going to use nonmetric distance functions. This choice imposes
diﬃculties on how to prune data in an index, since we cannot apply the
technique outlined above. As a result it is much more diﬃcult to design
eﬃcient indexing techniques for nonmetric distance functions. In the next
section we show that despite this fundamental drawback it is important to
use nonmetric distance functions in conditions of noise. In section 5 we will
also show that the proper choice of a distance measure can overcome this
drawback to some degree.
2.3. Motivation for NonMetric Distance Functions
Distance functions that are robust to extremely noisy data will typically
violate the triangular inequality. These functions achieve this by not con
sidering the most dissimilar parts of the objects. Moreover, they are useful,
because they represent an accurate model of the human perception, since
when comparing any kind of data (images, timeseries etc), we mostly focus
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
74 M. Vlachos, D. Gunopulos and G. Das
on the portions that are similar and we are willing to pay less attention to
regions of great dissimilarity.
Nonmetric distances are used nowadays in many domains, such as string
(DNA) matching, collaborative ﬁltering (where customers are matched
with stored ‘prototypical’ customers) and retrieval of similar images from
databases. Furthermore, psychology research suggests that human similar
ity judgments are also nonmetric.
For this kind of data we need distance functions that can address the
following issues:
• Diﬀerent Sampling Rates or diﬀerent speeds. The timeseries that
we obtain, are not guaranteed to be the outcome of sampling at ﬁxed
time intervals. The sensors collecting the data may fail for some period
of time, leading to inconsistent sampling rates. Moreover, two time series
moving at exactly the similar way, but one moving at twice the speed of
the other will result (most probably) to a very large Euclidean distance.
• Outliers. Noise might be introduced due to anomaly in the sensor col
lecting the data or can be attributed to human ‘failure’ (e.g. jerky move
ment during a tracking process). In this case the Euclidean distance will
completely fail and result to very large distance, even though this diﬀer
ence may be found in only a few points.
• Diﬀerent lengths. Euclidean distance deals with timeseries of equal
length. In the case of diﬀerent lengths we have to decide whether to
truncate the longer series, or pad with zeros the shorter etc. In general
its use gets complicated and the distance notion more vague.
• Eﬃciency. The similarity model has to be suﬃciently complex to express
the user’s notion of similarity, yet simple enough to allow eﬃcient com
putation of the similarity.
To cope with these challenges we use the Longest Common Subsequence
(LCSS) model. The LCSS is a variation of the edit distance [30,37]. The
basic idea is to match two sequences by allowing them to stretch, without
rearranging the sequence of the elements but allowing some elements to be
unmatched.
A simple extension of the LCSS model is not suﬃcient, because (for
example) this model cannot deal with parallel movements. Therefore, we
extend it in order to address similar problems. So, in our similarity model
we consider a set of translations and we ﬁnd the translation that yields the
optimal solution to the LCSS problem.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 75
3. Similarity Measures Based on LCSS
3.1. Original Notion of LCSS
Suppose that we have two time series A = (a
1
, a
2
, . . . , a
n
) and B =
(b
1
, b
2
, . . . , b
m
). For A let Head(A) = (a
1
, a
2
, . . . , a
n−1
). Similarly, for B.
Deﬁnition 2. The LCSS between A and B is deﬁned as follows:
LCSS(A, B) =
0 if A or B empty,
1 +LCSS(Head(A), B)) if a
n
= b
n
,
max(LCSS(Head(A), B), otherwise.
LCSS(A, Head(B)))
The above deﬁnition is recursive and would require exponential time to
compute. However, there is a better solution that can be oﬀered in O(m
∗
n)
time, using dynamic programming.
Dynamic Programming Solution [11,42]
The LCSS problem can easily be solved in quadratic time and space. The
basic idea behind this solution lies in the fact that the problem of the
sequence matching can be dissected in smaller problems, which can be com
bined after they are solved optimally. So, what we have to do is, solve a
smaller instance of the problem (with fewer points) and then continue by
adding new points to our sequence and modify accordingly the LCSS.
Now the solution can be found by solving the following equation using
dynamic programming (Figure 4):
LCSS[i, j] =
0 if i = 0,
0 if j = 0,
1 +LCSS[i −1, j −1] if a
i
= b
i
,
max(LCSS[i −1, j], LCSS[i, j −1]) otherwise.
where LCSS[i, j] denotes the longest common subsequence between the ﬁrst
i elements of sequence A and the ﬁrst j elements of sequence B. Finally,
LCSS[n, m] will give us the length of the longest common subsequence
between the two sequences A and B.
The same dynamic programming technique can be employed in order to
ﬁnd the Warping Distance between two sequences.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
76 M. Vlachos, D. Gunopulos and G. Das
Fig. 4. Solving the LCSS problem using dynamic programming. The gray area indicates
the elements that are examined if we conﬁne our search window. The solution provided
is still the same.
3.2. Extending the LCSS Model
Having seen that there exists an eﬃcient way to compute the LCSS between
two sequences, we extend this notion in order to deﬁne a new, more ﬂexible,
similarity measure. The LCSS model matches exact values, however in our
model we want to allow more ﬂexible matching between two sequences,
when the values are within certain range. Moreover, in certain applications,
the stretching that is being provided by the LCSS algorithm needs only to
be within a certain range, too.
We assume that the measurements of the timeseries are at ﬁxed and
discrete time intervals. If this is not the case then we can use interpolation
[23,34].
Deﬁnition 3. Given an integer δ and a real positive number ε, we deﬁne
the LCSS
δ,ε
(A, B) as follows:
LCSS
δ,ε
(A, B) =
0 if A or B is empty
1 +LCSS
δ,ε
(Head(A), Head(B))
if a
n
−b
n
 < ε and n −m ≤ δ
max(LCSS
δ,ε
(Head(A), B), LCSS
δ,ε
(A, Head(B)))
otherwise
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 77
Fig. 5. The notion of the LCSS matching within a region of δ & ε for a sequence. The
points of the two sequences within the gray region can be matched by the extended LCSS
function.
The constant δ controls how far in time we can go in order to match a
given point from one sequence to a point in another sequence. The constant
ε is the matching threshold (see Figure 5).
The ﬁrst similarity function is based on the LCSS and the idea is to
allow time stretching. Then, objects that are close in space at diﬀerent
time instants can be matched if the time instants are also close.
Deﬁnition 4. We deﬁne the similarity function S1 between two sequences
A and B, given δ and ε, as follows:
S1(δ, ε, A, B) =
LCSS
δ,ε
(A, B)
min(n.m)
Essentially, using this measure if there is a matching point within the
region ε we increase the LCSS by one.
We use function S1 to deﬁne another, more ﬂexible, similarity measure.
First, we consider the set of translations. A translation simply causes a
vertical shift either up or down. Let F be the family of translations. Then a
function f
c
belongs to F if f
c
(A) = (a
x,1
+c, . . . , a
x,n
+c). Next, we deﬁne
a second notion of the similarity based on the above family of functions.
Deﬁnition 5. Given δ, ε and the family F of translations, we deﬁne the
similarity function S2 between two sequences A and B, as follows:
S2(δ, ε, A, B) = max
fc∈F
S1(δ, ε, A, f
c
(B))
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
78 M. Vlachos, D. Gunopulos and G. Das
Fig. 6. Translation of sequence A.
So the similarity functions S1 and S2 range from 0 to 1. Therefore we
can deﬁne the distance function between two sequences as follows:
Deﬁnition 6. Given δ, ε and two sequences Aand B we deﬁne the following
distance functions:
D1(δ, ε, A, B) = 1 −S1(δ, ε, A, B)
and
D2(δ, ε, A, B) = 1 −S2(δ, ε, A, B)
Note that D1 and D2 are symmetric. LCSS
δ,ε
(A, B) is equal to
LCSS
δ,ε
(B, A) and the transformation that we use in D2 is translation
which preserves the symmetric property.
By allowing translations, we can detect similarities between movements
that are parallel, but not identical. In addition, the LCSS model allows
stretching and displacement in time, so we can detect similarities in move
ments that happen with diﬀerent speeds, or at diﬀerent times. In Figure 6
we show an example where a sequence A matches another sequence B after
a translation is applied.
The similarity function S2 is a signiﬁcant improvement over the S1,
because: (i) now we can detect parallel movements, (ii) the use of normal
ization does not guarantee that we will get the best match between two
timeseries. Usually, because of the signiﬁcant amount of noise, the average
value and/or the standard deviation of the timeseries that are being used in
the normalization process can be distorted leading to improper translations.
3.3. Diﬀerences between DTWand LCSS
Time Warping and the LCSS share many similarities. Here, we argue that
the LCSS is a better similarity function for correctly identifying noisy
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 79
sequences and the reasons are:
1. Taking under consideration that a large portion of the sequences may
be just outliers, we need a similarity function that will be robust under
noisy conditions and will not match the incorrect parts. This property
of the LCSS is depicted in the Figure 7. Time Warping by matching all
elements is also going to try and match the outliers which, most likely,
is going to distort the real distance between the examined sequences.
In Figure 8 we can see an example of a hierarchical clustering pro
duced by the DTW and the LCSS distances between four timeseries.
Fig. 7. Using the LCSS we only match the similar portions, avoiding the outliers.
Fig. 8. Hierarchical clustering of time series with signiﬁcant amount of outliers. Left:
The presence of many outliers in the beginning and the end of the sequences leads
to incorrect clustering. DTW is not robust under noisy conditions. Right: The LCSS
focusing on the common parts achieves the correct clustering.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
80 M. Vlachos, D. Gunopulos and G. Das
Fig. 9. Left: Two sequences and their mean values. Right: After normalization. Obvi
ously an even better matching can be found for the two sequences.
The sequences represent data collected through a video tracking process
(see Section 6). The DTW fails to distinguish the two classes of words, due
to the great amount of outliers, especially in the beginning and in the end of
the sequences. Using the Euclidean distance we obtain even worse results.
Using the LCSS similarity measure we can obtain the most intuitive clus
tering as shown in the same ﬁgure. Even though the ending portion of the
Boston 2 timeseries diﬀers signiﬁcantly from the Boston 1 sequence, the
LCSS correctly focuses on the start of the sequence, therefore producing
the correct grouping of the four timeseries.
2. Simply normalizing the timeseries (by subtracting the average value)
does not guarantee that we will achieve the best match (Figure 9). How
ever, we are going to show in the following section, that we can try a
set of translations which will provably give us the optimal matching (or
close to optimal, within some user deﬁned error bound).
4. Eﬃcient Algorithms to Compute the Similarity
4.1. Computing the Similarity Function S1
To compute the similarity functions S1 we have to run a LCSS compu
tation for the two sequences. The LCSS can be computed by a dynamic
programming algorithm in O(n
∗
m) time. However we only allow matchings
when the diﬀerence in the indices is at most δ, and this allows the use of a
faster algorithm. The following lemma has been shown in [12].
Lemma 1. Given two sequences A and B, with A = n and B = m, we
can ﬁnd the LCSS
δ,ε
(A,B) in O(δ(n +m)) time.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 81
If δ is small, the dynamic programming algorithm is very eﬃcient.
However, for some applications δ may need to be large. In this case, we
can speedup the above computation using random sampling. Given two
sequences A and B, we compute two subsets RA and RB by sampling each
sequence. Then we use the dynamic programming algorithm to compute
the LCSS on RA and RB. We can show that, with high probability, the
result of the algorithm over the samples, is a good approximation of the
actual value. We describe this technique in detail in [40].
4.2. Computing the Similarity Function S2
We now consider the more complex similarity function S2. Here, given
two sequences A, B, and constants δ, ε, we have to ﬁnd the translation f
c
that maximizes the length of the longest common subsequence of A, f
c
(B)
(LCSS
δ,ε
(A, f
c
(B)) over all possible translations.
A one dimensional translation f
c
is a function that adds a constant to
all the elements of a 1dimensional sequence: f
c
(x
1
, . . . , x
m
) = (x
1
+c, . . . ,
x
m
+ c).
Let the length of sequences A and B be n and m respectively. Let us
also assume that the translation f
c1
is the translation that, when applied
to B, gives a longest common subsequence LCSS
δ,ε
(A, f
c1
(B)) = a, and
it is also the translation that maximizes the length of the longest common
subsequence:
LCSS
δ,ε
(A, f
c1
(B)) = max
c∈R
LCSS
δ,ε
(A, f
c
(B)).
The key observation is that, although there is an inﬁnite number of
translations that we can apply to B, each translation f
c
results to a longest
common subsequence between A and f
c
(B), and there is a ﬁnite set of
possible longest common subsequences. In this section we show that we can
eﬃciently enumerate a ﬁnite set of translations, such that this set provably
includes a translation that maximizes the length of the longest common
subsequence of A and f
c
(B).
A translation by c
, applied to B can be thought of as a linear transfor
mation of the form f(b
i
) = b
i
+ c
. Such a transformation will allow b
i
to
be matched to all a
j
for which i −j < δ, and a
j
−ε ≤ f(b
i
) ≤ a
j
+ ε.
It is instructive to view this as a stabbing problem: Consider the
O(δ(n+m)) vertical line segments ((b
i
, a
j
−ε), (b
i
, a
j
+ε)), where i−j < δ
(Figure 10).
These line segments are on a two dimensional plane, where on the x axis
we put elements of B and on the y axis we put elements of A. For every
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
82 M. Vlachos, D. Gunopulos and G. Das
Fig. 10. An example of a translation. By applying the translation to a sequence and
executing LCSS for δ = 1 we can achieve perfect matching.
pair of elements b
i
, a
j
in A and B that are within δ positions from each
other (and therefore can be matched by the LCSS algorithm if their values
are within ε), we create a vertical line segment that is centered at the point
(b
i
, a
j
) and extends ε above and below this point. Since each element in A
can be matched with at most 2δ + 1 elements in B, the total number of
such line segments is O(δn).
A translation f
c
in one dimension is a function of the form f
c
(b
i
) =
b
i
+c
. Therefore, in the plane we described above, f
c
(b
i
) is a line of slope 1.
After translating B by f
c
, an element b
i
of B can be matched to an element
a
j
of A if and only if the line f
c
(x) = x + c
intersects the line segment
((b
i
, a
j
−ε), (b
i
, a
j
+ ε)).
Therefore each line of slope 1 deﬁnes a set of possible matchings between
the elements of sequences A and B. The number of intersected line seg
ments is actually an upper bound on the length of the longest common
subsequence because the ordering of the elements is ignored. However, two
diﬀerent translations can result to diﬀerent longest common subsequences
only if the respective lines intersect a diﬀerent set of line segments. For
example, the translations f
0
(x) = x and f
2
(x) = x+2 in Figure 10 intersect
diﬀerent sets of line segments and result to longest common subsequences
of diﬀerent length.
The following lemma gives a bound on the number of possible diﬀerent
longest common subsequences by bounding the number of possible diﬀerent
sets of line segments that are intersected by lines of slope 1.
Lemma 2. Given two one dimensional sequences A, B, there are O(δ(n+
m)) lines of slope 1 that intersect diﬀerent sets of line segments.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 83
Proof: Let f
c
(x) = x+c
be a line of slope 1. If we move this line slightly to
the left or to the right, it still intersects the same number of line segments,
unless we cross an endpoint of a line segment. In this case, the set of inter
sected line segments increases or decreases by one. There are O(δ(n +m))
endpoints. A line of slope 1 that sweeps all the endpoints will therefore
intersect at most O(δ(n + m)) diﬀerent sets of line segments during the
sweep.
In addition, we can enumerate the O(δ(n+m)) translations that produce
diﬀerent sets of potential matchings by ﬁnding the lines of slope 1 that
pass through the endpoints. Each such translation corresponds to a line
f
c
(x) = x + c
. This set of O(δ(n + m)) translations gives all possible
matchings for a longest common subsequence of A, B. Since running the
LCSS algorithm takes O(δ(n +m)) we have shown the following theorem:
Theorem 1. Given two sequences A and B, with A = n and B = m,
we can compute the S2(δ, ε, A, B) in O((n +m)
2
δ
2
) time.
4.3. An Eﬃcient Approximate Algorithm
Theorem 1 gives an exact algorithm for computing S2, but this algorithm
runs in quadratic time. In this section we present a much more eﬃcient
approximate algorithm. The key in our technique is that we can bound the
diﬀerence between the sets of line segments that diﬀerent lines of slope 1
intersect, based on how far apart the lines are.
Let us consider the O(δ(n + m)) translations that result to diﬀerent
sets of intersected line segments. Each translation is a line of the form
f
c
(x) = x +c
. Let us sort these translations by c
. For a given translation
f
c
, let L
fc
be the set of line segments it intersects. The following lemma
shows that neighbor translations in this order intersect similar sets of line
segments.
Lemma 3. Let f
1
(x) = x+c
1
, . . . , f
N
(x) = x+c
N
be the diﬀerent transla
tions for sequences A
x
and B
x
, where c
1
≤ · · · ≤ c
N
. Then the symmetric
diﬀerence L
fi
∆L
fj
≤ i −j.
Proof: Without loss of generality, assume i < j. The diﬀerence between f
i
and f
i+1
is simply that we cross one additional endpoint. This means that
we have to either add or subtract one line segment to L
fi
to get L
fi+1
. By
repeating this step (j −i) times we create L
fj
.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
84 M. Vlachos, D. Gunopulos and G. Das
Fig. 11. If we can aﬀord to be within a certain error range then we don’t have to try
all translations.
We can now prove our main theorem:
Theorem 2. Given two sequences A and B, with A = n and B = m
(suppose m < n), and a constant 0 < β < 1, we can ﬁnd an approxima
tion AS2
δ,β
(A, B) of the similarity S2(δ, ε, A, B) such that S2(δ, ε, A, B) −
AS2
δ,β
(A, B) < β in O(nδ
2
/β) time.
Proof: Let a = S2(δ, ε, A, B). There exists a translation f
i
such that L
fi
is a superset of the matches in the optimal LCSS of A and B. In addition,
by the previous lemma, there are 2b translations (f
i−b
, . . . , f
i+b
) that have
at most b diﬀerent matchings from the optimal.
Therefore, if we use the translations f
ib
, for i = 1, . . . ,
δn
b
in the order
ing described above, we are within b diﬀerent matchings from the opti
mal matching of A and B (Figure 11). We can ﬁnd these translations in
O(δnlog n) time if we ﬁnd and sort all the translations.
Alternatively, we can ﬁnd these translations in O(
δn
b
δn) time if we run
[
δn
b
] quantile operations.
So we have a total of
δn
b
translations and setting b = βn completes the
proof.
Given sequences A, B with lengths n, m respectively (m < n), and con
stants δ, β, ε, the approximation algorithm works as follows:
(i) Find all sets of translations for sequences A and B.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 85
(ii) Find the iβnth quantiles for this set, 1 ≤ i ≤
2δ
β
.
(iii) Run the LCSS
δ,ε
algorithm on A and B, for the computed
2δ
β
set of
translations.
(iv) Return the highest result.
5. Indexing Trajectories for Similarity Retrieval
In this section we show how to use the hierarchical tree of a clustering
algorithm in order to eﬃciently answer nearest neighbor queries in a dataset
of sequences.
The distance function D2 is not a metric because it does not obey
the triangle inequality. This makes the use of traditional indexing tech
niques diﬃcult. An example is shown in Figure 12, where we observe
that LCSS(δ, ε, A, B) = 1 and LCSS(δ, ε, B, C) = 1, therefore the
respective distances are both zero. However, obviously D2(δ, ε, A, C) >
D2(δ, ε, A, B) +D2(δ, ε, B, C) = 0 + 0.
We can however prove a weaker version of the triangle inequality, which
can help us avoid examining a large portion of the database objects. First
we deﬁne:
LCSS
δ,ε,F
(A, B) = max
fc∈F
LCSS
δ,ε
(A, f
c
(B))
Clearly,
D2(δ, ε, A, B) = 1 −
LCSS
δ,ε,F
(A, B)
min(A, B)
Fig. 12. An example where the triangle inequality does not hold for the new LCSS
model.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
86 M. Vlachos, D. Gunopulos and G. Das
(as before, F is the set of translations). Now we can show the following
lemma:
Lemma 4. Given trajectories A,B,C:
LCSS
δ,2ε
(A, C) > LCSS
δ,ε,F
(A, B) +LCSS
δ,ε,F
(B, C) − B
where B is the length of sequence B.
Proof: Clearly, if an element of A can match an element of B within
ε, and the same element of B matches an element of C within ε, then
the element of A can also match the element of C within 2ε. Since there
are at least B − (B −LCSS
δ,ε,F
(A, B)) − (B −LCSS
δ,ε,F
(B, C)) ele
ments of B that match with elements of A and with elements of C, it
follows that LCSS
δ,2ε,F
(A, C) > B − (B − LCSS
δ,ε,F
(A, B)) − (B −
LCSS
δ,ε,F
(B, C)) = LCSS
δ,ε,F
(A, B) +LCSS
δ,ε,F
(B, C) − B.
5.1. Indexing Structure
We ﬁrst partition all the sequences into sets according to length, so that
the longest sequence in each set is at most a times the shortest (typically
we use a = 2.) We apply a hierarchical clustering algorithm on each set,
and we use the tree that the algorithm produced as follows:
For every node C of the tree we store the medoid (M
C
) of
each cluster. The medoid is the sequence that has the minimum dis
tance (or maximum LCSS) from every other sequence in the clus
ter: max
vi∈C
min
vj∈C
LCSS
δ,ε,F
(v
i
, v
j
, e). So given the tree and a query
sequence Q, we want to examine whether to follow the subtree that is
rooted at C. However, from the previous lemma we know that for any
sequence B in C:
LCSS
δ,ε,F
(B, Q) < B +LCSS
δ,2ε,F
(M
C
, Q) −LCSS
δ,ε,F
(M
C
, B)
or in terms of distance:
D2(δ, ε, B, Q) = 1 −
LCSS
δ,ε,F
(B, Q)
min(B, Q)
> 1 −
B
min(B, Q)
−
LCSS
δ,2ε,F
(M
c
, Q)
min(B, Q)
+
LCSS
δ,ε,F
(M
c
, B)
min(B, Q)
.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 87
In order to provide a lower bound we have to maximize the expression
B − LCSS
δ,ε,F
(M
c
, B). Therefore, for every node of the tree along with
the medoid we have to keep the sequence r
c
that maximizes this expression.
If the length of the query is smaller than the shortest length of the sequences
we are currently considering we use that, otherwise we use the minimum
and maximum lengths to obtain an approximate result.
5.2. Searching the Index Tree for Nearest Trajectories
We assume that we search an index tree that contains timeseries with
minimum length, min l and maximum length, max l. For simplicity we dis
cuss the algorithm for the 1Nearest Neighbor query, where given a query
sequence Q we try to ﬁnd the sequence in the set that is the most similar
to Q. The search procedure takes as input a node N in the tree, the query
Q and the distance to the closest timeseries found so far. For each of the
children C, we check if the child is a single sequence or a cluster. In case
that it is a sequence, we just compare its distance to Q with the current
nearest sequence. If it is a cluster, we check the length of the query and we
choose the appropriate value for min(B, Q). Then we compute a lower
bound L to the distance of the query with any sequence in the cluster and
we compare the result with the distance of the current nearest neighbor
mindist. We need to examine this cluster only if L is smaller than mindist.
In our scheme we use an approximate algorithm to compute the
LCSS
δ,ε,F
. Consequently, the value of LCSS
δ,ε,F
(M
C
, B)/min(B, Q)
that we compute can be up to β times higher than the exact value. There
fore, since we use the approximate algorithm of section 3.2 for indexing
trajectories, we have to subtract β
∗
min(M
C
, Q)/min(B, Q) from the
bound we compute for D2(δ, ε, B, Q).
6. Experimental Evaluation
We implemented the proposed approximation and indexing techniques as
they are described in the previous sections and here we present experimen
tal results evaluating our techniques. We describe the datasets and then
we continue by presenting the results. The purpose of our experiments is
twofold: ﬁrst, to evaluate the eﬃciency and accuracy of the approximation
algorithm presented in Section 4 and second to evaluate the indexing tech
nique that we discussed in the previous section. Our experiments were run
on a PC Pentium III at 1 GHz with 128 MB RAM and 60 GB hard disk.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
88 M. Vlachos, D. Gunopulos and G. Das
6.1. Time and Accuracy Experiments
Here we present the results of some experiments using the approximation
algorithm to compute the similarity function S2. Our dataset here comes
from marine mammals’ satellite tracking data
2
. It consists of sequences of
geographic locations of various marine animals (dolphins, sea lions, whales,
etc) tracked over diﬀerent periods of time, that range from one to three
months (SEALS dataset). The length of the sequences is close to 100.
In Table 1 we show the computed similarity between a pair of sequences
in the SEALS dataset. We run the exact and the approximate algorithm
for diﬀerent values of δ and ε and we report here some indicative results. K
is the number of times the approximate algorithm invokes the LCSS proce
dure (that is, the number of translations c that we try). As we can see using
only a few translation we get very good results. We got similar results for
synthetic datasets. Also, in Table 2 we report the running times to com
pute the similarity measure between two sequences of the same dataset. The
approximation algorithm uses again from 15 to 60 diﬀerent runs. The run
ning time of the approximation algorithm is much faster even for K = 60.
As can be observed from the experimental results, the running times of
the approximation algorithm is not proportional to the number of runs (K).
This is achieved by reusing the results of previous translations and termi
nating early the execution of the current translation, if it is not going to
yield a better result. The main conclusion of the above experiments is that
the approximation algorithm can provide a very tractable time vs accu
racy tradeoﬀ for computing the similarity between two sequences, when
the similarity is deﬁned using the LCSS model.
Table 1. Similarity values between two sequences from our SEALS
dataset.
Similarity Error(%)
for K = 60
δ E Exact Approximate for K tries
15 30 60
2 0.25 0.71134 0.65974 0.71134 0.696701 1.4639
2 0.5 0.907216 0.886598 0.893608 0.9 0.7216
4 0.25 0.71230 0.680619 0.700206 0.698041 1.2094
4 0.5 0.927835 0.908763 0.865979 0.922577 0.5258
6 0.25 0.72459 0.669072 0.698763 0.71134 1.325
6 0.5 0.938144 0.938 0.919794 0.92087 1.7274
2
http://whale.wheelock.edu/whalenetstuff/stop cover.html
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 89
Table 2. Running time between two sequences from the SEALS
dataset.
Running Time (sec) Error(%)
for K = 60
δ E Exact Approximate for K tries
15 30 60
2 0.25 0.06 0.0038 0.006 0.00911 6.586
2 0.5 0.04 0.0032 0.00441 0.00921 4.343
4 0.25 0.17 0.0581 0.00822 0.01482 11.470
4 0.5 0.1 0.00481 0.00641 0.01031 9.699
6 0.25 0.331 0.00731 0.01151 0.01933 17.123
6 0.5 0.301 0.0062 0.00921 0.01332 22.597
6.2. Clustering using the Approximation Algorithm
We compare the clustering performance of our method to the widely used
Euclidean and DTW distance functions. Speciﬁcally:
(i) The Euclidean distance is only deﬁned for sequences of the same length
(and the lengths of our sequences vary considerably). We tried to oﬀer
the best possible comparison between every pair of sequences, by slid
ing the shorter of the two sequences across the longer one and recording
their minimum distance. Like this we can get the best possible results
out of the Euclidean method, imposing some time overhead. However,
the computation of the Euclidean distance is the least complicated
one.
(ii) The DTW can be used to match sequences of diﬀerent length. In both
DTW and Euclidean we normalized the data before computing the
distances. Our method does not need any normalization, since it com
putes the necessary translations.
(iii) For LCSS we used a randomized version with and without sampling,
and for various values of δ. The time and the correct clusterings repre
sent the average values of 15 runs of the experiment. This is necessary
due to the randomized nature of our approach.
6.2.1. Determining the Values for δ and ε
The choice of values for the parameters δ and ε are clearly dependent on
the application and the dataset. For most datasets we had at our disposal
we discovered that setting δ to more than 20–30% of the sequences length
did not yield signiﬁcant improvement. Furthermore, after some point the
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
90 M. Vlachos, D. Gunopulos and G. Das
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Delta percentage of sequence length
S
i
m
i
l
a
r
i
t
y
Whole Sequence
Sample 80%
Sample 60%
Sample 40%
Sample 30%
Sample 25%
Sample 20%
Fig. 13. In most datasets, the similarity stabilizes around a certain point after a
value of δ.
similarity stabilizes to a certain value (Figure 13). The determination of ε
is application dependent. In our experiments we used a value equal to some
percentage (usually 0.3–0.5) of the smallest standard deviation between
the two sequences that were examined at any time, which yielded good and
intuitive results. Nevertheless, when we use the index the value of ε has to
be the same for all pairs of sequences.
If we want to provide more accurate results of the similarity between
the two sequences, we can use weighted matching. This will allow us to give
more gravity to the points that match very closely and less gravity to the
points that match marginally within the search area of ε. Such matching
functions can be found in Figure 14. If two sequence points match for very
small ε, then we increase the LCSS by 1, otherwise we increase by some
amount in the range r, where 0 ≤ r < 1.
6.2.2. Experiment 1 — Video Tracking Data
These time series represent the X and Y position of a human tracking
feature (e.g. tip of ﬁnger). In conjunction with a “spelling program” the
user can “write” various words [20]. In this experiment we used only the X
coordinates and we kept 3 recordings for 5 diﬀerent words. The data cor
respond to the following words: ‘athens’, ‘berlin’, ‘london’, ‘boston’, ‘paris’.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 91
Fig. 14. Performing unweighted and weighted matching in order to provide more accu
racy in the matching between sequences.
The average length of the series is around 1100 points. The shortest one is
834 points and the longest one 1719 points.
To determine the eﬃciency of each method we performed hierarchical
clustering after computing the N(N −1)/2 pairwise distances for all three
distance functions. We evaluate the total time required by each method, as
well as the quality of the clustering, based on our knowledge of which word
each sequence actually represents. We take all possible pairs of words (in
this case 5 ∗ 4/2 = 10 pairs) and use the clustering algorithm to partition
them into two classes. While at the lower levels of the dendrogram the
clustering is subjective, the top level should provide an accurate division
into two classes. We clustered using single, complete and average linkage.
Since the best results for every distance function are produced using the
complete linkage, we report only the results for this approach (Table 3,
Fig. 15). The same experiment is conducted with the rest of the datasets.
Experiments have been conducted for diﬀerent sample sizes and values of
δ (as a percentage of the original series length).
The results with the Euclidean and the Time Warping distance have
many classiﬁcation errors. For the LCSS the only real variations in the
clustering are for sample sizes s ≤ 10% (Figure 16). Still the average incor
rect clusterings for these cases were constantly less than 2 (<1.85). For 15%
sampling or more, there were no errors.
6.2.3. Experiment 2 — Australian Sign Language Dataset (ASL)
3
The dataset consists of various parameters (such as the X ,Y, Z hand
position, roll, pitch, yaw, thumb bend etc) tracked while diﬀerent writers
3
http://kdd.ics.uci.edu
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
92 M. Vlachos, D. Gunopulos and G. Das
Table 3. Results using the video tracking data some
values of sample size s and δ.
Distance function Time (sec) Correct clusterings
(out of 10)
complete linkage
Euclidean 31.28 2
DTW 227.162 5
S2:
s5%, δ25% 2.645 8.45
s10%, δ25% 8.240 9.85
s15%, δ25% 15.789 10
s20%, δ25% 28.113 10
Fig. 15. Time required to compute the pairwise distance computations for various sam
ple sizes and values of δ.
sign one the 95 words of the ASL. These series are relatively short
(50–100 points). We used only the X coordinates and collected 5 recordings
of the following 10 words: ‘Norway’, ‘cold’, ‘crazy’, ‘eat’, ‘forget’, ‘happy’,
‘innocent’, ‘later’, ‘lose’, ‘spend’. This is the experiment conducted also in
[27]. Examples of this dataset can be seen in Figure 17.
In this experiment all distance functions do not depict signiﬁcant clus
tering accuracy, and the reason is that we used only 1 parameter out of
the possible 10. Therefore accuracy is compromised. Speciﬁcally, the per
formance of the LCSS in this experiment is similar to the DTW (both rec
ognized correctly 20 clusters). So both present equivalent accuracy levels,
which is expected, since this dataset does not contain excessive noise and
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 93
Fig. 16. Average number of correct clusterings (out of 10) for 15 runs of the algorithm
using diﬀerent sample sizes and δ values.
Fig. 17. Three recordings of the word ‘norway’ in the Australian Sign Language. The
graph depicts the x position of the writer’s hand.
furthermore the data seem to be already normalized and rescaled within
the range [−1, . . . , 1]. Therefore in this experiment we used also the sim
ilarity function S1 (no translation), since the translations were not going
to achieve any further improvement (Table 4, Figure 18). Sampling is only
performed down to 75% of the series length (these sequences are already
short). As a consequence, even though we don’t gain much in accuracy, our
execution time is comparable to the Euclidean (without performing any
translations).
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
94 M. Vlachos, D. Gunopulos and G. Das
Table 4. Results for ASL data and ASL with added noise.
Distance function Time (sec) Correct clusterings Correct clusterings
(out of 45) (out of 45)
ASL ASL with noise
Euclidean 2.170 15 1
DTW 8.092 20 2
S1:
s75%, δ40% 1.200 15.450 9.933
s75%, δ100% 2.297 19.231 10.4667
s100%, δ40% 1.345 16.080 12.00
s100%, δ100% 2.650 20.00 12.00
Fig. 18. ASL data: Time required to compute the pairwise distances of the 45 combi
nations (same for ASL and ASL with noise).
6.2.4. Experiment 3 — ASL with Added Noise
We added noise at every sequence of the ASL at a random starting point
and for duration equal to the 15% of the series length. The noise was added
using the function: x
noise
, y
noise
= x, y + randn ∗ rangeV alues, where
randn produces a random number, chosen from a normal distribution with
mean zero and variance one, and rangeValues is the range of values on X or
Y coordinates. In this last experiment we wanted to see how the addition of
noise would aﬀect the performance of the three distance functions. Again,
the running time is the same as with the original ASL data.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 95
Fig. 19. Noisy ASL data: The correct clusterings of the LCSS method using complete
linkage.
The LCSS proves to be more robust than the Euclidean and the DTW
under noisy conditions (Table 4, Figures 17 and 18). The Euclidean again
performed poorly, recognizing only 1 cluster, the DTW recognized 2 and
the LCSS up to 12 clusters (consistently recognizing more than 6 clusters).
These results are a strong indication that the models based on LCSS are
generally more robust to noise.
6.3. Evaluating the Quality and Eﬃciency of the Indexing
Technique
In this part of our experiments we evaluated the eﬃciency and eﬀective
ness of the proposed indexing scheme. We performed tests over datasets of
diﬀerent sizes and diﬀerent number of clusters. To generate large realistic
datasets, we used real sequences (from the SEALS and ASL datasets) as
“seeds” to create larger datasets that follow the same patterns. To perform
tests, we used queries that do not have exact matches in the database, but
on the other hand are similar to some of the existing sequences. For each
experiment we run 100 diﬀerent queries and we report the averaged results.
We have tested the index performance for diﬀerent number of clusters
in a dataset consisting of a total of 2000 sequences. We executed a set of
KNearest Neighbor (KNN) queries for K 1, 5, 10, 15 and 20 and we plot
the fraction of the dataset that has to be examined in order to guarantee
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
96 M. Vlachos, D. Gunopulos and G. Das
Fig. 20. Performance for increasing number of Nearest Neighbors.
Fig. 21. Index performance for variable number of data clusters.
that we have found the best match for the KNN query. Note that in this
fraction we included the medoids that we check during the search since they
are also part of the dataset.
In Figure 20 we show some results for KNearest Neighbor queries. We
used datasets with 5, 8 and 10 clusters. As we can see the results indicate
that the algorithm has good performance even for queries with large K.
We also performed similar experiments where we varied the number of
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 97
clusters in the datasets (Figure 21). As the number of clusters increased
the performance of the algorithm improved considerably. This behavior is
expected and it is similar to the behavior of recent proposed index structures
for high dimensional data [6,9,22]. On the other hand if the dataset has no
clusters, the performance of the algorithm degrades, since the majority of
the sequences have almost the same distance to the query. This behavior
follows again the same pattern of high dimensional indexing methods [6,43].
7. Conclusions
We have presented eﬃcient techniques to accurately compute the simi
larity between timeseries with signiﬁcant amount of noise. Our distance
measure is based on the LCSS model and performs very well for noisy sig
nals. Since the exact computation is ineﬃcient, we presented approximate
algorithms with provable performance bounds. Moreover, we presented an
eﬃcient index structure, which is based on hierarchical clustering, for sim
ilarity (nearest neighbor) queries. The distance that we use is not a metric
and therefore the triangle inequality does not hold. However, we prove that
a similar inequality holds (although a weaker one) that allows to prune
parts of the datasets without any false dismissals.
Our experiments indicate that the approximation algorithm can be used
to get an accurate and fast estimation of the distance between two time
series even under noisy conditions. Also, results from the index evaluation
show that we can achieve good speedups for searching similar sequences
comparing with the brute force linear scan.
We plan to investigate biased sampling to improve the running time of
the approximation algorithms, especially when full rigid transformations
(e.g. shifting, scaling and rotation) are necessary. Another approach to
index timeseries for similarity retrieval is to use embeddings and map the
set of sequences to points in a low dimensional Euclidean space [15]. The
challenge of course is to ﬁnd an embedding that approximately preserves the
original pairwise distances and gives good approximate results to similarity
queries.
References
1. Agarwal, P.K., Arge, L., and Erickson, J. (2000). Indexing Moving Points. In
Proc. of the 19th ACM Symp. on Principles of Database Systems (PODS),
pp. 175–186.
2. Agrawal, R., Faloutsos, C., and Swami, A. (1993). Eﬃcient Similarity Search
in Sequence Databases. In Proc. of the 4th FODO, pp. 69–84.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
98 M. Vlachos, D. Gunopulos and G. Das
3. Agrawal, R. Lin, K. Sawhney, H.S., and Shim, K. (1995). Fast Similarity
Search in the Presence of Noise, Scaling and TransLation in TimeSeries
Databases. In Proc. of (VLDB), pp. 490–501.
4. Barbara, D. (1999). Mobile Computing and Databases – A Survey. IEEE
TKDE, pp. 108–117.
5. Berndt, D. and Cliﬀord, J. (1994). Using Dynamic Time Warping to Find
Patterns in Time Series. In Proc. of KDD Workshop.
6. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is
“Nearest Neighbor” Meaningful? In Proc. of ICDT, Jerusalem, pp. 217–235.
7. Bollobas, B., Das, G., Gunopulos, D., and Mannila, H. (1997). TimeSeries
Similarity Problems and WellSeparated Geometric Sets. In Proc. of the 13th
SCG, Nice, France.
8. Bozkaya, T., Yazdani, N., and Ozsoyoglu, M. (1997). Matching And Indexing
Sequences of Diﬀerent Lengths. In Proc. of the CIKM, Las Vegas.
9. Chakrabarti, K. and Mehrotra, S. (2000). Local Dimensionality Reduction:
A New Approach to Indexing High Dimensional Spaces. In Proc. of VLDB,
pp. 89–100.
10. Chu, K. and Wong, M. (1999). Fast TimeSeries Searching with Scaling and
Shifting. ACM Principles of Database Systems, pp. 237–248.
11. Cormen, T.H., Leiserson, C.E., and Rivest, R.L. (1990). Introduction to Algo
rithms. The MIT Electrical Engineering and Computer Science Series. MIT
Press/McGraw Hill, 1990.
12. Das, G., Gunopulos, D. and Mannila, H. (1997). Finding Similar Time Series.
In Proc. of the First PKDD Symp., pp. 88–100.
13. Duda, R. and Hart, P. (1973). Pattern Classiﬁcation and Scene Analysis.
John Wiley and Sons, Inc.
14. Faloutsos, C., Jagadish, H., Mendelzon, A. and Milo, T. (1997). Signature
Technique for SimilarityBased Queries. In SEQUENCES 97.
15. Faloutsos, C. and Lin, K.I. (May, 1995). FastMap: A Fast Algorithm
for Indexing, Datamining and Visualization of Traditional and Multimedia
Datasets. In Proc. ACM SIGMOD, pp. 163–174.
16. Faloutsos, C., Ranganathan, M., and Manolopoulos, I. (1994). Fast Sub
sequence Matching in Time Series Databases. In Proc. of ACM SIGMOD,
pp. 419–429.
17. Gaﬀney, S. and Smyth, P. (1999). Trajectory Clustering with Mixtures of
Regression Models. In Proc. of the 5th ACM SIGKDD, San Diego, CA,
pp. 63–72.
18. Ge, X. and Smyth, P. (2000). Deformable Markov Model Templates for Time
Series Pattern Matching. In Proc. ACM SIGKDD.
19. Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity Search in High
Dimensions Via Hashing. In Proc. of 25th VLDB, pp. 518–529.
20. Gips, J., Betke, M. and Fleming, P. (July, 2000). The Camera Mouse: Pre
liminary investigation of automated visual tracking for computer access. In
Proceedings of the Rehabilitation Engineering and Assistive Technology Soci
ety of North America 2000 Annual Conference, pp. 98–100, Orlando, FL.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
Indexing TimeSeries under Conditions of Noise 99
21. Goldin, D. and Kanellakis, P. (1995). On Similarity Queries for TimeSeries
Data. In Proc. of CP ’95, Cassis, France.
22. Goldstein, J. and Ramakrishnan, R. (2000). Contrast Plots and pSphere
Trees: Space vs. Time in Nearest Neighbour Searches. In Proc. of the VLDB,
Cairo, pp. 429–440.
23. Grumbach, S., Rigaux, P., and Segouﬁn, L. (2000). Manipulating Interpolated
Data is Easier than you Thought. In Proc. of the 26th VLDB, pp. 156–165.
24. Jagadish, H.V., Mendelzon, A.O., and Milo, T. (1995). SimilarityBased
Queries. In Proc. of the 14th ACMPODS, pp. 36–45.
25. Kahveci, T. and Singh, A.K. (2001). Variable Length Queries for Time Series
Data. In Proc. of IEEE ICDE, pp. 273–282.
26. Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzani, M. (2001).
Locally Adaptive Dimensionality Reduction for Indexing Large Time Series
Databases. In Proc. of ACMSIGMOD, pp. 151–162.
27. Keogh, E. and Pazzani, M. (2000). Scaling up Dynamic Time Warping for
Datamining Applications. In Proc. 6th Int. Conf. on Knowledge Discovery
and Data Mining, Boston, MA.
28. Kollios, G., Gunopulos, D., and Tsotras, V. (1999). On Indexing Mobile
Objects. In Proc. of the 18th ACM Symp. on Principles of Database Systems
(PODS), pp. 261–272.
29. Lee, S.L., Chun, S.J., Kim, D.H., Lee, J.H., and Chung, C.W. (2000).
Similarity Search for Multidimensional Data Sequences. In Proc. of ICDE,
pp. 599–608.
30. Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Inser
tions, and Reversals. In Soviet Physics — Doklady 10, 10, pp. 707–710.
31. Park, S., Chu, W., Yoon, J., and Hsu, C. (2000) Eﬃcient Searches for Similar
Subsequences of Diﬀerent Lengths in Sequence Databases. In Proc. of ICDE,
pp. 23–32.
32. Perng, S., Wang, H., Zhang, S., and Parker, D.S. (2000). Landmarks: A New
Model for SimilarityBased Pattern Querying in Time Series Databases. In
Proc. of ICDE, pp. 33–42.
33. Pfoser, D., Jensen, C., and Theodoridis, Y. (2000) Novel Approaches in Query
Processing for Moving Objects. In Proc. of VLDB, Cairo Egypt.
34. Pfoser, D. and Jensen, C.S. (1999). Capturing the Uncertainty of Moving
Object Representations. Lecture Notes in Computer Science, 1651.
35. Qu, Y., Wang, C., and Wang, X. (1998). Supporting Fast Search in Time
Series for Movement Patterns in Multiple Scales. In Proc. of the ACM CIKM,
pp. 251–258.
36. Raﬁei, D. and Mendelzon, A. (2000). Querying Time Series Data Based on
Similarity. IEEE Transactions on Knowledge and Data Engineering, 12(5),
675–693.
37. Rice, S.V., Bunke, H., and Nartker, T. (1997). Classes of Cost Functions for
String Edit Distance. Algorithmica, 18(2), 271–280.
38. Sakoe, H. and Chiba, S. (1978). Dynamic Programming Algorithm Optimiza
tion for Spoken Word Recognition. IEEE Trans. Acoustics, Speech and Signal
Processing (ASSP), 26(1), 43–49.
April 26, 2004 16:8 WSPC/Trim Size: 9in x 6in for Review Volume chap04
100 M. Vlachos, D. Gunopulos and G. Das
39. Saltenis, S., Jensen, C., Leutenegger, S., and Lopez, M.A. (2000). Indexing
the Positions of Continuously Moving Objects. In Proc. of the ACM SIG
MOD, pp. 331–342.
40. Vlachos, M. (2001) Eﬃcient Similarity Measures and Indexing Structures for
Multidimensional Trajectories, MSc Thesis, UC Riverside.
41. Vlachos, M., Gunopulos, D., Kollios, G. (2002) Discovering Multidimensional
Trajectories. In Proc. of IEEE Int. Conference in Data Engineering (ICDE).
42. Wagner, R.A. and Fisher, M.J. (1974). The StringToString Correction
Problem. Journal of ACM, 21(1), 168–173.
43. Weber, R., Schek, H.J., and Blott, S. (1998). A Quantitative Analysis
and Performance Study for Similarity Search Methods in HighDimensional
Spaces. In Proc. of the VLDB, NYC, pp. 194–205.
44. Yi, B.K. and Faloutsos, C. (2000). Fast Time Sequence Indexing for Arbi
trary Lp Norms. In Proc. of VLDB, Cairo Egypt.
45. Yi, B.K., Jagadish, H.V., and Faloutsos, C. (1998). Eﬃcient Retrieval of
Similar Time Sequences Under Time Warping. In Proc. of the 14th ICDE,
pp. 201–208.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
CHAPTER 5
CHANGE DETECTION IN CLASSIFICATION MODELS
INDUCED FROM TIME SERIES DATA
Gil Zeira
∗
and Oded Maimon
†
Department of Industrial Engineering
TelAviv University, TelAviv 69978, Israel
Email:
∗
zeiragil@post.tau.ac.il;
†
maimon@eng.tau.ac.il
Mark Last
Department of Information Systems Engineering
BenGurion University, BeerSheva 84105, Israel
Email: mlast@bgumail.bgu.ac.il
Lior Rokach
Department of Industrial Engineering
TelAviv University, TelAviv 69978, Israel
Email: liorr@eng.tau.ac.il
Most classiﬁcation methods are based on the assumption that the
historic data involved in building and verifying the model is the best
estimator of what will happen in the future. One important factor that
must not be set aside is the time factor. As more data is accumulated
into the problem domain, incrementally over time, one must examine
whether the new data agrees with the previous datasets and make the
relevant assumptions about the future. This work presents a new change
detection methodology, with a set of statistical estimators. These changes
can be detected independently of the data mining algorithm, which is
used for constructing the corresponding model. By implementing the
novel approach on a set of artiﬁcially generated datasets, all signiﬁcant
changes were detected in the relevant periods. Also, in the realworld
datasets evaluation, the method produced similar results.
Keywords: Classiﬁcation; incremental learning; time series; change
detection; infofuzzy network.
101
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
102 G. Zeira, O. Maimon, M. Last and L. Rokach
1. Introduction
As mass of data is incrementally accumulated into large databases over
time, we tend to believe that the new data “acts” somehow resembling to
the prior knowledge we have on the operation or facts that it describes.
Change detection in time series is not a new subject and it has always been
a topic of continued interest. For instance, Jones et al. [17] have developed
a change detection model mechanism for serially correlated multivariate
data. Yao [39] has estimated the number of change points in time series
using the BIC criterion. However the change detection in classiﬁcation is
still not well elaborated.
There are many algorithms and methods that deal with the incremental
learning problem, which is concerned with updating an induced model upon
receiving new data. These methods are speciﬁc to the underlying data min
ing model. For example: Utgoﬀ’s method for incremental induction of deci
sion trees (ITI) [35,36], WeiMin Shen’s semiincremental learning method
(CDL4) [34], David W. Cheung technique for updating association rules in
large databases [5], Alfonso Gerevini’s network constraints updating tech
nique [12], ByoungTak Zhang’s method for feedforwarding neural networks
(SELF) [40], simple Backpropagation algorithm for neural networks [27],
Liu and Setiono’s incremental feature selection (LVI) [24] and more.
The main topic in most incremental learning theories is how the model
(this could be a set of rules, a decision tree, neural networks, and so on) is
reﬁned or reconstructed eﬃciently as new amounts of data is encountered.
This problem has been challenged by many of the algorithms mentioned
above, and many of them performed signiﬁcantly better than running the
algorithm from scratch, generally when the records were received online
and changes had a low magnitude. An important question that one must
examine whenever a new mass of data is accumulated is “Is it really wise to
keep on reconstructing or verifying the current model, when everything or
something in our notion of the model could have signiﬁcantly changed?” In
other words, the main problem is not how to reconstruct better, but rather
how to detect a change in a model based on a time series database.
Some researchers have proposed various representations of the problem
of large time changing databases and populations including:
• Deﬁning robustness and discovering robust knowledge from data
bases [15] or learning stable concepts [13] in domains with hidden changes
in concepts.
• Identifying and modeling a persistent drift in a database [11].
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 103
• Adapting to Concept and Population Drift [14,18,19,21,34].
• Activity Monitoring [9].
Rather than challenging the problem of detecting signiﬁcant changes, the
above methods deal directly with data mining in changing environment.
This chapter introduces a novel methodology for detecting a signiﬁcant
change in a classiﬁcation model of data mining, by identifying distinct cat
egories of changes and implementing a set of statistical estimators. The
major contribution of our change detection procedure (as will be described
in later sections), is the ability to make a conﬁdent claim that the model
that was prebuilt based on a suﬃciently large dataset is no longer valid for
any future use such as prediction, rule induction, etc., and consequently, a
new model must be constructed.
The rest of the chapter is organized as follows. In Section 2, we present
our change detection procedure in data mining models, by deﬁning the data
mining classiﬁcation model characteristics (2.1), describing the variety of
possible signiﬁcant changes (2.2), deﬁnition of the hypothesis testing (2.3),
and the methodology for the change detection procedure (2.4). In Section 3,
we describe an experimental evaluation of the change detection method
ology by introducing artiﬁcial changes in databases and implementing the
change detection methodology to identify these changes. Section 4 describes
evaluation of the change detection methodology on realworld datasets.
Section 5 presents validation of the work’s assumptions and Section 6 con
cludes this chapter by summingup the main contributions of our method,
and presenting several options for future research in implementation and
extension of our methodology.
2. Change Detection in Classiﬁcation Models
of Data Mining
2.1. Classiﬁcation Model Characteristics
Classiﬁcation is the task which involves constructing a model predicting the
(usually categorical) label of a classifying attribute and using it to classify
new data. The induced classiﬁcation model can be represented as a set of
rules (see the RISE system [8], the BMA method for rule induction [6],
PARULEL and PARADISER etc.), a decision tree (see [34–36] ), neural
networks (see [35]), informationtheoretic connectionists networks (see IFN
[25]) and so on.
Given a database D which consist of X set of records. The data mining
model M is one that by using an algorithm (G) generates a set of hypothesis
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
104 G. Zeira, O. Maimon, M. Last and L. Rokach
tests H within the available hypothesis space, which generalize the relation
ship between a set of candidate input variables and the target variable. The
following notation is a general description of the data mining classiﬁcation
modeling task: Given a database D, containing a complete set of records
X = (AT), where A is a vector of candidate input variables (attributes
from the examined phenomenon, which might have some inﬂuence on the
target concept), and T is a target variable (i.e. the target concept). Find
the best set of hypothesis H within the available hypothesis space, which
generalize the relationship between a set of candidate input variables and
the target variable (e.g. the model M) using some Data Mining algorithm.
Each record is regarded as a complete set of conjunction between attributes
and a target concept (or variable), such as
X
i
= (A
1i
= a
1j
(1), A
2i
= a
2j
(2), . . . , A
ni
= a
nj
(n), T
i
= t
j
).
A = (A
1
, A
2
. . . A
n
) is a known set of attributes of the desired phenomenon
and T (also noted in DM literature as Y ) is a known discrete or continuous
target variable or concept. The goal for the classiﬁcation task is to generate
best set of hypotheses to describe the model M using an algorithm G. To
simplify, this means, generating the following general equation:
M
G
: A¬
ˆ
T. (1)
It is obvious that not all attributes are proved to be statistically signif
icant in order to be included in the set of hypothesizes.
In most algorithms, the database D is divided into two parts — learning
(D
learn
) and validation (D
val
) sets. The ﬁrst is supposed to hold enough
information for assembling a statistically signiﬁcant and stable model based
on the DM algorithm. The second part is supposed to ensure that the
algorithm performs its goals by validation of the built models on unseen
records. Evaluating the prediction accuracy of any model M, which is built
by a classiﬁcation algorithm G, is commonly performed, by estimating the
Validation Error Rate of the examined model.
When the database D is not ﬁxed but, alternatively, accumulated over
time, the classiﬁcation task should be altered: in every period K, a new set
of records X
K
is accumulated to the database. d
K
will be the notation for
the set of records X
K
that was added in the start of period K, and D
K
will
be the notation for the accumulated database D
k
=
d
k
. Therefore, given
a database D
K
, containing a complete set of records X
K
, generate the best
set of hypothesis H
K
to describe the accumulated model M
K
, and a new
question is encountered: is M
K,G
= M
K+1,G
, in every K = 1, . . . , k?
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 105
As noted before, most existing methods have dealt with “how can the
model M be updated eﬃciently when a new period K is encountered?” or
“How can we adapt to the time factor?”, rather than asking the following
questions:
• “Was the model signiﬁcantly changed during the period K?”
• “What was the nature of the change?”
• “Should we consider several of the past periods as redundant or not
required in order for an algorithm G to generate a better model M?”
Hence, the objective of this work is: deﬁne and evaluate a change detec
tion methodology for identifying a signiﬁcant change that happened during
period K in a classiﬁcation model, which was incrementally built over peri
ods 1 to K−1, based on the data that was accumulated during the period K.
2.2. Variety of Changes
There are various signiﬁcant changes, which might occur when inducing
the model M using the algorithm G. There are several possible causes for
signiﬁcant changes in the data mining model:
(i) A change in the probability distribution of one or more of the input
attributes (A). For example, if a database in periods 1 to K−1, consists
of 45% males and 55% females, while in period K all records represent
males.
(ii) A change in the distribution of the target variable (T). For example, in
the case of examining the rate of failures in a ﬁnal exam based on the
characteristics of the students in consecutive years. If in the year 1999
the average failure rate was 20% and in the year 2000 was 40%, then a
change in the target distribution has occurred.
A change in the “patterns” (rules), which deﬁne the relationship of the
input attributes to the target variable. That is a change in the model M,
derived from a change in a set of hypothesis in H. For instance, in the case
of examining the rate of failures in ﬁnal exams based on the characteristics
of the students in the course of consecutive years, if in years 1999 male
students had 60% failures and female students had 5% failures, and in
year 2000 the situation was the opposite, then it is obvious that there was
a change in the patterns of behavior. This work deﬁnes this cause for a
signiﬁcant change in a Data Mining model M by the following deﬁnition:
A change C is encountered in the period K if the validation error of the
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
106 G. Zeira, O. Maimon, M. Last and L. Rokach
model M
K−1
(the model that is based on D
K−1
) on the database D
K−1
is
signiﬁcantly diﬀerent from the validation error rate of the model M
K−1
over
d
K
. d
K
consists of the accumulated set of records obtained in period K.
Inconsistency of the DM algorithm. The basic assumption is that the
DM algorithms in use were proven to be consistent and therefore are not
prone to be a major cause for a change in the model. Although, this option
should not be omitted, this work does not intend to deal with consistency
or inconsistency measures of existing DM algorithms. We just point out
here that the DM algorithm that was chosen for our experiments (IFN)
was shown to be stable in our previous work [21].
As noted in subsection 2.1, the data mining model is generally described
as: M
G
: A¬
ˆ
T, that is a relationship between A and T. The ﬁrst cause is
explained via a change in the set of attributes A, which can also cause a
change in the target variable. For example, if the percent of women will rise
from 50% to 80% and women drink more white wine than men, than the
overall percent of white wine consumption (T) will also increase. This can
also aﬀect the overall error rate of the data mining model if most rules were
generated for men. Also, if the cause of a change in the relevant period is
a change in the target variable (e.g., France has stopped producing white
wine, and the percent of white wine consumption has dropped from 50%
to 30%), it is possible that the rules relating the relevant attributes to the
target variable(s) will be aﬀected.
Since it is not trivial to identify the dominant cause of change, this
work claims that a variety of possible changes might occur in a relevant
Table 1. Deﬁnition of the variety of changes in a data mining model.
Rules A T Description
− − − No change.
− − + A change in the target variable.
− + − A change in the attribute variable(s).
− + + A change in the target and in the input variable(s).
+ − − A change in “patterns” (rules) of the data mining
model.
+ − + A change in “patterns” (rules) of the data mining
model, and a change in the target variable.
+ + − A change in “patterns” (rules) of the data mining
model, and a change in the input variable(s).
+ + + A change in “patterns” (rules) of the data mining
model, and a change in the target and the input
variables.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 107
period. The change can be caused by one of the three causes that were
mentioned above or a simple combination of them. Table 1 presents the
possible combinations of signiﬁcant causes in a given period.
The deﬁnition of the variety of possible changes in a datamining model
is quite a new concept. As noted, several researchers tended to deal with
concept change, population change, activity monitoring, etc. The notion
that all three major causes interact and aﬀect each other is quite new and
it is tested and validated in this work for the ﬁrst time.
2.3. Statistical Hypothesis Testing
In order to determine whether or not a signiﬁcant change has occurred in
period K, a set of statistical estimators is presented in this chapter. The
use of these estimators is subject to several conditions:
1. Every period contains a suﬃcient amount of training data in order to
rebuild a model for that speciﬁc period. The decision of whether a period
contains suﬃcient number of records should be based on the classiﬁca
tion algorithm in use, the inherent noisiness of the training data, the
acceptable diﬀerence between the training and validation error rates,
and so on.
2. The same DM algorithm is used in all periods to build the classiﬁcation
model (e.g., C4.5 or IFN).
3. The same method is used for estimating the validation error rate in all
periods (e.g., 5fold or 10fold cross validation, 1/3 holdout, etc.).
Detecting a change in “patterns” (rules). “Patterns” (rules) deﬁne the
relationship between input and target variables. Since massive data streams
are usually involved in building an incremental model, we can safely assume
that the true validation error rate of a given incremental model is accurately
estimated by the previous K−1 periods. Therefore, a change in the rules (R)
is encountered during period K if the validation error of the model M
K−1
(the model that is based on D
K−1
) on the database D
K−1
is signiﬁcantly
diﬀerent from the validation error rate of the model M
K−1
over d
K
(the
records obtained during period K).
Therefore, the parameter of interest for the statistical hypothesis testing
is the true validation error rate and the null hypothesis is as follows:
H
0
: ˆ e
M
K−1
,K
= ˆ e
M
K−1
,K−1
,
H
1
: ˆ e
M
K−1
,K
= ˆ e
M
K−1
,K−1
,
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
108 G. Zeira, O. Maimon, M. Last and L. Rokach
where:
ˆ e
M
K−1
,K−1
is the validation error rate of model M
K−1
measured on
D
K−1
set of records (the standard validation error of the model).
ˆ e
M
K−1
,K
is the validation error rate of the aggregated model M
K−1
on
the set of records d
K
.
In order to detect a signiﬁcant diﬀerence between the two error rates it
is needed to test the following statistic (two sided hypothesis):

ˆ
d = ˆ e
M
K−1
,K
− ˆ e
M
K−1
,K−1
,
ˆ σ
2
d
=
ˆ e
M
K−1
,K
− (1 − ˆ e
M
K−1
,K
)
n
K
+
ˆ e
M
K−1
,K−1
(1 − ˆ e
M
K−1
,K−1
)
n
K−1(val)
,
If 
ˆ
d ≥ z
(1−α/2)
·
ˆ σ
2
d
, then reject H
0
. A change has occured in period K.
n
K−1(val)
= D
K−1(val)
 is the number of records which were selected for
validation from periods 1, . . . , K−1 and n
K
= d
K
 is the number of records
in period K.
The foundations for the above way of hypothesis testing when comparing
error rates of classiﬁcation algorithms can be found in [29].
Detecting changes in variable distributions. The second statistical
test is Pearson’s chisquare statistic for comparing multinomial variables
(see [28]). This test examines whether a sample of the variable distribu
tion is drawn from the matching probability distribution known as the true
distribution of that variable. The objective of this estimator in the change
detection procedure is to validate our assumption that the distribution of
an input or a target variable has signiﬁcantly changed in statistical sense.
Again, since massive data streams are usually involved in building an incre
mental model, it is safe to assume that the stationary distribution of any
variable in a given incremental model can be accurately estimated by the
previous K − 1 periods.
The following null hypothesis is tested for every variable of interest X:
H
0
: the variable X’s distribution is stationary (timeinvariant).
H
1
: otherwise.
The decision is based on the following formula:
X
2
p
= n
K
·
j
i=1
(x
iK
/n
K
−x
iK−1
/n
K−1
)
2
x
iK−1
/n
K−1
, (2)
where:
n
K
is the number of records in the Kth period.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 109
n
K−1
is the number of records in periods 1, . . . , K − 1.
x
iK
is the number of records in the ith class of variable X in the K
period.
x
iK−1
is the number of records in the ith class of variable X in periods
1, . . . , K − 1.
If X
2
p
> χ
2
1−α
(j − 1), where j is the number of classes of the tested
variable, then the null hypothesis that the variable X’s distribution has
been stationary in period K like in the previous periods is rejected.
The explanation of Pearson’s statistical hypothesis testing is provided
in [28].
2.4. Methodology
This section describes the algorithmic usage of the previous estimators:
Inputs:
• G is the DM algorithm used for constructing the classiﬁcation model
(e.g., C4.5 or IFN).
• M is the classiﬁcation model constructed by the DM algorithm (e.g., a
decision tree).
• V is the validation method in use (e.g., 5fold crossvalidation).
• K is the cumulative number of periods in a data stream.
• α is the desired signiﬁcance level for the change detection procedure (the
probability of a false alarm when no actual change is present).
Outputs:
• CD(α) is the errorbased change detection estimator (1 – pvalue).
• XP(α) is the Pearson’s chisquare estimator of distribution change
(1 – pvalue).
2.5. Change Detection Procedure
Stage 1:
For periods K − 1 build the model M
K−1
using the DM algorithm G.
Deﬁne the data set D
K−1(val)
.
Count the number of records n
K−1
= D
K−1(val)
.
Calculate the validation error rate ˆ e
M
K−1
,K−1
according to the valida
tion method V .
Calculate x
iK−1
, n
K−1
for every input and target variable existing in
periods 1, . . . , K − 1.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
110 G. Zeira, O. Maimon, M. Last and L. Rokach
Stage 2:
For period K, deﬁne the set of records d
K
.
Count the number of records n
K
= d
K
.
Calculate ˆ e
M
K−1
,K
according to the validation method V .
Calculate the diﬀerence
d = ABS(ˆ e
M
K−1
,K
− ˆ e
M
K−1
,K−1
), ˆ σ
2
d
, H
0
= z
(1−α/2)
·
ˆ σ
2
d
·
Calculate and Return CD(α).
Stage 3:
For every input and target variable existing in periods 1, . . . , K:
Calculate: x
iK
, n
K
and X
2
p
.
Calculate and Return XP(α).
It is obvious that the complexity of this procedure is at most O(n
K
).
Also, it is very easy to store information about the distributions of target
and input variables in order to simplify the change detection methodology.
Based on the outputs of the change detection procedure, the user can
make a distinction between the eight possible variations of a change in
the data mining classiﬁcation model (see subsection above). Knowing the
causes of the change (if any), the user of this new methodology can act in
several ways, including reapplying the algorithm from scratch to the new
data, absorb the new period and update the model by using an incremental
algorithm, make K
= K + 1 and perform the change detection procedure
again for the next period, explore the type of the change and its magnitude
and eﬀect on other characteristics of the DM model, and incorporate other
known methods dealing with the speciﬁc change(s) detected. One may also
apply multiple model approaches such as boosting, voting, bagging, etc.
The methodology is not restricted to databases with a constant number
of variables. The basic assumption is that if the addition of a new variable
will inﬂuence the relationship between the target variable and the input
variables in a manner that changes the validation accuracy, it will be iden
tiﬁed as a signiﬁcant change.
The procedure has three major stages. The ﬁrst one is designed to per
form initialization of the procedure. The second stage is aimed at detecting
a signiﬁcant change in the “patterns” (rules) of the prebuilt datamining
model, as described in the previous section. The third stage is designated to
test whether the distribution of one or more variable(s) in the set of input
or target variable(s) has changed.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 111
The basic assumption for using this procedure is the use of suﬃcient
statistics for a run of the algorithm in every period. As indicated above, if
this assumption is not valid, it is necessary to merge two or more periods
to maintain statistically signiﬁcant outcomes.
3. Experimental Evaluation
3.1. Design of Experiments
In order to evaluate the change detection algorithm, a set of artiﬁcially
generated datasets were built based on the following characteristics:
• Predetermined deﬁnition and distribution of all variables (candidate
input and target).
• Predetermined set of rules.
• Pure random generation of records.
• Noncorrelated datasets (between periods).
• Minimal randomly generated noise.
• No missing data.
In all generated datasets, we have introduced and tested a series of
artiﬁcially noncorrelated changes of various types.
All datasets were mined with the IFN (InformationFuzzy Network)
program (version 1.2 beta), based on the InformationTheoretic Fuzzy
Approach to Knowledge Discovery in Databases [Maimon and Last (2000)].
This novel method, developed by Mark Last and Oded Maimon was shown
to have better dimensionality reduction capability, interpretability, and sta
bility than other data mining methods [e.g., see Last et al. (2002)] and was
therefore found suitable for this study.
This chapter uses two sets of experiments to evaluate the performance
of the change detection procedure:
• The ﬁrst set is aimed to estimate the hit rate (also called the “true
positive rate”) of the change detection methodology. Twenty four diﬀer
ent changes in two diﬀerent databases were designed under the rules
mentioned above in order to conﬁrm the expected outcomes of the
change detection procedure. Table 2 below summarizes the distribution
of the artiﬁcially generated changes in experiments on Database#1 and
Database#2.
• All changes were tested independently under the minimum 5% conﬁ
dence level by the following set of hypothesis. All hypotheses were tested
separately with the purpose of evaluating the relationship of all tests.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
112 G. Zeira, O. Maimon, M. Last and L. Rokach
• Also, 12 noncorrelated sets of experiments, which do not contain any
change, were implemented in order to estimate the actual 1st type error
rate (α) — the “detection” of a change that doesn’t occur. This error is
also called “false positive rate” or “false alarm rate”.
The expected outcomes of the hypothesis testing on the various types
of changes are described in the following tables.
It is expected that every uncorrelated change implemented will lead to
the associated outcome of the relevant test as indicated in Tables 3 and 4.
The second part of the experiments on artiﬁcially generated datasets
evaluated the change detection procedure on time series data. During the
third and sixth periods (out of seven consecutive periods) two changes (in
classiﬁcation “rules”) were introduced into the database and implemented
disregarding the previous and the consecutive periods. It is expected that
Table 2. Distribution of the artiﬁcially generated changes in experiment part1.
Change in A Change in Y Change in the “Rules” Total
Database #1 4 2 6 12
Database #2 4 2 6 12
Sum 8 4 12 24
Table 3. Change Detection in Rules (CD).
Change in A Change in T Change in the “patterns”
(rules)
CD (5%) TRUE/FALSE
(N/A)
TRUE/FALSE
(N/A)
TRUE
Table 4. Pearson’s estimator for comparing distributions of variables (XP).
Database section tested Change in A Change in T Change in the
“patterns” (rules)
Candidate
Variables XP(5%) TRUE TRUE/FALSE
(N/A)
TRUE/FALSE
(N/A)
Target
Variables XP(5%) TRUE/FALSE
(N/A)
TRUE TRUE/FALSE
(N/A)
Candidate & Target
Variables XP(5%) TRUE TRUE TRUE/FALSE
(N/A)
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 113
the Change Detection Procedure will reveal the changes in these periods
only and not in any other period.
3.2. Results — Part 1 (Hit Rate and False Alarm Rate)
Based on the 24 change detection trials the following accumulated results
have been obtained, as described thoroughly in Tables 5 and 6:
• All artiﬁcial changes in the “patterns” (rules) relating the input variables
to the target variable were detected by the CD (5%) procedure. The
average detection rate of CD was 100% meaning that all changes were
recognized as signiﬁcant by the procedure.
• Two trials generated by an artiﬁcial change in the target variable were
detected by the CD procedure. The average detection rate of CD in these
trials was 100%.
• According to the XP hypothesis testing, 50% of changes in the candidate
input variables produced a change in the target variable, resulting in an
average detection rate of 100% for input attributes and 98% for the target
attribute.
• All the changes, which were mainly introduced to aﬀect the target vari
able, were recognized as signiﬁcant by the XP procedure.
• All the changes, which were mainly introduced to aﬀect the relationship
between the target and the candidate input variables, were recognized as
signiﬁcant by the XP procedure, with an average of 100% detection rate.
• The actual 2nd type error (false negative) rate of the change detection
methodology in this experiment is 0%. No change was left undetected by
the methodology.
• The actual 1st type error (false alarm) rate of the change detection
methodology in these experiments is up to 6%. When implementing the
methodology in trials which do not include changes, the CD estimator
did not produce false alarms (0% error rate) and the XP estimator failed
to produce accurate estimation in only ﬁve out of 84 tests (5.9%).
To conclude this part of the experiments with various possible changes
in the data mining model, the following assumptions are validated by the
results which were described above: a major change can cause secondary
eﬀects and generate other changes. The Change Detection procedure can
detect these changes with theoretically and actually low Type 1 and Type 2
error rates.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
114 G. Zeira, O. Maimon, M. Last and L. Rokach
Table 5. Average detection rate of the 24 trials (all signif
icant attributes included in XP).
Type of change: A T R Average
Number of changes: 8 4 12 24
CD 67% 99% 100% 90%
XP(candidate) 100% 100%
XP(target) 100% 100% 100%
XP(target+candidate) 99% 98% 98% 99%
Table 6. Distribution of 5% signiﬁcance out
comes in the 24 trials.
Type of change: A T R sum
Number of changes: 8 4 12 24
CD 0 3 12 15
XP(candidate) 4 4
XP(target) 1 9 10
XP(target+candidate) 4 3 3 10
3.3. Results — Part 2 (Time Series Data)
Table 7 and Figure 1 describe the outcomes of applying the IFN algorithm
to seven consecutive periods from the same artiﬁcially generated database
(Database#1), with two changes introduced, and using the change detection
methodology to detect signiﬁcant changes that have occurred during these
periods.
The results of the experiment can be summarized as follows:
• All changes are detected by the change detection procedure only in the
relevant period where the corresponding change has occurred.
• Whenever CD attained a signiﬁcant level (above 5%), XP also reached a
signiﬁcant value (more than 5%).
• The eﬀect of the change decreases after one period for CD (when the
change occurs in period K, the signiﬁcance level in period K+1 is most
similar to period K − 1). Hence, after one period, if a change is not
properly detected, it will be absorbed and discarded.
• The XP estimator of distribution change in input and target variables is
quite sensitive to the eﬀect of any change.
These results support the assumptions that a change in the target vari
able does not necessarily aﬀect the classiﬁcation “rules” of the database and
that a change can mainly be detected in the ﬁrst period after its occurrence.
A
p
r
i
l
2
2
,
2
0
0
4
1
6
:
5
8
W
S
P
C
/
T
r
i
m
S
i
z
e
:
9
i
n
x
6
i
n
f
o
r
R
e
v
i
e
w
V
o
l
u
m
e
C
h
a
p
0
5
C
h
a
n
g
e
D
e
t
e
c
t
i
o
n
i
n
C
l
a
s
s
i
ﬁ
c
a
t
i
o
n
M
o
d
e
l
s
I
n
d
u
c
e
d
f
r
o
m
T
i
m
e
S
e
r
i
e
s
D
a
t
a
1
1
5
Table 7. Results of the CD hypothesis testing on an artiﬁcially generated time series database.
Period Change CD XP
introduced e
MK−1,K
e
MK−1,K−1
d H(95%) 1 −pvalue 1 −pvalue
1 No — — — — — —
2 No 20.30% 24.00% 3.70% 4.30% 91.90% 92.50%
3 Yes 32.80% 19.80% 13.00% 3.20% 100.00% 100.00%
4 No 26.60% 24.60% 2.00% 2.80% 88.20% 100.00%
5 No 26.30% 26.40% 0.10% 2.50% 52.60% 99.90%
6 Yes 18.80% 27.20% 8.40% 2.20% 100.00% 100.00%
7 No 22.10% 22.00% 0.10% 2.00% 53.40% 52.80%
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
116 G. Zeira, O. Maimon, M. Last and L. Rokach
84%
100 %
76%
53%
99.8%
53%
93% 100% 100% 100% 100%
53%
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7
period
C
o
n
f
i
d
e
n
c
e
l
e
v
e
l
CD XP
Fig. 1. Summary of implementing the change detection methodology on an artiﬁcially
generated time series database (1 −pvalue).
Table 8. Inﬂuence of discarding the detected change (Illustration).
Candidate Variable Prediction Match
Products Holidays Periods 1–5 Period 6
0 * 1 1 Yes
1 * 1 1 Yes
2 * 1 1 Yes
3 * 1 1 Yes
4 0 0 1 No
1 1 1 Yes
5 0 1 1 Yes
1 1 1 Yes
6 * 0 0 Yes
7 * 0 0 Yes
8 * 0 0 Yes
If our procedure detects a change in period K of a time series, the incre
mentally built classiﬁcation model should be modiﬁed accordingly. Modiﬁ
cation methods could be one of the following [see Last (2002)]: (a) rebuild
a new model based on period K only; (b) combine the DM model from
periods 1, . . . , K, with a new model by using a voting schema (uniﬁed,
exponential smoothing, any weighted majority, deterministic choice, etc.)
in order to improve predictive accuracy in the future; (c) add j more periods
in order to rebuild a model based on periods K, . . . , K + j.
In order to illustrate the inﬂuence of discarding a detected change, we
look at the rules induced from the database by the IFN algorithm before
and after the second change has occurred (Period 6).
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 117
It can be shown by a simple calculation that if all rules have the same
frequency (that is assuming a uniform distribution of all input variables),
the expected increase in the error rate as a result of the change in Period 6
is 9% (1 : 11).
4. A RealWorld Case Study
4.1. Dataset Description
In this section, we apply our change detection method to a realworld time
series dataset. The objectives for the case study are: (1) determine whether
or not our method detects changes, which have occurred during some time
periods; (2) determine whether or not our method produces “false alarms”
while the dataset’s characteristics do not change over time signiﬁcantly.
The dataset has been obtained from a large manufacturing plant in
Israel representing daily production orders of products. From now on this
dataset will be referred as “Manufacturing”.
The candidate input attributes are: Catalog number
group (CATGRP) — a discrete categorical variable; Market code group
(MRKTCODE) — a discrete categorical variable; Customer code group
(CUSTOMERGRP) — a discrete categorical variable; Processing duration
(DURATION) — a discrete categorical variable which represents the pro
cessing times as disjoint intervals of variable size; Time left to operate in
order to meet demand (TIME TO OPERATE) — a discrete categorical
variable which stands for the amount of time between the starting date of
the production order and its due date. Each value represents a distinct time
interval; Quantity (QUANTITY) — a categorical discrete variable which
describes the quantity of items in a production order. Each value represents
a distinct quantity interval. The target variable indicates whether the order
was delivered on time or not (0 or 1).
The time series database in this case study consists of records of pro
duction orders accumulated over a period of several months. The ‘Manu
facturing’ database was extracted from a continuous production sequence.
Without further knowledge of the process or any other relevant informa
tion about the nature of change of that process, we may assume that no
signiﬁcant changes of the operation characteristics are expected over such
a short period of time.
Presentation and Analysis of Results. Table 9 and Figure 2 describe
the results of applying the IFN algorithm to six consecutive months in the
‘Manufacturing’ database and using our change detection methodology to
detect signiﬁcant changes that have occurred during these months.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
118 G. Zeira, O. Maimon, M. Last and L. Rokach
The XP statistics, as described in Table 9 and Figure 2, refer only to
the target variable (delivery on time). The magnitude of change in the can
didate input variables as evaluated across the monthly intervals is shown
in Table 10.
Table 9. Results of the CD hypothesis testing on the ‘Manufacturing’ database
Month CD XP
e
M
K−1,K
e
M
K−1
,K−1
d H(95%) 1 −pvalue 1 −pvalue
1 — — — — — —
2 14.10% 12.10% 2.00% 4.80% 58.50% 78.30%
3 11.70% 10.40% 1.30% 3.40% 54.40% 98.80%
4 10.60% 9.10% 1.50% 2.90% 68.60% 76.50%
5 11.90% 10.10% 1.80% 2.80% 78.90% 100.00%
6 6.60% 8.90% 2.30% 2.30% 95.00% 63.10%
58%
54%
69%
79%
95%
78%
99%
76%
100%
63%
0%
20%
40%
60%
80%
100%
2 3 4 5 6
month
CD XP
1
–
p
v
a
l
u
e
Fig. 2. Summary of implementing the change detection methodology on ‘Manufactur
ing’ database (1 −pvalue).
Table 10. XP conﬁdence level of all independent and dependent variables
in ‘Manufacturing’ database (1 −pvalue).
CAT MRKT Duration Time to Quantity Customer
GRP Code operate GRP
Domain 18 19 19 19 15 18
Month 2 100% 100% 100% 100% 100% 100%
Month 3 100% 100% 100% 100% 100% 100%
Month 4 100% 99.8% 100% 100% 100% 100%
Month 5 100% 99.9% 100% 100% 100% 100%
Month 6 100% 100% 100% 100% 100% 100%
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 119
According to the change detection methodology, during all six consecu
tive months there was no signiﬁcant change in the rules describing the rela
tionships between the candidate and the target variables (which is our main
interest). Nevertheless, it is easy to notice that major changes have been
revealed by the XP statistic in distributions of most target and candidate
input variables. One can expect that variables with a large number of values
need greater data sets in order to reduce the variation of their distribution
across periods. However, this phenomenon has not aﬀected the CD statistic.
An interesting phenomenon is the increasing rate of the CD conﬁdence
level from month 2 to month 6. In order to further investigate whether a
change in frequency distribution has still occurred during the six consecu
tive months without resulting in a signiﬁcant CD conﬁdence level, we have
validated the sixth month on the ﬁfth and the ﬁrst months. Table 11 and
Figure 3 describe the outcomes of the change detection methodology.
Implementing the change detection methodology by validating the sixth
month on the ﬁfth and the ﬁrst month did not produce contradicting results.
That is, the CD conﬁdence level of both months ranges only within ±8%
from the original CD estimation based on the all ﬁve previous months.
Furthermore, although XP produced extremely high conﬁdence levels indi
cating a drastic change in the distribution of all candidate and target vari
ables, the data mining model was not aﬀected, and it kept producing similar
validation error rates (which were statistically evaluated by CD).
The following statements summarize the case study’s detailed results:
• Our expectation that in the ‘Manufacturing’ database, there are no sig
niﬁcant changes in the relationship between the candidate input variables
and the target variable over time, is validated by the change detection
Table 11. Outcomes of XP by validating the sixth month on the ﬁfth and the ﬁrst month
in ‘Manufacturing’ database (pvalue).
CAT MRKT Duration Time Quantity Customer Target
GRP Code to Operate GRP
Metric XP domain 18 19 19 19 15 18 2
(1 −pvalue) month
5 validated by
month 6
100% 100% 100% 100% 100% 100% 98.4%
month
1 validated by
month 6
100% 100% 100% 100% 100% 100% 100%
months 1 to
5 validated by
month 6
100% 100% 100% 100% 100% 100% 63.1%
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
120 G. Zeira, O. Maimon, M. Last and L. Rokach
95.0%
70.0%
75.0%
80.0%
85.0%
90.0%
months 1 to 5
validated by
month 6
month 1
validated by
month 6
month 5
validated by
month 6
88.5%
76.1%
Fig. 3. CD conﬁdence level (1−pvalue) outcomes of validating the sixth month on the
ﬁfth and the ﬁrst month in ‘Manufacturing’ database.
procedure. No results of the changed detection metric (CD) exceeding
the 95% conﬁdence level were produced in any period. This means that
no “false alarms” were issued by the procedure.
• Statistically signiﬁcant changes in the distributions of the candidate
input (independent) variables and the target (dependent) variable across
monthly intervals have not generated a signiﬁcant change in the rules,
which are induced from the database.
• The CD metric implemented by our method can also be used to determine
whether an incrementally built model is stable. If we are applying a stable
data mining algorithm, like the InfoFuzzy Network, to an accumulated
amount of data, it should produce increasing conﬁdence levels of the CD
metric over the initial periods of the time series, as more data supports
the induced classiﬁcation model.
Thus the results obtained from a realworld time series database conﬁrm
the conclusions of the experiments on artiﬁcial datasets with respect to
reliability of the proposed change detection methodology.
5. Conclusions and Future Work
As mentioned above, most methods of batch learning are based on the
assumption that the training data involved in building and verifying the
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 121
model is the best estimator of what will happen in the future. An impor
tant factor that must not be set aside is the time factor. As more data is
accumulated in a time series database, incrementally over time, one must
examine whether the data in a new period agrees with the data in previous
periods and take the relevant decisions about learning in the future. This
work presents a novel change detection method for detecting signiﬁcant
changes in classiﬁcation models induced from continuously accumulated
data streams by batch learning methods.
The following statements can summarize the major contributions of this
work to the area of data mining and knowledge discovery in databases:
(i) This work deﬁnes three main causes for a statistically signiﬁcant
change in a datamining model:
• A change in the probability distribution of one or more of candidate
input variables A.
• A change in the distribution of the target variable T.
• A change in the “patterns” (rules), which deﬁne the relationship
of the candidate input to the target variable. That is, a change in
the model M.
This work has shown that although there are three main causes for
signiﬁcant changes in the datamining models, it is common that these
main causes coexist in the same data stream, deriving eight possible
combinations for a signiﬁcant change in a classiﬁcation model induced
from time series data. Moreover, these causes aﬀect each other in a
manner and magnitude that depend on the database being mined and
the algorithm in use.
(ii) The change can be detected by the change detection procedure using a
threestage validation technique. This technique is designed to detect
all possible signiﬁcant changes.
(iii) The change detection method relies on the implementation of two
statistical tests:
(a) Change Detection hypothesis testing (CD) of every period K,
based on the deﬁnition of a signiﬁcant change C in classiﬁcation
“rules”, with respect to the previous K − 1 periods.
(b) Pearson’s estimator (XP) for testing matching proportions of vari
ables to detect a signiﬁcant change in the probability distribution
of candidate input and target attributes.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
122 G. Zeira, O. Maimon, M. Last and L. Rokach
(iv) The eﬀect of a change is relevant and can be mainly detected in
the period of change. If not detected, the inﬂuence of a change in
a database can be absorbed in successive periods.
(v) All procedures, hypothesis tests and deﬁnitions, were validated on
artiﬁcially generated databases and on a realworld database.
(vi) The change detection procedure has low computational costs. Its com
plexity is O(n
K
), where n
K
is the total number of validation records in
K periods, since it requires only testing whether the new data agrees
with the model induced from previously aggregated data.
(vii) The CD metric can also be used to determine whether an incremen
tally built model is stable. In our realworld experiment, stability of
infofuzzy network has been conﬁrmed by the increasing conﬁdence
levels over initial periods of the data stream.
The Change Detection Procedure, with the use of the statistical estima
tors, can detect signiﬁcant changes in classiﬁcation models of data mining.
These changes can be detected independently of the data mining algorithm
used (e.g., C4.5 and ID3 by Quinlan, KS2, ITI and DMTI by Utgoﬀ, IFN
by Last and Maimon, IDTM by Kohavi, Shen’s CDL4, etc. ), or the induced
classiﬁcation model (rules, decision trees, networks, etc.). As change detec
tion is quite a new application area in the ﬁeld of data mining, many future
issues could be developed, including the following:
(i) Implementing metalearning techniques according to the cause(s) and
magnitude(s) of a change(s) detected in period K for combining sev
eral models, such as: exponential smoothing, voting weights based on
the CD conﬁdence level, ignoring old or problematic periods, etc.
(ii) Increasing the granularity of signiﬁcant changes. There should be sev
eral subtypes of changes with various magnitudes of the model struc
ture and parameters that could be identiﬁed by the change detection
procedure, in order to give the user extra information about the mined
time series.
(iii) Integrating the change detection methodology in an existing data min
ing algorithm. As indicated above, the change detection procedure’s
complexity is only O(n). One can implement this procedure in an
existing incremental (online) learning algorithm, which will continue
eﬃciently rebuilding an existing model if the procedure does not indi
cate a signiﬁcant change in the newly obtained data. This option is
also applicable for metalearning and multimodel methods.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 123
(iv) Using the CD statistical hypothesis testing for continuous monitoring
of speciﬁc attributes.
(v) Using the CD statistical estimator or its variations as a metric for
measuring the stability of an incrementally built data mining model.
(vi) Further analysis of the relationship between the two error rates α (false
positive) and β (false negative) and its impact on the performance of
the change detection procedure.
(vii) Further validation of the Change Detection Procedure on other DM
models, algorithms, and data types.
References
1. Ali, K. and Pazzani, M. (1995). Learning Multiple Descriptions to Improve
Classiﬁcation Accuracy. International Journal on Artiﬁcial Intelligence
Tools, 4, 1–2.
2. Case, J., Jain, S., Lange, S. and Zeugmann, T. (1999). Incremental Concept
Learning for Bounded Data Mining Information & Computation, 152(1),
pp. 74–110.
3. Chan, P.K. and Stolfo, S.J. (1995). A Comparative Evaluation of Voting and
Metalearning on Partitioned Data. Proceedings of the Twelfth International
Conference on Machine Learning, pp. 90–98.
4. Chan, P.K. and Stolfo, S. (1996). Sharing Learned Models among Remote
Database Partitions by Local Metalearning. Proc. Second Intl. Conf. on
Knowledge Discovery and Data Mining, pp. 2–7.
5. Cheung, D., Han, J., Ng, V., and Wong, C.Y. (1996). Maintenance of Dis
covered Association Rules in Large Databases: An Incremental Updating
Technique. Proceedings of 1996 International Conference on Data Engineer
ing (ICDE’96), New Orleans, Louisiana, USA, Feb. 1996.
6. Domingos, P. (1997). Bayesian Model Averaging in Rule Induction. Prelimi
nary Papers of the Sixth International Workshop on Artiﬁcial Intelligence
and Statistics, Ft. Lauderdale, FL: Society for Artiﬁcial Intelligence and
Statistics, pp. 157–164.
7. Domingos, P. (1997). Knowledge Acquisition from Examples Via Multiple
Models. Proceedings of the Fourteenth International Conference on Machine
Learning, Nashville, TN: Morgan Kaufmann, pp. 98–106.
8. Domingos, P. (1996). Using Partitioning to Speed Up SpeciﬁctoGeneral
Rule Induction. Proceedings of the AAAI96 Workshop on Integrating Mul
tiple Learned Models, AAAI Press, pp. 29–34.
9. Fawcett, T. and Provost, F. (1999). Activity Monitoring: Noticing Interesting
Changes in Behavior. Proceedings of the Fifth International Conference on
Knowledge Discovery and Data Mining (KDD99).
10. Fayyad, U., PiatetskyShapiro, G., and Smyth, P. (1996). Knowledge Dis
covery and Data Mining: Towards a Unifying Framework. Proceedings of the
Second ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD96).
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
124 G. Zeira, O. Maimon, M. Last and L. Rokach
11. Freund, Y. and Mansour, Y. (1997). Learning under persistent drift. Proceed
ings of EuroColt, pp. 109–118.
12. Gerevini, A., Perini, A., and Ricci, F. (1996). Incremental Algorithms for
Managing Temporal Constraints. IEEE International Conference on Tools
with Artiﬁcial Intelligence (ICTAI’96).
13. Harries, M. and Horn, K. (1996). Learning stable concepts in domains with
hidden changes in context. Proceedings of the ICML96 Workshop on Learn
ing in ContextSensitive Domains, Bari, Italy, July 3, pp. 21–31.
14. Helmbold, D.P. and Long, P.M. (1994). Traching Drifting Concepts By Min
imizing Disagreements. Machine Learning, 14(1), 27–46.
15. Hsu, C.N. and Knoblock, C.A. (1998). Discovering Robust Knowledge from
Databases that Change. Journal of Data Mining and Knowledge Discovery,
2(1), 1–28.
16. Hines, W.H. and Montgomery, D.C. (1990). Probability and Statistics in Engi
neering and Management Science. Third Edition, Wiley.
17. Jones, R.H., Crowell, D.H., and Kapuniai, L.E. (1970). Change detection
model for serially correlated multivariate data. Biometrica, 26, 269–280.
18. Hulten, G., Spencer, L., and Domingos, P. (2001). Mining TimeChanging
Data Streams. Proceedings of the Seventh ACM SIGKDD International Con
ference on Knowledge Discovery and Data Mining (KDD2001).
19. Kelly, M.G., Hand, D.J., and Adams, N.M. (1999). The Impact of Chang
ing Populations on Classiﬁer Performance. Proceedings of the Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD99).
20. Kohavi, R. (1995). The Power of Decision Tables. Proceedings of European
Conference on Machine Learning (ECML).
21. Lane, T. and Brodley, C.E. (1998). Approaches to Online and Concept Drift
for User Identiﬁcation in Computer Security. Proceedings of the Fourth Inter
national Conference on Knowledge Discovery and Data Mining, New York,
NY, pp. 259–263.
22. Last, M., Maimon, O., and Minkov, E. (2002). Improving Stability of Deci
sion Trees. International Journal of Pattern Recognition and Artiﬁcial Intel
ligence, 16(2), 145–159.
23. Last, M. (2002). Online Classiﬁcation of Nonstationary Data Streams.
Intelligent Data Analysis, 6(2), 129–147.
24. Liu, H. and Setiono, R. (1998). Incremental Feature Selection. Journal of
Applied Intelligence, 9(3), 217–230.
25. Maimon, O. and Last, M. (2000). Knowledge Discovery and Data Mining,
the InfoFuzzy Network (IFN) Methodology, Kluwer.
26. Martinez, T. (1990). Consistency and Generalization in Incrementally
Trained Connectionist Networks. Proceeding of the International Symposium
on Circuits and Systems, pp. 706–709.
27. Mangasarian, O.L. and Solodov, M.V. (1994). Backpropagation Convergence
via Deterministic Nonmonotone Perturbed Mininization. Advances in Neural
Information Processing Systems, 6, 383–390.
April 22, 2004 16:58 WSPC/Trim Size: 9in x 6in for Review Volume Chap05
Change Detection in Classiﬁcation Models Induced from Time Series Data 125
28. Minium, E.W., Clarke, R.B., and Coladarci, T. (1999). Elements of Statistical
Reasoning, Wiley, New York.
29. Mitchell, T.M. (1997). Machine Learning, McGraw Hill.
30. Montgomery, D.C. and Runger, G.C. (1999). Applied Statistics and Proba
bility for Engineers, Second Edition, Wiley.
31. Nouira, R. and Fouet, J.M. (1996). A Knowledge Based Tool for the Incre
mental construction, Validation and Reﬁnement of Large Knowledge Bases.
Preliminary Proceedings of Workshop on validation, veriﬁcation and reﬁne
ment of BKS (ECAI96).
32. Ohsie, D., Dean, H.M., Stolfo, S.J., and Silva, S.D. (1995). Performance of
Incremental Update in Database Rule Processing.
33. Shen, W.M. (1997). An Active and SemiIncremental Algorithm for
Learning Decision Lists. Technical Report, USCISI97, Information Sci
ences Institute, University of Southern California, 1997. Available at:
http://www.isi.edu/∼shen/activecdl4.ps.
34. Shen, W.M. (1997). Bayesian Probability Theory — A General
Method for Machine Learning. From MCCCarnot10193. Microelectron
ics and Computer Technology Corporation, Austin, TX. Available at:
http://www.isi.edu/∼shen/BayesML.ps.
35. Utgoﬀ, P.E. and Clouse, J.A. (1996). A KolmogorovSmirnoﬀ metric for
decision tree induction. Technical Report 963, Department of computer
science, University of Massachusetts. Available at: ftp://ftp.cs.umass.edu/
pub/techrept/techreport/1996/UMCS1996003.ps
36. Utgoﬀ, P.E. (1994). An improved algorithm for incremental induction of deci
sion trees. Machine Learning: Proceedings of the Eleventh International Con
ference, pp. 318–325.
37. Utgoﬀ, P.E. (1995). Decision tree induction based on eﬃcient tree restruc
turing. Technical Report 9518, Department of computer science, Univer
sity of Massachusetts. Available at: ftp://ftp.cs.umass.edu/pub/techrept/
techreport/1995/UMCS1995018.ps
38. Widmer, G. and Kubat, M. (1996). Learning in the Presence of Concept Drift
and Hidden Contexts. Machine Learning, 23(1), 69–101.
39. Yao, Y. (1988). Estimating the number of change points via Schwartz’ crite
rion. Statistics and probability letters, pp. 181–189.
40. Zhang, B.T. (1994). An Incremental Learning Algorithm that Optimizes
Network Size and Sample Size in One Trial. Proceedings of International
Conference on Neural Networks (ICNN94), IEEE, pp. 215–220.
This page intentionally left blank
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
CHAPTER 6
CLASSIFICATION AND DETECTION OF ABNORMAL
EVENTS IN TIME SERIES OF GRAPHS
H. Bunke
Institute of Computer Science and Applied Mathematics
University of Bern, CH3012 Bern, Switzerland
Email: bunke@iam.unibe.ch
M. Kraetzl
ISR Division, Defense Science and Technology Organization
Edinburgh SA 5111, Australia
Email: mkraetz@nsa.gov
Graphs are widely used in science and engineering. In this chapter, the
problem of detecting abnormal events in times series of graphs is inves
tigated. A number of graph similarity measures are introduced. These
measures are useful to quantitatively characterize the degree of change
between two graphs in a time series. Based on any of the introduced
graph similarity measures, an abnormal change is detected if the simi
larity between two consecutive graphs in a time series falls below a given
threshold. The approach proposed in this chapter is not geared towards
any particular application. However, to demonstrate its feasibility, its
application to abnormal event detection in the context of computer net
works monitoring is studied.
Keywords: Graph; time series of graphs; graph similarity; abnormal
event detection; computer network monitoring.
1. Introduction
Graphs are a powerful and ﬂexible data structure useful for the represen
tation of objects and concepts in many disciplines of science and engineer
ing. In a graph representation, the nodes typically model objects, object
parts, or object properties, while the edges describe relations between the
nodes, for example, temporal, spatial, or conceptual dependencies between
the objects that are modelled through the nodes. Examples of applications
127
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
128 H. Bunke and M. Kraetzl
where graph representations have been successfully used include chemical
structure analysis [1], molecular biology [2], software engineering [3], and
database retrieval [4]. In artiﬁcial intelligence, graphs have been success
fully applied to casebased reasoning [5], machine learning [6], planning [7],
knowledge representation [8], and data mining [9]. Also in computer vision
and pattern recognition graphs are widely used. Particular applications
include character recognition [10], graphics recognition [11], shape analysis
[12], and object classiﬁcation [13].
In many artiﬁcial intelligence applications object representations in
terms of feature vectors or lists of attributevalue pairs are used [14]. On the
one hand, graphs have a much higher representational power than feature
vectors and attributevalue lists as they are able to model not only unary,
but higher order relations and dependencies between various entities. On
the other hand, many operations on graphs are computationally much more
costly than equivalent operations on feature vectors and attributevalue
lists. For example, computing the dissimilarity, or distance, of two objects
that are represented through feature vectors is an operation that is linear
in the number of features, but computing the edit distance of two graphs
in exponential in the number of their nodes [15].
In this chapter we focus on a special class of graphs that is character
ized by a low computational complexity with regard to a number of oper
ations. Hence this class of graphs oﬀers the high representational power
of graphs together with the low computational complexity typically found
with feature vector representations. The computational eﬃciency of the
graph operations results from constraining the graphs to have unique node
labels. In particular we are concerned with time series of graphs. In such
series, each individual graph represents the state of an object, or a system,
at a particular point of time. The task under consideration is the detection
of abnormal events, i.e. the detection of abnormal change of the state of
the system when going from one point in time to the next.
Data mining is generally concerned with the detection and extraction
of meaningful patterns and rules from large amounts of data [16,17]. Clas
siﬁcation is considered to be one of the main subﬁelds of data mining [18].
The goal of classiﬁcation is to assign an unknown object to one out of a
given number of classes, or categories. Obviously, the task considered in
the present paper is a particular instance of classiﬁcation, where the transi
tions between the individual elements of a given sequence of graphs are to
be classiﬁed as normal or abnormal. From the general point of view, such
a classiﬁcation may serve as the ﬁrst step of a more complex data mining
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 129
and knowledge extraction process that is able to infer hitherto hidden rela
tionships between normal and abnormal events in time series.
The procedures introduced in this chapter do not make any particular
assumptions about the underlying problem domain. In other words, the
given sequence of graphs may represent any time series of objects or sys
tems. However, to demonstrate the feasibility of the proposed approach, a
special problem in the area of computer network monitoring is considered
[19]. In this application each graph in a sequence represents a computer
network at a certain point of time, for example, at a certain time each day.
Changes of the network are captured using a graph distance measure. Hence
a sequence of graph similarity values is derived from a given time series of
graphs. Each similarity value is an indicator of the degree of change that
occurred to the network between two consecutive points of time. In this
sequence of similarity values abnormal change can be detected by means of
thresholding and similar techniques. That is, it is assumed that an abnor
mal change has occurred if the similarity between two consecutive graphs
is below a given threshold.
There are other applications where complex systems, which change their
behaviour or properties over time, are modelled through graphs. Examples
of such systems include electrical power grids [20], regulatory networks con
trolling the mammalian cell cycle [21], and coauthorship and citation net
works in science [22]. The formal tools introduced in this chapter can also
be used in these applications to detect abnormal change in the system’s
behaviour.
The remainder of this chapter is organized as follows. In Section 2 our
basic terminology is introduced. Graph similarity measures based on graph
spectra are presented in Section 3. Another approach to measuring graph
similarity, using graph edit distance, is discussed in Section 4. The median
of a set, or sequence, of graphs is introduced in Section 5, and its use for
the classiﬁcation of events in a sequence of graphs is discussed in Section 6.
Then in Section 7 the application of the similarity measures introduced in
the previous sections to the detection of abnormal change in communication
networks is discussed. Finally, some conclusions are drawn in Section 8.
2. Preliminaries
A graph G = (V, E) consists of a ﬁnite set of vertices V and ﬁnite set of edges
E which are pairs of vertices; a pair of vertices denotes the endpoints of an
edge. Two vertices u, v ∈ V are said to be adjacent if they are endpoints of
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
130 H. Bunke and M. Kraetzl
the same edge [23]. Edges in G can be directed. In this case edge (u, v) ∈ E
originates at node u ∈ V and terminates at node v ∈ V .
Objects such as vertices or edges (or their combinations) associated with
a graph are referred to as elements of the graph. A weight is a function whose
domain is a set of graph elements in G. The domain can be restricted to that
of edge or vertex elements only, where the function is referred to as edge
weight or vertexweight respectively. A weight whose domain is all vertices
and edges is called total. Values of weight assigned to elements in G may
be numerical (e.g. the amount of traﬃc in a computer network as values of
edgeweight), or symbolic (e.g. node identiﬁers as values of vertexweight).
The set of possible values in the range of the weight function are called
attributes. A unique labeling is a onetoone weight, for a vertexweight
this serves to uniquely identify vertices of a graph. A graph with a unique
labeling is called a labeled graph.
The graph G = (V, E, w
V
, w
E
) is vertexlabeled with weight w
V
:
V → L
V
assigning vertex identiﬁer attributes L
V
to individual vertices,
where the value of an assigned vertexweight is w
V
(u) for vertex u. Fur
thermore, w
V
(u) = w
V
(v), ∀u, v ∈ V, uv since the graphs in our applica
tion possess unique vertexlabelings. Edges are also weighted with weight
w
E
: E → R
+
. The notation w
G
E
is used to indicate that edge weight w
E
belongs to graph G. The number of vertices in G = (V, E) is denoted by
V , and likewise the number of edges is denoted by E.
3. Analysis of Graph Spectra
One of the known approaches to the measurement of change in a time
series of graphs is through the analysis of graph spectra. Algebraic aspects
of spectral graph theory are useful in the analysis of graphs [24–27]. There
are several ways of associating matrix spectra (sets of all eigenvalues) with
a given weighted graph G. The most obvious way is to investigate the
structure of a ﬁnite digraph G by analyzing the spectrum of its adjacency
matrix A
G
. Note that A
G
is not a symmetric matrix for directed graphs
because weight may not be a symmetric function.
For a given ordering of the set of n vertices V in a graph G = (V, E),
one can investigate the spectrum σ(G) = {λ
1
, λ
2
, . . . , λ
n
}, where λ
i
are the
eigenvalues of the weighted adjacency matrix A
G
. Obviously, σ(G) does
not depend on the ordering of V . Also, the matrix spectrum does not
change under any nonsingular transformation. Since the adjacency matrices
have nonnegative entries, the class of all isospectral graphs (graphs having
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 131
identical spectra) must be relatively small. Because of that fact, and because
isomorphic graphs are isospectral, one might expect that classes of isospec
tral and isomorphic graphs coincide. However, it can be shown that the
class of isospectral graphs is larger [24].
It is easy to verify that by (arbitrary) assignment of x
i
’s in an eigenvec
tor x = [x
1
, x
2
, . . . , x
n
] of A
G
to n various vertices in V , the components of
x can be interpreted as positive vertexweight values (attributes) of the cor
responding vertices in the digraph G. This property is extremely useful in
ﬁnding connectivity components of G, and in various clustering, coarsening,
or graph condensing implementations [28].
A diﬀerent approach to graph spectra for undirected graphs [25,29,30]
investigates the eigenvalues of the Laplace matrix L
G
= D
G
−A
G
, where the
degree matrix D
G
of graph G is deﬁned as D
G
= diag{
v∈V
w
G
E
(u, v)u ∈
V
G
}. Note that in the unweighted case, diagonal elements of D
G
are sim
ply the vertex degrees of indexed vertices V . It follows that the Laplacian
spectrum is always nonnegative, and that the number of zero eigenvalues
of L
G
equals the number of connectivity components of G. In the case of
digraphs, the Laplacian is deﬁned as L
G
= D
G
− (A
G
+ A
T
G
) (in order to
ensure that σ(L
G
) ⊆ {0}∪R
+
). The second smallest eigenvalue (called the
algebraic connectivity of G) provides graph connectivity information and
is always smaller or equal to the vertex connectivity of G [30]. Laplacian
spectra of graphs have many applications in graph partitions, isoperimetric
problems, semideﬁnite programming, random walks on graphs, and inﬁnite
graphs [25,29].
The relationship between graph spectra and graph distance measures
can be established using an eigenvalue interpretation [27]. Consider the
weighted graph matching problem for two graphs G = (V
G
, E
G
) and H =
(V
H
, E
H
), where V
G
 = V
H
 = n, with positive edgeweights w
G
E
and w
H
E
,
which is deﬁned as ﬁnding a onetoone function Φ: V
G
→ V
H
such that
the graph distance function:
dist(Φ) =
u∈V
G
,v∈V
H
w
G
E
(u, v) −w
H
E
(Φ(u), Φ(v))
2
(3.1)
is minimal. This is equivalent to ﬁnding a permutation matrix P that min
imises J(P) = PA
G
P
T
− A
H

2
, where A
G
and A
H
are the (weighted)
adjacency matrices of the two graphs, and · is the ordinary Euclidean
matrix norm. Notice that this graph matching problem is computationally
expensive because it is a combinatorial optimisation problem.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
132 H. Bunke and M. Kraetzl
In the case of two graphs G and H with the same vertex set V , and
(diﬀerent) edgeweights w
G
E
and w
H
E
, the (Umeyama) graph distance is
expressed as:
dist(G, H) =
u,v∈V
_
w
G
E
(u, v) −w
H
E
(u, v)
¸
2
. (3.2)
In spite of the computational diﬃculty of ﬁnding a permutation matrix
P which minimises J(P), in the restricted case of orthogonal (permutation)
matrices, one can use the spectra (eigendecompositions) of the adjacency
matrices A
G
and A
H
. In fact, for Hermitian matrices A, B ∈ M
nn
(R)
with corresponding spectra σ(A) = {α
1
> α
2
> · · · > α
n
} and σ(B) =
{β
1
> β
2
> · · · > β
n
}, and with eigendecompositions A = U
A
D
A
U
T
A
and
B = U
B
D
B
U
T
B
(U
(·)
are unitary, and D
(·)
are diagonal matrices):
n
i=1
(a
i
−b
i
)
2
= min
Q
QAQ
T
−B
2
, (3.3)
where the minimum is taken over all unitary matrices Q. Equation (3.3)
justiﬁes the use of the adjacency matrix spectra; given two weighted graphs
G and H with respective spectra
σ(A
G
) = {λ
1
, λ
2
, . . . , λ
n
1
} and σ(A
H
) = {µ
1
, µ
2
, . . . , µ
n
2
},
the k largest positive eigenvalues are incorporated into the graph distance
measure [26] as:
dist(G, H) =
¸
¸
¸
_
k
i=1
(λ
i
−µ
i
)
2
min
_
k
i=1
λ
2
i
,
k
j=1
µ
2
j
_
, (3.4)
where k is an arbitrary summation limit, but empirical studies in pattern
recognition and image analysis show that k ≈ 20 is a good choice. Notice
that similar approaches to distance measures can be applied to the case of
Laplacian spectra.
4. Graph Edit Distance
In this section we introduce a graph distance measure, termed graph edit
distance ged, that is based on graph edit operations. This distance measure
evaluates the sequence of edit operations required to modify an input graph
such that it becomes isomorphic to some reference graph. This can include
the possible insertion and deletion of edges and vertices, in addition to
weight value substitutions [31]. Generally, ged algorithms assign costs to
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 133
each of the edit operations and use eﬃcient tree search techniques to identify
the sequence of edit operations resulting in the lowest total edit cost [15,32].
The resultant lowest total edit cost is a measure of the distance between
the two graphs.
In general graph matching problems with unlabeled graphs, a unique
sequence of edit operations does not exist due to the occurrence of multiple
possible vertex mappings. The ged algorithms are required to search for
the edit sequence that results in a minimum edit cost. However, for the
class of graphs introduced in Section 2, vertexweight value substitution is
not a valid edit operation because vertexweight values are unique. As a
result, the combinatorial search reduces to the simple identiﬁcation of ele
ments (vertices and edges) inserted or deleted from one graph G to produce
the other graph H. The implementation requires linear time in size of the
problem.
If the cost associated with the insertion or deletion of individual ele
ments is one, and edgeweight value substitution is not considered (i.e.,
we consider the topology only), the edit sequence cost becomes the diﬀer
ence between the total number of elements in both graphs, and all graph
elements in common.
Using the above cost function, let G =
V
G
, E
G
, w
G
V
, w
G
V
and H =
V
H
, E
H
, w
H
V
w
H
V
be two graphs the similarity of which is to be evaluated.
The graph edit distance d
1
(G, H) describing topological change that takes
place when going from graph G to H then becomes:
d
1
(G, H) = V
G
 +V
H
 −2V
G
∩ V
H
 +E
G
 +E
H
 −2E
G
∩ E
H
. (4.1)
Clearly the edit distance, as a measure of topology change, increases
with increasing degree of change. Edit distance d
1
(G, H) is bounded below
by d
1
(G, H) = 0 when H and G are isomorphic (i.e., there is no change),
and above by d
1
(G, H) = V
G
 +V
H
 +E
G
 +E
H
 when G∪ H = ∅, the
case where the graphs are completely diﬀerent.
In the second graph edit distance measure studied in this paper, we
consider not only change in graph topology, but also in edge weight. For
this purpose, we assign a cost c to each vertex deletion and insertion, where
c > 0 is a constant. The cost of changing weight w
G
E
(e) on edge e ∈ E
G
into
w
H
E
(e) on e ∈ E
H
is deﬁned as
w
G
E
(e) −w
H
E
(e)
. To simplify our notation
we let our graphs be completely connected (i.e., there is an edge e ∈ E
H
between any two vertices in G and an edge e ∈ E
H
between any two vertices
in H) and assign a weight equal to zero to edge e ∈ E
G
(e ∈ E
H
) if this edge
does not exist in G(H). Hence substitution of edge weights, edge deletions,
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
134 H. Bunke and M. Kraetzl
and edge insertions can be treated uniformly. Note that the deletion of an
edge e ∈ E
G
with weight w
G
E
(e) has a cost equal to w
G
E
(e). Similarly the
insertion of an edge e ∈ E
H
has the weight of that edge, w
H
E
(e), assigned
as its cost. Consequently, the graph edit distance under this cost function
becomes
d
2
(G, H) = c[V
G
 +V
H
 −2V
G
∩ V
H
] +
e∈E
G
∩E
H
w
G
E
(e) −w
H
E
(e)
+
e∈E
G
\(E
G
∩E
H
)
w
G
E
(e) +
e∈E
H
\(E
G
∩E
H
)
w
H
E
(e). (4.2)
The constant c is a parameter that allows us to weight the importance
of a vertex deletion or insertion relatively to the weight changes on the
edges.
5. Median Graphs
Intuitively speaking, the median of a sequence of graphs S = (G
1
, . . . , G
n
)
is a single graph that represents the given G
i
’s in the best possible manner.
Using any of the available graph distance measures, for example, d
1
(G, H)
or d
2
(G, H) introduced in Section 4, the median of a sequence of graphs
S is deﬁned as a graph that minimises the sum of the edit distances to all
members of sequence S. Formally, let U be the family of all graphs that
can be constructed using labels from L
V
for vertices and real numbers for
edges. Then
¯
G is a median graph of the sequence S = (G
1
, . . . , G
n
) if:
¯
G = arg min
G∈U
n
i=1
d(G, G
i
). (5.1)
If we constrain
¯
G to be a member of S then the graph that satisﬁes
Eq. (5.1) is called the set median of S. In a general context where node labels
are not unique a set median is usually easier to compute than a median
graph. However, in the context of the present paper where we have uniquely
labelled graphs, we will exclusively focus on median graph computation and
its application to abnormal change detection. It is to be noted that median
graphs need not be unique.
The term “median graph” is used because the median ¯ x of an ordered
sequence of real numbers (x
1
, . . . , x
n
), which is deﬁned as
if n is even, then ¯ x = median(x
1
, . . . , x
n
) = x
n/2
,
else ¯ x = median(x
1
, . . . x
n
) = x
n/2
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 135
has a similar property as the median of a sequence of graphs: it minimises
the sum of distances to all elements of the given sequence i.e., it minimises
the expression
n
i=1
¯ x −x
i
.
Next we describe a procedure for the computation of the median G
of a sequence of graphs (G
1
, . . . , G
n
) using the topological graph distance
measure d
1
deﬁned in Eq. (4.1). Let G = (V, E) with V =
n
i=1
V
i
and E =
n
i=1
E
i
. Furthermore, let C(u) denote the total number of occurrences of
vertex u in V
1
, . . . , V
n
. Formally, C(u) is deﬁned by the following procedure:
C(u) = 0;
for i = 1 to n do
if u ∈ V
i
then C(u) = C(u) + 1
An analogous procedure can be deﬁned for the edges (u, v) ∈ E:
C(u, v) = 0;
for i = 1 to n do
if (u, v) ∈ E
i
then C(u, v) = C(u, v) + 1
Next we deﬁne graph
¯
G = (
¯
V ,
¯
E) where:
¯
V = {uu ∈ V ∧ C(u) > n/2},
¯
E = {(u, v)(u, v) ∈ E ∧ C(u, v) > n/2}.
Then it can be proven that
¯
G is a median of sequence (G
1
, . . . , G
n
) [33].
Consequently, a practical procedure for computing the median of sequence
S of graphs is as follows. We consider the union of all vertices and all
edges in S. For each vertex u a counter, C(u), and for each edge (u, v) a
counter, C(u, v), is deﬁned. For each instance of vertex u (and edge (u, v))
in sequence S, C(u) (and C(u, v)) is incremented by one. Finally, vertex u
(edge (u, v)) is included in
¯
G if C(u) > n/2 (C(u, v) > n/2). Obviously,
this procedure is linear in the number of vertices and edges, and in the
number of graphs in S. In contrast to the general method discussed in [34],
which is computationally very costly, it can be expected that the procedure
introduced in this paper can handle long sequences of large graphs.
In [34] it was shown that the median of a set of graphs is not necessarily
unique. It is easy to see that the median graph computed by the procedure
introduced in this section is always unique. However, note that the condition
C(u) > n/2 and C(u, v) > n/2 can be relaxed by inclusion of a vertex, or
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
136 H. Bunke and M. Kraetzl
an edge, in G if C(u) = n/2, or C(u, v) = n/2, respectively. Obviously, any
graph resulting from this relaxed procedure is also a median of sequence S.
Consequently, for the graph distance measure d
1
used in this section, several
medians for a given sequence of graphs may exist. They result from either
inclusion or noninclusion of a vertex u (or an edge (u, v)) with C(u) = n/2
(or C(u, v) = n/2) in G.
An example of the situation where the median graph of a sequence is not
unique is shown in Figure 1. Here we have the choice for both vertex 0 and
3 and their incident edges of either including or not in the median graph.
This leads to a total of nine possible median graphs depicted in Figure 2.
Another property worth mentioning is the convergence property of the
median under distance d
1
. That is, for any given sequence S = (G
1
, . . . , G
n
)
of graphs,
median(G
1
, . . . , G
n
) = median(G
1
, . . . , G
n
, median(G
1
, . . . , G
n
)).
Clearly, each vertex u and each edge (u, v) occurring in sequence S is
included in the median if and only if it occurs in k > n/2 graphs. Hence,
adding its median number of occurrences to set S will result in a median
that is identical to the median of S. For example,
¯
G
1
shown in Figure 2 is
a median of (G
1
, G
2
) in Figure 1, but
¯
G
1
is also a median of the sequence
(
¯
G
1
, G
1
, G
2
).
In the remainder of this section we derive a procedure for median
graph computation using the generalized graph distance measure d
2
deﬁned
in Eq. (4.2). We start again from S = (G
1
, . . . , G
n
), G = (V, E) with
V =
n
i=1
V
i
and E =
n
i=1
E
i
, and let C(u) and C(u, v) be the same as
introduced before.
Let graph
ˆ
G = (
ˆ
V ,
ˆ
E, ˆ w
E
) be deﬁned as follows:
ˆ
V = {uu ∈ V ∧ C(u) > n/2},
ˆ
E = {(u, v)u, v ∈
ˆ
V },
ˆ w
E
(u, v) = median{w
G
i
E
i
(u, v)i = 1, . . . , n}.
Then it can be proven that graph
ˆ
G is a median of sequence S under d
2
[33].
1 3
0 2 1
2
G
1
:
G
2
:
Fig. 1. Two graphs (G
1
, G
2
).
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 137
1 3
0 2 1
2
:
1
G
:
2
G
0
1
2
3 2
1
2
2 1 3
0 3 1
0
1 2 0
1 2
3
:
5
G
:
4
G
:
6
G
:
7
G
:
3
G
:
8
G
:
9
G
3
2 1
0
Fig. 2. All possible medians of the sequence (G
1
, G
2
) from Fig. 1.
Comparing the median graph construction procedure for d
1
with the one
for d
2
, we notice that the former is a special case of the latter, constraining
edge weights to assume only binary values. Edge weight zero (or one) indi
cates the absence (or presence) of an edge. Including an edge (u, v) in the
median graph
¯
G because it occurs in more than n/2 of the given graphs
is equivalent to labeling that edge in
ˆ
G with the median of the weights
assigned to it in the given graphs.
The median of a set of numbers according to Eq. (5.2) is unique. Hence
when constructing a median graph under graph distance measure d
2
, there
will be no ambiguity in edge weights. But the nonuniqueness of the exis
tence of vertices observed for graph distance measure d
1
still exists under
d
2
. The convergence property holds also for d
2
, because for any sequence
of edge weights (x
1
, . . . , x
k
) one has:
median(x
1
, . . . , x
k
) = median(x
1
, . . . , x
k
, median(x
1
, . . . , x
k
)).
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
138 H. Bunke and M. Kraetzl
1
4
3
2
1
1
3
1
4
3
2
1
1
3
1
4
3
2
1
1
3
G
2
: G
3
: G
1
:
Fig. 3. Three graphs (G
1
, G
2
, G
3
).
:
ˆ
G
1
4
3
2
1
1
3
Fig. 4. The median of (G
1
, G
2
, G
3
) in Fig. 3 using d
2
.
We conclude this section with an example of median graph under graph
distance d
2
. Three diﬀerent graphs G
1
, G
2
, and G
3
are shown in Figure 3.
Their median is unique and is displayed in Figure 4.
6. Median Graphs and Abnormal Change Detection in
Sequences of Graphs
The measures deﬁned in Sections 3 and 4 can be applied to consecutive
graphs in a time series of graphs (G
1
, . . . , G
n
) to detect abnormal change.
That is, values of d(G
i−1
, G
i
) are computed for i = 2, . . . , n, and the change
from time i − 1 to i is classiﬁed abnormal if d(G
i−1
, G
i
) is larger than a
certain threshold. However, it can be argued that measuring network change
only between consecutive points of time is potentially vulnerable to noise,
i.e., the random appearance or disappearance of some vertices together
with some random ﬂuctuation of the edge weights may lead to a graph
distance larger than the chosen threshold, though these changes are not
really signiﬁcant.
One expects a more robust change detection procedure is obtained
through the use of median graphs. In statistical signal processing the median
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 139
ﬁlter is widely used for removing impulsive noise. A median ﬁlter is com
puted by sliding a window of length L over data values in a time series.
At each step the output to this process is the median of values within
the window. This process is also termed the running median [35]. In the
following we discuss four diﬀerent approaches to abnormal change detec
tion that utilise median ﬁlters. All these approaches assume that a time
series of graphs (G
1
, . . . , G
n
, G
n+1
, . . .) is given. The median graph of a
subsequence of these graphs can be computed using either graph distance
measure d
1
or d
2
.
6.1. Median vs. Single Graph, Adjacent in Time (msa)
Given the time series of graphs, we compute the median graph in a window
of length L, where L is a parameter that is to be speciﬁed by the user
dependent on the underlying application. Let
˜
G
n
be the median of the
sequence (G
n−L+1
, . . . , G
n
). Then d(
˜
G
n
, G
n+1
) can be used to measure
abnormal change. We classify the change between G
n
and G
n+1
as abnormal
if d(
˜
G
n
, G
n+1
) is larger than some threshold.
Increased robustness can be expected if we take the average deviation
ϕ, of graphs (G
n−L+1
, . . . , G
n
) into account. We compute
ϕ =
1
L
n
i=n−L+1
d(
˜
G
n
, G
i
) (6.1)
and classify the change between G
n
and G
n+1
abnormal if
d(
˜
G
n
, G
n+1
) ≥ αϕ (6.2)
where α is a parameter that needs to be determined from examples of
normal and abnormal change. Note that the median
˜
G
n
is, by deﬁnition, a
graph that minimises ϕ in Eq. (6.1).
In Section 5 it was pointed out that
˜
G
n
is not necessarily unique. If
several instances of
˜
G
n
1
, . . . ,
˜
G
n
1
of
˜
G
n
exist, one can apply Eqs. (6.1)
and (6.2) on all of them. This will result in a series of values ϕ
1
, . . . , ϕ
1
and a series of values d(
˜
G
n
1
, G
n+1
), . . . , d(
˜
G
n
1
, G
n+1
). Under a conservative
scheme, an abnormal change will be reported if d(
˜
G
n
1
, G
n−1
) ≥ αϕ
1
∧· · · ∧
d(
˜
G
n
t
, G
n+1
) ≥ αϕ
t
.
By contrast a more sensitive change detector is obtained if a change is
reported as soon as there exists at least one i for which
d(
˜
G
n
i
, G
n+1
) ≥ αϕ
i
, 1 ≤ i ≤ t.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
140 H. Bunke and M. Kraetzl
6.2. Median vs. Median Graph, Adjacent in Time (mma)
Here we compute two median graphs,
˜
G
1
and
˜
G
2
, in windows
of length L
1
and L
2
, respectively, i.e.,
˜
G
1
is the median of the
sequence (G
n−L
1
+1
, . . . , G
n
) and
˜
G
2
is the median of the sequence
(G
n+1
, . . . , G
n+L
2
). We measure now the abnormal change between time
n and n + 1 by means of d(
˜
G
1
,
˜
G
2
). That is, we compute ϕ
1
and ϕ
2
for
each of the two windows using Eq. (6.1) and classify the change from G
n
to G
n+1
as abnormal if
d(
˜
G
1
,
˜
G
2
) ≥ α
L
1
ϕ
1
+L
2
ϕ
2
L
1
+L
2
.
Measure mma can be expected even more robust against noise and out
liers than measure msa. If the considered median graphs are not unique,
similar techniques (discussed for measure msa) can be applied.
6.3. Median vs. Single Graph, Distant in Time (msd)
If graph changes are evolving rather slowly over time, it may be better
not to compare two consecutive graphs, G
n
and G
n+1
, with each other,
but G
n
and G
n+l
, where l > 1. Instead of msa, as proposed above, we
use d(
˜
G
n
, G
n+1
) as a measure of change between G
n
and G
n+l
, where l
is a parameter deﬁned by the user and is dependent on the underlying
application.
6.4. Median vs. Median Graph, Distant in Time (mmd)
This measure is a combination of the measures mma and msd. We use
˜
G
1
as deﬁned for mma, and let
˜
G
2
= median(G
n+l+1
, . . . , G
n+l+L
2
). Then
d(
˜
G
1
,
˜
G
2
) can serve as a measure of change between time n and n +l + 1.
Obviously, Eqs. (6.1) and (6.2) can be adapted to msd and mmd similarly to
the way they are adapted to mma.
7. Application to Computer Network Monitoring
7.1. Problem Description
In managing large enterprise data networks, the ability to measure net
work changes in order to detect abnormal trends is an important perfor
mance monitoring function [36]. The early detection of abnormal network
events and trends can provide advance warning of possible fault conditions
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 141
[37], or at least assist with identifying the causes and locations of known
problems.
Network performance monitoring typically uses statistical techniques to
analyse variations in traﬃc distribution [38,39], or changes in topology [40].
Visualisation techniques are also widely used to monitor changes in network
performance [41]. To complement these approaches, speciﬁc measures of
change at the network level in both logical connectivity and traﬃc variations
are useful in highlighting when and where abnormal events may occur in
the network [42]. Using these measures, other network management tools
may then be focused on problem regions of the network for more detailed
analysis.
In the previous sections, various graph similarity measures are intro
duced. The aim of the study described in the present section is to identify
whether using these techniques, signiﬁcant changes in logical connectivity
or traﬃc distributions can be observed between large groups of users com
municating over a wide area data network. This data network interconnects
some 120,000 users around Australia. For the purposes of this study, a net
work management probe was attached to a physical link on the wide area
network backbone and collected traﬃc statistics of all data traﬃc operating
over the link. From this information a logical network of users communi
cating over the physical link is constructed.
Communications between user groups (business domains) within the
logical network over any one day is represented as a directed graph. Edge
direction indicates the direction of traﬃc transmitted between two adjacent
nodes (business domains) in the network, with edgeweight indicating the
amount of traﬃc carried. A subsequent graph can then describe communi
cations within the same network for the following day. This second graph
can then be compared with the original graph, using a measure of graph
distance between the two graphs, to indicate the degree of change occur
ring in the logical network. The more dissimilar the graphs, the greater the
graph distance value. By continuing network observations over subsequent
days, the graph distance scores provide a trend of the logical network’s
relative dynamic behaviour as it evolves over time.
In the study described in this section, log ﬁles were collected continu
ously over the period 9 July 1999 to 24 December 1999. Weekends, pub
lic holidays and days where probe data was unavailable were removed to
produce a ﬁnal set of 102 log ﬁles representing the successive business
days’ traﬃc. The graph distance measures examined in this paper pro
duce a distance score indicating the dissimilarity between two given graphs.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
142 H. Bunke and M. Kraetzl
Successive graphs derived from the 102 log ﬁles of the network data set are
compared using the various graph distance measures to produce a set of
distance scores representing the change experienced in the network from
one day to the next.
7.2. Experimental Results
Figures 5 and 6 represent the outcomes of connectivity and traﬃc (weighted
graph) spectral distances, respectively, as introduced in Section 3, applied
to the time series of graphs derived from the network data. In spite of the
less intuitive interpretation of these graph distance measures, there exists
reasonable correlation with the peaks of other distance measures and far
less sensitivity to daily variations using this approach (see below).
Figure 7 shows results for edit distance applied to consecutive graphs
of the time series for the topology only measure d
1
. This measure produces
three signiﬁcant peaks (on days 25, 65 and 90). The ﬁgure also shows sev
eral secondary peaks that may also indicate events of potential interest.
Also, there is signiﬁcant minor ﬂuctuation throughout the whole data set
0 20 40 60 80 100 120
0
0.5
1
1.5
2
2.5
days
s
p
e
c
t
r
a
l
d
i
s
t
a
n
c
e
(
c
o
n
n
e
c
t
i
v
i
t
y
)
Fig. 5. Special distance (connectiviy).
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 143
0 20 40 60 80 100 120
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
days
s
p
e
c
t
r
a
l
d
i
s
t
a
n
c
e
(
t
r
a
f
f
i
c
)
Fig. 6. Special distance (traﬃc).
0 10 20 30 40 50 60 70 80 90 100 110
0
100
200
300
400
500
600
700
800
900
time (days)
e
d
i
t
d
i
s
t
a
n
c
e
d
1
Fig. 7. Consecutive day using measure d
1
.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
144 H. Bunke and M. Kraetzl
0 10 20 30 40 50 60 70 80 90 100 110
100
200
300
400
500
600
700
800
900
time (day)
e
d
i
t
d
i
s
t
a
n
c
e
d
2
Fig. 8. Consecutive day using measure d
2
.
appearing as background noise. Figure 8 shows the results for distance mea
sure d
2
being applied to consecutive graphs in the time series. Whilst this
measure considers both topology and traﬃc, it does not appear to provide
any additional indicators of change additional to those found using the
topology only measure. The main diﬀerence with this measure is that it
has increased the amplitude of the ﬁrst two peaks so that they now appear
as major peaks. This suggests that these peaks were a result of network
change consisting of large change in edge weights.
The ﬁndings for this technique indicate that the application of measures
to consecutive graphs in a time series is most suited for detecting daily
ﬂuctuations in network behaviour. This is especially useful for identifying
outliers in a time series of graphs. Ideally, it would be desirable that each
point of signiﬁcant change be validated against known performance anoma
lies. Unfortunately, information generated by the network management
system of the network under observation is not suﬃcient for validation
purposes, because all data used in the experiments were collected from a
network under real operating conditions. That is, the data are unlabelled
as there is no ‘superinstance’ that would know whether a network change
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 145
that has occurred is normal or abnormal. This shortfall of useful informa
tion for validation of network fault and performance anomalies is common
[43]. Visualisation tools have been used to assist with validating some of
the major changes by observing the network before and after a signiﬁcant
change event. More work needs to be done in producing a reliable, validated
data set.
A number of additional experiments using median graphs have been
reported in [44]. In these experiments it was demonstrated that using the
median graph, computed over a running window of the considered time
series, rather than individual graphs has a smoothing eﬀect, which reduces
the inﬂuence of outliers.
8. Conclusion
Graphs are one of the most popular data structures in computer science
and related ﬁelds. This chapter has examined a special class of graphs that
are characterised by the existence of unique node labels. This property
greatly reduces the computational complexity of many graph operations
and allows a user to deal with large sets of graphs, each consisting of many
nodes and edges. Furthermore, a number of graph distance measures have
been analysed. They are based on eigenvalues from spectral graph theory
and graph edit distance. A novel concept that can be used to measure the
similarity of graphs is median graph. The median of a set, or a sequence, S,
of graphs is a graph that minimises the average edit distance to all members
in S. Thus the median can be regarded as the best single representative of
a given set or sequence of graphs.
Abnormal events in a sequence of graphs representing elements of a
time dependent process can be detected by computing the distance of
consecutive pairs of graphs in the sequence. If the distance is larger than a
threshold than it can be concluded that an abnormal event has occurred.
Alternatively, to make the abnormal change detection procedure more
robust against noise and outliers, one can compute the median of the
time series of graphs over a window of a given span, and compare it to
an individual graph, or the median of a sequence of graphs, following that
window.
None of the formal concepts introduced in this paper is geared towards
a particular application. However, to demonstrate their usefulness in prac
tice, an application in computer network monitoring was studied in this
chapter. Communication transactions, collected by a network management
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
146 H. Bunke and M. Kraetzl
system between logical nodes occurring over periodic time intervals are
represented as a series of weighted graphs. The graph distance measures
introduced in this chapter are used to assess the changes in communication
between user groups over successive time intervals. Using these data in a
number of experiments, the usefulness of the proposed concepts could be
established. As a practical implication, the proposed methods have the
potential to relive human network operators from the need to continu
ally monitor a network, if connectivity and traﬃc patterns can be shown
to be similar to the activity over previous time periods. Alternatively, if
traﬃc volumes are seen to abruptly increase or decrease over the physi
cal link, network operators are able to more readily identify the individ
ual user groups contributing to this change in aggregated traﬃc. Future
work will concentrate on enhancing the present tools for computer network
monitoring and identifying other application areas for the proposed formal
concepts.
References
1. Rouvray, D.H. and Balaban, A.T. (1979). Chemical Applications of Graph
Theory. In R.J. Wilson and L.W. Beineke, eds. Applications of Graph Theory,
pp. 177–221. Academic Press.
2. Baxter, K. and Glasgow, J. (2000). Protein Structure Determination: Com
binnig Inexact Graph Matching and Deformable Templates. Proc. Vision
Interface 2000, pp. 179–186.
3. Rodgers, P.J. and King, P.J.H. (1997). A Graph–Rewriting Visual Language
for Database Programming. Journal of Visual Languages and Computing,
8, 641–674.
4. Shearer, K., Bunke, H., and Venkatesh, S. Video Indexing and Similarity
Retrieval by Largest Common Subgraph Detection Using Decision Trees.
Pattern Recognition, 34, 1075–1091.
5. B¨orner, K., Pippig, E., Tammer, E., and Coulon, C. (1996). Structural Simi
larity and Adaption. In I. Smith and B. Faltings, eds. Advances in Casebased
Reasoning, pp. 58–75. LNCS 1168, Springer.
6. Fisher, D.H. (1990). Knowledge Acquisition Via Incremental Conceptual
Clustering. In J.W. Shavlik and T.G. Dietterich, eds. Readings in Machine
Learning, pp. 267–283. Morgan Kaufmann.
7. Sanders, K., Kettler, B., and Hendler, J. (1997). The Case for Graph
Structured Representations. In D. Leake and E. Plaza, eds. CaseBase Rea
soning Research and Development, 1266 of Lecture Notes in Computer Sci
ence, pp. 245–254. Springer.
8. Ehrig, H. (1992). Introduction to Graph Grammars with Applications to
Semantic Networks. Computers and Mathematics with Applications, 23,
557–572.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs 147
9. Djoko, S., Cook, D.J., and Holder, L.B. (1997). An Empirical Study of
Domain Knowledge and its Beneﬁts to Substructure Discovery. IEEE Trans.
Knowledge and Data Engineering, 9, 575–586.
10. Lu, S.W., Ren, Y., and Suen, C.Y. (1991). Hierarchical Attributed Graph
Representation and Recognition of Handwritten Chinese Characters. Pattern
Recognition, 24, 617–632.
11. Llados, J., Marti, E., and Villanueva, J.J. (2001). Symbol Recognition
by ErrorTolerant Subgraph Matching Between Region Adjacency Graphs.
IEEE Trans. Pattern Analysis and Machine Intelligence, 23, 1144–1151.
12. Shokonfandeh, A. and Dickinson, S. (2001). A Uniﬁed Framework for Index
ing and Matching Hierarchical Shape Structures. In C. Arcelli, L. Cordella,
and G. Sanniti di Baja, eds. Visual Form 2001, pp. 67–84. Springer Verlag,
LNCS 2059.
13. Kittler, J. and Ahmadyfard, A. (2001). On Matching Algorithms for the
Recogniton of Objects in Cluttered Background. In C. Arcelli, L. Cordella,
and G. Sanniti di Baja, Eds. Visual Form 2001, pp. 51–66. Springer Verlag,
LNCS 2059.
14. Mitchell, T.M. (1997). Machine Learning. Mc GrawHill.
15. Messmer, B.T. and Bunke, H. (1998). A New Algorithm for ErrorTolerant
Subgraph Isomorphism Detection. IEEE Trans. Pattern Anal. & Machine
Intell., 20, 493–504.
16. Fayyad, U., PiatetskySharpiro, G., Smyth, P., and Uthurusamy, R. (1996).
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.
17. Kandel, A., Last, M., and Bunke, H. (2001). Data Mining and Computational
Intelligence. Physica Verlag.
18. Chen, M.S., Han, J., and Yu, P.S. (1996). Data Mining: An Overview from
a Database Perspective. IEEE Trans. Data & Knowledge Engineering, 8,
866–883.
19. Bunke, H., Kraetzl, M., Shoubridge, P. and Wallis, W.D. Measuring Abnor
mal Change in Large Data Networks. Submitted for publications.
20. Strogaz, S.H. (2001). Exploring Complex Networks. Nature, 40, 268–276.
21. Kuhn, K.W. (1999). Molecular Interaction Map of the Mammalian Cell Cycle
Control and DNA Repair System. Mol. Biol. Cell, 10, pp. 2703–2734.
22. Newman, M.E.J. (2001). The Structure of Scientiﬁc Collaboration Networks.
Proc. Nat. Acad. Science USA, 98, 404–409.
23. West, D.B. (1996). Introduction to Graph Theory. Prentice Hall, New Jersey.
24. Cvetkovic, D.M., Doob, M., and Sachs, H. (1980). Spectra of Graphs. Aca
demic Press, New York.
25. Mohar, B. (1992). Laplace Eigenvalues of Graphs — A Survey. Discrete
Math., 109, 171–183.
26. Sarkar, S. and Boyer, K.L. (1998). Quantitative Measures of Change Based
on Feature Organization: Eigenvalues and Eigenvectors. Computer Vision
and Image Understanding, 71, 110–136.
27. Umeyama, S. (1988). An Eigendecomposition Approach to Weighted Graph
Matching Problems. IEEE Trans. Pattern Anal.& Machine Intell., 10(5),
695–703.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap06
148 H. Bunke and M. Kraetzl
28. Guattery, S. and Miller, G.L. (1998). On the Quality of Spectral Separators.
SIAM J. Matrix Anal. Appl., 19(3), 701–719.
29. Chung, F.R.K. (1997). Spectral Graph Theory. CBMS, Regional Conference
Series in Mathematics. American Mathematical Society.
30. Fiedler, M. (1973). Algebraic Connectivity of Graphs. Czech. Math. J.,
23(98), 298–305.
31. Sanfeliu, A. and Fu, K.S. (1983). A Distance Measure Between Attributed
Relational Graphs for Pattern Recognition. IEEE Transactions on Systems
Management and Cybernetics, 13(3), 353–362.
32. Bunke, H. and Messmer, B.T. (1997). Recent Advances in Graph Match
ing. International J. Pattern Recognition and Artiﬁcial Intelligence, 11(1),
169–203.
33. Dickinson, P., Bunke, H., Dadej, A., and Kraetzl, M. (2001). Application
of Median Graphs in Detection of Abnomalous Change in Communication
Networks. In Proc. World Multiconference on Systemics, Cybernetics and
Informatics, 5, pp. 194–197. Orlando, US.
34. Jiang, X. and Bunke, H. On median Graphs: Properties, Algorithms, and
Applications. IEEE Trans. Pattern Anal. and Machine Intell., 23(10), 1144–
1151.
35. Astola, J.T. and Campbell, T.G. (1989). On Computation of the Running
Median. IEEE Trans. Acoustics, Speech, and Signal Processing, 37(4), 572–
574.
36. Boutaba, R., Guemhioui, K.E., and Dini, P. (1997). An Outlook on Intranet
Management. IEEE Communications Magazine, pp. 92–99.
37. Thottan, M. and Ji, C. (1998). Proactive Anomaly Detection Using Dis
tributed Intelligent Agents. IEEE Network, 12(5), 21–27.
38. Higginbottom, G.N. (1998). Performance Evaluation of Communication Net
works. Artech House, Massachusetts.
39. Jerkins, J.L. and Wang, J.L. (1998). A Close Look at Traﬃc Measurements
from Packet Networks. In IEEE GLOBECOM 98, Sydney, 4, 2405–2411.
40. White, C.C., Sykes, E.A., and Morrow, J.A. (1995). An Analytical Approach
to the Dynamic Topology Problem. Telecommunication Systems, 3, 397–413.
41. Becker, R.A., Eick, S.G., and Wilks, A.R. (1995). Visualizing Network Data.
IEEE Trans. Visualization and Computer Graphics, 1(1), 16–21.
42. Shoubridge, P., Kraetzl, M., and Ray, D. (1999) Detection of Abnormal
Change in Dynamic Networks. In IEEE Information, Decision and Control,
IDC ’99 conference, pp. 557–562. Adelaide, Australia.
43. Hood, C.S. and Ji, C. (1997). Beyond Thresholds: An Alternative Method
for Extracting Information from Network Measurements. In IEEE Global
Telecommunications Conference, GLOBECOM ’97, 1, 487–491.
44. Dickinson, P., Bunke, H., Dadej, A., and Kraetzl, M. (2001). Application of
Median Graphs in Detection of Anomalous Change in Time Series of Net
works. Submitted for publication.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
CHAPTER 7
BOOSTING INTERVALBASED LITERALS: VARIABLE
LENGTH AND EARLY CLASSIFICATION
∗
Carlos J. Alonso Gonz´alez
Dpto. de Inform´ atica, Universidad de Valladolid, Spain
Grupo de Sistemas Inteligentes
Email: calonso@infor.uva.es
Juan J. Rodr´ıguez Diez
Lenguajes y Sistemas Inform´ aticos, Universidad de Burgos, Spain
Grupo de Sistemas Inteligentes
Email: jjrodriguez@ubu.es
This work presents a system for supervised time series classiﬁcation,
capable of learning from series of diﬀerent length and able of providing
a classiﬁcation when only part of the series are presented to the clas
siﬁer. The induced classiﬁers consist of a linear combination of literals,
obtained by boosting base classiﬁers that contain only one literal. Never
theless, these literals are speciﬁcally designed for the task at hand and
they test properties of fragments of the time series on temporal inter
vals. The method had already been developed for ﬁxed length time series.
This work exploits the symbolic nature of the classiﬁer to add it two new
features. First, the system has been slightly modiﬁed in order that it is
now able to learn directly from variable length time series. Second, the
classiﬁer can be used to identify partial time series. This “early classi
ﬁcation” is essential in some task, like on line supervision or diagnosis,
where it is necessary to give an alarm signal as soon as possible. Several
experiments on diﬀerent data test are presented, which illustrate that
the proposed method is highly competitive with previous approaches in
terms of classiﬁcation accuracy.
Keywords: Interval based literal; boosting; time series classiﬁcation;
machine learning.
∗
This work has been supported by the Spanish CYCIT project TAP 99–0344 and the
“Junta de Castilla y Le´on” project VA101/01.
149
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
150 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
1. Introduction
Multivariate time series classiﬁcation is useful in those classiﬁcation tasks
where time is an important dimension. Instances of these kind of tasks
may be found in very diﬀerent domains, for example analysis of biomedi
cal signals [Kubat et al. (1998)], diagnosis of continuous dynamic systems
[Alonso Gonz´alez and Rodr´ıguez Diez (1999)] or data mining in tempo
ral databases [Berndt and Cliﬀord (1996)]. Time series classiﬁcation may
be addressed like an static classiﬁcation problem, extracting features of
the series through some kind of preprocessing, and using some conven
tional machine learning method. However, this approach has several draw
backs [Kadous (1999)]: the preprocessing techniques are usually ad hoc and
domain speciﬁc, and the descriptions obtained using these features can be
hard to understand. The design of speciﬁc machine learning methods for
the induction of time series classiﬁers allows for the construction of more
comprehensible classiﬁers in a more eﬃcient way because, ﬁrstly, they may
manage comprehensible temporal concepts diﬃcult to capture by the pre
processing technique — for instance the concept of permanence in a region
for certain amount of time — and secondly, there are several heuristics
applicable to temporal domains that preprocessing methods fails to exploit.
The method for learning time series classiﬁers that we propose in this
work is based on literals over temporal intervals (such as increases or always
in region) and boosting (a method for the generation of ensembles of classi
ﬁers from base or weak classiﬁers) [Schapire (1999)] and was ﬁrst introduced
in [Rodr´ıguez et al. (2000)]. The input for this learning task consist of a
set of examples and associated class labels, where each example consists of
one or more time series. Although the series are often referred to as vari
ables, since they vary over time, form a machine learning point of view,
each point of each series is an attribute of the example. The output of
the learning task is a weighted combination of literals, reﬂecting the fact
that the based classiﬁers consist of clauses with only one literal. These base
classiﬁers are inspired by the good results of works using ensembles of very
simple classiﬁers [Schapire (1999)], sometimes named stumps.
Although the method has already been tested over several data set
[Rodr´ıguez et al. (2001)] providing very accurate classiﬁers, it imposes two
restrictions that limits its application to real problems. On the one hand,
it requires that all the time series were of the same length, which is not the
case in every task, as the Auslan example (Australian sign language, see
Section 6.4) [Kadous (1999)] illustrates. On the other hand, it requires as
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 151
an input the completed time series to be classiﬁed. This may be a severe
drawback if the classiﬁers are to be use on line in a dynamic environment
and a classiﬁcation is needed as soon as possible. This work shows how the
original method can be extended to cope with these weaknesses.
Regarding variable length time series, it is always possible to prepro
cess the data set in order to obtain a new one with ﬁxed length series.
Nevertheless, this cannot be considered as a generic solution, because, usu
ally, the own length of the series provides essential information for its clas
siﬁcation. The method can now be used with variable length series because
the literals are allowed to abstain (that is, the result of their evaluation can
be true, false or an abstention) if the series is not long enough to evaluate
some literal. A variant of boosting that can work with base classiﬁers with
abstentions is used.
Regarding the classiﬁcation of partial examples, or early classiﬁcation,
the system has been modiﬁed to add it the capacity of assigning a prelim
inary classiﬁcation to an incomplete example. This feature is crucial when
the examples to classify are being generated dynamically and the time nec
essary to generate an example is long. For instance, consider a supervision
task of a dynamic system. In this case, the example to classify is the current
state, considering the historical values of the variables. For this problem, it
is necessary to indicate a possible problem as soon as possible. In order to
conﬁrm the problem, it will be necessary to wait and see how the variables
evolve. Hence, it is necessary to obtain classiﬁcations using as input series
of diﬀerent lengths. Again, the capability of the literals to abstain allows
tackling this problem.
The rest of the chapter is organized as follows. Section 2 is a brief intro
duction to boosting, suited to our method. The base classiﬁers, interval
literals, are described in Section 3. Sections 4 and 5 deal, respectively, with
variable length series and early classiﬁcation. Section 6 presents experimen
tal results. Finally, we give some concluding remarks in Section 7.
2. Boosting
At present, an active research topic is the use of ensembles of classiﬁers.
They are obtained by generating and combining base classiﬁers, constructed
using other machine learning methods. The target of these ensembles is to
increase the accuracy with respect to the base classiﬁers.
One of the most popular methods for creating ensembles is boosting
[Schapire (1999)], a family of methods, of which AdaBoost is the most
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
152 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
prominent member. They work by assigning a weight to each example.
Initially, all the examples have the same weight. In each iteration a base
(also named weak) classiﬁer is constructed, according to the distribution of
weights. Afterwards, the weight of each example is readjusted, based on the
correctness of the class assigned to the example by the base classiﬁer. The
ﬁnal result is obtained by weighted votes of the base classiﬁers.
AdaBoost is only for binary problems, but there are several methods of
extending AdaBoost to the multiclass case. Figure 1 shows AdaBoost.MH
1
[Schapire and Singer (1998)]. Each instance x
i
belongs to a domain X
and has an associated label y
i
, which belongs to a ﬁnite label space Y .
AdaBoost.MH associates a weight to each combination of examples and
labels. The base learner generates a base hypothesis h
t
, according to the
weights. A real value, α
t
, the weight of the base classiﬁer, is selected. Then,
the weights are readjusted. For y, l ∈ Y, y[l] is deﬁned as
y[l] =
+1 if l = y,
−1 if l = y.
Two open questions are how to select α
t
and how to train the weak
learner. The ﬁrst question is addressed in [Schapire and Singer (1998)]. For
two class problems, If the base classiﬁer returns a value in {−1, +1}, then
Given (x
1
, y
1
), . . . , (x
m
, y
m
) where x
i
∈ X, y
i
∈ Y
Initialize D
1
(i, l) = 1/(mk)
For t = 1, . . . , T:
• Train weak learner using distribution D
t
• Get weak hypothesis h
t
: X ×Y →R
• Choose α
t
∈ R
• Update
D
t+1
(i, l) = D
t
(i, l) exp(−α
t
y
i
[l] h
t
(x
i
, l))/Z
t
where Z
t
is a normalization factor (chosen so that D
t+1
will be a distribution)
Output the ﬁnal hypothesis
Fig. 1. AdaBoost.MH [Schapire and Singer (1998)].
1
Although AdaBoost.MH also considers multilabel problems (an example can be simul
taneously of several classes), the version presented here does not include this case.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 153
the best is
α =
1
2
ln
W
+
W
−
.
Where W
+
and W
−
are, respectively, the sum of the weights of the
examples well and bad classiﬁed.
For the last question, how to train the base learner, it must be a multi
class learner, and we want to use binary learners (only one literal). Then, for
each iteration, we train the weak learner using a binary problem: one class
(selected randomly) against the others. The output of this weak learner,
h
B
t
(x) is binary. On the other hand, we do not generate a unique α
t
, but
for each class, l, an α
tl
is selected. They are selected considering how
good is the weak learner for discriminating between the class l and the
rest of classes. This is a binary problem so the selection of the values can
be done as indicated in [Schapire and Singer (1998)]. Now, we can deﬁne
h
t
(x, l) = α
tl
h
B
t
(x) and α
t
= 1 and we use AdaBoost.MH.
3. Interval Based Literals
Figure 2 shows a classiﬁcation of the predicates used to describe the series.
Point based predicates use only one point of the series:
• point le(Example, Variable, Point, Threshold) it is true if, for the Example,
the value of the Variable at the Point is less or equal than Threshold.
Note that a learner that only uses this predicate is equivalent to an
attributevalue learning algorithm. This predicate is introduced to test the
results obtained with boosting without using interval based predicates.
Two kinds of interval predicates are used: relative and region based. Rel
ative predicates consider the diﬀerences between the values in the interval.
Region based predicates are based on the presence of the values of a variable
in a region during an interval. This section only introduces the predicates
[Rodr´ıguez et al. (2001)] gives a more detailed description, including how
to select them eﬃciently.
Region based: Sometimes, always, true_percentage
Relative: increases, decreases, stays
Interval based
Point based: point_le
Predicates
Fig. 2. Classiﬁcation of the predicates.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
154 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
3.1. Relative Predicates
A natural way of describing series is to indicate when they increase, decrease
or stay. These predicates deal with these concepts:
• increases(Example, Variable, Beginning, End, Value). It is true, for the
Example, if the diﬀerence between the values of the Variable for End and
Beginning is greater or equal than Value.
• decreases(Example, Variable, Beginning, End, Value).
• stays(Example, Variable, Beginning, End, Value). It is true, for the Example,
if the range of values of the Variable in the interval is less or equal than
Value.
Frequently, series are noisy and, hence, a strict deﬁnition of increases
and decreases in an interval i.e., the relation holds for all the points in the
interval, is not useful. It is possible to ﬁlter the series prior to the learning
process, but we believe that a system for time series classiﬁcation must not
rely on the assumption that the data is clean. For these two predicates we
consider what happens only in the extremes of the interval. The parameter
Value is necessary for indicating the amount of change.
For the predicate stays it is neither useful to use a strict deﬁnition. In
this case all the points in the interval are considered. The parameter Value
is used to indicate the maximum allowed diﬀerence between the values in
the interval.
3.2. Region Based Predicates
The selection and deﬁnition of these predicates is based in the ones used in
a visual rule language for dynamic systems [Alonso Gonz´ alez and Rodr´ıguez
Diez (1999)]. These predicates are:
• always(Example, Variable, Region, Beginning, End). It is true, for the Exam
ple, if the Variable is always in this Region in the interval between Begin
ning and End.
• sometimes(Example, Variable, Region, Beginning, End).
• true percentage(Example, Variable, Region, Beginning, End, Percentage). It
is true, for the Example, if the percentage of the time between Beginning
and End where the variable is in Region is greater or equal to Percentage.
Once that it is decided to work with temporal intervals, the use and
deﬁnition of the predicates always and sometimes is natural, due to the fact
that they are the extension of the conjunction and disjunction to intervals.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 155
Since one appears too demanding and the other too ﬂexible, a third one has
been introduced, true percentage. It is a “relaxed always” (or a “restricted
sometimes”). The additional parameter indicates the degree of ﬂexibility
(or restriction).
Regions. The regions that appear in the previous predicates are inter
vals in the domain of values of the variable. In some cases the deﬁnitions
of these regions can be obtained from an expert, as background knowledge.
Otherwise, they can be obtained with a discretization preprocess, which
obtains r disjoint, consecutive intervals. The regions considered are these
r intervals (equality tests) and others formed by the union of the intervals
1, . . . , i (less or equal tests).
The reasons for ﬁxing the regions before the classiﬁer induction, instead
of obtaining them while inducing, are eﬃciency and comprehensibility.
The literals are easier to understand if the regions are few, ﬁxed and not
overlapped.
3.3. Classiﬁer Example
Table 1 shows a classiﬁer. It was obtained from the data set CBF
(Section 6.1). This classiﬁer is composed by 10 base classiﬁers. The ﬁrst
column shows the literal. For each class in the data set there is another
column, with the weight associated to the literal for that class.
In order to classify a new example, a weight is calculated for each class,
and then the example is assigned to the class with greater weight. Initially,
the weight of each class is 0. For each base classiﬁer, the literal is evaluated.
If it is true, then, for each class, its weight is updated adding the weight of
Table 1. Classiﬁer example, for the CBF data set (Section 6.1). For each literal,
a weight is associated to every class.
Literal Cylinder Bell Funnel
true percentage(E, x, 1 4, 4, 36, 95) −0.552084 3.431576 −0.543762
not true percentage(E, x, 1 4, 16, 80, 40) 1.336955 0.297363 −0.527967
true percentage(E, x, 3, 49, 113, 25) −0.783590 −0.624340 1.104166
decreases(E, x, 34, 50, 1.20) −0.179658 0.180668 0.771224
not true percentage(E, x, 4, 14, 78, 15) 0.899025 −0.234799 −0.534271
decreases(E, x, 35, 51, 1.20) −0.289225 −0.477789 0.832881
decreases(E, x, 32, 64, 0.60) −0.028495 −0.462483 0.733433
not true percentage(E, x, 3, 12, 76, 15) 0.971446 −0.676715 −0.248722
true percentage(E, x, 1 4, 18, 34, 95) 0.085362 3.103075 −0.351317
true percentage(E, x, 1 3, 34, 38, 30) 0.041179 2.053417 0.134683
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
156 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
the class for the literal. If the literal is false, then the weight of the class
for the literal is subtracted from the weight of the class.
The ﬁrst literal in the table “true percentage(E, x, 1 4, 4, 36, 95)” has
associated for each class the weights −0.55, 3.43 and −0.54. This means
that if the literal is true (false), it is likely that the class is (not) the second
one. This literal is not useful for discriminating between the ﬁrst and the
third class.
4. Variable Length Series
Due to the method used to select the literals that forms the base classi
ﬁers, the learning system that we have introduced in the preceding sections
requires series of equal length. Consequently, to apply it to series of diﬀer
ent lengths, it is necessary to preprocess the data, normalizing the length
of the series.
There are several methods, more or less complex, that allow normalizing
the length of a set of series. These methods, which preprocess the data set,
can be adequate for some domains. However, the use of these methods is
not a general solution, because they destroy a piece of information that may
be useful for the classiﬁcation: the own length of the series. Therefore, it is
important that the learning method can deal with variable length series. Of
course, it can still be used with preprocessed data sets of uniform length.
To learn from series of variable length, we have opted for a slight mod
iﬁcation of the learning algorithm, that allow a literal to inhibit — or
abstain — when there are not enough data for its evaluation. This modiﬁ
cation also requires some change on the boosting method. With the basic
boosting method, for binary problems, the base classiﬁers return +1 or
−1. Nevertheless, there are variants that use conﬁdencerated predictions
[Schapire and Singer (1998)]. In this case, the base learner returns a real
value: the sign indicates the classiﬁcation and the absolute value the conﬁ
dence in this prediction. A special case consists in the use of three values:
−1, 0 and +1. The value 0 indicates that that base classiﬁer abstains. The
value of α is again selected as
α =
1
2
ln
W
+
W
−
.
Until now, a literal could be true or false, because all the series had the
same length. If the series are of diﬀerent lengths, there will be literals with
intervals that are after the end of the shortest series. For these cases, the
result of the evaluation of the literal could be an abstention, a 0, but due
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 157
to the nature of the literals it will not always be 0. If the end of the series
is before the beginning the interval, the result will be 0. If the end of the
series is in the interval, the result depends on the predicate:
• For increases and decreases the result will be always 0, because only the
extremes of the interval are considered.
• For stays the result will be −1 if the available points are not in the allowed
range. In other case the result will be 0.
• For always, the result will be −1 if there is some available point in the
interval that is out of the region. In other case the result will be 0.
• For sometimes, the result will be 1 if there is some available point in the
interval that is out of the region. In other case the result will be 0.
• For true percentage, the result will be 1 if the series has already enough
points in the interval inside the region. In other case the result will be 0.
5. Early Classiﬁcation
The capability of obtaining an initial classiﬁcation as soon as the available
data give some cue about the possible class they belong to, is a very desir
able property if the classiﬁers are to be used in a dynamic environment were
data are generated and analyzed on line. Now, the setting is diﬀerent to
the variable length series learning problem of the previous paragraph: the
classiﬁer has already been obtained, maybe form series of diﬀerent length,
and the variations occur on the length of the series to be classiﬁed, from
partial examples to full series.
Somewhat surprisingly, this early classiﬁcation ability can be obtained
without modifying the learning method, exploiting the same ideas that
allow learning from series of diﬀerent length. When a partial example is
presented to the classiﬁer some of its literals can be evaluated, because their
intervals refer to areas that are already available in the example. Certainly,
some literals cannot be evaluated because the intervals that they refer to
are still not available for the example: its value is still unknown. Given
that the classiﬁer consists of a linear combination of literals, the simple
omission of literals with unknown value allows to obtain a classiﬁcation
from the available data. The classiﬁcation given to a partial example will
be the linear combination of those literals that have known results.
A very simple approach to identify literals with unknown values would
be to abstain whenever there are point values that are still unknown for
the example. Nevertheless, if the values of some points in the interval are
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
158 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
known, it is sometimes possible to know the result of the literal. This is done
in a similar way than when evaluating literals for variable length series.
6. Experimental Validation
The characteristics of the data sets used to evaluate the behavior of the
learning systems are summarized in Table 2. If a training test partition
it speciﬁed, the results are obtained by averaging 5 experiments. In other
case, 10 fold stratiﬁed cross validation was used.
Table 3 shows the obtained results for diﬀerent data sets with diﬀerent
combination of predicates. The setting named Relative uses the predicates
increases, decreases and stays. The setting Interval also includes these liter
als, and true percentage. It does not include always and sometimes, because
they can be considered as special cases of true percentage.
Normally, 100 literals were used. For some data sets this value is
more than enough, and it shows clearly that the method does not over
ﬁt. Nevertheless, the results of some data set can be improved using more
Table 2. Characteristics of the data sets.
Classes Variables Examples Training/ Length
Test
Min. Average Max.
CBF 3 1 798 10 fold CV 128 128.00 128
CBFVar 3 1 798 10 fold CV 64 95.75 128
Control 6 1 600 10 fold CV 60 60.00 60
ControlVar 6 1 600 10 fold CV 30 44.86 60
Trace 16 4 1600 800/800 67 82.49 99
Auslan 10 8 200 10 fold CV 33 49.34 102
Table 3. Results (error rates) for the diﬀerent data sets using the diﬀerent literals.
Data set Literals Point Relative Always/ True Interval
sometimes percentage
CBF 100 3.51 2.28 1.38 0.63 0.50
CBFVar 100 3.49 1.61 2.75 1.25 1.13
Control 100 4.00 1.17 1.00 1.33 0.83
ControlVar 100 13.00 7.00 5.00 4.83 5.00
Trace 100 72.00 3.00 3.83 8.70 0.78
Trace 200 70.98 0.03 2.65 6.78 0.20
Gloves 100 8.00 5.50 5.50 3.00 4.50
Gloves 200 7.50 6.50 3.50 2.50 2.50
Gloves 300 5.00 4.00 4.50 2.50 1.50
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 159
Table 4. Early Classiﬁcation. Results (error rates) for the diﬀerent data sets using
the diﬀerent literals.
Data set Percentage Point Relative Always/ True Interval
sometimes percentage
CBF 60 4.51 4.65 2.63 1.39 0.88
CBF 80 4.26 2.70 2.01 0.50 0.75
CBFVar 60 4.11 1.86 3.12 1.87 1.61
CBFVar 80 3.49 1.87 2.75 1.25 1.00
Control 60 42.17 17.33 30.17 35.00 34.33
Control 80 11.67 3.50 1.67 1.17 1.00
ControlVar 60 22.50 20.67 20.33 19.00 17.17
ControlVar 80 13.00 7.17 5.00 5.33 5.00
Trace 60 83.33 32.00 55.48 47.73 37.10
Trace 80 70.95 0.45 5.45 7.68 0.63
Gloves 60 5.00 4.00 4.50 3.00 1.50
Gloves 80 5.00 4.00 4.50 2.50 1.50
literals. Table 3 also shows for these data sets the obtained results using
more iterations.
An interesting issue of these results is that the error rate achieved using
point based literals are always the worst; a detailed study of this topic
for ﬁxed length time series can be found in [Rodriguez et al. (2001)]. The
obtained results for the setting named Interval are the better or very close
to the better results. The comparison between relative and region based
literal clearly indicates that its adequacy depends on the data set.
Table 4 shows some results for early classiﬁcation. The considered
lengths are expressed in terms of the percentage of the length of the longest
series in the data set. The table shows early classiﬁcation results for 60%
and 80%. Again, the usefulness of interval literals is clearly conﬁrmed.
The rest of this section contains a detailed discussion for each data set,
including its description, the results for boosting point based and interval
based literals and another known results for the data set.
6.1. CBF (Cylinder, Bell and Funnel)
This is an artiﬁcial problem, introduced in [Saito (1994)]. The learning task
is to distinguish between three classes: cylinder (c), bell (b) or funnel (f).
Examples are generated using the following functions:
c(t) = (6 + η) · χ
[a,b]
(t) + ε(t),
b(t) = (6 + η) · χ
[a,b]
(t) · (t − a)/(b − a) + ε(t),
f(t) = (6 + η) · χ
[a,b]
(t) · (b − t)/(b − a) + ε(t)
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
160 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
Cylinder
–2
0
2
4
6
8
Bell
Funnel
–2
0
2
4
6
8
20 40 60 80 100 120
20 40 60 80 100 120
–2
0
2
4
6
8
20 40 60 80 100 120
Fig. 3. Examples of the CBF data set. Two examples of the same class are shown in
each graph.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 161
where
χ
[a,b]
=
0 if t < a ∨ t > b,
1 if a ≤ t ≤ b,
and η and ε(t) are obtained from a standard normal distribution N(0, 1),
a is an integer obtained from a uniform distribution in [16, 32] and b − a
is another integer obtained from another uniform distribution in [32, 96].
The examples are generated evaluating those functions for t = 1, 2, . . . , 128.
Figure 3 shows some examples of this data set.
The obtained results are shown in Figure 4(A). Two kinds of graphs
are used. The ﬁrst one shows the error rate as a function of the number of
literals. The second also shows the error rate, but in this case using early
classiﬁcation and the maximum number of considered literals. The error
rate is shown as a function of the percentage of the length of the series.
In each graph, two results are plotted, one using point based literals and
another using interval based literals. The results are also shown in Table 5.
For early classiﬁcation, the results obtained with the 60% of the series
length are very close to those obtained with the full series.
error/series length percentage
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
error/series length percentage
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
(A) ORIGINAL DATA SET
error/number of literals
error/number of literals
points
intervals
points
intervals
points
intervals
points
intervals
(B) VARIABLE LENGTH DATA SET
Fig. 4. Error graphs for the CBF data set.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
162 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
Table 5. Results for the CBF data set.
10 20 30 40 50 60 70 80 90 100
Error/Number of Literals
Points 8.67 6.02 5.64 4.51 4.39 3.64 3.64 3.15 3.02 3.51
Intervals 3.63 1.64 1.00 0.75 0.75 0.88 0.50 0.75 0.50 0.50
Error/Series Length Percentage
Points 67.17 48.15 35.85 27.10 10.13 4.51 3.39 4.26 3.77 3.51
Intervals 67.19 53.78 34.34 15.44 5.04 0.88 0.88 0.75 0.38 0.50
Table 6. Results for the CBFVar data set.
10 20 30 40 50 60 70 80 90 100
Error/Number of literals
Points 8.75 5.24 4.37 3.25 4.01 3.98 3.99 4.00 4.12 3.49
Intervals 4.11 2.99 2.13 1.75 1.50 1.25 1.13 1.25 1.13 1.13
Error/Series length percentage
Points 68.30 49.02 37.99 22.22 5.62 4.11 3.37 3.49 3.49 3.49
Intervals 65.84 54.38 32.83 16.67 3.25 1.61 1.00 1.00 1.13 1.13
The error reported in [Kadous (1999)] is 1.9, using event extraction,
event clustering and decision trees. Our results using interval based literals
are better than this value using 20 or more literals. Moreover, our method
is able to obtain a better result than 1.9, 0.88, using early classiﬁcation
with 100 literals when only 60% of the series length is available.
Variable Length Version. All the examples of the CBF data set have
the same length. In order to check the method for variable length series, the
data set was altered. For each example, a random number of values were
deleted from its end. The maximum number of values eliminated is 64,
the half of the original length. The obtained results for this new data set,
named CBFVar, are shown in Figure 4(B) and in Table 6. When using a
variable length data set and early classiﬁcation, the series length percentage
in graphs and tables are relative to the length of the longest series in the
data set. For instance, in this data set 25% corresponds to the ﬁrst 32 points
of each series, because the longest series have 128 points.
An interesting issue is that the obtained results with this altered data
set are still better than the results reported in [Kadous (1999)] using the
original data set.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 163
6.2. Control Charts
In this data set there are six diﬀerent classes of control charts, synthetically
generated by the process in [Alcock and Manolopoulos (1999)]. Each time
series is of length n = 60, and it is deﬁned by y(t), with 1 ≤ t ≤ n:
• Normal: y(t) = m+rs. Where m = 30, s = 2 and r is a random number
in [−3,3].
• Cyclic: y(t) = m + rs + a sin(2πt/T). a and T are in [10,15].
• Increasing: y(t) = m + rs + gt. g is in [0.2,0.5].
• Decreasing: y(t) = m + rs −gt.
• Upward: y(t) = m+rs +kx. x is in [7.5,20] and k = 0 before time t
3
and
1 after this time. t
3
is in [n/3, 2n/3].
• Downward: y(t) = m + rs −kx.
Figure 5 shows two examples of each class. The data used was obtained
from the UCI KDD Archive [Hettich and Bay (1999)]. The results are shown
in Figure 6(A) and in Table 7. The results are clearly better for interval
based literals, with the exception of early classiﬁcation with very short
fragments. This is possibly due to the fact that most of the literals refer to
intervals that are after these initial fragments. In any case, early classiﬁca
tion is not useful with so short fragments, because the error rates are too
large.
Variable Length Version. The Control data set was also altered to a
variable length variant, ControlVar. In this case the resulting series have
lengths between 30 and 60. The results are shown in Figure 6(B) and in
Table 8.
6.3. Trace
This dataset is introduced in [Roverso (2000)]. It is proposed as a bench
mark for classiﬁcation systems of temporal patterns in the process industry.
This data set was generated artiﬁcially. There are four variables, and each
variable has two behaviors, as shown in Figure 7. The combination of the
behaviors of the variables produces 16 diﬀerent classes. For this data set it
is speciﬁed a training/test partition.
The length of the series is between 268 and 394. The running times
of the algorithms depend on the length of the series, so the data set was
preprocessed, averaging the values in groups of four. The lengths of the
examples in the resulting data set have lengths between 67 and 99.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
164 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
Normal Cyclic
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Increasing Decreasing
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Upward Downward
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Fig. 5. Examples of the Control data set. Two examples of the same class are shown
in each graph.
The results are shown in Figure 8 and in Table 9. For this data set,
boosting point based literals is not able to learn anything useful, the error
is always greater than 70%. Nevertheless, using boosting of interval based
literals and more than 80 literals, the error is less than 1%. For early classi
ﬁcation, the error is under 1% with the 80% of the series length. The result
reported in [Roverso (2000)], using recurrent neural networks and wavelets
is an error of 1.4%, but 4.5% of the examples are not assigned to any class.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 165
error/number of literals error/series length percentage
0
5
10
15
20
25
30
0 10 20 30 40 50 60 70 80 90 100
points
intervals
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80 90 100
points
intervals
error/number of literals error/series length percentage
0
5
10
15
20
25
30
0 10 20 30 40 50 60 70 80 90 100
points
intervals
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80 90 100
points
intervals
(A) ORIGINAL DATASET
(B) VARIABLE LENGTH DATASET
Fig. 6. Error graphs for the Control data set.
Table 7. Results for the Control data set.
10 20 30 40 50 60 70 80 90 100
Error/Number of literals
Points 20.83 13.83 9.50 7.33 6.50 5.50 4.67 4.17 4.33 4.00
Intervals 4.50 1.67 1.33 1.17 1.17 1.00 1.00 1.00 0.83 0.83
Error/Series length percentage
Points 63.17 54.00 48.00 46.17 45.00 42.17 28.50 11.67 4.67 4.00
Intervals 87.00 56.17 45.17 47.00 41.33 34.33 6.67 1.00 0.67 0.83
Table 8. Results for the ControlVar data set.
10 20 30 40 50 60 70 80 90 100
Error/Number of literals
Points 30.33 22.83 15.83 15.17 13.00 13.67 12.83 12.50 13.17 13.00
Intervals 6.50 6.50 6.00 5.50 5.83 5.67 5.33 5.33 5.50 5.00
Error/Series length percentage
Points 65.00 51.67 43.67 41.50 31.00 22.50 12.83 13.00 13.00 13.00
Intervals 74.67 48.17 41.50 40.00 31.33 17.17 5.50 5.00 5.00 5.00
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
166 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200 250 300 350 400
–1
–0.5
0
0.5
1
0 50 100 150 200 250 300 350 400
–1
–0.5
0
0.5
1
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350 400
V
a
r
i
a
b
l
e
1
V
a
r
i
a
b
l
e
1
V
a
r
i
a
b
l
e
1
V
a
r
i
a
b
l
e
1
Case 1 Case 2
Fig. 7. Trace data set. Each example is composed by four variables, and each variable
has two possible behaviors. In the graphs, two examples of each behavior are shown.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 167
error/number of literals error/series length percentage
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120 140 160 180 200
points
intervals
points
intervals
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Fig. 8. Error graphs for the Trace data set.
Table 9. Results for the Trace data set.
Error/Number of literals
Literals 20 40 60 80 100 120 140 160 180 200
Points 72.63 73.83 72.78 72.40 72.00 72.25 71.70 70.45 70.55 70.98
Intervals 7.30 2.20 1.05 0.98 0.78 0.53 0.43 0.28 0.25 0.20
Error/Series length percentage
Percentage 10 20 30 40 50 60 70 80 90 100
Points 93.63 92.48 89.20 88.28 85.48 83.32 72.13 70.95 70.98 70.98
Intervals 93.75 88.15 75.05 73.05 58.05 37.10 6.90 0.63 0.28 0.20
error/number of literals
0
5
10
15
20
0 50 100 150 200 250 300
points
intervals
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90 100
points
intervals
error/series length percentage
Fig. 9. Error graphs for the Auslan data set.
6.4. Auslan
Auslan is the Australian sign language, the language of the Australian deaf
community. Instances of the signs were collected using an instrumented
glove [Kadous (1999)]. Each example is composed by 8 series: x, y and z
position, wrist roll, thumb, fore, middle and ring ﬁnger bend. There are
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
168 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
Table 10. Results for the Auslan data set.
Error/Number of literals
Literals 30 60 90 120 150 180 210 240 270 300
Points 16.50 11.50 8.00 8.50 8.00 8.00 7.50 6.50 6.00 5.00
Intervals 11.00 7.50 4.00 4.00 3.50 3.00 3.00 2.50 2.00 1.50
Error/Series length percentage
Percentage 10 20 30 40 50 60 70 80 90 100
Points 76.00 42.00 10.00 5.50 5.00 5.00 5.00 5.00 5.00 5.00
Intervals 82.50 41.50 13.00 3.50 2.00 1.50 1.50 1.50 1.50 1.50
10 classes and 20 examples of each class. The minimum length is 33 and
the maximum is 102.
The results are shown in Figure 9 and in Table 10. The best result
reported in [Kadous (1999)] is an error of 2.50, using event extraction,
event clustering and Na¨ıve Bayes Classiﬁers. Our result is 1.5 using 300
interval based literals.
With respect to the results for early classiﬁcation, it is necessary to
consider two facts. First, for this problem it is not practical to use early
classiﬁcation, because the examples are generated very quickly. The second
fact is that the used percentage is for the longest series, and in this data
set the lengths vary a lot. Hence, the obtained results for a length of 50%,
include many examples that are already completed.
7. Conclusions
This chapter has presented a novel approach to the induction of multivariate
time series classiﬁers. It summarizes a supervised learning method that
works boosting very simple base classiﬁers. From only one literal, the base
classiﬁers, boosting creates classiﬁers consisting of a linear combination of
literals.
The learning method is highly domain independent, because it only
makes use of very general techniques, like boosting, and only employs very
generic descriptions of the time series, interval based literals. In this work we
have resorted to two kind of interval predicates: relative and region based.
Relative predicates consider the evolution of the values in the interval, while
region based predicates consider the occurrence of the values of a variable in
a region during an interval. These kind of predicates, speciﬁcally design for
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 169
time series although essentially domain independent, allows the assembly
of very accurate classiﬁers.
Experiments on diﬀerent data sets show that in terms of error rate
the proposed method is highly competitive with previous approaches. On
several data sets, it achieves better than all previously reported results
we are aware of. Moreover, although the strength of the method is based
on boosting, the experimental results using point based predicates show
that the incorporation of interval predicates can improve signiﬁcantly the
obtained classiﬁers, especially when using less iterations.
An important aspect of the learning method is that it can deal directly
with variable length series. As we have shown, the simple mechanism of
allowing that the evaluation of a literal may give as a result an absten
tion when there are not enough data to evaluate the literal, make pos
sible to learn from series of diﬀerent length. The symbolic nature of the
base classiﬁers facilitates their capacity to abstain. It also requires the
use of a boosting method able to work with abstentions in the base
classiﬁers.
Another feature of the method is the ability to classify incomplete exam
ples. This early classiﬁcation is indispensable for some applications, when
the time necessary to generate an example can be rather big, and where
it is not an option to wait until the full example is available. It is impor
tant to notice that this early classiﬁcation does not inﬂuence the learning
process. Particularly, it is relevant the fact that in order to get early clas
siﬁers there are neither necessary additional classiﬁers nor more complex
classiﬁers. This early classiﬁcation capacity is another consequence of the
symbolic nature of the classiﬁer. Nevertheless, an open issue is if it would
be possible to obtain better results for early classiﬁcation by modifying the
learning process. For instance, literals with later intervals could be some
what penalized.
An interesting advantage of the method is its simplicity. From a user
point of view, the method has only one free parameter, the number of iter
ations. Moreover, the classiﬁers created with more iterations includes the
previously obtained. Hence, it is possible (i) to select only an initial frag
ment of the ﬁnal classiﬁer and (ii) to continue adding literals to a previously
obtained classiﬁer. Although less important, from the programmer point of
view the method is also rather simple. The implementation of boosting
stumps is one of the easiest among classiﬁcation methods.
The main focus of this work has been classiﬁcation accuracy, at the
expenses of classiﬁer comprehensibility. There are methods that produce
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
170 C. J. Alonso Gonz´alez and J. J. Rodr´ıguez Diez
more comprehensible models than weighted literals, such as decision trees
or rules, and literals over intervals can also be used with these models.
A ﬁrst attempt to learn rules of literals is described in [Rodr´ıguez et al.
(2000)], where we obtained less accurate but more comprehensible clas
siﬁers. Nevertheless, the use of these literals with other learning models
requires more research.
Finally, a short remark about the suitability of the method for time
series classiﬁcation. Since the only properties of the series that are consid
ered for their classiﬁcation are the attributes tested by the interval based lit
erals, the method will be adequate as long as the series may be discriminated
according to what happens in intervals. Hence, although the experimental
results are rather good, they cannot be generalized for arbitrary data sets.
Nevertheless, this learning framework can be used with other kinds of lit
erals, more suited for the problem at hand. For instance [Rodr´ıguez and
Alonso (2000)] uses literals based on distances between series.
References
1. Alcock, R.J. and Manolopoulos, Y. (1999). TimeSeries Similarity Queries
Employing a FeatureBased Approach. In Proceedings of the 7th Hellenic
Conference on Informatics, Ioannina, Greece.
2. Alonso Gonz´alez, C.J. and Rodr´ıguez Diez, J.J. (1999). A Graphical
Rule Language for Continuous Dynamic Systems. In International Confer
ence on Computational Intelligence for Modelling, Control and Automation
(CIMCA’99), Vienna, Austria.
3. Berndt, D. and Cliﬀord, J. (1996). Finding Patterns in Time Series: A
Dynamic Programming Approach. In Fayyad, U., PiatetskyShapiro, G.,
Smyth, P., and Uthurusamy, R., eds, Advances in Knowledge Discovery and
Data Mining, pp. 229–248. AAAI Press/MIT Press.
4. Hettich, S. and Bay, S.D. (1999). The UCI KDD archive.
http://kdd.ics.uci.edu/.
5. Kadous, M.W. (1999). Learning Comprehensible Descriptions of Multivariate
Time Series. In Bratko, I. and Dzeroski, S., eds, Proceedings of the 16th Inter
national Conference of Machine Learning (ICML99). Morgan Kaufmann.
6. Kubat, M., Koprinska, I., and Pfurtscheller, G. (1998). Learning to Classify
Biomedical Signals. In Michalski, R., Bratko, I., and Kubat, M., eds, Machine
Learning and Data Mining, pp. 409–428. John Wiley & Sons.
7. Rodr´ıguez, J.J. and Alonso, C.J. (2000). Applying Boosting to Similarity
Literals for Time Series Classiﬁcation. In Proceedings of the 1st International
Workshop on Multiple Classiﬁer Systems (MCS 2000). Springer.
8. Rodr´ıguez, J.J., Alonso, C.J., and Bostr¨om, H. (2000). Learning First Order
Logic Time Series Classiﬁers: Rules and Boosting. In Proceedings of the 4th
European Conference on Principles of Data Mining and Knowledge Discovery
(PKDD 2000). Springer.
April 26, 2004 16:9 WSPC/Trim Size: 9in x 6in for Review Volume chap07
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation 171
9. Rodr´ıguez, J.J., Alonso, C.J., and Bostr¨om, H. (2001). Boosting Interval
Based Literals. Intelligent Data Analysis, 5(3), 245–262.
10. Roverso, D. (2000). Multivariate Temporal Classiﬁcation by Windowed
Wavelet Decomposition and Recurrent Neural Networks. In 3rd ANS Inter
national Topical Meeting on Nuclear Plant Instrumentation, Control and
HumanMachine Interface.
11. Saito, N. (1994). Local Feature Extraction and its Applications Using a
Library of Bases. PhD thesis, Department of Mathematics, Yale University.
12. Schapire, R.E. (1999). A Brief Introduction to Boosting. In 16th Interna
tional Joint Conference on Artiﬁcial Intelligence (IJCAI99), pp. 1401–1406.
Morgan Kaufmann.
13. Schapire, R.E. and Singer, Y. (1998). Improved Boosting Algorithms Using
ConﬁdenceRated Predictions. In 11th Annual Conference on Computational
Learning Theory (COLT 1998), pp. 80–91. ACM.
This page intentionally left blank
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
CHAPTER 8
MEDIAN STRINGS: A REVIEW
Xiaoyi Jiang
Department of Electrical Engineering and Computer Science
Technical University of Berlin
Franklinstrasse 28/29, D10587 Berlin, Germany
Email: jiang@iam.unibe.ch
Horst Bunke
Department of Computer Science, University of Bern
Neubr¨ uckstrasse 10, CH3012 Bern, Switzerland
Email: bunke@iam.unibe.ch
Janos Csirik
Department of Computer Science, Attila Jozsef University
Arpad t´er 2, H6720 Szeged, Hungary
Time series can be eﬀectively represented by strings. The median con
cept is useful in various contexts. In this chapter its adaptation to the
domain of strings is discussed. We introduce the notion of median string
and provide related theoretical results. Then, we give a review of algo
rithmic procedures for eﬃciently computing median strings. Some exper
imental results will be reported to demonstrate the median concept and
to compare some of the considered algorithms.
Keywords: String distance; set median string; generalized median string;
online handwritten digits.
1. Introduction
Strings provide a simple and yet powerful representation scheme for sequen
tial data. In particular time series can be eﬀectively represented by strings.
Numerous applications have been found in a broad range of ﬁelds including
computer vision [2], speech recognition, and molecular biology [13,34].
173
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
174 X. Jiang, H. Bunke and J. Csirik
A large number of operations and algorithms have been proposed to
deal with strings [1,5,13,34,36]. Some of them are inherent to the special
nature of strings such as the shortest common superstring and the longest
common substring, while others are adapted from other domains.
In data mining, clustering and machine learning, a typical task is to
represent a set of (similar) objects by means of a single prototype. Interest
ing applications of the median concept have been demonstrated in dealing
with 2D shapes [16, 33], binary feature maps [23], 3D rotation [9], geomet
ric features (points, lines, or 3D frames) [32], brain models [12], anatomical
structures [37], and facial images [31]. In this paper we discuss the adapta
tion of the median concept to the domain of strings.
The median concept is useful in various contexts. It represents a funda
mental quantity in statistics. In sensor fusion, multisensory measurements
of some quantity are averaged to produce the best estimate. Averaging the
results of several classiﬁers is used in multiple classiﬁer systems in order to
achieve more reliable classiﬁcations.
The outline of the chapter is as follows. We ﬁrst formally introduce the
median string problem in Section 2 and provide some related theoretical
results in Section 3. Sections 4 and 5 are devoted to algorithmic procedures
for eﬃciently computing set median and generalized median strings. In
Section 6 we report experimental results to demonstrate the median concept
and to compare some of the considered algorithms. Finally, some discussions
conclude this paper.
2. Median String Problem
Assuming an alphabet Σ of symbols, a string x is simply a sequence of
symbols from Σ, i.e. x = x
1
x
2
· · · x
n
where x
i
∈ Σ for i = 1, . . . , n. Given
the space U of all strings over Σ, we need a distance function d(p, q) to
measure the dissimilarity between two strings p, q ∈ U. Let S be a set of
N strings from U. The essential information of S is captured by a string
¯ p ∈ U that minimizes the sum of distances of ¯ p to all strings from S, also
called the consensus error E
S
(p):
¯ p = arg min
p∈U
E
S
(p), where E
S
(p) =
q∈S
d(p, q).
The string ¯ p is called a generalized median of S. If the search is con
strained to the given set S, the resultant string
ˆ p = arg min
p∈S
E
S
(p)
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 175
is called a set median of S. For a given set S, neither the generalized median
nor the set median is necessarily unique. This deﬁnition was introduced by
Kohonen [20]. Note that diﬀerent terminology has been used in the liter
ature. In [13] the set median string and the generalized median string are
termed center string and Steiner string, respectively. In [24] the generalized
median was called consensus sequence.
Diﬀerent possibility is mentioned by Kohonen [20] too. This is the par
allel of mean from elementary statistics. Here we would like to search for
p
that minimizes
q∈S
d
2
(p
, q).
MartinezHinarejos et al. [27] returned to this deﬁnition and investigated
the possibility of using mean instead of median.
Note that in a broad sense, the median string concept is related to the
center of gravity of a collection of masses. While the latter represents the
point where all the weight of the masses can be considered to be concen
trated, the median string corresponds to a single representation of strings
based on a string distance.
Several string distance functions have been proposed in the literature.
The most popular one is doubtlessly the Levenshtein edit distance. Let
A = a
1
a
2
· · · a
n
and B = b
1
b
2
· · · b
m
be two words over Σ. The Levenshtein
edit distance d(A, B) is deﬁned in terms of elementary edit operations which
are required to transform A into B. Usually, three diﬀerent types of edit
operations are considered, namely (1) substitution of a symbol a ∈ A by a
symbol b ∈ B, a = b, (2) insertion of a symbol a ∈ Σ in B, and (3) deletion
of a symbol a ∈ A. Symbolically, we write a → b for a substitution, ε → a
for an insertion, and a → ε for a deletion. To model the fact that some
distortions may be more likely than others, costs of edit operations, c(a →
b), c(ε → a), and c(a → ε), are introduced. Let s = l
1
l
2
· · · l
k
be a sequence
of edit operations transforming A into B. We deﬁne the cost of this sequence
by c(s) =
k
i=1
c(l
i
). Given two strings A and B, the Levenshtein edit
distance is given by
d(A, B) = min{c(s)  s is a sequence of edit operations
transforming A into B}.
To illustrate the Levenshtein edit distance, let us consider two words
A = median and B = mean built on the English alphabet. Examples of
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
176 X. Jiang, H. Bunke and J. Csirik
sequences of edit operations transforming A into B are:
• s
1
= d → a, i → n, a → ε, n → ε,
• s
2
= d → a, i → ε, a → ε,
• s
3
= d → ε, i → ε.
Under the edit cost c(a → ε) = c(ε → a) = c(a → b) = 1, a = b, s
3
represents the optimal sequence with the minimum total cost 2 for trans
forming median into mean among all possible transformations. Therefore,
we observe d(median, mean) = 2.
In [38] an algorithm is proposed to compute the Levenshtein edit dis
tance by means of dynamic programming. Other algorithms are discussed in
[13,34,36]. Further string distance functions are known from the literature,
for instance, normalized edit distance [28], maximum posterior probability
distance [20], feature distance [20], and others [8]. The Levenshtein edit
distance is by far the most popular one. Actually, some of the algorithms
we discuss later are tightly coupled to this particular distance function.
3. Theoretical Results
In this section we summarize some theoretical results related to median
strings. The generalized median is a more general concept and usually a
better representation of the given strings than the set median. From a prac
tical point of view, the set median can be regarded an approximate solution
of the generalized median. As such it may serve as the start for an iterative
reﬁnement process to ﬁnd more accurate approximations. Interestingly, we
have the following result (see [13] for a proof):
Theorem 1. Assume that the string distance function satisﬁes the triangle
inequality. Then E
S
(ˆ p)/E
S
(¯ p) ≤ 2 −2/S.
That is, the set median has a consensus error relative to S that is at
most 2 −2/S times the consensus error of the generalized median string.
Independent of the distance function we can always ﬁnd the set median
of N strings by means of
1
2
N(N −1) pairwise distance computations. This
computational burden can be further reduced by making use of special
properties of the distance function (e.g. metric) or resorting to approximate
procedures. Section 4 will present examples of these approaches.
Compared to set median strings, the computation of generalized median
strings represents a much more demanding task. This is due to the
huge search space which is substantially larger than that for determining
the set median string. This intuitive understanding of the computational
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 177
complexity is supported by the following theoretical results. Under the two
conditions:
• every edit operation has cost one, i.e., c(a → b) = c(ε → a) =
c(a →ε) = 1,
• the alphabet is not of ﬁxed size.
It is proved in [15] that computing the generalized median string is
NPhard. Sim and Park [35] proved that the problem is NPhard for ﬁnite
alphabet and for a metric distance matrix. Another result comes from com
putational biology. The optimal evolutionary tree problem there turns out
to be equivalent to the problem of computing generalized median strings
if the tree structure is a star (a tree with n + 1 nodes, n of them being
leaves). In [39] it is proved that in this particular case the optimal evolu
tionary tree problem is NPhard. The distance function used is problem
dependent and does not even satisfy the triangle inequality. All these theo
retical results indicate the inherent diﬃculty in ﬁnding generalized median
strings. Not surprisingly, the algorithms we will discuss in Section 5 are
either exponential or approximate.
4. Fast Computation of Set Median Strings
The naive computation of set median requires O(N
2
) distance computa
tions. Considering the relatively high computational cost of each individ
ual string distance, this straightforward approach may not be appropriate,
especially in the case of a large number of strings. The problem of fast set
median search can be tackled by making use of properties of metric distance
functions or developing approximate algorithms. Several solutions [19,30]
have been suggested for fast set median search in arbitrary spaces. They
apply to the domain of strings as well.
4.1. Exact Set Median Search in Metric Spaces
In many applications the underlying string distance function is a metric
which satisﬁes:
(i) d(p, q) ≥ 0 and d(p, q) = 0 if and only if p = q,
(ii) d(p, q) = d(q, p),
(iii) d(p, q) + d(q, r) ≥ d(p, r).
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
178 X. Jiang, H. Bunke and J. Csirik
A property of metrics is:
d(p, r) −d(r, q) ≤ d(p, q), ∀p, q, r ∈ S,
which can be utilized to reduce the number of string distance computations.
The approach proposed in [19] partitions the input set S into subsets
S
u
(used), S
e
(eliminated), and S
a
(alive). The set S
a
keeps track of those
strings that have not been fully evaluated yet; initially S
a
= S. A lower
bound g(p) is computed for each string p in S
a
, i.e., the consensus error of
p satisﬁes:
E
S
(p) =
q∈S
d(p, q) ≥ g(p).
Clearly, strings with small g values are potentially better candidates for
set median. For this reason the string p with the smallest g(p) value among
all strings in S
a
is transferred from S
a
to S
u
. Then, the consensus error
E
S
(p) is computed and, if necessary, the current best median candidate p
∗
is updated. Then, the lower bound g is computed for all strings that are
alive, and those whose g is not smaller than E
S
(p
∗
) are moved from S
a
to S
e
. They will not be considered as median candidates any longer. This
process is repeated until S
a
becomes empty.
In each iteration, the consensus error for p with the smallest g value is
computed by:
E
S
(p) =
q∈S
u
d(p, q) +
q∈S
e
∪(S
a
−{p})
d(p, q).
Using (1) the term d(p, q) in the second summation is estimated by:
d(p, q) ≥ d(p, r) −d(r, q), ∀r ∈ S
u
.
Taking all strings of S
u
into account, we obtain the lower bound:
E
S
(p) ≥
q∈S
u
d(p, q) +
q∈S
e
∪(S
a
−{p})
max
rεS
u
d(p, r) −d(r, q) = g(p).
The critical point here is to see that all the distances in this lower
bound are concerned with p and strings from S
u
, and were therefore already
computed before. When strings in S
a
are eliminated (moved to S
u
), their
consensus errors need not to be computed in future. This fact results in sav
ing of distance computations. In addition to (2), two other lower bounds
within the same algorithmic framework are given in [19]. They diﬀer in the
resulting ratio of the number of distance computations and the remain
ing overhead, with the lower bound (2) requiring the smallest amount of
distance computations.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 179
Ideally, the distance function is desired to be a metric, in order to match
the human intuition of similarity. The triangle inequality excludes the case
in which d(p, r) and d(r, q) are both small, but d(p, q) is very large. In prac
tice, however, there may exist distance functions which do not satisfy the
triangle inequality. To judge the suitability of these distance functions, the
work [6] suggests quasimetrics with a relaxed triangle inequality. Instead
of the strict triangle inequality, the relation:
d(p, r) + d(r, q) ≥
d(p, q)
1 + ε
is required now. Here ε is a small nonnegative constant. As long as ε is not
very large, the relaxed triangle inequality still retains the human intuition
of similarity. Note that the strict triangle inequality is a special case with
ε = 0. The fast set median search procedure [19] sketched above easily
extends to quasimetrics. In this case the relationship (1) is replaced by:
d(p, q) ≥ max
d(p, r)
1 + ε
−d(q, r),
d(q, r)
1 + ε
−d(p, r)
, ∀p, q, r ∈ S
which can be used in the same manner to establish a lower bound g(p).
4.2. Approximate Set Median Search in Arbitrary Spaces
Another approach to fast set median search makes no assumption on the
distance function and covers therefore nonmetrics as well. The idea of this
approximate algorithm is simple. Instead of computing the sum of distances
of each string to all the other strings of S to select the best one, only a
subset of S is used to obtain an estimation of the consensus error [30]. The
algorithm ﬁrst calculates such estimations and then calculates the exact
consensus errors only for strings that have low estimations.
This approximate algorithm proceeds in two steps. First, a random sub
set S
r
of N
r
strings is selected from S. For each string p of S, the consensus
error E
S
r
(p) relative to S
r
is computed and serves as an estimation of the
consensus error E
S
(p). In the second step N
t
strings with the lowest con
sensus error estimations are chosen. The exact consensus error E
S
(p) is
computed for the N
t
strings and the string with the minimum E
S
(p) is
regarded the (approximate) set median string of S.
5. Computation of Generalized Median Strings
While the set median problem is characterized by selecting one particular
member out of a given set of strings, the computation of generalized median
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
180 X. Jiang, H. Bunke and J. Csirik
strings is inherently constructive. The theoretical results from Section 3
about computational complexity indicate the fundamental diﬃculties we
are faced with. In the following we describe various algorithms for comput
ing generalized median strings. Not surprisingly, they are either of expo
nential complexity or approximate.
5.1. An Exact Algorithm and Its Variants
An algorithm for the exact computation of generalized median strings under
the Levenshtein distance is given in [21]. Let ε be the empty string and
Σ
= Σ ∪ {ε} the extended alphabet. We deﬁne:
δ(r
1
, r
2
, . . . , r
N
) = min
v∈Σ
[c(v → r
1
) + c(v → r
2
) +· · · + c(v → r
N
)].
The operator δ can be interpreted as a voting function, as it determines
the best value v at a given stage of computation. Finding an optimal value
of v requires an exhaustive search over Σ
in the most general case, but in
practice the cost function is often simple such that a shortcut can be taken
and the choice of the optimal v is not costly.
Having deﬁned δ this way, the generalized median string can be com
puted by means of dynamic programming in an Ndimensional array, sim
ilarly to string edit distance computation [38]. For the sake of notational
simplicity, we only discuss the case N = 3. Assume the three input strings
be u
1
u
2
· · · u
l
, v
1
v
2
· · · v
m
, and w
1
w
2
· · · w
n
. A threedimensional distance
table of dimension l ×m×n is constructed as follows:
Initialization: d
0,0,0
= 0;
Iteration:
d
i,j,k
= min
d
i−1,j−1,k−1
+ δ(u
i
, v
j
, w
k
)
d
i−1,j−1,k
+ δ(u
i
, v
j
, ε)
d
i−1,j,k−1
+ δ(u
i
, ε, w
k
)
d
i−1,j,k
+ δ(u
i
, ε, ε)
d
i,j−1,k−1
+ δ(ε, v
j
, w
k
)
d
i,j−1,k
+ δ(ε, v
j
, ε)
d
i,j,k−1
+ δ(ε, ε, w
k
)
0 ≤ i ≤ l
0 ≤ j ≤ m
0 ≤ k ≤ n
End: if (i = l) ∧ (j = m) ∧ (k = n)
The computation requires O(lmn) time and space. The path in the
distance table that leads from d
0,0,0
to d
l,m,n
deﬁnes the generalized
median string ¯ p with d
l,m,n
being the consensus error E
S
(¯ p). Note that
a generalization to arbitrary N is straightforward. If the strings of S are
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 181
of length O(n), both the time and space complexity amounts to O(n
N
) in
this case.
Despite of its mathematical elegance the exact algorithm above is
impractical because of the exponential complexity. There have been eﬀorts
to shorten the computation time using heuristics or domainspeciﬁc knowl
edge. Such an approach from [24] assumes that the string of S be quite
similar. Under reasonable constraints on the cost function (c(a → ε) =
c(ε → a) = 1 and c(a → b) nonnegative), the generalized median string
¯ p satisﬁes E
S
(¯ p) ≤ k with k being a small number. In this case the opti
mal dynamic programming path must be close to the main diagonal in
the distance table. Therefore only part of the ndimensional table needs
to be considered. Details of the restricted search can be found in [24]. Its
asymptotic time complexity is O(nk
N
). While this remains exponential, k is
typically much smaller than n, resulting in a substantial speedup compared
to the full search of the original algorithm [21].
We may also use any domainspeciﬁc knowledge to limit the search
space. An example is the approach in the context of classiﬁer combination
for handwritten sentence recognition [25]. An ensemble of classiﬁers provide
multiple classiﬁcation results of a scanned text. Then, the consensus string
is expected to yield the best overall recognition performance. The input
strings from the individual classiﬁers are associated with additional infor
mation of position, i.e. the location of each individual word in a sequence of
handwritten words. Obviously, it is very unlikely that a word at the begin
ning of a sequence corresponds to a word at the end of another sequence.
More generally, only words at a similar position in the text image are mean
ingful candidates for being matched to each other. The work [25] makes use
of this observation to exclude a large portion of the full Ndimensional
search space from consideration.
5.2. Approximate Algorithms
Because of the NPhardness of generalized median string computation,
eﬀorts have been undertaken to develop approximate approaches which
provide suboptimal solutions in reasonable time. In this section we will
discuss several algorithms of this class.
5.2.1. Greedy Algorithms
The following algorithm was proposed by Casacuberta and de Antoni
[4]. Starting from an empty string, a greedy algorithm constructs an
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
182 X. Jiang, H. Bunke and J. Csirik
¯ p
0
= ε;
for l = 1; ; l + + {
a
1
= arg min
a∈Σ
E
S
(¯ p
l−1
a);
¯ p
l
= ¯ p
l−1
a
l
;
if (termination criterion fulﬁlled) break;
}
choose preﬁx of ¯ p
l
for output;
Fig. 1. General framework of greedy algorithms.
approximate generalized median string ¯ p symbol by symbol. When we are
going to generate the lth symbol a
l
(l ≥ 1), the substring a
1
· · · a
l−1
(ε for
l = 1) has already been determined. Then, each symbol from Σ is considered
as a candidate for a
l
. All the candidates are evaluated and the ﬁnal decision
of a
l
is made by selecting the best candidate. The process is continued until
a termination criterion is fulﬁlled.
A general framework of greedy algorithms is given in Figure 1. There
are several possible choices for the termination criterion and the preﬁx.
The greedy algorithm proposed in [4] stops the iterative construction pro
cess when E
S
(¯ p
l
) > E
S
(¯ p
l−1
). Then, ¯ p
l−1
is regarded the approximate
generalized median string. Alternatively, the author of [22] suggests the
termination criterion l = max
p∈S
p. The output is the preﬁx of ¯ p
l
with
the smallest consensus error relative to S. For both variants a suitable data
structure [4] enables a time complexity of O(n
2
NΣ) for the Levenshtein
distance. The space complexity amounts to O(nNΣ).
In the general framework in Figure 1 nothing is stated about how to
select a
l
if there exist multiple symbols from Σ with the same value of
E
S
(¯ p
l−1
a). Besides a simple random choice, the history of the selection
process can be taken into account to make a more reliable decision [22].
5.2.2. Genetic Search
Genetic search techniques are motived by concepts from the theory of bio
logical evolution. They are generalpurpose optimization methods that have
been successfully applied to a large number of complex tasks. In [17] two
genetic algorithms are proposed to construct generalized median strings.
The ﬁrst approach is based on a straightforward representation of strings
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 183
in terms of chromosomes by onedimensional arrays of varying lengths con
sisting of symbols from Σ. The string “median”, for instance, is coded as
chromosome
m e d i a n
Given a set S of N input strings of length O(n) each, the ﬁtness func
tion is simply deﬁned by the consensus error of a chromosome string. The
roulette wheel selection is used to create oﬀspring. For crossover the single
point operator is applied. With a mutation probability either deletion,
insertion or substitution is performed at each position of a chromosome.
The input strings are used as part of the initial population. The rest of
the initial population is ﬁlled by randomly selecting some input strings and
applying the mutation operator on them. The population evolution process
terminates if one of the following two criteria is fulﬁlled. The ﬁrst crite
rion is that the maximal number of generations has been reached. Second,
if the population becomes very homogeneous, the process terminates as well.
The homogeneity is measured by means of the average and the variance of
the ﬁtness values of a population. If their product is smaller than a pre
speciﬁed threshold, the population is considered enough homogeneous to
stop the evolution. Let P denote the population size and p the replacement
percentage. When using the Levenshtein edit distance, the time complexity
of this genetic algorithm amounts to O(Nn
2
pP) per generation.
The genetic algorithm above requires quite a large number of string dis
tance computations due to the ﬁtness function evaluation in each iteration.
For better computational eﬃciency a second genetic algorithm is designed
using a more elaborate chromosome coding scheme of strings; see [17] for
details. The time complexity becomes O(NnmpP), m << n, implying a
substantial speedup.
5.2.3. PerturbationBased Iterative Reﬁnement
The set median represents an approximation of the generalized median
string. The greedy algorithms and the genetic search techniques also give
approximate solutions. An approximate solution ¯ p can be further improved
by an iterative process of systematic perturbations. This idea was ﬁrst
suggested in [20]. But no algorithmic details are speciﬁed there. A concrete
algorithm for realizing systematic perturbations is given in [26]. For each
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
184 X. Jiang, H. Bunke and J. Csirik
position i, the following operations are performed:
(i) Build perturbations
• Substitution: Replace the ith symbol of ¯ p by each symbol of Σ in
turn and choose the resulting string x with the smallest consensus
error relative to S.
• Insertion: Insert each symbol of Σ in turn at the ith position of ¯ p
and choose the resulting string y with the smallest consensus error
relative to S.
• Deletion: Delete the ith symbol of ¯ p to generate z.
(ii) Replace ¯ p by the one from {¯ p, x, y, z} with the smallest consensus error
relative to S.
For the Levenshtein distance one global iteration that handles all posi
tions of the initial ¯ p needs O(n
3
NΣ) time. The process is repeated until
there is more improvement possible.
5.3. Dynamic Computation of Generalized Median Strings
In a dynamic context we are faced with the situation of a steady arrival of
new data items, represented by strings. At each point of time, t, the set S
t
of existing strings is incremented by a new string, resulting in S
t+1
, and
its generalized median string is to be computed. Doubtlessly, a trivial solu
tion consists in applying any of the approaches discussed above to S
t+1
. By
doing this, however, we compute the generalized median string of S
t+1
from
scratch without utilizing any knowledge about S
t
, in particular its general
ized median string. All algorithms for computing generalized median strings
are of such a static nature and thus not optimal in a dynamic context. The
work [18] proposes a genuinely dynamic approach, in which the update
scheme only considers the generalized median string of S
t
together with
the new data item, but not the individual members of S
t
.
The inspiration for the algorithm comes from a fundamental fact
in real space. Under the distance function d(p
i
, p
j
) = (p
i
− p
j
) ·
(p
i
− p
j
), i.e. the squared Euclidean distance of p
i
and p
j
,
the generalized median of a given set S
t
= {p
1
, p
2
, . . . , p
t
} of
t points is the wellknown mean:
¯ p
t
=
1
t
·
t
i=1
p
i
.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 185
When an additional point p
t+1
is added to S
t
, the resultant new set S
t+1
=
S
t
∪ {p
t+1
} has the generalized median
¯ p
t+1
=
1
t + 1
·
t+1
i=1
p
i
=
t
t + 1
· ¯ p
t
+
1
t + 1
· p
t+1
which is the socalled weighted mean of ¯ p
t
and p
t+1
satisfying
d(¯ p
t+1
, ¯ p
t
) =
1
t + 1
· d(¯ p
t
, p
t+1
),
d(¯ p
t+1
,
¯
P
t+1
) =
t
t + 1
· d(¯ p
t
, p
t+1
).
Geometrically, ¯ p
t+1
is a point on the straight line segment connecting ¯ p
t
and p
t+1
that has distances 1/(t + 1) · d(¯ p
t
, p
t+1
) and 1/(t + 1) · d(¯ p
t
, p
t+1
)
to ¯ p
t
and p
t+1
, respectively.
On a heuristic basis the special case in real space can be extended to
the domain of strings. Given a set S
t
= {p
1
, p
2
, . . . , p
t
} of t strings and
its generalized median ¯ p
t
, the generalized median of a new set S
t+1
=
S
t
∪{p
t+1
} is estimated by a weighted mean of ¯ p
t
and p
t+1
, i.e. by a string
¯ p
t+1
such that
d(¯ p
t+1
, ¯ p
t
) = α · d(¯ p
t
, p
t+1
),
d(¯ p
t+1
, p
t+1
) = (1 − α) · d(¯ p
t
, p
t+1
)
where α ∈ [0, 1]. In real space α takes the value 1/(t + 1). For strings,
however, we have no possibility to specify α in advance. Therefore, we
resort to a search procedure. Remember that our goal is to ﬁnd ¯ p
t+1
that
minimizes the consensus error relative to S
t+1
. To determine the optimal
α value a series of α values 0, 1/k, . . . , (k − 1)/k is probed and the α value
that results in the smallest consensus error is chosen.
The dynamic algorithm uses the method described in [3] for computing
the weighted mean of two strings. It is an extension of the Levenshtein
distance computation [38].
6. Experimental Evaluation
In this section we report some experimental results to demonstrate the
median string concept and to compare some of the computational proce
dures described above. The used data are online handwritten digits from a
subset
1
of the UNIPEN database [14]. An online handwritten digit is a time
1
Available at ftp: //ftp.ics.uci.edu/pub/machinelearningdatabases/pendigits/.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
186 X. Jiang, H. Bunke and J. Csirik
sequence of 2D points recorded from a person’s writing on a special tablet.
Each digit is originally given as a sequence d = (x
1
, y
1
, t
1
), . . . , (x
k
, y
k
, t
k
) of
points in the xyplane, where t
k
is the time of recording the kth point during
writing a digit. In order to transform such a sequence of points into a string,
we ﬁrst resample the given data points such that the distance between any
consecutive pair of points has a constant value ∆. That is, d is transformed
into sequence d
= (¯ x
1
, ¯ y
1
), . . . , (¯ x
l
, ¯ y
l
) where (¯ x
i+1
, ¯ y
i+1
)−(¯ x
i
, ¯ y
i
) = ∆ for
i = 1, . . . , l −1. Then, a string s = a
1
a
2
· · · a
l−1
is generated from sequence
d
where a
i
is the vector pointing from (¯ x
i
, ¯ y
i
) to (¯ x
i+1
, ¯ y
i+1
). In our exper
iments we ﬁxed ∆ = 7 so that the digit strings have between 40 and 100
points.
The costs of the edit operations are deﬁned as follows: c(a → ε) =
c(ε → a) = a = ∆, c(a → b) = a − b. Notice that the minimum cost of a
substitution is equal to zero (if and only if a = b), while the maximum cost is
2∆. The latter case occurs if a and b are parallel and have opposite direction.
We conducted experiments using 10 samples of digit 1, 2 and 3 each,
and 98 samples of digit 6, see Figure 2 for an illustration (for space
reasons only ten samples of digit 6 are shown there). The results of
four algorithms are shown in Figure 3: the genetic algorithm [17] (GA,
Section 5.2.2), the dynamic algorithm [18] (Section 5.3), the greedy algo
rithm [4] (Section 5.2.1), and the set median. For comparison purpose the
consensus error E
S
and the computation time t in seconds on a SUN Ultra60
workstation are also listed in Figure 3. Note that the consensus errors of
digit 6 are substantially larger than those of the other digits because of
the deﬁnition of consensus error as the sum, but not the average, of the
distances to all input samples.
The best results are achieved by GA, followed by the dynamic approach.
Except for digit 1, the greedy algorithm reveals some weakness. Looking
at the median for digits 2, 3 and 6 it seems that the iterative process ter
minates too early, resulting in a string (digit) much shorter than it should
be. The reason lies in the simple termination criterion deﬁned in [4]. It
works well for the (short) words used there, but obviously encounters diﬃ
culties in dealing with longer strings occurring in our study. At ﬁrst glance,
the dynamic approach needs more computation time than the greedy algo
rithm. But one has to take into account that the recorded time is the total
time of the dynamic process of adding one sample to the existing set each
time, starting from a set consisting of the ﬁrst sample. Therefore, totally
9 (97) generalized medians have eﬀectively been computed for digits 1/2/3
(6). Taking the actual computation time for each update step of the whole
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 187
40 94.26
133
100
0
74.13
40 0 94.26
133
100
0
74.13
40 0 94.26
133
100
0
74.13
40 0 94.26
133
100
0
74.13
40 0 94.26
133
100
0
74.13
40 0 94.26
133
100
0
74.13
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
62.85 0 100 147.5
154.6
100
0
76
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
64.54 0 100 161.7
188
100
0
59
223
200
100
0
95.62
154.9 100
223
200
100
0
95.62
154.9 100
223
200
100
0
95.62
154.9 100
223
200
100
0
95.62
154.9 100
223
200
100
0
95.62
154.9 100 0 96 0 96 0 96 0 96 0 96
154.9 100 154.9 100 154.9 100 154.9 100 154.9 100 0 96 0 96 0 96 0 96 0 96
223
200
100
0
95.62
223
200
100
0
95.62
223
200
100
0
95.62
223
200
100
0
95.62
223
200
100
0
95.62
0 40 94.26
133
100
0
74.13
0 40 94.26
133
100
0
74.13
0 40 94.26
133
100
0
74.13
0 40 94.26
133
100
0
74.13
0
Fig. 2. Samples of digits 1, 2, 3 and 6, respectively.
dynamic process into account, the dynamic algorithm is attractive com
pared to static ones such as the greedy algorithm. If we use the static
greedy algorithm to handle the same dynamic process, then the computa
tion for digit 3, for instance, would require 46.10 seconds in contrast to 22.8
seconds of the dynamic approach. Considering the computational eﬃciency
alone, set median is the best among the compared methods. But the gener
alized median string is a more precise abstraction of the given set of strings
with a smaller consensus error.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
188 X. Jiang, H. Bunke and J. Csirik
26.48 0 100 151.9
124.5
100
0
52.29
46.66 0 82.92
159.1
100
0
30.47
70 60 50 40 30 20 10 0
151.1
100
0
61.63
0
51.15
30.32 0 98.06
106.8
100
0
50.75
30.84 0 71.94
147.6
100
0
31.99
60 50 40 30 20 10 0
147.6
100
0
0 10 20 30 40 50 60 70 80 90
79.24
0
50.88
0 100 143. 5
140.1
100
0
52.05
0 10 20 30 40 50 60 70
160.9
100
0
36.6
100 0
171. 5
100
0
11.69
40 0 94.26
133
100
0
74.13
62.85 0 100 147. 5
154. 6
100
0
76
64.54 0 100 161. 7
188
100
0
59
154. 9 100 0 96
223
200
100
0
95.62
10 0 10 20 30 40 50 60
10 0 10 20 30 40 50
64.56
0
52.26
Fig. 3. Median computation of digits.
7. Discussions and Conclusion
In this paper we have reviewed the concept of median string. Particularly,
we have brieﬂy described several procedures for computing median strings.
Experimental results were reported to demonstrate the median concept in
dealing with a special case of time series, namely online handwritten digits,
and to compare some of the discussed algorithms.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 189
Table 1. Characteristic of median computation algorithms.
Algorithm String distance
(original paper)
Extension
to arbitrary
distance
Handling
weighted
median
Handling
center
median
Exact algorithm and its
variants [21,24,25]
Levenshtein No Yes No
Greedy algorithms
[4,22]
Levenshtein Yes Yes Yes
Genetic algorithm [17] Levenshtein Yes Yes Yes
Dynamic algorithm [18] Levenshtein No Yes No
perturbationbased
iterative reﬁnement
[20,26]
Arbitrary
distance
N/A Yes Yes
The majority of the algorithms described in this paper are based on
the Levenshtein edit distance. The algorithms’ applicability to an arbitrary
string distance function is summarized in Table 1. Note that an extension
to an arbitrary string distance function usually means a computational
complexity diﬀerent from that for the Levenshtein edit distance.
In the deﬁnition of median string, all the input strings have a uniform
weight of one. If necessary, this basic deﬁnition can be easily extended to
weighted median string to model the situation where each string has an
individual importance, conﬁdence, etc. Given the weights w
q
, q ∈ S, the
weighted generalized median string is simply
¯ p = arg min
p∈U
q=S
w
q
· d(p, q).
All the computational procedures discussed before can be modiﬁed to
handle this extension in a straightforward manner.
The generalized median string represents one way of capturing the essen
tial characteristics of a set of strings. There do exist other possibilities. One
example is the socalled center string [15] deﬁned by:
p
∗
= arg min
p∈U
max
q∈S
d(p, q).
It is important to note that the same term is used in [13] to denote the
set median string. Under the two conditions given in Section 3, it is proved
in [15] that computing the center string is NPhard. Another result is given
in [7] where the NPhardness of the center string problem is proved for
the special case of a binary alphabet (i.e., Σ = {0, 1}) and the Hamming
string distance. The ability of the algorithms to compute the center string
is summarized in Table 1.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
190 X. Jiang, H. Bunke and J. Csirik
Another issue of general interest is concerned with cyclic strings. Sev
eral methods have been proposed to eﬃciently compute the Levenshtein
distance of cyclic strings [10,11,29]. It remains an open problem to deter
mine medians of this kind of strings.
Acknowledgments
The authors would like to thank K. Abegglen for her contribution to parts
of the chapter.
References
1. Apostolico and Galil, Z. (1997). (eds.), Pattern Matching Algorithms, Oxford
University Press.
2. Bunke, H. (1992). Recent Advances in String Matching, in H. Bunke (ed.).
Advances in Structural and Syntactic Pattern Recognition, World Scientiﬁc,
pp. 3–21.
3. Bunke, H., Jiang, X., Abegglen, K., and Kandel, A. (2002). On the Weighted
Mean of a Pair of Strings. Pattern Analysis and Applications, 5(1), 23–30.
4. Casacuberta, F. and de Antoni, M.D. (1997). A Greedy Algorithm for Com
puting Approximate Median Strings. Proc. of National Symposium on Pat
tern Recognition and Image Analysis, pp. 193–198, Barcelona, Spain.
5. Crochemore, M. and Rytter, W. (1994). Text Algorithms, Oxford University
Press.
6. Fagin, R. and Stockmeyer, L. (1998). Relaxing the Triangle Inequality in
Pattern Matching. Int. Journal on Computer Vision, 28(3), 219–231.
7. Frances, M. and Litman, A. (1997). On Covering Problems of Codes. Theory
of Computing Systems, 30(2), 113–119.
8. Fred, A.L.N. and Leit˜ao, J.M.N. (1998). A Comparative Study of String Dis
similarity Measures in Structural Clustering. Proc. of Int. Conf. on Document
Analysis and Recognition, pp. 385–394.
9. Gramkow, C. (2001). On Averaging Rotations. Int. Journal on Computer
Vision, 42(1/2), 7–16.
10. Gregor, J. and Thomason, M.G. (1993). Dynamic Programming Alignment
of Sequences Representing Cyclic Patterns. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 15(2), 129–135.
11. Gregor, J. and Thomason, M.G. (1996). Eﬃcient Dynamic Programming
Alignment of Cyclic Strings by Shift Elimination. Pattern Recognition, 29(7),
1179–1185.
12. Guimond, A., Meunier, J., and Thirion, J.P. (2000). Average Brain Models:
A Convergence Study. Computer Vision and Image Understanding, 77(2),
192–210.
13. Gusﬁeld, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology, Cambridge University Press.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
Median Strings: A Review 191
14. Guyon, Schomaker, L., Plamondon, R., Liberman, M. and Janet, S. (1994).
UNIPEN Project on OnLine Exchange and Recognizer Benchmarks. Proc.
of 12th Int. Conf. on Pattern Recognition, pp. 29–33.
15. de la Higuera, C. and Casacuberta, F. (2000). Topology of Strings: Median
String is NPComplete. Theoretical Computer Science, 230(1/2), 39–48.
16. Jiang, X., Schiﬀmann, L., and Bunke, H. (2000). Computation of Median
Shapes. Proc. of 4th. Asian Conf. on Computer Vision, pp. 300–305, Taipei.
17. Jiang, X., Abegglen, K., and Bunke, H. (2001). Genetic Computation of
Generalized Median Strings. (Submitted for publication.)
18. Jiang, X., Abegglen, K., Bunke, H., and Csirik, J. (2002). Dynamic Com
putation of Generalized Median Strings, Pattern Analysis and Applications.
(Accepted for publication.)
19. Juan and Vidal, E. (1998). Fast Median Search in Metric Spaces, in A. Amin
and D. Dori (eds.). Advances in Pattern Recognition, pp. 905–912, Springer
Verlag.
20. Kohonen, T. (1985). Median Strings. Pattern Recognition Letters, 3, 309–313.
21. Kruskal, J.B. (1983). An Overview of Sequence Comparison: Time Warps,
String Edits, and Macromolecules. SIAM Reviews, 25(2), 201–237.
22. Kruzslicz, F. (1999) Improved Greedy Algorithm for Computing Approxi
mate Median Strings. Acta Cybernetica, 14, 331–339.
23. Lewis, T., Owens, R., and Baddeley, A. (1999). Averaging Feature Maps.
Pattern Recognition, 32(9), 1615–1630.
24. Lopresti, D. and Zhou, J. (1997). Using Consensus Sequence Voting to Cor
rect OCR Errors. Computer Vision and Image Understanding, 67(1), 39–47.
25. Marti, V. and Bunke, H. (2001). Use of Positional Information in Sequence
Alignment for Multiple Classiﬁer Combination, in J. Kittler and F. Roli
(eds.). Multiple Classiﬁer Combination, pp. 388–398, SpringerVerlag.
26. MartinezHinarejos, C.D., Juan, A., and Casacuberta, F. (2000). Use of
Median String for Classiﬁcation. Proc. of Int. Conf. on Pattern Recognition,
pp. 907–910.
27. MartinezHinarejos, C.D. Juan, A. and Casacuberta, F. (2001). Improving
Classiﬁcation Using Median String and NN Rules. Proc. of National Sympo
sium on Pattern Recognition and Image Analysis.
28. Marzal, A. and Vidal, E. (1993). Computation of Normalized Edit Distance
and Applications. IEEE Trans. on PAMI, 15(9), 926–932.
29. Marzal, A. and Barrachina, S. (2000). Speeding up the Computation of the
Edit Distance for Cyclic Strings. Proc. of Int. Conf. on Pattern Recognition,
2, pp. 895–898.
30. Mico, L. and Oncina, J. (2001). An Approximate Median Search Algorithm
in NonMetric Spaces. Pattern Recognition Letters, 22(10), 1145–1151.
31. O’Toole, A.J., Price, T., Vetter, T., Barlett, J.C., and Blanz, V. (1999). 3D
Shape and 2D Surface Textures of Human Faces: The Role of “Averages” in
Attractiveness and Age. Image and Vision Computing, 18(1), 9–19.
32. Pennec, X. and Ayache, N. (1998). Uniform Distribution, Distance and
Expectation Problems for Geometric Features Processing, Journal of Math
ematical Imaging and Vision, 9(1), 49–67.
April 22, 2004 16:59 WSPC/Trim Size: 9in x 6in for Review Volume Chap08
192 X. Jiang, H. Bunke and J. Csirik
33. Sanchez, G., Llados, J., and Tombre, K. (2002). A Mean String Algorithm
to Compute the Average Among a Set of 2D Shapes. Pattern Recognition
Letters, 23(1–3), 203–213.
34. Sankoﬀ, D. and Kruskal, J.B. (1983). (eds.). Time Warps, String Edits, and
Macromolecules: The Theory and Practice of String Comparison, Addison
Wesley.
35. Sim, J.S. and Park, K. (2001). The Consensus String Problem for a Metric
is NPComplete. Journal of Discrete Algorithms, 2(1).
36. Stephen, G.A. (1994). String Searching Algorithms, World Scientiﬁc.
37. Subramanyan, K. and Dean, D. (2000). A Procedure to Average 3D Anatom
ical Structures. Medical Image Analysis, 4(4), 317–334.
38. Wagner, R.A. and Fischer, M.J. (1974). The StringtoString Correction
Problem. Journal of the ACM, 21(1), 168–173.
39. Wang, L. and Jiang, Y. (1994). On the Complexity of Multiple Sequence
Alignment. Journal of Computational Biology, 1(4), 337–348.
DATA MINING IN TIME SERIES DATABASES
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel ) Vol. 48: Multimodal Interface for HumanMachine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal ) Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du Buf and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics — Linking Perception to Action (M. Xie) Vol. 55: Web Document Analysis: Challenges and Opportunities (Eds. A. Antonacopoulos and J. Hu) Vol. 56: Artificial Intelligence Methods in Software Testing (Eds. M. Last, A. Kandel and H. Bunke) Vol. 57: Data Mining in Time Series Databases (Eds. M. Last, A. Kandel and H. Bunke) Vol. 58: Computational Web Intelligence: Intelligent Technology for Web Applications (Eds. Y. Zhang, A. Kandel, T. Y. Lin and Y. Yao) Vol. 59: Fuzzy Neural Network Theory and Application (P. Liu and H. Li)
*For the complete list of titles in this series, please write to the Publisher.
Series in Machine Perception and Artificial Intelligence  Vol, 57
DATA MINING IN TIME SERIES DATABASES
Editors
Mark Last
BenGurion LIniversity o the Negeu, Israel f
Abraham Kandel
ZlAuiv University, Israel University of South Florida, Tampa, LISA
Horst Bunke
University of Bern, Switzerland
vp World Scientific
N E W JERSEY * LONDON * SINGAPORE BElJlNG SHANGHAI HONG KONG TAIPEI CHENNAI
recording or any information storage and retrieval system now known or to be invented. 57) Copyright © 2004 by World Scientific Publishing Co.
DATA MINING IN TIME SERIES DATABASES Series in Machine Perception and Artificial Intelligence (Vol. In this case permission to photocopy is not required from the publisher. Pte. 5 Toh Tuck Link. MA 01923. This book. or parts thereof. Covent Garden. London WC2H 9HE
British Library CataloguinginPublication Data A catalogue record for this book is available from the British Library. please pay a copying fee through the Copyright Clearance Center.
ISBN 9812382909
Typeset by Stallion Press Email: enquiries@stallionpress. Ltd. without written permission from the Publisher. including photocopying. Singapore 596224 USA office: 27 Warren Street. 222 Rosewood Drive. Hackensack. Ltd.. NJ 07601 UK office: 57 Shelton Street. Danvers. USA.Published by World Scientific Publishing Co. may not be reproduced in any form or by any means. All rights reserved. Inc.com
Printed in Singapore by World Scientific Printers (S) Pte Ltd
.
For photocopying of material in this volume. electronic or mechanical. Suite 401402. Pte.
University of South Florida
.Dedicated to
The Honorable Congressman C. Bill Young House of Representatives For his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department. W.
This page intentionally left blank
.
a timestamped record may include several numerical and nominal attributes. Since most today’s databases already include temporal data in the form of “date created”.Preface
Traditional data mining methods are designed to deal with “static” databases. which may depend not only on the time dimension but also on each other. statistical methods of time series analysis are focused on sequences of values representing a single numeric variable (e. where sequential information. there are certainly many other cases.
vii
. and interpretation of temporal data that is stored by modern information systems. can signiﬁcantly enhance our knowledge about the mined data. Some of these problems have been treated in the past by experts in time series analysis.g. In other words. databases where the ordering of records (or other database objects) has nothing to do with the patterns of interest. the question we are currently facing is: How to mine time series data? The purpose of this volume is to present some recent advances in preprocessing. price of a speciﬁc stock). and time series classiﬁcation and clustering. such as a timestamp associated with every record. the objects in a time series may represent some complex graph structures rather than vectors of featurevalues. To make the data mining task even more complicated. In a realworld database. However.. Though the assumption of order irrelevance may be suﬃciently accurate in some applications. “date modiﬁed”.e. i. and other timerelated ﬁelds. the only problem is how to exploit this valuable information to our beneﬁt. detecting change points in time series. mining. One example is a series of stock values: a speciﬁc closing price recorded yesterday has a completely diﬀerent meaning than the same value a year ago. Adding the time dimension to a database produces a Time Series Database (TSDB) and introduces new aspects and challenges to the tasks of data mining and knowledge discovery. These new challenges include: ﬁnding the most eﬃcient representation of time series data. measuring similarity of time series.
The most frequently used representation of singlevariable time series is piecewise linear approximation. The problem of change point detection in a sequence of values has been studied in the past. especially in the context of time series segmentation (see above). one can see each time series. and medical diagnosis. However. as a single object. and Michael Pazzani provides an extensive and comparative overview of existing techniques for time series segmentation. Dimitrios Gunopulos. A change detection procedure for detecting abnormal events in time series of graphs is presented by Horst Bunke and Miro Kraetzl in Chapter 6. Oded Maimon. involving multivariate and even graph data. Chapter 5 (by Gil Zeira. David Hart. An overview of current methods for eﬃcient retrieval of time series is presented in Chapter 2 by Magnus Lie Hetland. where the original points are reduced to a set of straight lines (“segments”). potentially unlimited number of points. Classiﬁcation of Time Series. researchers have been looking for sets of similar data sequences that diﬀer only slightly from each other. Rather than partitioning a time series into segments. Speciﬁc problems challenged by the authors of this volume are as follows. In the view of shortcomings of existing approaches. meteorological studies. Mark Last. Indexing and Retrieval of Time Series. or any other sequence of data points. The problem of retrieving similar series arises in many areas such as marketing and stock data analysis. Pratt) presents a new method for fast compression and indexing of time series. Chapter 3 (by Eugene Fink and Kevin B. and Gautam Das in Chapter 4. Representation of Time Series. the same chapter introduces an improved segmentation algorithm called SWAB (Sliding Window and Bottomup). Chapter 1 by Eamonn Keogh. Eﬃcient and eﬀective representation of time series is a key to successful discovery of timerelated patterns. The procedure is applied to abnormal event detection in a computer network. Classiﬁcation and clustering of such complex
. Selina Chu. Since each time series is characterized by a large. and Lior Rokach) covers the problem of change detection in a classiﬁcation model induced by a data mining algorithm from time series data.viii
Preface
Our book covers the stateoftheart research in several areas of time series data mining. A robust similarity measure for retrieval of noisy time series is described and evaluated by Michail Vlachos. the nature of realworld time series may be much more complex. Thus. Change Detection in Time Series. ﬁnding two identical time series for any phenomenon is hopeless.
Space and Naval Warfare Systems Command grant number N000390112248. Speciﬁc suggestions for future research can be found in individual chapters. Carlos J. College of Engineering. Rodr´ a ıguez Diez present a new method for early classiﬁcation of multivariate time series. by Xiaoyi Jiang. the area of mining time series databases still includes many unexplored and insuﬃciently explored issues. January 2004 Mark Last Abraham Kandel Horst Bunke
. TelAviv University.Preface
ix
“objects” may be particularly beneﬁcial for the areas of process control. Department of Computer Science and Engineering. As indicated above. Department of Information Systems Engineering. Alonso Gonz´lez and Juan J. The USIsrael Educational Foundation. Horst Bunke. We also would like to acknowledge the generous support and cooperation of: BenGurion University of the Negev. In Chapter 7. we believe that interesting and useful results can be obtained by applying the methods described in this book to realworld sets of sequential data. Acknowledgments The preparation of this volume was partially supported by the National Institute for Systems Test and Productivity at the University of South Florida under U. The Fulbright Foundation.S. Their method is capable of learning from series of variable length and able of providing a classiﬁcation when only part of the series is presented to the classiﬁer. intrusion detection. and character recognition. and Janos Csirik) opens new opportunities for applying classiﬁcation and clustering methods of data mining to sequential data. University of South Florida. In general. A novel concept of representing time series by median strings (see Chapter 8.
This page intentionally left blank
.
. . Vlachos. Rodr´ a ıguez Diez Chapter 8 Median Strings: A Review . . . . M. . . . . . . . . . . . . Das Chapter 5 Change Detection in Classiﬁcation Models Induced from Time Series Data . . . . . . . . . . . . . Bunke and J. 127 H. . . . . Hart and M. . . . . . . . 101 G. . . . . . . . . . . . . . . vii
Chapter 1 Segmenting Time Series: A Survey and Novel Approach . . . J. . . D. . . . . . .Contents
Preface
. . . . . . . . . . . . . . . . . . . . . . . . . . . O. . Bunke and M. . . . Keogh. . Alonso Gonz´lez and J. . Pazzani
1
Chapter 2 A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences . . Kraetzl Chapter 7 Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation . . . 67 M. . . . S. 23 M. Pratt Chapter 4 Indexing TimeSeries under Conditions of Noise . . . . Gunopulos and G. . . . . . Last and L. . . . Rokach Chapter 6 Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs . . . . 173 X. . . . Zeira. . . . . . . Maimon. . . . . . . . . . . Csirik
xi
. . . . . . . L. . 149 C. . Jiang. . . . B. . . . Hetland Chapter 3 Indexing of Compressed Time Series . . . . . Chu. . J. . E. . . . . . . . . H. . . . . . . . . . . . 43 E. . Fink and K. . D. . . . . .
This page intentionally left blank
.
uci. This representation has been used by various researchers to support clustering. California 92697. indexing and association rule mining of time series data. Irvine. regression. data mining. Introduction In recent years. and Michael Pazzani Department of Information and Computer Science. We show that all these algorithms have fatal ﬂaws from a data mining perspective. Several high level
1
. representation of the data is the key to eﬃcient and eﬀective solutions. California 92521. Keywords: Time series. A variety of algorithms have been proposed to obtain this representation.
1. USA Email: eamonn@cs.edu In recent years. we undertake the ﬁrst extensive review and empirical comparison of all proposed techniques. In this chapter. University of California. Riverside. there has been an explosion of interest in mining time series databases.ucr. We introduce a novel algorithm that we empirically show to be superior to all others in the literature. University of California — Riverside. David Hart. pazzani}@ics.CHAPTER 1 SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH
Eamonn Keogh Computer Science & Engineering Department. USA Email: {selina. As with most computer science problems. As with most computer science problems. there has been an explosion of interest in mining time series databases. with several algorithms having been independently rediscovered several times. representation of the data is the key to eﬃcient and eﬀective solutions. One of the most commonly used representations is piecewise linear approximation. piecewise linear approximation. dhart.edu
Selina Chu. classiﬁcation. segmentation.
(1999). of length n. (2001). (1998). Keogh and Pazzani (1999). Last et al.
. Qu et al. Li et al.
representations of time series have been proposed. Symbolic Mappings [Agrawal et al. Speciﬁcally. (1995). (1998). (1995). Keogh. Vullings et al. (2001). Osaki et al. Chu. 1. Hunter and McIntosh (1999). In this work. (1998). (1999)] and relevance feedback [Keogh and Pazzani (1999)]. multiresolution queries [Wang and Wang (2000). (2000). (1997). including “fuzzy queries” [Shatkay (1995). Pazzani
(a)
(b)
Fig. Intuitively. Keogh and Smyth (1997). Keogh and Pazzani (1998). (2000)]. Koski et al. S. (2000)]. in the context of data mining. Shatkay (1995). Wavelets [Chan and Fu (1999)]. (1998)]. Lavrenko et al. with K straight lines (hereafter known as segments). (b) Electrocardiogram (ECG). Das et al. • Support novel distance measures for time series.2
E. Perng et al. Shatkay and Zdonik (1996)]. Hart and M. Figure 1 contains two examples. Piecewise Linear Representation refers to the approximation of a time series T . the piecewise linear representation has been used to: • Support fast exact similarly search [Keogh et al. transmission and computation of the data more eﬃcient. • Support change point detection [Sugiura and Ogden (1994). Park et al. Li et al. D. Two time series and their piecewise linear representation. Ge and Smyth (2001)]. Park et al. this representation makes the storage. (a) Space Shuttle Telemetry. Wang and Wang (2000)]. (1999). including Fourier Transforms [Agrawal et al. (2000)] and Piecewise Linear Representation (PLR). (1993). (2000)]. weighted queries [Keogh and Pazzani (1998)]. • Support novel clustering and classiﬁcation algorithms [Keogh and Pazzani (1998)]. Keogh et al. Because K is typically much smaller that n. dynamic time warping [Park et al. perhaps the most frequently used representation [Ge and Smyth (2001). • Support concurrent mining of text and time series [Lavrenko et al. Shatkay and Zdonik (1996). we conﬁne our attention to PLR.
In Section 3. • Given a time series T . The major result of these experiments is that only online algorithm in the literature produces very poor approximations of the data. as segmentation algorithms. The rest of the chapter is organized as follows.Segmenting Time Series: A Survey and Novel Approach
3
Surprisingly. In this chapter. and produces high quality approximations. Data mining researchers. and the various modiﬁcations and extensions by data miners. Heckbert and Garland (1997). is online. (1995)]. Indeed. we will refer to these types of algorithm. Segmentation algorithms can also be classiﬁed as batch or online. For example. • Given a time series T . These results motivated us to introduce a new online algorithm that scales linearly in the size of the data set. not all algorithms can support all these speciﬁcations. total max error. The segmentation problem can be framed in several ways. we review the three major segmentation approaches in the literature and provide an extensive empirical evaluation on a very heterogeneous collection of datasets from ﬁnance. produce the best representation using only K segments. max error. Koski et al. who needed to produce a piecewise linear approximation. and that the only algorithm that consistently produces high quality results and scales linearly in the size of the data is a batch algorithm. This is an important distinction because many data mining problems are inherently dynamic [Vullings et al. from the ﬁelds of cartography or computer graphics [Douglas and Peucker (1973). Ramer (1972)]. For clarity. In Section 2.
. manufacturing and science. we provide an extensive review of the algorithms in the literature. produce the best representation such that the maximum error for any segment does not exceed some userspeciﬁed threshold. there has been little attempt to understand and compare the algorithms that produce it. have typically either independently rediscovered an algorithm or used an approach suggested in related literature. which input a time series and return a piecewise linear representation. we provide a detailed empirical comparison of all the algorithms. As we shall see in later sections. produce the best representation such that the combined error of all segments is less than some userspeciﬁed threshold. We explain the basic approaches. (1997). medicine. • Given a time series T . in spite of the ubiquity of this representation. with the exception of [Shatkay (1995)]. there does not even appear to be a consensus on what to call such an algorithm.
Given that we are going to approximate a time series with straight lines. A function that takes in a time series and returns the approximation error of the linear segment approximation of it. which ironically seem to be better understood. t2 .
. We refer the interested reader to [Heckbert and Garland (1997)]. D. . .
Table 1.4
E. Background and Related Work In this section. A discussion of the higher dimensional cases is beyond the scope of this chapter. T T [a : b] Seg TS create segment(T ) calculate error(T ) Notation. ta . tn The subsection of T from a to b. . The results will be used to motivate the need for a new algorithm that we will introduce and validate in Section 4. Individual segments can be addressed with Seg T S(i). A function that takes in a time series and returns a linear segment approximation of it. Table 1 contains the notation used in this chapter. . • BottomUp: Starting from the ﬁnest possible approximation. Section 5 oﬀers conclusions and directions for future work. . S. Keogh. The process repeats with the next data point not included in the newly approximated segment. ta+1 . . • TopDown: The time series is recursively partitioned until some stopping criteria is met. which contains an excellent survey. there are at least two ways we can ﬁnd the approximating line.
2. we describe the three major approaches to time series segmentation in detail. . most time series segmentation algorithms can be grouped into one of the following three categories: • Sliding Windows: A segment is grown until it exceeds some error bound. Hart and M. Pazzani
We will show that the most popular algorithms used by data miners can in fact produce very poor approximations of the data.
A time series in the form t1 . Although appearing under diﬀerent names and with slightly diﬀerent implementation details. . segments are merged until some stopping criteria is met. Almost all the algorithms have 2 and 3 dimensional analogues. tb A piecewise linear approximation of a time series of length n with K segments. Chu.
the quality of the approximating line. Another commonly used measure of goodness of ﬁt is the distance between the best ﬁt line and the data point furthest away in the vertical direction
Linear Interpolation
Linear Regression
Fig. Two 10segment approximations of electrocardiogram data. so that either technique can be imagined as the approximation technique. In particular. However. 2. together with its low computational complexity has made it the technique of choice in computer graphic applications [Heckbert and Garland (1997)]. giving the piecewise approximation a “smooth” look. squaring them and then summing them together. Linear interpolation tends to closely align the endpoint of consecutive segments. Linear regression. The two techniques are illustrated in Figure 2. The aesthetic superiority of linear interpolation. This is calculated by taking all the vertical diﬀerences between the bestﬁt line and the actual data points.Segmenting Time Series: A Survey and Novel Approach
5
• Linear Interpolation: Here the approximating line for the subsequence T[a : b] is simply the line connecting ta and tb . A measure commonly used in conjunction with linear regression is the sum of squares. we deliberately keep our descriptions of algorithms at a high level. This can be obtained in constant time. This can be obtained in time linear in the length of segment. in contrast. or the residual error. All segmentation algorithms also need some method to evaluate the quality of ﬁt for a potential segment. In contrast. piecewise linear regression can produce a very disjointed look on some datasets. regression or any other technique. in terms of Euclidean distance. the pseudocode function create segment(T) can be imagined as using interpolation. is generally inferior to the regression approach.
. The approximation created using linear interpolation has a smooth aesthetically appealing appearance because all the endpoints of the segments are aligned. produces a slightly disjointed appearance but a tighter approximation in terms of residual error. In this chapter. • Linear Regression: Here the approximating line for the subsequence T[a : b] is taken to be the best ﬁtting line in the least squares sense [Shatkay (1995)].
1)]). In particular. The Sliding Window algorithm is attractive because of its great simplicity.
. and the process repeats until the entire time series has been transformed into a piecewise linear approximation.anchor = anchor + i. the pseudocode function calculate error(T) can be imagined as using any sum of squares. noted that on ECG data it is possible to speed up the algorithm by incrementing the variable i by “leaps of length k” instead of 1. Hart and M.1. At some point i. Seg TS = concat(Seg TS. create segment(T[anchor: anchor + (i . Keogh. S. max error) anchor = 1. The Sliding Window Algorithm The Sliding Window algorithm works by anchoring the left point of a potential segment at the ﬁrst data point of a time series.
Algorithm Seg TS = Sliding Window(T. while not finished segmenting time series i = 2.6
E. Pazzani
(i. Several variations and optimizations of the basic algorithm have been proposed. They suggest initially setting i to s. (1995)]. furthest point. we have kept our descriptions of the algorithms general enough to encompass any error measure. As before. then attempting to approximate the data to the right with increasing longer segments. the algorithm is 15 times faster with little eﬀect on the output accuracy [Koski et al. intuitiveness and particularly the fact that it is an online algorithm. 2. If the guess was pessimistic (the measured error
Table 2. The generic Sliding Window algorithm. end. Chu. where s is the mean length of the previous segments. D.e. For k = 15 (at 400 Hz). end. there may be other optimizations possible. (1997)]. Vullings et al. Koski et al. while calculate error(T[anchor: anchor + i ]) < max error i = i + 1. or any other measure. Depending on the error measure used. the L∞ norm between the line and the data). The anchor is moved to location i. one does not have to test every value of i from 2 to the ﬁnal chosen value [Vullings et al. The pseudocode for the algorithm is shown in Table 2. noted that since the residual error is monotonically nondecreasing with the addition of more data points. the error for the potential segment is greater than the userspeciﬁed threshold. so the subsequence from the anchor to i − 1 is transformed into a segment.
Vullings et al. all segments consist of data points of the form of t1 ≤ t2 ≤ · · · ≤ tn or t1 ≥ t2 ≥ · · · ≥ tn . Wang and Wang (2000)]. each designed to be robust to a certain case. Variations on the Sliding Window algorithm are particularly popular with the medical community (where it is known as FAN or SAPA). The pseudocode for the algorithm is shown in Table 3. McKee et al. the algorithm recursively continues to split the subsequences until all the segments have approximation errors below the threshold. (2001) suggested modifying the algorithm to create “monotonically changing” segments [Park et al. (1994). (2001)]. But on real world datasets with any amount of noise. in contrast. the approximation is greatly overfragmented.
2. If not. perhaps because they tested the algorithm on stock market data. Otherwise they begin to decrement i until the measured error is less than max error. In cartography. This modiﬁcation worked well on the smooth synthetic dataset it was demonstrated on. They consider three variants of the basic algorithm. and its relative performance is best on noisy data. (1983). (1997)]. but they underline the diﬃculty of producing a single variant of the algorithm that is robust to arbitrary data sources. particularly if the time series in question contains abrupt level changes. Koski et al. it is known as the DouglasPeucker algorithm [Douglas and
.Segmenting Time Series: A Survey and Novel Approach
7
is still less than max error) then the algorithm continues to increment i as in the classic algorithm. Both subsections are then tested to see if their approximation error is below some userspeciﬁed threshold. (1995). The monotonically nondecreasing property of residual error also allows binary search for the length of the segment. Surprisingly. Most researchers have not reported this [Qu et al. does notice the problem and gives elegant examples and explanations [Shatkay (1995)]. This optimization can greatly speed up the algorithm if the mean length of segments is large in relation to the standard deviation of their length. (1998). The TopDown Algorithm The TopDown algorithm works by considering every possible partitioning of the times series and splitting it at the best location. That is. Park et al. Variations on the TopDown algorithm (including the 2dimensional case) were independently introduced in several ﬁelds in the early 1970’s. Shatkay (1995). since patient monitoring is inherently an online task [Ishijima et al.2. no one we are aware of has suggested this. The Sliding Window algorithm can give pathologically poor results under some circumstances.
Shatkay and Zdonik use it (after considering alternatives such as Sliding Windows) to support approximate queries in time series databases [Shatkay and Zdonik (1996)]. for i = 2 to length(T) . Chu. the algorithm has been used by [Li et al. But on real world data sets with any amount of noise. in image processing. They then use the segmentation to support a special case of dynamic time warping. Pazzani Table 3. Keogh. // Recursively split the left segment if necessary. which calls it “Iterative EndPoints Fits” [Duda and Hart (1973)]. Most researchers in the machine learning/data mining community are introduced to the algorithm in the classic textbook by Duda and Harts.
Peucker (1973)]. This modiﬁcation worked well on the smooth synthetic dataset it was demonstrated on. Lavrenko et al. max error) best so far = inf. if improvement in approximation < best so far breakpoint = i. These extreme points used to create an initial segmentation. if calculate error(T[1:breakpoint]) > max error Seg TS = Top Down(T[1:breakpoint]). (1999)]. introduced a modiﬁcation where they ﬁrst perform a scan over the entire dataset marking every peak and valley [Park et al. (2000)].8
E. Hart and M. end. improvement in approximation = improvement splitting here(T.2 // Find the best splitting point. and the TopDown algorithm is applied to each of the segments (in case the error on an individual segment was still too high). D. the approximation is greatly overfragmented. They attempt to discover the inﬂuence of news stories on ﬁnancial markets. (1998)] to support a framework for mining sequence databases at multiple abstraction levels. Park et al. it is known as Ramer’s algorithm [Ramer (1972)]. best so far = improvement in approximation. S. In the data mining community. The generic TopDown algorithm. end. i). uses the TopDown algorithm to support the concurrent mining of text and time series [Lavrenko et al. end.
Algorithm Seg TS = Top Down(T. end. Their algorithm contains some interesting modiﬁcations including a novel stopping criteria based on the ttest. if calculate error(T[breakpoint + 1:length(T)]) > max error Seg TS = Top Down(T[breakpoint + 1:length(T)]).
. // Recursively split the right segment if necessary.
while min(merge cost) < max error // While not finished. // Find ‘‘cheapest’’ pair to merge.1). the cost of merging each pair of adjacent segments is calculated. the algorithm needs to perform some bookkeeping. create segment(T[i: i + 1 ])).3. and the algorithm begins to iteratively merge the lowest cost pair until a stopping criteria is met. Keogh and Pazzani (1998). merge cost(p . Next. The BottomUp Algorithm The BottomUp algorithm is the natural complement to the TopDown algorithm. // Update records. end. p = min(merge cost). Seg TS = concat(Seg TS. merge cost(p) = calculate error(merge(Seg TS(p). // Find merging costs. Seg TS(i + 1))]).1 merge cost(i) = calculate error([merge(Seg TS(i). In data mining. for i = 1 : length(Seg TS) .
. Seg TS(p))). In addition.1) = calculate error(merge(Seg TS(p . The algorithm begins by creating the ﬁnest possible approximation of the time series. First. the cost of merging the i − 1 segments with its new larger neighbor must be recalculated. delete(Seg TS(p + 1)). Seg TS(p) = merge(Seg TS(p).Segmenting Time Series: A Survey and Novel Approach
9
Finally Smyth and Ge use the algorithm to produce a representation that can support a Hidden Markov Model approach to both change point detection and pattern matching [Ge and Smyth (2001)]. the algorithm was used by Hunter and McIntosh to provide the high level representation for their medical pattern matching system [Hunter and McIntosh (1999)]. In medicine. the algorithm has been used extensively by two of the current authors to support a variety of time series data mining tasks [Keogh and Pazzani (1999). The pseudocode for the algorithm is shown in Table 4. end. Two and threedimensional analogues of this algorithm are common in the ﬁeld of computer graphics where they are called decimation methods [Heckbert and Garland (1997)].
Table 4.
Algorithm Seg TS = Bottom Up(T. Keogh and Smyth (1997)]. When the pair of adjacent segments i and i + 1 are merged. Seg TS(p + 1))). // Merge them. max error) for i = 1 : 2 : length(T) // Create initial fine approximation. The generic BottomUp algorithm. the cost of merging the new segment with its right neighbor must be calculated.
2. end. so that n/2 segments are used to approximate the nlength time series. Seg TS(p + 1)).
This takes θ(n) for each split point. n/2] and for [n/2 + 1. We assume that for each set S of points. The number of data points is n. i] and for points [i + 1.
. Deﬁnitions and Assumptions. This reﬂects the way these algorithms are coded in practice. we discuss both a worstcase time that gives an upper bound and a bestcase time that gives a lower bound for each approach. The next iteration ﬁnds split points for [1. Feature Comparison of the Major Algorithms We have deliberately deferred the discussion of the running times of the algorithms until now. that the segments generated (with no limit on length) are tightly clustered around the average length. The actual length of segments created by an algorithm varies and we will refer to the lengths as Li . or θ(n2 ) total for all split points. for each split point i. The running time for each approach is data dependent. we compute a best segment and compute the error in θ(n) time. n]. except topdown. so the time analysis that follows is exact when the algorithm includes this limit. and thus the average segment length is L = n/K. It is trivial to code the algorithms to enforce this. Hart and M. We note. TopDown. Empirical results show. however. perform considerably worse if we allow any of the LI to become very large (say n/4). We use the standard notation of Ω(f (n)) for a lower bound. In what follows. Keogh. so this limit has little eﬀect in practice.10
E. D. S. when the reader’s intuition for the various approaches are more developed. All algorithms. We leave that as a topic for future work. and θ(f (n)) for a function that is both a lower and upper bound. Pazzani
2. O(f (n)) for an upper bound. which is to use a packaged algorithm or function to do linear regression. The ﬁrst iteration computes. the best line for points [1. This gives a recurrence T (n) = 2T(n/2) + θ(n2 ) where we have T (2) = c. n]. and this solves to T (n) = Ω(n2 ). For that reason. all computations of best segment and error are assumed to be θ(n). The best time for TopDown occurs if each split occurs at the midpoint of the data. This is a lower bound because we assumed the data has the best possible split points. the number of segments we plan to create is K. however. so we assume that the algorithms limit the maximum length L to some multiple of the average length.4. Chu. that we believe one can produce asymptotically faster algorithms if one custom codes linear regression (or other best ﬁt algorithms) to reuse computed values so that the computation is done in less than O(n) time in subsequent steps.
merge the pair into a new segment Snew . This is both a best case and worst case bound. The time to compute the cost of merging a length 2 segment with a length i segment is θ(i). so the total time is Ω(K L log L) = Ω(n log n/K). The heap operations may take as much as O(n log n). going from 2 up to at most cL (by the assumption we discussed above). The maximum time to compute a single segment is cL 2 i=2 θ(i) = θ(L ). Sliding Windows. the merged segments always have about equal length. and the ﬁnal segments have length L. we compute best segments for larger and larger windows. into half as many segments is θ(L) (for the time to compute the best segment for every pair of merged segments). In addition to the time complexity there are other features a practitioner might consider when choosing an algorithm. so the time is n/cL × θ(L2 ) = O(Ln).) In the best case. Each iteration takes the same time repeating θ(log L) times gives a segment of size L. The number of segments can be as few as n/cL = K/c or as many as K. (The time to construct the heap is O(n). BottomUp. The time to merge a set of length 2 segments. the merges always involve a short and long segment. and the ﬁnal segments are mostly of length cL. and insert these costs into our heap of costs. This is easily seen to take O(n) time. In the worst case. delete from a heap (keeping track of costs is best done with a heap) the costs of merging segments i − 1 and i and merging segments i + 1 and i + 2. The time is thus θ(L2 K) or θ(Ln). The number of times we produce length L segments is K. For a lower bound we have proven just Ω(n log n/K). and the cL time to reach a length cL segment is i=2 θ(i) = θ(L2 ). The ﬁrst iteration computes the segment through each pair of points and the costs of merging adjacent segments. we look up the minimum error pair i and i + 1 to merge. First there is the question of
. For this algorithm. compute the costs of merging Snew with Si−1 and with Si−2 . not counting heap operations. In the following iterations. (Time for heap operations is inconsequential. The recurrence is T (n) = T (n − 2) + θ(n2 ) We must stop after K iterations. The time to look up the best cost is θ(1) and the time to add and delete costs from the heap is O(log n). giving a time of O(n2 K).) This complexity study is summarized in Table 5. rather than the middle. which will end up being one length L segment.Segmenting Time Series: A Survey and Novel Approach
11
The worst time occurs if the computed split point is always at one side (leaving just 2 points on one side). There are at most n/cL such segments to compute.
testing on purely random data forces the all algorithms to produce essentially the same results. ME. Figure 3 illustrates the 10 datasets used in the experiments. Secondly. Keogh. but forces the other two approaches to produce arbitrarily poor approximations. K → Number of segments. noisy/smooth. including ﬁnance.12 Table 5. A (nonrecursive) implementation of TopDown can also be made to support all three options. the maximum error per segment. Experimental Methodology For simplicity and brevity. ME. Since linear regression minimizes the sum of squares error. stationary/nonstationary. 3.
E. Empirical Comparison of the Major Segmentation Algorithms In this section.1. 3. In addition. With trivial modiﬁcations the BottomUp algorithm allows the user to specify the desired value of K. To overcome the potential for biased results. K E Online No No Yes Complexity O(n2 K) O(Ln) O(Ln)
Algorithm TopDown BottomUp Sliding Window
1 KEY:
E → Maximum error for a given segment. we only include the linear regression versions of the algorithms in our study. M E → Maximum error for a given segment for entire time series. D. manufacturing and science. it also minimizes the Euclidean distance (the Euclidean
. symmetric/asymmetric. there is the question of how the user can specify the quality of desired approximation. These datasets where chosen to represent the extremes along the following dimensions. K E.
whether the algorithm is online or batch. Chu. we tested the algorithms on a very diverse collection of datasets. medicine. S. Pazzani A feature summary for the 3 major algorithms. It is possible to create artiﬁcial datasets that allow one of the algorithms to achieve zero error (by any measure). In contrast. User can specify1 E. cyclical/noncyclical. we will provide an extensive empirical comparison of the three major algorithms. the data sets represent the diverse areas in which data miners apply their algorithms. Hart and M. or total error of the approximation. etc. However Sliding Window only allows the maximum error per segment to be speciﬁed.
the performance in the midrange of the 6 values should be considered most important. (iii) Tickwise II. by deﬁnition. (v) Water Level. Keogh and Pazzani (1998). Instead. (vii) ECG. since they all simply approximate T with a single bestﬁt line. Keogh et al. (viii) Noisy Sine Cubed. The linear interpolation versions of the algorithms. 3.
. we must test the relative performance for some reasonable value of max error. is by far the most common metric used in data mining of time series [Agrawal et al. we did the following. (1993). Figure 4 illustrates this idea. (ii) Exchange Rates. Wang and Wang (2000)]. a value that achieves a good trade oﬀ between compression and ﬁdelity. Chan and Fu (1999). We immediately encounter a problem when attempting to compare the algorithms. Agrawal et al. and the highest tends to produce a very coarse approximation. Keogh and Smyth (1997). At the opposite end. The performance of the algorithms depends on the value of max error. (1998). will always have a greater sum of squares error. (vi) Manufacturing. Instead we give each of the algorithms a ﬁxed max error and measure the total error of the entire piecewise approximation. Euclidean distance. So in general. Das et al. Qu et al. (1998). since they would produce n/2 segments with no error. (x) Space Shuttle. (iv) Tickwise I. (ix) Sine Cube. As max error goes to zero all the algorithms have the same performance. (1995). or some measure derived from it. (i) Radio Waves. The 10 datasets used in the experiments. the algorithms once again will all have the same performance. as max error becomes very large. Because this “reasonable value” is subjective and dependent on the data mining application and the data itself. and then we bracketed it with 6 values separated by powers of two.Segmenting Time Series: A Survey and Novel Approach
13
(i) (ii) (iii) (iv) (v)
(vi) (vii) (viii) (ix) (x)
Fig. since Sliding Windows does not allow one to specify the number of segments. We chose what we considered a “reasonable value” of max error for each dataset.
distance is just the square root of the sum of squares). (2000). Keogh and Pazzani (1999). We cannot compare them for a ﬁxed number of segments. The lowest of these values tends to produce an overfragmented approximation.
Comparing the results for Sine cubed and Noisy Sine supports our conjecture that the noisier a dataset. On the other hand BottomUp often signiﬁcantly out performs TopDown.
Since we are only interested in the relative performance of the algorithms. it is the worse performing algorithm.14
E. The most obvious result is the generally poor quality of the Sliding Windows algorithm. TopDown does occasionally beat BottomUp. Pazzani
Too fine an approximation
max_error = E × 21
max_error = E × 24
max_error = E × 22
max_error = E × 25
“Correct” approximation
max_error = E × 23
Too coarse an approximation
max_error = E × 26
Fig. Manufacturing and Water Level data sets. Chu. A New Approach Given the noted shortcomings of the major segmentation algorithms. D. we normalized the performance of the 3 algorithms by dividing by the error of the worst performing approach. We are most interested in comparing the segmentation algorithms at the setting of the userdeﬁned threshold max error that produces an intuitively correct level of approximation. 3. brackets the range of reasonable approximations. The main problem with the Sliding Windows algorithm is its inability to look ahead. usually by a large amount. S. 4. lacking the global view of its oﬄine (batch) counterparts. Keogh. Wang and Wang (2000)]. especially on the ECG. Since this setting is subjective we chose a value for E. the less diﬀerence one can expect between algorithms.2. such that max error = E × 2i (i = 1 to 6). This suggests that one should exercise caution in attempting to generalize the performance of an algorithm that has only been demonstrated on a single noisy dataset [Qu et al. 4. for each setting of max error on each data set. but only by small amount. we investigated alternative techniques. Hart and M. The BottomUp and the TopDown
. With a few exceptions. Experimental Results The experimental results are summarized in Figure 5. (1998).
6 0. 4. where the data are in the order of terabytes or arrive in continuous streams.8 0.6 0. A comparison of the three major times series segmentation algorithms.8 0.4 0.2 0 1 2
Tickwise 2
1 0. The SWAB Segmentation Algorithm The SWAB algorithm keeps a buﬀer of size w.
. but are oﬄine and require the scanning of the entire data set.8 0. Each experimental result (i.2 0 1 2 3 4 5 6 1 0.2 0 1 2 3 4 5 6 1 0.4 0.4 0.8 0.8 0.4 0. over a range in parameters.e.2 3 4 5 6 0 1
1 2 3 4 5 6
E*2 E*2 E*2 E*2 E*2 E*2
2 3 4 5 6
Fig.2 0 1 2
Radio Waves
1 0. We therefore introduce a novel approach in which we capture the online nature of Sliding Windows and yet retain the superiority of BottomUp.4 0.4 0.6 0.1.6 0.8 0.8 0.2 0 1
Manufacturing
1 0.4 0.
approaches produce better results. We call our new algorithm SWAB (Sliding Window and Bottomup).4 0.8 0.8 0. on ten diverse datasets.4 0.6 0.8 0.Segmenting Time Series: A Survey and Novel Approach
15
Space Shuttle
1 0.2 2 3 4 5 6 0 1
Noisy Sine Cubed
2
3
4
5
6
ECG
1 0. a triplet of histogram bars) is normalized by dividing by the performance of the worst algorithm on that experiment.2 3 4 5 6 0 1
Exchange Rate
2
3
4
5
6
1 0.6 0.6 0.2 2 3 4 5 6 0 1
Water Level
2
3
4
5
6
Tickwise 1
1 0.2 0 1
Sine Cubed
1 0.6 0.8 0.6 0.6 0. The buﬀer size should initially be chosen so that there is enough data to create about 5 or 6 segments.4 0. This is impractical or may even be unfeasible in a datamining context.6 0. 5.4 0.2 0 1 2 3 4 5 6 1 0.
while error ≤ max error // next potential segment. d. while data at input T = Bottom Up(w. Keogh. Table 6 shows the pseudo code for the algorithm. (T . This process of applying BottomUp to the buﬀer.16
E. seg num) // seg num is a small integer. The data corresponding to the reported segment is removed from the buﬀer and more datapoints are read in. reporting the leftmost segment. The intuition behind the algorithm is this. end while. w = CONCAT(w. {check upper and lower bound. return S. Chu. Seg TS = CONCAT(SEG TS.e. into S S = CONCAT(S. S. // Deletes w points in T(1) from w. Pazzani
BottomUp is applied to the data in the buﬀer and the leftmost segment is reported.
Algorithm Seg TS = SWAB(max error. read in one additional data point. // seg num of segments. which is basically just classic Sliding Windows. Seg TS = CONCAT(SEG TS. 5 or 6 read in w number of data points // Enough to approximate lower bound = w / 2.T(1))) end. As the data moves through the buﬀer the (relatively good) BottomUp algorithm is given a chance to reﬁne the segmentation. BEST LINE(max error)). T(1)). d). and reading in the next “best ﬁt” subsequence is repeated as long as data arrives (potentially forever). Hart and M. max error) // Call the BottomUp algorithm. This process is performed by the Best Line function. upper bound = 2 * w.
The SWAB (Sliding Window and Bottomup) algorithm.
. The number of datapoints read in depends on the structure of the incoming data. end. Function S = BEST LINE(max error) // returns S points to approximate. D. because it has a “semiglobal” view of the data. The Best Line function ﬁnds data corresponding to a single segment using the (relatively poor) Sliding Windows and gives it to the buﬀer. These points are incorporated into the buﬀer and BottomUp is applied again. error = approx segment(S). w = TAKEOUT(w. if data at input // Add w points from BEST LINE() to w. w ). i. adjust if necessary} else // flush approximated segments from buffer. By the time the data is ejected from the buﬀer. the segmentation breakpoints are usually the same as the ones the batch version of BottomUp would have chosen.
Table 6.
The surprising result (demonstrated below) is that by allowing the buﬀer to contain just 5 or 6 times the data normally contained by is a single segment. but a small buﬀer will deteriorate it to Sliding Windows. allowing excessive fragmentation may occur. In contrast. The result. since it was created by merging other segments that also had an even number of segments. is that the new algorithm produces results that are essentiality identical to BottomUp. Since this happens multiple times for SWAB.2. Conclusions and Future Directions We have seen the ﬁrst extensive review and empirical comparison of time series segmentation algorithms from a data mining perspective. We have shown the most popular approach. can produce reasonable results.Segmenting Time Series: A Survey and Novel Approach
17
Using the buﬀer allows us to gain a “semiglobal” view of the data set for BottomUp. constant amount of memory. Sliding Windows. the least well known. this time comparing the new algorithm with pure (batch) BottomUp and classic Sliding Windows. generally produces very poor results. The reason why this can occur is because SWAB is exploring a slight larger search space. A buﬀer that is allowed to grow arbitrarily large will revert our algorithm to pure BottomUp. it is eﬀectively searching a slight larger search space. Our new algorithm requires only a small. we used an upper (and lower) bound of twice (and half) of the initial buﬀer. it does not scale well. 4. Our algorithm can be seen as operating on a continuum between the two extremes of Sliding Windows and BottomUp. and that while the second most popular approach. The reader may be surprised that SWAB can sometimes be slightly better than BottomUp. summarized in Figure 6. yet is able process a neverending stream of data. The only possible exception is the rightmost segment. BottomUp approach produces excellent results and scales linearly with the size of the dataset. Experimental Validation We repeated the experiments in Section 3. However. In our algorithm. TopDown. it important to impose upper and lower bounds on the size of the window. 5. the algorithm produces essentially the same results as BottomUp.
. Every segment in BottomUp must have an even number of datapoints. and the time complexity is a small constant factor worse than that of the standard BottomUp algorithm. which can have an even number of segments if the original time series had an odd length.
2 0 2 3 4 5 6
E*2 1
1
E*2 2
2
E*2 3
3
E*2 4
4
E*2 5
5
E*2 6
6
Fig. • The performance of BottomUp is particularly surprising given that it explores a smaller space of representations.4 0.4 0.4 0. Keogh.6 0.4 0. In contrast the two other algorithms allow segments to have odd or even lengths.6 0.8 0. A comparison of the SWAB algorithm with pure (batch) BottomUp and classic Sliding Windows.8 0.6 0.2 0 1
Tickwise 1
1 0.8 0. on ten diverse datasets.8 0.8 0.2 0 1 2
Sine Cubed
1 0.4 0.e. a triplet of histogram bars) is normalized by dividing by the performance of the worst algorithm on that experiment. 6. Pazzani
Space Shuttle
1 0. we have introduced SWAB.8 0. S.8 0.8 0.2
Water Level
4
5
6
0
1
2
3
4
5
6
0
1
2
3
4
5
6
1 0.6 0. It would be
.2 0
Tickwise 2
1 0.6 0.
In addition.4 0.2 0
Noisy Sine Cubed
3
4
5
6
1
2
3
4
5
6
1 0.4 0. requires only constant space and produces high quality approximations of the data.2 0 1 2 3 4 5 6 1 0.6 0. which scales linearly with the size of the dataset. Hart and M.4 0. There are several directions in which this work could be expanded. D.2
Manufacturing
1 0. Chu. over a range in parameters.4 0.6 0.8 0.2 0 1
Radio Waves
1 0.4 0.8 0.2 0
Exchange Rate
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1 0. Because the initialization phase of the algorithm begins with all line segments having length two. a new online algorithm.6 0.8 0.6 0.4 0.6 0.2 0 1 2 3
ECG
1 0.18
E.6 0. Each experimental result (i. all merged segments will also have even lengths.
Agrawal. and Fu. R. Proceedings of 21th International Conference on Very Large Data Bases. Eﬃcient Time Series Matching by Wavelets. K.. Proceedings of the 15th IEEE International Conference on Data Engineering. (1999). Hunter. Wiley. 4. Ge. Fast Similarity Search in the Presence of Noise.cs. 69–84.. IEEE Transactions on Semiconductor Engineering.H. Reproducible Results Statement: In the interests of competitive scientiﬁc inquiry. M. (2001).E. Canadian Cartographer. Chan. 112–122. C. K. (1973). H. (1998). New York. Lin. D. Proceedings of the 24th International Conference on Computer Graphics and Interactive Techniques. Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining. We believe we may be able to reuse calculations from previous invocations of BottomUp. K. 9. thus achieving speedup. pp. Heckbert. Pattern Classiﬁcation and Scene Analysis. Lin. pp.. 6. and Shim. Mannila. pp. Survey of Polygonal Surface Simpliﬁcation Algorithms. pp. (1997). 271–280. G. 5. R.Segmenting Time Series: A Survey and Novel Approach
19
interesting to see if removing this limitation of BottomUp can improve its performance further.. Multiresolution Surface Modeling Course. Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature..I. H.
References
1. Douglas.K. 8. (1973).. all datasets and code used in this work are freely available at the University of California Riverside. Das. and Swami. Segmental SemiMarkov Models for Endpoint Detection in Plasma Etching. and Peucker. T. 10(2) December. Scaling. pp. Duda. N. Rule Discovery from Time Series. Springer. Faloutsos.O. and Smyth P. and Garland.edu/∼eamonn/TSDMA/index.html}. Eﬃcient Similarity Search in Sequence Databases.
. (1995). (1993). R. P. and Hart. Sawhney. and Translation in TimesSeries Databases. and Smyth. (1999).ucr. Artiﬁcial Intelligence in Medicine. G. P.. 16–22. Agrawal. pp. X.. Time Series Data Mining Archive {www. P. and McIntosh. Renganathan. 3.S. KnowledgeBased Event Detection in Complex Time Series Data. A. K. 7.S. 2. Proceedings of the 4th Conference on Foundations of Data Organization and Algorithms. This clearly results in some computation redundancy. J. 126–133. we have assumed that the inner loop of the SWAB algorithm simply invokes the BottomUp algorithm each time. • For simplicity and brevity. 490–501. W.
24–20. 351–352. M. Lawrie. (2000). 11. Yu. and Castelli. 183–190. Osaki.. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. 12. M. Park. Eﬃcient Implementation of the Fan/SAPA2 Algorithm Using Fixed Point Arithmetic. Hart and M. 14. 239–241. (2000).. D. 20. 22. V.. 109–117.W (2001). 33–45. Proceedings of the 22th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval. Perng. Shimada. Ishijima. ScanAlong Polygonal Approximation for Data Compression of Electrocardiograms. Fast Retrieval of Similar Subsequences in Long Sequence Databases.. Li.. Last. Wang.. Y. D. Lee. K. 24. E. Keogh. McKee. Waveform Segmentation Through Functional Approximation. An Enhanced Representation of Time Series which Allows Fast and Accurate Classiﬁcation. et al. (1995). A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. S. Proceedings of 16th International Conference on Data Engineering. 16. S. S. Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining. 3(3). M. Juhola. (1983). and Pazzani.W. (2000). P. and Chu.J (1994). and Owens.. H. Park. Chu. E. 248–252. Pazzani
10. 17. N. M. pp. 723–729.20
E. Jensen. Keogh. IEEE Transactions on Computers. S.. M. IEEE Transactions on Biomedical Engineering (BME). and Mehrotra. S. 13. pp. AAAI Press. K.W (1999).E. Chakrabarti. E. S.. W. Pattern Recognition. 31B(1).. Lavrenko. (1998). 263–286. P. Pazzani. 21.. C. Proceedings of the 16th ACM Symposium on Applied Computing. F. 30(11). Mining of Concurrent Text and Time Series. 18. 689–697.. J.. (2001). Proceedings of the 9th International Conference on Information and Knowledge Management. pp. Journal of Knowledge and Information Systems. MALM: A Framework for Mining Sequence Database at Multiple Abstraction Levels. Proceedings of the 2nd International Conference on Discovery Science. pp. 28(12). Proceedings of the 3rd IEEE Knowledge and Data Engineering Exchange Workshop. Keogh. (1976).. S. pp. 160–169. E. (1998). Klein. Schmill. pp.. and Parker. 16. Zhang. Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining.. and Uehara. Clustering and Relevance Feedback. R. D. P. Ogilvie. and Allan.J. M. A. Evans. pp. Extraction of Primitive Motion for Human Motion Recognition. M. (1997). 267–272. Keogh. Relevance Feedback Retrieval of Time Series Data.. Syntactic Recognition of ECG Signals By Attributed Finite Automata. Koski. M. 15... T. C. and Pazzani. SegmentBased Approach for Subsequence Searches in Sequence Databases. (1999). D. Pavlidis.
. 19. 23. and Smyth. Automedica. Man.. 37–44. and Meriste.. V. and Cybernetics. pp.. Landmarks: A New Model for SimilarityBased Pattern Querying in Time Series Databases.. and Chu. and Kandel. Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining. (1999). IEEE Transactions on Systems. W. 1927– 1940. Knowledge Discovery in Time Series Databases. Kim. A. Keogh. M. J.
Qu. Approximate Queries and Representations for Large Data Sequences. 29. Supporting Fast Search in Time Series for Movement Patterns in Multiples Scales. 23. 69–81. (1972). Vullings. 30.. M. Technical Report cs9503. (1998). S. H.
. An Iterative Procedure for the Polygonal Approximation of Planar Curves.Segmenting Time Series: A Survey and Novel Approach
21
25. 1.G. Testing ChangePoints with Linear Trend. 251–258. 27.. and Wang.M. pp. Ramer. pp.H. Y.T (1994). and Ogden. U.. pp. 287–322. Shatkay. Proceedings of the 2nd International Symposium on Intelligent Data Analysis. C. 275–286. H. and Verbruggen H.J L. Brown University. (1995). Supporting ContentBased Searches on Time Series Via Approximation. and Zdonik. Proceedings of the 7th International Conference on Information and Knowledge Management. (2000). and Wang. Wang. Proceedings of the 12th IEEE International Conference on Data Engineering. Verhaegen. Wang. Computer Graphics and Image Processing. C. 26. 28. Communications in Statistics B: Simulation and Computation. 546–553. N. (1997). S. 244–256.B.. R. S. (1996). Department of Computer Science. H. Shatkay. Proceedings of the 12th International Conference on Scientiﬁc and Statistical Database Management. Sugiura. pp. Approximate Queries and Representations for Large Data Sequences. ECG Segmentation Using TimeWarping. 31.
This page intentionally left blank
.
1. searching through large. spatial indexing. This chapter gives an overview of recent research and shows how the methods ﬁt into a general context of signature extraction. similarity search. socalled “query by example.org
Time sequences occur in many applications. Introduction Time sequences arise in many applications—any applications that involve storing sensor inputs. Such similaritybased retrieval has attracted a great deal of attention in recent years. most are based on the common premise of dimensionality reduction and spatial access methods. In many of these applications. unstructured databases based on sample sequences is often desirable. sequence databases.CHAPTER 2 A SURVEY OF RECENT METHODS FOR EFFICIENT RETRIEVAL OF SIMILAR TIME SEQUENCES
Magnus Lie Hetland Norwegian University of Science and Technology Sem Sælands vei 7–9 NO7491 Trondheim.
23
.” Some uses of this are [Agrawal et al. • Discovering stocks with similar movement in stock prices. A problem which has received an increasing amount of attention lately is the problem of similarity retrieval in databases of time sequences. (1993)]: • Identifying companies with similar patterns of growth. or sampling a value that changes over time. Norway Email: magnus@hetland. Keywords: Information retrieval. • Determining products with similar selling patterns. Although several diﬀerent approaches have appeared. ranging from science and technology to business and entertainment. time sequences.
• The presence of noise in various forms makes it necessary to support very ﬂexible similarity measures.
. Section 4 shows some other approaches. through economy. or even discrete. To ﬁnd the correct oﬀset of a query in a large database. This means that. or even O(nm2 ) or worse. typically linear or quadratic. This chapter describes some of the recent advances that have been made in this ﬁeld. Hetland
• Finding out whether a musical score is similar to one of a set of copyrighted scores. while Section 5 concludes the chapter. or shrink and search. The running times of simple algorithms for comparing time sequences are generally polynomial in the length of both sequences. but with time sequences there are a few extra issues to deal with: • The range of values is not generally ﬁnite. Section 3 describes the general approach of signature based retrieval. Many methods are known for performing this sort of query in the domain of strings over ﬁnite alphabets. (1994). The chapter is structured as follows: Section 2 gives a more formal presentation of the problem of similaritybased retrieval and the socalled dimensionality curse. to scientiﬁc disciplines such as meteorology and astrophysics [Faloutsos et al. a naive sequential scan will require a number of such comparisons that is linear in the length of the database. given a query of length m and a database of length n. Yi and Faloutsos (2000)].1
1 The
term “distance” is used loosely in this paper. L. methods that allow for indexing of time sequences using ﬂexible similarity measures that are invariant under a wide range of transformations and error sources. • The sampling rate may not be constant. • Finding portions of seismic waves that are not similar to spot geological irregularities. as well as three speciﬁc methods using this approach. Finally. Appendix gives an overview of some basic distance measures. A distance measure is simply the inverse of a similarity measure and is not required to obey the metric axioms. the search will have a time complexity of O(nm). For large databases this is clearly unacceptable.24
M. Applications range from medicine.
(2000)]. see Table 1. Abusing the notation slightly. xn = (vn . such a signature will always be a vector of ﬁxed size k.
A sequence A signature of x Element number i of x Elements i to j (inclusive) of x The length of x
. an additional assumption is that the elements are equispaced: for every two consecutive elements xi and xi+1 we have ti+1 − ti = ∆.) Such a signature is written x. (For a more thorough discussion of signatures. . each consisting of a value vi and a timestamp ti . The contiguous subsequence of x containing elements xi to xj (inclusive) is written xi:j .
Table 1. . The length of a time sequence x is its cardinality.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
25
1. but it is generally assumed that the values are real numbers. yet is simpler than x. in some applications. . t1 ). It is also possible to resample the sequence to make the elements equispaced. (1)
In some methods. (2)
where ∆ (the sampling rate of x) is a (positive) constant. The only requirement of the timestamps is that they be nondecreasing (or. the values may be taken from a ﬁnite class of values [Mannila and Ronkainen (1997)]. tn ) is an ordered collection of elements xi . see Section 3. . Terminology and Notation A time sequence x = x1 = (v1 . For a summary of the notation. and t1 to 0. x x ˜ xi xi:j x Notation. the value of xi may be referred to as xi . For some retrieval methods.1. when required. A signature of a sequence x is some structure that somehow represents x. If the actual sampling rate is not important. strictly increasing) with respect to the sequence indices: ti ≤ tj ⇔ i ≤ j. or may have more than one dimension [Lee et al. written as x. In the context of this chapter. ∆ may be normalized to 1. This assumption is a requirement for most of the methods described in this chapter.
1. the wrongly dismissed sequences in R − S are called false dismissals. more precisely: R = {x ∈ Xd(q. Several distance measures will be described rather informally in this chapter.
2 Except
in the description of LCS in Appendix. subsequence means contiguous subsequence. in the basic case. For more formal deﬁnitions. The Problem The problem of retrieving similar time sequences may be stated as follows: Given a sequence q. Conversely. In this example. a (nonnegative) distance measure d. a set of time sequences X. requires comparing q not only to all elements of X.2 If a method retrieves a subset S of R. (3)
Alternatively. A useful variation of the problem is to ﬁnd a set of subsequences of the sequences in X. L. one might wish to ﬁnd the k nearest neighbours of q.
. and a tolerance threshold ε.
Fig. Hetland
2.26
M. This. ﬁnd the set R of sequences closer to q than ε. The parameter ε is typically supplied by the user. the sequences in S − R are called false alarms. Figure 1 illustrates the problem for Euclidean distance in two dimensions. while the distance function d is domaindependent. if S is a superset of R. or segment. which amounts to setting ε so that R = k. x) ≤ ε}. or. while y will not.
Similarity retrieval. but to all possible subsequences. the vector x will be included in the result set R. see Appendix.
Many of the newer retrieval methods focus on using more robust distance measures. The fact that the value ranges of the sequences usually are continuous.3. Robust Distance Measures The choice of distance measure is highly domain dependent. makes it diﬃcult to use standard textindexing techniques such as suﬃxtrees. However. One requirement of such spatial access methods is that the distance measure must be monotonic
. Keogh et al. and many indexing methods exist to process queries eﬃciently [BaezaYates and RibeiroNeto (1999)]. Good Indexing Methods Faloutsos et al. Spatial Indices and the Dimensionality Curse The general problem of similarity based retrieval is well known in the ﬁeld of information retrieval. and that the elements may not be equispaced. the number of false alarms should also be low.2. 2. for instance). incur little space overhead. which are invariant under such transformations as time warping (see Appendix for details) without loss of performance. One of the most promising techniques is multidimensional indexing (Rtrees [Guttman (1984)].1. warping. and similar objects can be retrieved in sublinear time. allow queries of various length. this may be too brittle [Keogh and Pazzani (1999b)] since it does not tolerate such transformations as scaling. However. (2001b) add the following criteria to the list above: (vi) It should be possible to build the index in reasonable time. (vii) The index should preferably be able to handle more than one distance measure. allow insertions and deletions without rebuilding the index. in which the objects in question are multidimensional vectors. (1994) list the following desirable properties for an indexing method: (i) (ii) (iii) (iv) (v) It It It It It should should should should should be faster than a sequential scan. and in some cases a simple Lp norm such as Euclidean distance may be suﬃcient. in many cases. certain properties of time sequences make the standard methods unsuitable. or translation along either axis. 2.
To achieve high performance. be correct: No false dismissals must occur.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
27
2.
28
M. L. Hetland
in all dimensions, usually satisﬁed through the somewhat stricter requirement of the triangle inequality (d(x, z) ≤ d(x, y) + d(y, z)). One important problem that occurs when trying to index sequences with spatial access methods is the socalled dimensionality curse: Spatial indices typically work only when the number of dimensions is low [Chakrabarti and Mehrotra (1999)]. This makes it unfeasible to code the entire sequence directly as a vector in an indexed space. The general solution to this problem is dimensionality reduction: to condense the original sequences into signatures in a signature space of low dimensionality, in a manner which, to some extent, preserves the distances between them. One can then index the signature space.
3. Signature Based Similarity Search A time sequence x of length n can be considered a vector or point in an ndimensional space. Techniques exist (spatial access methods, such as the Rtree and variants [Chakrabarti and Mehrotra (1999), Wang and Perng (2001), Sellis et al. (1987)] for indexing such data. The problem is that the performance of such methods degrades considerably even for relatively low dimensionalities [Chakrabarti and Mehrotra (1999)]; the number of dimensions that can be handled is usually several orders of magnitude lower than the number of data points in a typical time sequence. A general solution described by Faloutsos et al. (1994; 1997) is to extract a lowdimensional signature from each sequence, and to index the signature space. This shrink and search approach is illustrated in Figure 2.
Fig. 2.
The signature based approach.
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
29
An important result given by Faloutsos et al. (1994) is the proof that in order to guarantee completeness (no false dismissals), the distance function used in the signature space must underestimate the true distance measure, or: x ˜ dk (˜, y ) ≤ d(x, y). (4)
This requirement is called the bounding lemma. Assuming that (1.4) holds, an intuitive way of stating the resulting situation is: “if two signatures are far apart, we know the corresponding [sequences] must also be far apart” [Faloutsos et al. (1997)]. This, of course, means that there will be no false dismissals. To minimise the number of false alarms, we want dk to approximate d as closely as possible. The bounding lemma is illustrated in Figure 3. This general method of dimensionality reduction may be summed up as follows [Keogh et al. (2001b)]: 1. Establish a distance measure d from a domain expert. 2. Design a dimensionality reduction technique to produce signatures of length k, where k can be eﬃciently handled by a standard spatial access method. 3. Produce a distance measure dk over the kdimensional signature space, and prove that it obeys the bounding condition (4). In some applications, the requirement in (4) is relaxed, allowing for a small number of false dismissals in exchange for increased performance. Such methods are called approximate. The dimensionality reduction may in itself be used to speed up the sequential scan, and some methods (such as the piecewise linear approximation of Keogh et al., which is described in Section 4.2) rely only on this, without using any index structure.
Fig. 3.
An intuitive view of the bounding lemma.
30
M. L. Hetland
Methods exist for ﬁnding signatures of arbitrary objects, given the distances between them [Faloutsos and Lin (1995), Wang et al. (1999)], but in the following I will concentrate on methods that exploit the structure of the time series to achieve good approximations.
3.1. A Simple Example As an example of the signature based scheme, consider the two sequences shown in Figure 4. The sequences, x and y, are compared using the L1 measure (Manhattan distance), which is simply the sum of the absolute distances between each aligning pair of values. A simple signature in this scheme is the preﬁx of length 2, as indicated by the shaded area in the ﬁgure. As shown in Figure 5, these signatures may be interpreted as points in a twodimensional plane, which can be indexed with some standard spatial indexing method. It is also clear that the signature distance will underestimate the real distance between the sequences, since the remaining summands of the real distance must all be positive.
Fig. 4.
Comparing two sequences.
Fig. 5.
A simple signature distance.
A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
31
Fig. 6.
An example time sequence.
Although correct, this simple signature extraction technique is not particularly precise. The signature extraction methods introduced in the following sections take into account more information about the full sequence shape, and therefore lead to fewer false alarms. Figure 6 shows a time series containing measurements of atmospheric pressure. In the following three sections, the methods described will be applied to this sequence, and the resulting simpliﬁed sequence (reconstructed from the extracted signature) will be shown superimposed on the original.
3.2. Spectral Signatures Some of the methods presented in this section are not very recent, but introduce some of the main concepts used by newer approaches. Agrawal et al. (1993) introduce a method called the F index in which a signature is extracted from the frequency domain of a sequence. Underlying their approach are two key observations: • Most realworld time sequences can be faithfully represented by their strongest Fourier coeﬃcients. • Euclidean distance is preserved in the frequency domain (Parseval’s Theorem [Shatkay (1995)]). Based on this, they suggest performing the Discrete Fourier Transform on each sequence, and using a vector consisting of the sequence’s k ﬁrst amplitude coeﬃcients as its signature. Euclidean distance in the signature space will then underestimate the real Euclidean distance between the sequences, as required. Figure 7 shows an approximated time sequence, reconstructed from a signature consisting of the original sequence’s ten ﬁrst Fourier components. This basic method allows only for wholesequence matching. In 1994, Faloutsos et al. introduce the ST index, an improvement on the F index
For each position in the database. split the query into wlength segments. Partition the trails into suitable (multidimensional) Minimal Bounding Rectangles (MBRs). that is. For example. ˜ because if a point is within a radius of ε of q . See Wu et al. according to some heuristic.
. and that the DWT method is empirically superior. Store the MBRs in a spatial index structure. contained in all the result sets. The points for one sequence will therefore constitute a trail in signature space. 7. Each point will be close to the previous. (1994)] are seminal. and Chan and Fu (1999) show how the Discrete Wavelet Transform (DWT) can be used instead of the Discrete Fourier Transform (DFT). (1993) and Faloutsos et al. Hetland
Fig. search for each of them. Because a sequence in the result set R cannot be closer to the full query sequence than it is to any one of the window signatures.32
M. L.
A sequence reconstructed from a spectral signature. it has to be close to all of them. simply look up all MBRs that intersect a hypersphere with radius ε around the signature point q . Raﬁei and Mendelzon (1997) show how the method can be made more robust by allowing various transformations in the comparison. This is guaranteed not to produce any false dismissals. because the contents of the sliding window change slowly. 2. The main steps of the approach are as follows: 1. These two papers [Agrawal et al. (2000) for a comparison between similarity search based on DFT and DWT. To search for subsequences similar to a query q of length w. and intersect the result sets. 3. several newer approaches are based on them. it cannot possibly be contained ˜ in an MBR that does not intersect the hypersphere. and create a spectral signature (a point) for it. extract a window of length w.
that makes subsequence matching possible. To search for sequences longer than w.
Yi and Faloutsos (2000) also show that this signature can be used with arbitrary Lp norms without changing the index structure. do not achieve similar speedups. (1993. or Yi et al. the socalled Adaptive Piecewise Constant Approximation. Figure 8 shows an approximated time sequence. Keogh et al.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
33
3. or PCA. and the index can be built in linear time. but point to the fact that the structure allows for more ﬂexible distance measures than many of the competing methods. reconstructed from a tendimensional PCA signature.
A sequence reconstructed from a PCA signature. Preprocessing to make the index more robust in the face of such transformations as oﬀset translation. 1997). which is something no previous method [such as Agrawal et al. Keogh et al. where each part of the sequence has a diﬀerent weight. and to use the average value of each segment as a coordinate of a kdimensional signature vector. Keogh and Pazzani (2000) is to divide each sequence into k segments of equal length. Raﬁei and Mendelzon (1997). and time scaling can also be performed. the PCA methods seem promising: Yi and Faloutsos demonstrate up to a ten times speedup over methods based on the discrete wavelet transform. This is
Fig. Keogh et al. This means that the distance measure may be speciﬁed by the user. Empirically. it is easy to understand and implement. call the method Piecewise Constant Approximation. (1994.3. (1998)] could accomplish.
. Keogh et al. This deceptively simple dimensionality reduction technique has several advantages [Keogh et al. (2001b)]: The transform itself is faster than most other transforms. demonstrate that the representation can also be used with the socalled weighted Euclidean distance. (2001b). (2001a) later propose an improved version of the PCA. 8. amplitude scaling. Faloutsos et al. it supports more ﬂexible distance measures than Euclidean distance. or APCA. Piecewise Constant Approximation An approach independently introduced by Yi and Faloutsos (2000) and Keogh et al. 1995).
which may be accounted for by the compression of the piecewise linear approximation. The distance measure is based on the assumption that the allowed warping is restricted. Both these methods also use the dimensionality reduction technique of piecewise linear approximation (see Section 4. The main contribution of this representation is that it is a more eﬀective compression than the PCA. The PCA signatures of these limits are then extracted. Keogh and Smyth introduce a probabilistic method for sequence retrieval. Two distance measures are developed for the APCA. and his method clearly outperforms any other existing method for indexing time warping. In a recent paper. where the features extracted are characteristic parts of the sequence. One of the contributions of the method is that it is one of the ﬁrst that allows some longitudinal scaling. socalled feature shapes. In its most general form. The empirical data suggest that the APCA outperforms both methods based on the discrete Fourier transform. Keogh performs extensive empirical experiments. Keogh (1997) uses a similar landmark based technique. and uses the PCA approach to index it. Keogh constructs two warped versions of the sequence to be indexed: An upper and a lower limit. Under this assumption. Landmark Methods In 1997. long segments. The methods are based on ﬁnding similar landmark features (or shapes) in the target sequences. one which is guaranteed to underestimate Euclidean distance. The technique is shown to be signiﬁcantly faster than sequential scanning (about an order of magnitude). except that the segments need not be of equal length. but which may generate some false dismissals.4. ignoring shifting and scaling within given limits. A more recent paper by Perng et al.34
M. 3. while reasonably featureless regions may be represented with fewer. can handle arbitrary Lp norms. Keogh (2002) develops a distance measure that is a lower bound for dynamic time warping. It is also shown that this technique. Hetland
similar to the PCA. like the PCA. (2000) introduces a more general landmark model. the model allows any point of
. Thus regions with great ﬂuctuations may be represented with several short segments. and methods based on the discrete wavelet transform with a speedup of one to two orders of magnitude.2) as a preprocessing step. while still representing the original faithfully. and one which can be calculated more eﬃciently. L. which is often the case in real applications. and together with Keogh’s distance measure form an exact index (one with no false dismissals) with high precision.
although as the number of transformations allowed increases. (2001). Thus. by using the intersection of the features allowed for each of them. (2000) is to show that for suitable selections of landmark features. 9. and the ﬁrst and last elements of a sequence. ﬁrstorder landmarks are extrema. it cannot be used directly. This makes the method quite ﬂexible and robust. For instance.
great importance to be identiﬁed as a landmark.
A landmark approximation. A particularly simple landmark based method (which can be seen as a special case of the general landmark method) is introduced by Kim et al. local extrema representing small ﬂuctuations may not be as important as a global maximum or minimum. They show that by extracting the minimum. which lets certain landmarks be overshadowed by others. The speciﬁc form used in the paper deﬁnes an nth order landmark of a onedimensional function to be a point where the function’s nth derivative is zero. One of the main contributions of Perng et al. reconstructed from a twelvedimensional landmark signature. A smothing technique is also introduced. maximum. Figure 9 shows an approximated time sequence. consequently. since time warping distance does not obey the triangle inequality [Yi et al. the index will be less precise. This problem is solved by developing a new distance measure that underestimates the time warping distance while simultaneously satisfying
.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
35
Fig. and so forth. one gets a (rather crude) signature that is invariant to time warping. the number of features will decrease. (1998)]. secondorder landmarks are inﬂection points. the model is invariant with respect to the following transformations: • • • • • Shifting Uniform amplitude scaling Uniform time scaling Nonuniform time scaling (time warping) Nonuniform amplitude scaling
It is also possible to allow for several of these transformations at once. However.
Park et al. all the sequences in the database are indexed with a suﬃx tree. A suﬃx tree stores all the suﬃxes of a sequence. have introduced a dimensionality reduction technique using piecewise linear approximation of the original sequence data [Keogh (1997). To avoid this. If a sequential search for subsequence matches is performed. 4. Hetland
the triangle inequality. Keogh and Pazzani (1999a).36
M. avoid this assumption by classifying each sequence element into one of a ﬁnite set of categories. Using Suﬃx Trees to Avoid Redundant Computation BaezaYates and Gonnet (1999) and Park et al. BaezaYates and Gonnet use edit distance (see Appendix for details). Keogh and Pazzani (1998). Both methods achieve subquadratic running times. Keogh and Smyth (1997)]. the cost may easily become prohibitive. 4.1. If two other sequences are compared that have identical preﬁxes to these (for instance. Note that this method does not achieve results comparable to those of Keogh (2002). The basic idea of the approach is as follows: When comparing two sequences x and y with dynamic programming. use time warping. Keogh and Pazzani (1999b). the query and another sequence from the database). with identical preﬁxes stored only once. By performing a depthﬁrst traversal of the suﬃx tree one can access every suﬃx (which is equivalent to each possible subsequence position) and backtrack to reuse the calculations that have already been performed for the preﬁx that the current and the next candidate subsequence share. the same calculations will have to be performed again. BaezaYates and Gonnet assume that the sequences are strings over a ﬁnite alphabet.2. L. This section contains a sampling of other approaches. 4. while Park et al. a subtask will be to compare their preﬁxes x1:i and y1:j . Data Reduction through Piecewise Linear Approximation Keogh et al. (2000) independently introduce the idea of using suﬃx trees [Gusﬁeld (1997)] to avoid duplicate calculations when using dynamic programming to compare a query sequence with other sequences in a database. Other Approaches Not all recent methods rely on spatial access methods. This reduces the number of data
.
Following the general description. some methods that are not based on signature extraction were mentioned. This approximation is shown to be valid under several distance measures. piecewise constant approximation. and for each position a hash key is generated to decide into which bin the corresponding subsequence is put. the bins are arranged in a bestﬁrst order. Conclusion This chapter has sought to give an overview of recent advances in the ﬁeld of similarity based retrieval in time sequence databases.
. based on Fourier components (or wavelet components). including dynamic time warping distance [Keogh and Pazzani (1999b)]. An enhanced representation is introduced in [Keogh and Pazzani (1998)]. to prune away entire bins without examining their contents. Search Space Pruning through Subsequence Hashing Keogh and Pazzani (1999a) describe an indexing method based on hashing. Although the ﬁeld of time sequence indexing has received much attention and is now a relatively mature ﬁeld [Keogh et al. outperforming methods based on the Discrete Fourier Transform by one to three orders of magnitude [Keogh and Pazzani (1999b)]. the problem of similarity search and the desired properties of robust distance measures and good indexing methods were outlined. 5. for instance. 4. To get more beneﬁt from the bin pruning. An equispaced template grid window is moved across the sequence.3. while 0 means that it is decreasing. (2002)] there are still areas where further research might be warranted. The hash key is simply a binary string. Finally. These bin keys may then be used during a search.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
37
points by a compression factor typically in the range from 10 to 600 for real data [Keogh (1997)]. in addition to the piecewise linear approximation. based on the extraction of signiﬁcant points in a sequence. the general approach of signature based similarity search was described. and some line segments may be more representative of the class than others. where every line segment in the approximation is augmented with a weight representing its relative importance. Then. and the related method adaptive piecewise constant approximation. three speciﬁc signature extraction approaches were discussed: Spectral signatures. Two such areas are (1) thorough empirical comparisons and (2) applications in data mining. First. and landmark methods. where 1 means that the sequence is predominantly increasing in the corresponding part of the template grid. a combined sequence may be constructed representing a class of sequences.
(2000) and. Although some comparisons have been made [such as in Wu et al. Most current mining methods are based on a full. constructing an index of the data could make it possible to perform this full data traversal only once. not only about their relative performances in general. L. little research seems to have been published on this topic. in the more general context of spatial similarity search. (5)
T is a set of allowable transformations. Lp norms (such as Manhattan distance
. but comparing the results is not a trivial task—in most cases it might not even be very meaningful. Implementing several of the most promising methods and testing them on real world problems (under similar conditions) might lead to new insights. linear scan of the sequence data. While this may seem unavoidable. to the basic spectral signature methods). It has been argued that data mining should be interactive [Das et al. and d0 is a socalled base distance. Some publications can be found about using time sequence indexing for data mining purposes [such as Keogh et al. Data mining in time series databases is a relatively new ﬁeld [Keogh et al. Appendix Distance Measures Faloutsos et al. Hetland
The published methods have undergone thorough empirical tests that evaluate their performance (usually by comparing them to sequential scan. c(Ti ) is the cost of performing the transformation Ti . (2002)]. and later perform several data mining passes that only use the index to perform their work. Ti (x) is the sequence resulting from performing the transformation Ti on x. in Weber et al. since variations in performance may be due to implementation details. (1997) describe a general framework for sequence distance measures [a similar framework can be found in Jagadish et al. (1998)]. They show that many common distance measures can be expressed in the following form: d(x. in some cases. available hardware. (2002). (1995)]. and several other factors that may not be inherent in the indexing methods themselves. and. For instance. y). where a method is presented for mining patterns using a suﬃx tree index] but there is still a potential for combining existing sequence mining techniques with existing methods for similaritybased retrieval. but also about which methods are best suited for which problems.38
M.T2 ∈T {c(T1 ) + c(T2 ) + d(T1 (x). typically calculated in linear time. in which case such techniques could prove useful. T2 (y))} d0 (x. (1998)]. y) = min minT1 .
dlcs (x.A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
39
and Euclidean distance) results when T = ∅ and d0 (x y) = Lp =
p
l i=1
xi − yi 
p
(6)
where x = y = l. A distance function with time warping allows nonuniform scaling along the time axis. y2:n ) c(sub(x1 y1 )) + ded (x2:m . giving a distance in the range [0. respectively. A typical distance measure is: (x – stutter). in sequence terms. Editing distance (or Levenshtein distance) is the weight of the minimum sequence of editing operations needed to transform one sequence into another [Sankoﬀ and Kruskal (1999)]. but Mannila and Ronkainen (1997) show how to generalise this measure to general (non equispaced) time sequences. y) = d0 (x1 . y). y1:j ). y1 ) stands for substituting the ﬁrst element of x with the ﬁrst element of y. editing distance may be deﬁned as: c(del(x1 )) + ded (x2:m . s ⊆ y . n]. or. dlcs (x. n = y. y) (7) ded (x. y). (1993). in a manner quite similar to ded . (1993)]. y2:n ) (8) dtw (x. dtw (x2:m . In the framework given above. The Longest Common Subsequence (LCS) measure [Cormen et al. Both ded and dtw can be computed in quadratic time (O(mn)) using dynamic programming [Cormen et al. The ﬁnal distance d(x. y) (y – stutter).
. 1]. Stuttering occurs when an element from one of the sequences is repeated several times. del(x1 ) and del(y1 ) stand for deleting the ﬁrst elements of x and y. y2:n ) where m = x. is the length of the longest sequence s which is a (possibly noncontiguous) subsequence of both x and y. y2:n ) (no stutter). and sub (x1 . y) = max s  s ⊆ x. is found in D[m. y1 ) + min dtw (x2:m . j] = d(x1:i . y) may be calculated using dynamic programming. dtw (x. It is usually deﬁned on strings (or equispaced time sequences). y). in other words: dlcs (x. y) = min c(del(y1 )) + ded (x. Sankoﬀ and Kruskal (1999)]: An m × n table D is ﬁlled iteratively so that D[i. stuttering. (9)
In some applications the measure is normalised by dividing by max(x.
I. Proc. and Gonnet. 15. 163–174.... MIT Press. Rtrees: A Dynamic Index Structure for Spatial Searching. and Lin. G. T.. Conf.V. 14th Symposium on Principles of Database Systems (PODS). 10. and Milo. Conf. (1998). and Fu. 14. Proc. Jagadish. Lin. (1993). 3. Hetland
References
1. on Tools with Artiﬁcial Intelligence (ICTAI). on Data Engineering (ICDE).40
M. 6. and Shim. Conf.. Faloutsos. Conf. P. Das. Proc. 126–133. Modern Information Retrieval. K. on Management of Data. A. Conf. (1994). Eﬃcient Time Series Matching by Wavelets. 578–584. R. Lin. Faloutsos. FastMap: A Fast Algorithm for Indexing. H. pp. and RibeiroNeto.N. (1999). Introduction to Algorithms. (1993). 419–429. Agrawal. Proc.E. pp. K. (1997)... K. H. Proc. and Smyth. pp. Conf. C. E. Proc. and Swami. A. A Signature Technique for SimilarityBased Queries. A. (1999). pp. Mendelzon. Y. Proc. Proc. The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. G. 15th Int. Faloutsos.O. on Knowledge Discovery and Data Mining (KDD). (1997).. 47–57. Compression and Complexity of Sequences (SEQUENCES). and Rivest. R. 13. pp. SimilarityBased Queries. Ranganathan. pp. Chan... pp. pp. C. Eﬃcient Similarity Search in Sequence Databases. Conf. of the 1995 ACM SIGMOD Int. ACM Press/Addison–Wesley Longman Limited. (1999). 36–45. B. Scaling. 490– 501.. (1995). Renganathan.H. Faloutsos.W. 8. 9th Int. on Very Large Databases (VLDB).L. and Translation in TimeSeries Database. Cambridge University Press. M. 16–22. Jagadish.J. Proc. 11. H. and Mehrotra. on Management of Data. (1997). 4. S. Cormen. D. 9. Leiserson. Proc. Conf. and Milo.. BaezaYates. (1984). (1995). Mendelzon. 4th Int. Keogh. Fast Subsequence Matching in TimeSeries Databases.S. of the 1994 ACM SIGMOD Int. 69–84. R. BaezaYates. Rule Discovery from Time Series. C. 4th Int. Fast Similarity Search in the Presence of Noise. K. Proc. A... (1999). L.H. 440–447. 5. Guttman. 2. 21st Int. Sawhney. Agrawal. on Foundations of Data Organization and Algorithms (FODO). T. 15th Int.V. DataMining and Visualization of Traditional and Multimedia Datasets. C. G. 6th String Processing and Information Retrieval Symposium (SPIRE). Gusﬁeld. on Data Engineering (ICDE). Chakrabarti. 1984 ACM SIGMOD Int. K. 16–23.O. T.
. R. R. 12. Conf. on Management of Data. pp. Algorithms on Strings.. 7. C. A Fast Algorithm on Average for AllAgainstAll Sequence Matching.. A. K. Mannila. (1995).. pp. and Manolopoulos. Trees and Sequences. pp.. A Fast and Robust Method for Pattern Matching in Time Series Databases. Proc. H.
. 550–556. 24. (2001b). P. Conf. S. Keogh.J.R. Proc. Keogh. 28. A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases.
. E. and Pazzani. Conf.J. M.W. and Hsu. (1999a). Keogh. (2000). E. W.. Conf. 19. Exact Indexing of Dynamic Time Warping. 17. M. E.J. E. 25. Proc. 16th Int. 1–11.. S. C. 27. Proc.J. Proc. E... Workshop on Temporal Representation and Reasoning. and Pazzani. Perng. on Data Engineering (ICDE). 22. 239–243. K. on Data Engineering (ICDE). 136–139. Proc. Eﬃcient Search for Similar Subsequences of Diﬀerent Lengths in Sequence Databases. Keogh. and Chung. 4th Int.. pp. pp.J. Mehrotra. pp. S. on Knowledge Discovery and Data Mining (KDD). Lonardi. on Very Large Data Bases (VLDB). 23–32. Wang. on Principles of Data Mining and Knowledge Discovery (PKDD). pp. K. S. Keogh.J. Yoon. (1997). (1997). pp. Chakrabarti. 4th Int.. M. Keogh.. Scaling up Dynamic Time Warping to Massive Datasets. and Pazzani. H. Keogh... 3rd European Conf. J. 11th Int. E. Keogh.J. Similarity Search for Multidimensional Data Sequences. M. Lee. (1999b).. Conf. Proc. (2001a). E. Journal of Knowledge and Information Systems 3(3). on Knowledge Discovery and Data Mining (KDD). (2000).A Survey of Recent Methods for Eﬃcient Retrieval of Similar Time Sequences
41
16. Proc. S.. Conf.W. Park. (2002). pp. 24–30. (2001). pp. and Mehrotra. 2001 ACM SIGMOD Conf.. H. P. on Management of Data. Chu. M. Proc.. on Data Engineering (ICDE). 56–67. Kim. S. (2002). 28th Int. on Data Engineering (ICDE). 151–162. Landmarks: A New Model for SimilarityBased Pattern Querying in Time Series Databases. Clustering and Relevance Feedback. of the 4th PaciﬁcAsia Conference on Knowledge Discovery and Data Mining (PAKDD).J. pp. A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. Proc. D. An Indexing Scheme for Fast Similarity Search in Large Time Series Databases. Lee.J. S. Proc. pp. C. 122–133. 26. and Ronkainen. TIME.J. and Chiu. pp. 3rd Int. S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Keogh. S. Conf. (2000). J. (2000). 263–286. An IndexBased Approach for Similarity Search Supporting Time Warping in Large Sequence Databases. 20.J. 599–609. Chakrabarti. Kim. Proc. An Enhanced Representation of Time Series which Allows Fast and Accurate Classiﬁcation. Proc.. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. (1998). Chun. and Chu. Conf. 16th Int. M. Pazzani. 23. W. 21.. Proc. 406–417. pp. and Pazzani. and Pazzani. Conf. 8th ACM SIGKDD Int. 33–42. C.J. and Parker. E. pp. 18. pp. 607–614.J. Park. Conf. on Scientiﬁc and Statistical Database Management (SSDBM).S.J. 29. on Knowledge Discovery and Data Mining (KDD).J. Zhang. D. E. B. Finding Surprising Patterns in a Time Series Database in Linear Time and Space. 17th Int.. Mannila. 16th Int... and Smyth. Similarity of Event Sequences..
(2000).. Eﬃcient Retrieval of Similar Time Sequences Under Time Warping. Weber. 33. pp. K. Time Warps. Shasha. 24th Int. Yi. 507–518. B. 14th Int. The VLDB Journal. and Faloutsos. H.. T. Y. 26(2). Wang. Sellis. Conf. H.. 385–594. D. pp. 194–205. (1987). H. Jagadish. and Macromolecules: The Theory and Practice of Sequence Comparison. N. (2000). D. (1995)..K. Shatkay. pp. On SimilarityBased Queries for Time Series Data. 34. H. 307–311. 13th Int. Conf. A Quantitative Analysis and Performance Study for SimilaritySearch Methods in HighDimensional Spaces. Proc. 32. L. D. pp. 35. pp. Proc. Schek.I. on Knowledge Discovery and Data Mining (PAKDD). 312–323. C.. on Data Engineering (ICDE). (1998). A Comparison of DFT and DWT Based Similarity Search in TimeSeries Databases. S. C. SIGMOD Record. 36. on Very Large Database (VLDB). J. Conf. (1998). Proc. on Very Large Databases (VLDB). pp. Conf.42
M. Lin. on Information and Knowledge Management (CIKM). (2001). Raﬁei. A. D. and Zhang. (1999).A. on Knowledge Discovery and Data Mining. and Faloutsos. Technical Report CS9537. C.
.V. ed. (1997).. (1999).. and Mendelzon. of the 1999 ACM SIGKDD Int. 488–495. and Abbadi. The S 2 Tree.... reissue edition. Wang. and Faloutsos. B. and Kruskal. 37. X. C. and Perng.L. J. String Edits. 38. 31. The R+ Tree: A Dynamic Index for MultiDimensional Objects. A. Sankoﬀ. 201–208. Evaluating a Class of DistanceMapping Algorithms for Data Mining and Clustering. Shapiro. Brown University. and Blott. 39. Proc.E. Conf. Agrawal. 13–25. An Index Structure for Subsequence Matching of Spatial Objects. Roussopoulos. Proc. Wang. Proc. 9th Int.. of the 5th PaciﬁcAsia Conf.T. The fourier transform: A primer. B. Hetland
30. Yi. pp. K. Fast Time Sequence Indexing for Arbitrary Lp Norms. Wu. R... CSLI Publications..
Florida 32919. by selecting major extrema and discarding the other points. Pratt Harris Corporation.
1. Introduction We view a time series as a sequence of values measured at equal intervals. and present two applications of this procedure. The retrieval procedure searches for the series whose compressed representation is similar to the compressed pattern. and retrieval of series similar to a given pattern. compression. The ﬁrst application is fast compression of a series.com
We describe a procedure for identifying major minima and maxima of a time series. W3/9708 Melbourne. Keywords: Time series. We also propose a measure of similarity between
43
. for example.CHAPTER 3 INDEXING OF COMPRESSED TIME SERIES
Eugene Fink Computer Science and Engineering. It allows the user to control the tradeoﬀ between the speed and accuracy of retrieval. similarity measures. For example. and so on. We describe a compression procedure based on extraction of certain important minima and maxima from a series.usf. 1025 NASA Blvd. We show the eﬀectiveness of the compression and retrieval for stock charts. 22.edu
Kevin B. the series in Figure 1 includes the values 20. Florida 33620. and electroencephalograms. and discarding the other points. fast retrieval. USA Email: kpratt01@harris. USA Email: eugene@csee. The compression algorithm runs in linear time and takes constant memory. The second application is indexing of compressed series by their major extrema. 22.. we can compress the series in Figure 1 by extracting the circled minima and maxima. meteorological data. University of South Florida Tampa. 25.
edu). and used ninetyeight stocks in the experiments.570 256 Total number of points 60. B.000 450. Wind speeds: We have used daily wind speeds from twelve sites in Ireland.44
E.ics.
Data sets used in the experiments.cmu.edu).stat.800–6.000 79. we present a technique for indexing and retrieval of compressed series. we have tested it on four data sets (Table 1). we have downloaded them from an archive at the University of California at Irvine (kdd. Electroencephalograms: We have utilized EEG obtained by Henri Begleiter at the Neurodynamics Laboratory of the SUNY Health Center at Brooklyn. 1 second
series and show that it works well with compressed data. Number of series 98 136 12 64 Points per series 610 1.000
98 stocks. Data set Stock prices Air and sea Temperatures Wind speeds EEG Description
Example of a time series. downloaded from the Knowledge Discovery archive at the University of California at Irvine (kdd. Air and sea temperatures: We have experimented with daily temperature readings by sixtyeight buoys in the Paciﬁc Ocean. Table 1. Finally.ics.edu/datasets). These data are from sixtyfour electrodes at standard points on the scalp. 2.600 6. obtained from an archive at Carnegie Mellon University (lib.000 16. 1.uci. We have downloaded daily prices from America Online. 18 years 64 electrodes. Pratt
30 value 25 20 0 1 2 3 4 5 6 7 8 time 9 10 11 12 13 14 15
Fig. from 1961 to 1978. Stock prices: We have used stocks from the Standard and Poor’s 100 listing of large companies for the period from January 1998 to April 2000.
.3 years 68 buoys.uci. Fink and K. 18 years. which are publicly available through the Internet. from 1980 to 1998. 2 sensors per buoy 12 stations. discarded newly listed and delisted stocks.
Guralnik and Srivastava (1999) considered the problem of detecting a change in the trend of a data stream. They described a technique for computing these features and identifying the points of change in the feature values. Researchers have considered various feature sets for compressing time series and measuring similarity between them. (2000)]. We describe an alternative compression technique. (2000) investigated a compression technique based on extraction of “landmark points. These techniques give a high compression rate. and it does not work well for erratic series [Ikeda et al. particle dynamics. but their descriptive power is limited. (1998) utilized an alphabet of primitive shapes for eﬃcient compression. For example. Stoﬀer (1999).Indexing of Compressed Time Series
45
2. (1998)]. Chan and his colleagues applied Haar wavelet transforms to time series and showed several advantages of this technique over Fourier transforms [Chan and Fu 1999. (1999). (1995). (2001) proposed a general framework for knowledge discovery in time series. and applied string matching to the pattern search [Agrawal et al. Researchers have also studied the use of small alphabets for compression of time series. In particular. 1998) used the endpoints of bestﬁt line segments to compress a series. Keogh et al. They have extensively studied Fourier transforms. Perng et al. (2003)]. Last et al. AndrJnsson and Badal (1997). based on selection of important minima and maxima. Keogh and Pazzani (1997. and stellar light intensity using a threeletter alphabet. which allow fast compression [Singh and McAtackney (1998). Singh and McAtackney (1998) represented stock prices. which included representation of a series by its key features. Chanet al. Guralnik et al. Yi et al.” which included local minima and maxima. Sheikholeslami et al. Previous Work We review related work on the comparison and indexing of time series. which may lead to a loss of important information. (1999)]. Lam and Wong (1998). it smoothes local extrema. Lin and Risch (1998) used a twoletter alphabet to encode major spikes in a series. this technique has several disadvantages.
. (1997) compressed stock prices using a nineletter alphabet. (2001) reviewed the compression techniques based on approximation of a time series by a sequence of straight segments. Park et al. however. (1998). such as slope and signaltonoise ratio. Qu et al. Huang and Yu (1999). Feature sets. Das et al. which makes them inapplicable in many domains. and developed a technique for ﬁnding “change points” in a series.
but found that the grid performance is often no better than exhaustive search. They also showed that exhaustive search among compressed series is often faster than sophisticated indexing techniques. Bozkaya and Ozsoyoglu (1999)]. Li et al. Researchers have studied a variety of techniques for indexing of time series. and it can be adapted for noisy and missing data [Das et al. Keogh and Pazzani (2000) also described a pointbypoint similarity with modiﬁed Euclidean distance. (1997). Park et al. (1998) proposed a retrieval technique based on a multilevel abstraction hierarchy of features. but it is less eﬀective for selecting the most similar series. and shape of video data. (2001) indexed series by their local extrema and by properties of the segments between consecutive extrema. Keogh and Pazzani (1998) used linear interpolation with this technique. An alternative deﬁnition of similarity is based on bounding rectangles. we can measure a pointbypoint similarity of two series and then aggregate these measures. Finally. (2000) applied cubic approximation. (1997)]. Chan and Fu (1999) combined wavelet transforms with Rtrees. For example. It allows fast pruning of clearly dissimilar series [Perng et al. (1997). two series are similar if their bounding rectangles are similar. but it is often ineﬀective for combining disparate features. (2000) applied Euclidean distance to compare feature vectors containing color. (2000). Pratt
Similarity measures. The overall similarity is measured by the number of envelopes where the series are similar [Agrawal et al. For example. Bollobas et al. Fink and K. Indexing and retrieval. which often requires interpolation of missing points. CaracaValente and LopezChavarrias (2000) used Euclidean distance between feature vectors containing angle of knee movement and muscle strength. Two series are similar within an envelope if their pointbypoint diﬀerences are within a certain threshold. This technique works well when all features have the same units of scale [Goldin and Kanellakis (1995)]. Deng (1998) applied kdtrees to arrange series by their signiﬁcant features. and deﬁning a yes/no similarity for each envelope. and Bozkaya and her colleagues used vantagepoint trees for indexing series by numeric features ¨ [Bozkaya et al. B. Aggarwal and Yu (2000) considered grid structures. For example. called envelopes.46
E. and Lee et al. texture. and Perng et al. which does not require interpolation. Lee et al. This measure allows fast computation of similarity.
. Several researchers have deﬁned similarity as the distance between points in a feature space. (1996)]. The envelopecount technique is based on dividing a series into short segments. (2000)].
9
Important points for 91% compression (left) and 94% compression (right). as well as the ﬁrst and last point of the series. without storing the
1. and the endpoint values of this segment are much larger than am (Figure 3). aj . . . 3.Indexing of Compressed Time Series
47
3. such that • am is the minimum among ai . .
am /R j i time j
time
Important minimum (left) and important maximum (right). . am is the minimal value of some segment ai . We control the compression rate with a parameter R. It outputs the values and indices of all important points. where i ≤ m ≤ j. and • am /ai ≥ R and am /aj ≥ R. Similarly. Important Points We compress a time series by selecting some of its minima and maxima.0 0. . . an is an important minimum if there are indices i and j. am is an important maximum if there are indices i and j. Intuitively.0 0. .
am · R
am
am i
Fig.
1. The intuitive idea is to discard minor ﬂuctuations and keep major minima and maxima. such that • am is the minimum among ai . This procedure can process new points as they arrive. . A point am of a series a1 . and dropping the other points (Figure 2). 2. In Figure 4. where i ≤ m ≤ j. . .9
Fig. . and • ai /am ≥ R and aj /am ≥ R.
. . aj . aj . . . which is always greater than one. . which takes linear time and constant memory. an increase of R leads to selection of fewer points. . we give a procedure for selecting important points.
n) FINDFIRST—Find the ﬁrst important point. We process a series a1 . . 4. iM ax = i while i < n and aiM ax /ai < R do if ai > aiM ax then iM ax = i i=i+1 if i < n or aiM ax > ai then output (aiM ax . Fink and K. i = 2 while i < n and ai /aiM in < R and aiM ax /ai < R do if ai < aiM in then iM in = i if ai > aiM ax then iM ax = i i=i+1 if iM in < iM ax then output (aiM in. iM ax) return i Fig. . iM in = 1. iM in = i while i < n and ai /aiM in < R do if ai < aiM in then iM in = i i=i+1 if i < n or aiM in < ai then output (ai M in. B. We have implemented it in Visual Basic 6 and tested on a 300MHz PC. iM in) else output (aiM ax. an . an and use a global variable n to represent its size. . the output is the values and indices of the selected important points. Pratt
IMPORTANTPOINTS—Toplevel function for ﬁnding important points. . .
original series. iM in) return i FINDMAXIMUM(i)—Find the ﬁrst important maximum after the ith point. for an npoint series. The input is a time series a1 . output (a1 . . . . Compression procedure. iM ax = 1. it can compress a live electroencephalogram without waiting until the end of the data collection. iM ax) return i FINDMINIMUM(i)—Find the ﬁrst important maximum after the ith point. the compression time is 14 · n microseconds.
. The procedure outputs the values and indices of the selected points.48
E. for example. 1) i = FINDFIRST if i < n and ai > a1 then i = FINDMAXIMUM(i) while i < n do i = FINDMINIMUM(i) i = FINDMAXIMUM(i) output (an .
· n i=1 We also deﬁne the root mean square similarity: 1 · sim(ai . . We have experimented with diﬀerent compression rates. We have considered three measures of diﬀerence between the original series. . . zero means no likeness and one means perfect likeness.. b1 . .n] ai − bi . For example. First. b) = 1 − a − b . a + b
The mean similarity between two series. . It ranges from −1 to 1. . “eightypercent compression” means that we select 20% of points and discard the other 80%. but we can convert
n n
. which underlies the retrieval procedure. Root mean square diﬀerence:
1 n
·
n i=1 (ai
− bi )2 . . . bi ). an and b1 . . and compared it with two simple techniques: equally spaced points and randomly selected points.. which are deﬁned as the percentage of points removed from a series. bi )2 . 4. For each compression technique. . we deﬁne similarity between two numbers. .. which shows that important points are signiﬁcantly more accurate than the two simple methods. .
Maximum diﬀerence: maxi∈[1. a1 . We review three basic measures of similarity and then propose a new measure. . Similarity Measures We deﬁne similarity between time series. . Mean diﬀerence:
1 n
·
n i=1
ai − bi . We measure similarity on a zerotoone scale. bn . a1 .
We summarize the results in Table 2. . and the series interpolated from the compressed version. which is a standard statistical measure of similarity.. . we consider the correlation coeﬃcient. n i=1 In addition. we have measured the diﬀerence between the original series and the compressed series. a and b: sim(a. an . bn . is the mean of their pointbypoint similarity: 1 sim(ai .Indexing of Compressed Time Series
49
We have applied the compression procedure to the data sets in Table 1.
04 0.64 0.05 0. Pratt
.10 1.06 0.80 0.02 Sea temp. Accuracy of three compression techniques.02 EEG 0.03 0.83 0.30 0.03 0.08 0.10 0.23 0.08 0.04 0.05 0.12 0.23 0.28
E.09 1.03 0.08 Ninetyﬁve percent Stock prices Air temp.21 0.82 1.70 0.85 1. Fink and K.05 0.01 1.07 0. important points 0.30 0.09 1.10 0.14 0.16 0.09 1.04 0.03 Air temp.04 0.05 0.21 0.10 1.08 0.90 ﬁxed points 1.11 0.03 0.13 0.03 Ninety percent compression Stock prices 0.00 1.03 0. Wind speeds EEG compression 0. 0.83 1.70 0.03 0. 0.04 0.80 0.04 0. B.07 0.27 0.14 0.74 0.14 0.72 0.08 0.05 0.01 Wind speeds 0.10 0.70 0.07 0.50
Table 2.80 0.06 0. Mean diﬀerence important points Eighty percent compression Stock prices 0.03 0.35 0.08 0.10 0.09 0.08 0. We give the average diﬀerence between an original series and its compressed version using the three diﬀerence measures.10 random points 1.82 1.75 1.32 0.14 0. smaller diﬀerences correspond to more accurate compression.70 0.10 Root mean square diﬀ.05 0.05 0. Sea temp.03 1.06 0.83 0.03 0.31 random points 0.04 0.05 0.68 1.03 EEG 0.03 0.60 0. 0.85 1.01 Wind speeds 0.78 0.16 0.21 0.12 0.05 0.07 0.08 1.24 0. 0.06 0.10 0.06 0.18 0.10 0.02 Air temp.24 ﬁxed points 0.01 Sea temp.12 0.09 1.17 0.04 0.17 0.10 0.08 0.81 1.16 Maximum diﬀerence important points 0.77 0.33 0.08 0.17 random points 0.12 0.78 0.03 0.13 ﬁxed points 0.21 0.60 0.03 0.
. 5. with mean values ma = (a1 + · · · + an )/n and mb = (b1 + · · · + bn )/n.b) –a
Fig. we show the meaning of this deﬁnition for a ≥ b. we have considered small neighborhoods formed by industry subgroups. and that the correlation coeﬃcient is the least eﬀective. For stocks. we have found the ﬁve most similar series. we summarize the results and compare them with the perfect exhaustivesearch selection and with random selection. For numbers a and b. . we have also used geographic proximity. In Table 3. based on the illustrated triangle. b)
In Figure 5. We have also used the four similarity measures to identify close matches for each series. For wind speeds. . . bn ))/n. For air and sea temperatures. 2 · max(a. The results show that the peak similarity performs better than the other measures. . and then determined the mean diﬀerence between the given series and the other ﬁve. . . The peak similarity of two series is the mean similarity of their points (psim(a1 . For series a1 . the ﬁrst neighborhood includes all sites within 70 miles. b1 ) + · · · + psim(an .
b
a
Peak similarity of numbers a and b. called the peak similarity. . an and b1 .
. their peak similarity is psim(a. according to Standard and Poor’s classiﬁcation. We next give an empirical comparison of the four similarity measures. where a ≥ b. the ordinate of the intersection is the similarity of a and b. as well as large neighborhoods formed by industry groups. We draw the vertical line through b to the intersection with the triangle’s side. b) = 1 − a − b . we have used geographic proximity to deﬁne two groundtruth neighborhoods. bn . the correlation coeﬃcient is
n i=1 (ai n i=1 (ai
− ma ) · (bi − mb )
n i=1 (bi
− ma )2 ·
− mb )2
We next deﬁne a new similarity measure.
it to the zerotoone scale by adding one and dividing by two. and compared the results with groundtruth neighborhoods. and the second is the 3 × 5 rectangle.Indexing of Compressed Time Series
51
1 psim (a. The ﬁrst neighborhood is the 1 × 5 rectangle in the grid of buoys. For each series.
370 0.103 0.497 0.022 0.022 0.210 Max diﬀ.103 0.235 0.021 0.022 0.570 0.033 0.279 0.043 Time (sec) Wind speeds Mean diﬀ.018 0. For each given series.052 0. 0.138 0.029 0.110 0.152 0.054 0.020 0.287 90% 95% 90% 95% 90% 95% 90% 95% 0.261 0.138 0.147 0.016 0.281 0.152 0.534 0.063 0.031 0.017 0.185 0.046 0.016 0.136 0.042 Time (sec) Air temperatures Mean diﬀ.030 0.072 0.019 0.110 0.317 0.024 0.453 0.051 0.134 0. 0.026 0. We also show the running times of selecting similar series.030 0.148 0.094 0.
Similarity measure Comp.017 0.023 0.073 0.021 0.224 0.019 0.214 0.170 0.078 0. 0.588 1.101 0. Fink and K.066 0.051 0.023 0. we have selected the ﬁve most similar series.068 0.055 0. 0. Pratt
Perfect selection Random selection Peak Similarity Mean Similarity Root mean Square sim. Smaller diﬀerences correspond to better selection. 0.048 0. 0.153 0.026 0.022 0.103 0.090 0.323 0.038 0.137 0.022 0.025 0.136 0.023 0.215 0.014 0.017 0. 0.045 Time (sec) Sea temperatures Mean diﬀ.092 0.015 0.206 0. and then measured the mean diﬀerence between the given series and the other ﬁve.024 0.044 0.437 1.017 0.063 Max diﬀ.022 0.133 0. rate Stock prices Mean diﬀ. B.070 0.016 0.024 0.525 0.021 0.042 0. 0.022 0.021 0.037 0. 0.121 0.115 0.306 0.014 0. 0.064 0.154 0.024 0.023 0.072 0.056 0.179 0.134 0.126 0.015 0.Table 3. Diﬀerences between selected similar series.024 Max diﬀ.026 0.241 0.028 Time (sec)
E.162 0.030 0.016 0.051 Max diﬀ.349 0.029 0.024 0.031 0.022 0.023 0. Correlation Coeﬃcient
.429 0.112 0.068 EEG Max diﬀ.035 0.024 0.106 0.033 Time (sec) Mean diﬀ.016 0.019 1.
We have observed a good linear correlation. we give the results and compare them with the perfect selection and random selection. eeg. the ﬁrst neighborhood is the 3×3 rectangle in the grid of electrodes. we give an example of these values in Figure 8. however. the peak similarity outperforms the other measures on the stocks. larger numbers correspond to better selections. and then determined the average number of matches that belong to the same neighborhood. If we use the 90% compression. and its length is between
. We have also checked how well the peak similarity of original series correlates with the peak similarity of their compressed versions (Figure 6). air and sea temperatures. ﬁnding similar features in the database. ir. the peak similarity gives better selection than the other similarity measures for the stock prices. we store the values listed in Figure 7. In Table 4. If we use the 95% compression. ratio. and largeneighborhood selection for the wind speeds.
5. A leg is considered similar to the pattern leg if its ratio is between ratiop /C and ratiop · C. we have found the ﬁve closest matches. The retrieval procedure inputs a pattern and searches for similar segments in a database (Figure 9). it identiﬁes all legs in the database that have a similar endpoint ratio and length. denoted ratiop . it loses to the mean similarity and correlation coeﬃcient on the smallneighborhood selection for the wind speeds. For each leg. and the second is the 5 × 5 rectangle. it gives worse results for the wind speeds. For each series. which gracefully degrades with an increase of compression rate. and length. temperatures. and comparing the pattern with each series that has a similar feature. The results show that the peak similarity is usually more eﬀective than the other three similarities. it ﬁnds the pattern leg with the greatest endpoint ratio.Indexing of Compressed Time Series
53
and the second includes the sites within 140 miles. The prominent leg of a pattern is the leg with the greatest endpoint ratio. and eeg. For electroencephalograms. Next. Pattern Retrieval We give an algorithm that inputs a pattern series and retrieves similar series from a database. vr. We begin by deﬁning a leg of a series. however. lengthp . and determines the length of this leg. denoted vl. which is the segment between two consecutive important points. il. First. It includes three steps: identifying a “prominent feature” of the pattern.
19 2 5.75 EEG 1 5.33 1.15 0.40 1.77 0.92 2.29 2 5.59 1.39 0.48 0.55 0.20 0.90 2.49 0.50 1. 1 5. and then determined the average number of the series among them that belong to the same groundtruth neighborhood.47 0.18 0.51 2 5.10 0.82 0.66 2.85 0.05 0.53 0.00 0.89 0.00 0.50 Sea temp.27 2.33 1.92 2.36 1.25 0.88 0.83 2.34 2 5.49 0.28 0.34 0.74 0.25 1.21 0.11 0.09 1.71 0.50 1.72 Air temp.20 0.16 1. Correlation Coeﬃcient 90% 95% 90% 95% 90% 95% 90% 95% Comp.17 0.29 0.60 Wind speeds 1 5.14 0.00 0.03 2.25 0.00 0.62 0.16 0.75 0.33 1.00 1.65 E. rate Stocks 1 5.98 0. B.Table 4.07 0.58 1.26 0.82 0.83 0.55 0.35 0.17 0.40 0.00 0.92 2.54 0.18 0.81 2. For each series.36 1.19 0. Pratt
.32 0.00 2. Fink and K.81 1.90 2. we have found the ﬁve closest matches.77 0.12 0.22 0.65 0.65 0.24 1.35 1. Similarity Measure Perfect selection Random selection Peak Similarity Mean Similarity Root mean Square sim.34 0.00 0.00 0.68 2 5.74 1. 1 5.83 2.75 2.00 0. Finding members of the same neighborhood.
90
correlation: .93
correlation: . Correlation between the peak similarity of original series and the peak similarity of their compressed versions.76 correlation: .58
Fig.Indexing of Compressed Time Series
55
stock prices 1 80% comp.92
0
1
0 1
0
1
0 1
0
1
0 1
0
1
correlation: . and the number of retrieved legs with an appropriate ratio and length is k.95 correlation: . 7. where C and D are parameters for controlling the search.76 correlation: . Basic data for a leg.70
correlation: . 0 1 95% comp. We show the correlation for three compression rates: 80%. We index all legs in the database by their ratio and length using a range tree.74
correlation: . If the total number of legs is l. then the retrieval time is O(k + lg l).
vl vr il ir ratio length
value of the left important point of the leg value of the right important point of the leg index of the left important point index of the right important point ratio of the endpoints.
.57 correlation: .96
correlation: .68
0
1
0
0
1
0
0
1
0
0
1
correlation: .
lengthp /D and lengthp · D. 0 1 90% comp. Samet (1990)]. and 95%.86 correlation: . deﬁned as vr/vl length of the leg. which is a standard structure for indexing points by two coordinates [Edelsbrunner (1981). deﬁned as ir – il Fig. 90%. 6.93 correlation: . 0 0 1 0 0 1 0 1 0 1 1
air temperatures 1
sea temperatures 1
wind speeds 1
EEG
0 1
0
1
0 1
0
1
0 1
0
1
0 1
0
1
correlation: .98
correlation: .
(b) Similar leg in a series.4 1. In Figure 11. denoted ratiop . Pratt
Finally. Fig. we give an example of a stockprice pattern and similar segments retrieved from the stock database. Determine the length of this pattern leg. We use three parameters to control the search: maximal ratio deviation C. and similarity threshold T .
Example legs. If the similarity is above a given threshold T.6 vr = 1. Search for segments similar to a given pattern. For each leg in the set of selected legs: Identify the segment corresponding to the pattern (Figure 10). Fink and K.6 1.
1. Compute the similarity between the segment and the pattern.
(d) Identify the respective segment in the series.
(a) Prominent leg in a pattern. Identify the pattern leg p with the greatest endpoint ratio.
Identifying a segment that may match the pattern.
Fig.8 0. the procedure identiﬁes the segments that contain the selected legs (Figure 10) and computes their similarity to the pattern.
. We show the basic data for the two legs marked by thick lines.56
E. the output is a list of segments from the database that match the pattern.2 1.
(c) Align the right end of these legs. 10.0 0. If the similarity is above the threshold T . B. the procedure outputs the segment as a match.0 il = 21 ir = 27 ratio = 0. and • their lengths are between lengthp /D and lengthp · D.
PATTERNRETRIEVAL The procedure inputs a pattern series and searches a timeseries database.6 0 5 10
vl = 0. then output the segment. 8.33 length = 6 vl = 1. 9. Find all legs in the database that satisfy the following conditions: • their endpoint ratios are between ratiop /C and ratiop · C.4 il = 9 ir =15 ratio = 2. maximal length deviation D.62 length = 6
15
time
20
25
30
Fig. denoted lengthp .6 vr = 1.
We illustrate this problem in Figure 12. . We identify all extended legs. 11.
Example of retrieved stock charts. . Example of extended legs. but the pattern’s prominent leg has no equivalent in the series. The pattern (a) matches the series (b). the procedure for ﬁnding downward extended legs is similar.Indexing of Compressed Time Series
57
pattern
Fig. .
(a) Prominent leg of a pattern. it uses this information to identify extended legs. To avoid this problem. and • for every m ∈ [i. and aj is a local maximum. 12. j]. In Figure 13. . Second. points ai and aj of a series a1 . Formally.
The described procedure can miss a matching segment that does not have a leg corresponding to the pattern’s prominent leg. and index them in the same way as normal legs.
(b) Dissimilar legs. The deﬁnition of an extended downward leg is similar.
Fig. The advantage of this approach is more accurate retrieval. an form an extended upward leg if • ai is a local minimum. and the disadvantage is a larger indexing structure. the prominent leg matches one of them. we give an algorithm for identifying upward extended legs. it identiﬁes the next larger maximum and stores its index in next[k]. we have ai < am < aj . The running time of the ﬁrst part is linear in the
. the procedure processes important maxima. First. for each maximum irk . If we identify the extended legs (c). We assume that normal upward legs in the input series are numbered from 1 to l.
(c) Extended legs. which is a segment that would be a leg under a higher compression rate (Figure 12c). we introduce the notion of an extended leg. where the prominent leg of the pattern has no equivalent in the matching series.
58
E. Fink and K. B. Pratt
EXTENDEDLEGS The input is the list of legs in a series; the output is a list of all extended legs. initialize an empty stack S of leg indices PUSH(S, 1) for k = 2 to l do while S is not empty and irTOP(S) < irk do next[TOP(S)] = k; POP(S) PUSH(k) while S is not empty do next[TOP(S)] = NIL; POP(S) initialize an empty list of extended legs for k = 1 to l − 1 do m = next[k] while m is not NIL do add (ilk , irm ) to the list of extended legs m = next[m] Fig. 13. Identifying extended legs. We assume that normal legs are numbered 1 to l.
number of normal legs, and the time of the second part is linear in the number of extended legs. To evaluate the retrieval accuracy, we have compared the retrieval results with the matches identiﬁed by a slow exhaustive search. We have ranked the matches found by the retrieval algorithm from most to least similar. In Figures 14 and 15, we plot the ranks of matches found by the fast algorithm versus the ranks of exhaustivesearch matches. For instance, if the fast algorithm has found only three among seven closest matches, the graph includes the point (3, 7). Note that the fast algorithm never returns “false positives” since it veriﬁes each candidate match. The retrieval time grows linearly with the pattern length and with the number of candidate segments identiﬁed at the initial step of the retrieval algorithm. If we increase C and D, the procedure ﬁnds more candidates and misses fewer matchers, but the retrieval takes more time. In Table 5, we give the mean number of candidate segments in the retrieval experiments, for three diﬀerent values of C and D; note that this number does not depend on the pattern length. We have measured the speed of a VisualBasic implementation on a 300MHz PC. If the pattern has m legs, and the procedure identiﬁes k candidate matches, then the retrieval time is 70 · m · k microseconds. For the stock
Indexing of Compressed Time Series
59
397
397
perfect ranking
perfect ranking
235
0
0 200 fast ranking C= D= 5 time: 0.31 sec
0 0 169 fast ranking C= D= 2 time: 0.14 sec
(a) Search for fourleg patterns 398 331
perfect ranking
perfect ranking
210
0
0
200
0
0
200
perfect ranking
0
perfect ranking
0 0 113 fast ranking C = D = 1.5 time: 0.09 sec
0
151
fast ranking C=D=5 time: 0.41 sec
fast ranking C=D=2 time: 0.14 sec
fast ranking C = D = 1.5 time: 0.07 sec
(b) Search for sixleg patterns
398 328
perfect ranking perfect ranking perfect ranking
202
0
0
200
0
0
200
0
0
167
fast ranking C=D=5 time: 2.50 sec
fast ranking C=D=2 time: 1.01 sec
fast ranking C = D = 1.5 time: 0.68 sec
(c) Search for thirtyleg patterns
Fig. 14. Retrieval of stock charts. The horizontal axes show the ranks of matches retrieved by the fast algorithm. The vertical axes are the ranks assigned to the same matches by the exhaustive search. If the fast algorithm has found all matches, the graph is a fortyﬁve degree line; otherwise, it is steeper.
database with 60,000 points, the retrieval takes from 0.1 to 2.5 seconds. For the database of air and sea temperatures, which includes 450,000 points, the time is between 1 and 10 seconds. 6. Concluding Remarks The main results include a procedure for compressing time series, indexing of compressed series by their prominent features, and retrieval of series
60
E. Fink and K. B. Pratt
395
392
perfect ranking
perfect ranking
202
0 0 0 0 0 82 200 151 fast ranking fast ranking fast ranking C=D=5 C = D = 1.5 C=D=2 time: 9.43 sec time: 2.18 sec2 time: 1.09 sec (a) Search for twelveleg patterns of air and sea temperatures.
0
perfect ranking
0
perfect ranking
perfect ranking
200
206
0
perfect ranking
233
0
200
0
200
0
0
200
fast ranking C=D=5 time: 1.58 sec
fast ranking C=D=2 time: 1.06 sec
fast ranking C = D = 1.5 time: 0.61 sec
(b) Search for nineleg patterns of wind speeds.
390 382
perfect ranking
244
0
perfect ranking
0
200
0
0
89
perfect ranking
0 0 35 fast ranking C = D = 1.5 time: 0.05 sec
fast ranking C=D=5 time: 0.31 sec
fast ranking C=D=2 time: 0.10 sec
(c) Search for electroencephalogram patterns with twenty legs.
Fig. 15. Retrieval of weather and electroencephalogram patterns. The horizontal axes show the similarity ranks assigned by the fast algorithm, and the vertical axes are the exhaustivesearch ranks.
whose compressed representation is similar to the compressed pattern. The experiments have shown the eﬀectiveness of this technique for indexing of stock prices, weather data, and electroencephalograms. We plan to apply it to other timeseries domains and study the factors that aﬀect its eﬀectiveness. We are working on an extended version of the compression procedure, which will assign diﬀerent importance levels to the extrema of time series,
Indexing of Compressed Time Series
61
Table 5. Average number of the candidate segments that match a pattern’s prominent leg. The retrieval algorithm identiﬁes these candidates and then compares them with the pattern. The number of candidates depends on the search parameters C and D. Search parameters C = D = 1.5 C=D=2 C=D=5 Stock prices 270 440 1,090 Air and sea temperatures 1,300 2,590 11,230 Wind speeds 970 1,680 2,510 EEG 40 70 220
and allow construction of a hierarchical indexing structure [Gandhi (2003)]. We also aim to extend the developed technique for ﬁnding patterns that are stretched over time, and apply it to identifying periodic patterns, such as weather cycles. Acknowledgements We are grateful to Mark Last, Eamonn Keogh, Dmitry Goldgof, and Rafael Perez for their valuable comments and suggestions. We also thank Savvas Nikiforou for his comments and help with software installations. References
1. Aggarwal, C.C. and Yu, Ph.S. (2000). The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in HighDimensional Space. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining, pp. 119–129. 2. Agrawal, R., Psaila, G., Wimmers, E.L., and Zait, M. (1995). Querying Shapes of Histories. Proceedings of the TwentyFirst International Conference on Very Large Data Bases, pp. 502–514. 3. Agrawal, R., Mehta, M., Shafer, J.C., Srikant, R., Arning, A., and Bollinger, T. (1996). The Quest Data Mining System. Proceedings of the Second ACM International Conference on Knowledge Discovery and Data Mining, pp. 244–249. 4. Andr´J¨nsson, H. and Badal, D.Z. (1997). Using Signature Files for Querye o ing TimeSeries Data. Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 211–220. 5. Bollobas, B., Das, G., Gunopulos, D., and Mannila, H. (1977). TimeSeries Similarity Problems and WellSeparated Geometric Sets. Proceedings of the Thirteenth Annual Symposium on Computational Geometry, pp. 454–456. ¨ 6. Bozkaya, T. and Ozsoyoglu, Z.M. (1999). Indexing Large Metric Spaces for Similarity Search Queries. ACM Transactions on Database Systems 24(3), 361–404. ¨ 7. Bozkaya, T., Yazdani, N., and Ozsoyoglu, Z.M. (1997). Matching and Indexing Sequences of Diﬀerent Lengths. Proceedings of the Sixth International Conference on Information and Knowledge Management, pp. 128–135.
. (2000).. Das.. Robotics Institute. 19. Barry.P.C. (2000). Renganathan. Proceedings of the Fifteenth International Conference on Data Engineering. D. A.. 88–100. H. 11. United Kingdom. (1998). and Riedl.. K. A.A. K. Chi. CaracaValente. 21. 34–40. KI. R. Fu. Computer Science and Engineering. Englewood Cliﬀs. A Note on Dynamic Range Searching..H.. 14. Eﬃcient Time Series Matching by Wavelets. (1997). 15. Visualization of Biological Sequence Similarity Search Results. Berlin. (2000). P. D.. and Guo. pp. Rule Discovery from Time Series.. Hancock: A Language for Extracting Signatures from Data Streams. Topics in Nonlinear Time Series Analysis with Implications for EEG Analysis. Cortes. pp.H.. T. H. Fink and K. Franses. Singapore. (1996). Germany. K. Das..P. Chan. Domingos. 15. P. 18. Gopinath.S. Brockwell. (2003). J. C. (2000). (1997). Prentice Hall. NJ. and Smyth. Edelsbrunner. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining. G. and Yu.W. pp. 9. 12. Gandhi. Discovering Similar Patterns in Time Series. (1998). Carnegie Mellon University. and LopezChavarrias. Fountain. Deng.G. A..C. (1999). H. Mining HighSpeed Data Streams. 13. 18–25. Chan. Gunopulos. Dietterich. (2000). Haar Wavelets for Eﬃcient Similarity Search of TimeSeries: With and Without Time Warping. pp. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining. (1981).S.H. H. Shoop. J. Fisher. OMEGA: OnLine MemoryBased General Purpose System Classiﬁer. World Scientiﬁc. Important Extrema of Time Series: Theory and Applications. Lin. pp. Pregibon. Introduction to Time Series and Forecasting. A. G.. C.. and Davis. Master’s thesis.. Mannila. (1998). P. 22. Introduction to Wavelets and Wavelet Transforms: A Primer.. G. R. G. Proceedings of the IEEE Conference on Visualization. B. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining.62
E. 23. E.J. Mining IC Test Data to Optimize VLSI Testing.. Carlis. I. Burrus. Technical Report CMURITR9833. Proceedings of the Fourth ACM International Conference on Knowledge Discovery and Data Mining. Time Series Models for Business and Economic Forecasting. (2003). 497–505. T. and Mannila. pp. B. Ph.. 17. Forthcoming.. PhD thesis. Retzel. Cambridge University Press. and Sudyka. 16..V. K. Bulletin of the European Association for Theoretical Computer Science. 44–51. IEEE Transactions on Knowledge and Data Engineering. SpringerVerlag. E. and Rogers. P. J. Forthcoming. Pratt
8. 20. Proceedings of the First European Conference on Principles of Data Mining and Knowledge Discovery. University of South Florida. H. Finding Similar Time Series. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining. Cambridge. Galka.W. pp. 9–17. E. 71–80. and Hulten. 16–22. pp. C. 10.A. and Fu. 126–133. (1995).P.
.
. SpaceTime Modeling with LongMemory Dependence: Assessing Ireland’s Wind Power Resource. and Srivastava. K.C. and Quint. W. 30. An Introduction to Wavelets.. A.J. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Y. 34. Proceedings of the First International Conference on Principles and Practice of Constraint Programming. A. 3–11. Graps.J. Scaling up Dynamic Time Warping for Data Mining Applications. 35. Anguelov. Adaptive Query Processing for TimeSeries Data. Huang. MD. 38. 31. Gavrilov. Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining. (1998). Ikeda. IEEE Computational Science and Engineering.. Pattern Recognition Letters. and Kanellakis. Haslett.R. D. J.W. (1998).V. 282–286. P. J. and Sequences: Computer Science and Computational Biology. (1989). Baltimore.J. Gusﬁeld. 2(2). Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining. (1995). 33–42. Cambridge. and Pazzani. Clustering and Relevance Feedback. pp. Wijesekera. pp. Johns Hopkins University Press. R. Keogh. pp. (1998)..E. Mining SegmentWise Periodic Patterns in TimeRelated Databases. (1999). Trees. 50–61. 29. Gong. and Srivastava. An Enhanced Representation of Time Series which Allows Fast and Accurate Classiﬁcation. Elements of Pattern Theory. pp. 33. pp.. 214–218.. and Yin. 487–496. P. Geva.. (1996). M. 137–153. (1997). Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining.. pp. and Raftery. Guralnik.. 1519–1532. U. P. Cambridge University Press. 27. Han. J. V. Grenander. 1–50. and Pazzani. A. E. E. Wavelet Decomposition of Heart Period Data. 26. (1999). Proceedings of the Fourth ACM International Conference on Knowledge Discovery and Data Mining. 28. 25.Q. On Similarity Queries for TimeSeries Data: Constraint Speciﬁcation and Implementation.J. D. Goldin. Mining the Stock Market: Which Measure is Best. HierarchicalFuzzy Clustering of Temporal Patterns and its Application for TimeSeries Prediction. Applied Statistics. V. Y.. pp.
. Guralnik. 239–243.B. (2000). (2000). 51–57. Event Detection from Time Series Data. Proceedings of the Fourth ACM International Conference on Knowledge Discovery and Data Mining. 36.. M.Indexing of Compressed Time Series
63
24. J. Vaughn. 20(14). Proceedings of the IEEE First Joint BMES/EMBS Conference. S. M. 285–289.. (1999). D. United Kingdom.L. and Motwani. pp.. and Yu. 37. pp. (1995). (1999). 32. Indyk. PatternDirected Mining of Sequence Data.. D. Algorithms on Strings. Proceedings of the Fourth ACM International Conference on Knowledge Discovery and Data Mining.S. Keogh. B.
K.J. (2000). MALM: A Framework for Mining Sequence Database at Multiple Abstraction Levels. W. (1998). 23–32.. 39. 42. S. C. (1991). pp. 28(3). and Feng.. E. Eﬃcient Searches for Similar Subsequences of Diﬀerent Lengths in Sequence Databases. and Chu.. 41. 47. Lippincott. pp. 43. and Chung.W. Yu. and Pazzani. pp. pp. 160–169. An Online Algorithm for Segmenting Time Series. Stock Movement Prediction and N Dimensional InterTransaction Association Rules. and Risch. V. and Chu. Chu. and Hsu. and Wong.. Keogh. Yoon. Niedermeyer. Lee... Man. A Fast Projection Algorithm for Sequence Data Searching.L. E. M. 49. Similarity Search for Multidimensional Data Sequences. Y. Last. Williams and Wilkins.W. S.W. pp. 48. Electroencephalography: Basic Principles. Clinical Applications. Fast Similarity Search in the Presence of Longitudinal Scaling in Time Series Databases.W.. P. 44. 321–339. (1998). pp. Proceedings of the Third IEEE Knowledge and Data Engineering Exchange Workshop. Lin. Pratt
38..64
E. 248–252. (1998). pp. Querying Continuous Time Sequences. (2001). Maurer. M.J. D. SegmentBased Approach for Subsequence Searches in Sequence Databases. Germany.. 46. S. Proceedings of the Sixteenth ACM Symposium on Applied Computing.. H. pp. A. and Chu. Proceedings of the Ninth IEEE International Conference on Tools with Artiﬁcial Intelligence.S. and Kandel. Chun.W.. W. 607–614. 267–272. Proceedings of the TwentyFourth International Conference on Very Large Data Bases.. Lee. 599–608. MD. 170–181. S. Han. IEEE Transactions on Systems. S. (1999). Park. 51. (1999).. Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. S. Fast Retrieval of Similar Subsequences in Long Sequence Databases. Fink and K. W. (1997). S. Park. L.S. Lam.W.K..J. Keogh. 50. Park.. E. Park. 12:1–7. S. pp. and Castelli. S. and Dierks. 578–584. D. 31(1). and Related Fields. Berlin. M. Kim. T. Proceedings of the IEEE International Conference on Data Mining. (1998). J. An IndexBased Approach for Similarity Search Supporting Time Warping in Large Sequence Databases. (2001). Chu. fourth edition.. Data and Knowledge Engineering. W. Lee. 40. Li. Lu. Proceedings of the Sixteenth International Conference on Data Engineering. SpringerVerlag. C. (2000).H.H. Proceedings of the Seventeenth International Conference on Data Engineering. C. S.. Klein. 45.. Atlas of Brain Mapping: Topographic Mapping of EEG and Evoked Potentials... Kim. 289–296. Baltimore. J. Kim. and Lopes DaSilva..
. Proceedings of the Seventh International Conference on Information and Knowledge Management.. Proceedings of the Sixteenth International Conference on Data Engineering. F.W... Hart.J. (2001).H. L. B. (2001). T. D.. Knowledge Discovery in Time Series Databases. J. and Cybernetics. Part B.
and Chen. T.S. Wang. P... Reading. E. and Wang.Indexing of Compressed Time Series
65
52. and McAtackney. (1998). Online Data Mining for CoEvolving Time Series. 13–22.K. H.. 339–343. P.
.B. (2001). CA.R.S. Supporting Fast Search in Time Series for Movement Patterns in Multiple Scales. Computer Science and Engineering. Master’s thesis. 428–439. (1998).V.A. 94. (2000). AddisonWesley. 57. pp. Y. A.. (2000).. A. 54. N. H. Wang. Samet. Dynamic TimeSeries Forecasting Using Local Approximation. 53. and Geva. and Cybernetics. pp. B.S. 89–106. Landmarks: A New Model for SimilarityBased Pattern Querying in Time Series Databases. Jagadish. (2002).. pp. 63. Locating patterns in discrete time series. and Biliris. A. R. and McGee. Stoﬀer. 60. Graphics and Image Processing. Yi. Introduction to Time Series Analysis and Forecasting.K. K. Sahoo. and Scott Parker. 56.. 58.B. Singh. C. Wong. Pratt. 1341–1356. pp. Faloutsos. Yaﬀee. (2000). Man. H. Proceedings of the Sixteenth International Conference on Data Engineering.C. 59. 33–42. (1999). S. San Diego. 392–399.. Search for Patterns in Compressed Time Series. 2(1). Y. S.. and Fink. X. S. Journal of the American Statistical Association. 62. WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases. (1990). 30(2)... Sidiropoulos. Johnson. Soltani... D. Proceedings of the TwentyFourth International Conference on Very Large Data Bases. Part B.. Zhang. M. 251–258. International Journal of Image and Graphics. A Survey of Thresholding Techniques. Policker. Sheikholeslami.. Pratt. Perng. Proceedings of the Sixteenth International Conference on Data Engineering. S. K. (2000). D. Proceedings of the Tenth IEEE International Conference on Tools with Artiﬁcial Intelligence. Academic Press. 41. Qu. The Design and Analysis of Spatial Data Structures. NonStationary Time Series Analysis by Temporal Clustering. (1998).B. C. (1988). A.K. Computer Vision. 55. and Zhang. S.. Detecting Common Signals in Multiple Time Series Using the Spectral Envelope.D.C. Chatterjee. MA. 61. G. Proceedings of the Seventh International Conference on Information and Knowledge Management. University of South Florida. 233–260. pp. IEEE Transactions on Systems. C.
This page intentionally left blank
.
The performance of previously used measures is generally degraded under noisy conditions. Finally. WA 98052. Stretching of sequences in time is allowed. spatiotemporal. as well as global translating of the sequences in space.CHAPTER 4 INDEXING TIMESERIES UNDER CONDITIONS OF NOISE
Michail Vlachos and Dimitrios Gunopulos Department of Computer Science and Engineering Bourns College of Engineering University of California. We prove a weaker version of the triangle inequality and employ it in an indexing structure to Answer nearest neighbor queries.
67
. dg}@cs. time warping.ucr. we present experimental results that validate the accuracy and eﬃciency of our approach. Riverside Riverside. Furthermore they provide an intuitive notion of similarity between timeseries by giving more weight to the similar portions of the sequences. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach. outliers.edu Gautam Das Microsoft Research. This is an important topic because the data obtained using various sensors (examples include GPS data or video tracking data) are typically noisy. USA Email: {mvlachos. Eﬃcient approximate algorithms that compute these similarity measures are also provided. Here we formalize nonmetric similarity functions based on the Longest Common Subsequence that are very robust to noise. timeseries.com
We present techniques for the analysis and retrieval of timeseries under conditions of noise. USA Email: gautamd@microsoft. Keywords: Longest Common Subsequence. CA 92521. One Microsoft Way. Redmond. especially under the strong presence of noise.
Other examples of the data that we are interested in include features extracted from video clips.68
M. Timeseries data come up in a variety of domains. The problem is hard. sensor and GPS technology have made it possible to collect large amounts of spatiotemporal data and there is increasing interest to perform data analysis tasks over this data [4]. because the similarity model should allow for imprecise matches. users equipped with mobile devices move in space and register their location at diﬀerent time instants to spatiotemporal databases via wireless links. or to classify an object based on a set of known examples. Vlachos. the objective is to cluster diﬀerent objects into similar groups. In environmental information systems. Therefore. medical and ﬁnancial data. Web data that count the number of clicks on given sites. In general the timeseries will be obtained during a tracking procedure. Data analysis in such data includes determining and ﬁnding objects that moved in a similar way or followed a certain motion pattern.
. tracking animals and weather conditions is very common and large datasets can be created by storing locations of observed objects over time. 1. Start & ending contain many outliers. Here also
athens 1 athens 2
X movement
outliers
outliers
time
Fig. in mobile computing. the advances in mobile computing. D. Gunopulos and G. with the aid of devices such as GPS transmitters. sign language recognition. and so on. especially under the presence of noise. In the last few years. data gloves etc. or model the usage of diﬀerent pages are also modeled as time series. Introduction We consider the problem of discovering similar timeseries. including stock market analysis. For example. environmental data. Das
1. Examples of videotracked data representing 2 instances of the word ‘athens’. telecommunication data.
Characteristic examples are environmental data collected on a daily basis or satellite image databases of the earth1 .
1 http://www. Finally. The NNC is conceptually a simple technique. our objective is the automatic classiﬁcation of timeseries using Nearest Neighbor Classiﬁcation (NNC). More speciﬁcally.” We need to deﬁne the following: 1. Furthermore. We argue that the use of nonmetric distance functions is preferable. So. In section 2 we will review the most prevalent similarity measures used in timeseries databases. it has been shown that the one nearest neighbor rule has asymptotic error rate that is at most twice the Bayes error rate [13].Indexing TimeSeries under Conditions of Noise
69
lies the main obstacle of such data. A realistic distance function that will match the user’s perception of what is considered similar. so that one can retrieve accurately and eﬃciently the data of interest. since such functions can eﬀectively ‘ignore’ the noisy sections of the sequences. If we would like to allow the user to explore these vast databases. In NNC the timeseries query is classiﬁed according to the majority of its nearest neighbors. especially in the last decade. Section 4 demonstrates eﬃcient algorithms to compute these functions and section 5 elaborates on the indexing structure. since we are going to use a nonmetric distance function. The rest of the chapter is organized as follows. they may contain a signiﬁcant amount of outliers (Figure 1) or in other words incorrect data measurements. section 7 concludes the chapter. and this can partly be attributed to the explosion of database sizes.com
. This classiﬁcation technique is particularly suited for our setting. we will demonstrate the need for nonmetric distance functions and we will also consider related work. In section 3 we formalize the new similarity functions by extending the Longest Common Subsequence (LCSS) model.terraserver. the technique has good theoretical properties. the problem we consider is: “Given a database D of timeseries and a query Q (not already in the database). 2. Section 6 provides the experimental validation of the accuracy and eﬃciency of the proposed approach. ﬁnd the sequence T that is closest to Q. we should organize the data in such a way. Background Indexing of timeseries has concentrated great attention in the research community. but provides very good results in practical situations.
DTW between two sequences A and B is described by the following equation: DTW (A. Most of the related work on timeseries has concentrated on the use of some metric Lp Norm. Time Series Similarity Measures The simplest approach to deﬁne the distance between two sequences is to map each sequence into a vector and then use a pnorm to calculate their distance.70
M. B) = Dbase (am . On the other hand the model cannot deal well with outliers and is very sensitive to small distortions in the time axis.44].19.45]. Lee et al. An eﬃcient indexing scheme. Recent works on DTW are [27. Gunopulos and G. which will speed up the user queries. Other recent works on indexing time series data for similarity queries assuming the Euclidean model include [25. Vlachos. DTW has been ﬁrst used to match signals in speech recognition [38]. Head(B)). B).36]. propose methods to index sequences of multidimensional points. 2. Another. bn ) + min{DTW (Head(A). The advantage of this simple model is that it allows eﬃcient indexing by a dimensionality reduction technique [2. In [29]. by computing the distances between multidimensional Minimum Bounding Rectangles. The similarity model is based on the Euclidean distance and they extend the ideas presented by Faloutsos et al.21]. D. DTW (A. more ﬂexible way to describe similar but outofphase sequences can be achieved by using the Dynamic Time Warping (DTW) [38].26]. Head(B))} where Dbase is some Lp Norm and Head(A) of a sequence A are all the elements of A except for am . y) =
i=1
xi − yi 
p
For p = 2 it is the well known Euclidean distance and for p = 1 the Manhattan distance. We will brieﬂy discuss some issues associated with these two topics. Berndt and Cliﬀord [5] were the ﬁrst that introduced this measure in the datamining community. normalization [21] and moving average [36]. in [16]. shifting [10.31. A domain independent framework for deﬁning queries in terms of similarity of objects is presented in [24]. There are a number of interesting extensions to the above model to support various transformations such as scaling [10.15.1.
. Das
2. The pnorm distance between two ndimensional vectors x and y is deﬁned as:
n
1 p
Lp (x. the last one. DTW (Head(A).
Indexing TimeSeries under Conditions of Noise
71
Fig. Although the ﬂexibility provided by DTW is very important. nonetheless DTW is not the appropriate distance function for noisy data.7.
The diﬃculties imposed by DTW include the fact that it is a nonmetric distance function and also that its performance deteriorates in the presence of large amount of outliers (Figure 2). Ge and Smyth [18] present an alternative
. Since the values are real numbers. it also matches the outliers distorting the true distance between the sequences. The LCSS model can support timeshifting and eﬃciently ignore the noisy parts. Another technique to describe the similarity is to ﬁnd the longest common subsequence (LCSS) of two sequences and then deﬁne the distance using the length of this subsequence [3. diﬀerent parts of each sequence can be scaled by a diﬀerent factor before matching). allowing local scaling (that is.12. we typically allow approximate matching.12] fast probabilistic algorithms to compute the LCSS of two time series are presented. An interesting approach to represent a time series using the direction of the sequence at regular time intervals is presented in [35]. 2. therefore distorting the true distance between sequences. A technique similar to our work is that of Agrawal et al. However.41]. rather than exact matching. The LCSS shows how well the two sequences can match one another if we are allowed to stretch them but we cannot rearrange the sequence of values. the outliers as well). The support of outofphase matching is important. In this chapter the authors propose an algorithm for ﬁnding an approximation of the LCSS between two timeseries. In [7. since by matching all the points. Other techniques to deﬁne time series similarity are based on extracting certain features (Landmarks [32] or signatures [14]) from each timeseries and then use these features to deﬁne the similarity. [3]. the DTW matches all points (so.
They use a variation of the EM (expectation maximization) algorithm to cluster small sets of trajectories. The most related paper to our work is the Bozkaya et al. y) + d(y.
. 3. z).39]. Example: The Euclidean distance is a metric distance function. Triangle Inequality. 2. Positivity. they present two eﬃcient methods to index the sequences for similarity retrieval. including sequences Seq1. A distance function d(x. d(x. Pfoser et al. there has been some work on indexing moving objects to answer spatial proximity queries (range and nearest neighbor queries) [1. x). D. y) between two objects x and y is metric when it complies to the following properties: 1. y) ≥ 0 and d(x. so as not only to group together similar items. Assume we have a set of sequences. Indexing Time Series Indexing refers to the organization of data in such a way. present index methods to answer topological and navigational queries in a database that stores trajectories of moving objects. they focus on sequences of feature vectors extracted from images and they do not discuss transformations or approximate methods to compute the similarity. The pruning power of such a function can be shown in the following example(Figure 3). z) ≥ d(x. However. Vlachos. Also. Das
approach for sequence similarity that is based on probabilistic matching. [8]. However.72
M. Symmetry. Deﬁnition 1. y) = 0 iﬀ x = y. but also to facilitate the pruning of irrelevant data. their method is a model based approach that usually has scalability problems. d(x. d(x. They discuss how to deﬁne similarity measures for sequences of multidimensional points using a restricted version of the edit distance which is equivalent to the LCCS.2. The second part is greatly aﬀected from the fact whether our distance function is metric or not.28. y) = d(y. Lately. Gunopulos and G. or time period that is also speciﬁed by the query. Also. However the scalability properties of this approach have not been investigated yet. in [33]. using Hidden Markov Models. A recent work that proposes a method to cluster trajectory data is due to Gaﬀney and Smyth [17]. 2. However these works do not consider a global similarity model between sequences but they concentrate on ﬁnding objects that are close to query locations during a time instant.
since it is not going to oﬀer a better solution. These functions achieve this by not considering the most dissimilar parts of the objects.Seq1) = 20 and it is true that: D(Q. 3. Assume we have already found a sequence. This choice imposes diﬃculties on how to prune data in an index. bestmatchsofar. because D(Seq2.3. However.Seq1) ≥ 150 − 20 = 130 we can safely prune Seq1. As a result it is much more diﬃcult to design eﬃcient indexing techniques for nonmetric distance functions. Similarly. Suppose we pose a query Q. Comparing the query sequence Q to Seq2 yields D(Q. In the next section we show that despite this fundamental drawback it is important to use nonmetric distance functions in conditions of noise. and we want to ﬁnd the closest sequence to Q.Seq1) ⇒ D(Q.
Seq2. 2. timeseries etc). since when comparing any kind of data (images. since we cannot apply the technique outlined above. they are useful. Motivation for NonMetric Distance Functions Distance functions that are robust to extremely noisy data will typically violate the triangular inequality.Indexing TimeSeries under Conditions of Noise
73
Pairwise Distances
Seq1 Seq2 Seq1 Seq2 0 20 20 0 90 Seq3 110 90 0
Query
Seq3 110
Seq 1
Seq 2
Seq 3
Fig. because they represent an accurate model of the human perception. In section 5 we will also show that the proper choice of a distance measure can overcome this drawback to some degree. and we have already recorded their pairwise distances in a table. and Seq3.Seq1) ≥ D(Q. we can prune Seq3.Seq2) − D(Seq2. We are going to use nonmetric distance functions. Moreover. we mostly focus
.
Pruning power of triangle inequality. Seq2) = 150. whose distance from Q is 20.
74
M. Therefore. A simple extension of the LCSS model is not suﬃcient. Nonmetric distances are used nowadays in many domains. So. Noise might be introduced due to anomaly in the sensor collecting the data or can be attributed to human ‘failure’ (e. Euclidean distance deals with timeseries of equal length.37]. in our similarity model we consider a set of translations and we ﬁnd the translation that yields the optimal solution to the LCSS problem. even though this diﬀerence may be found in only a few points. are not guaranteed to be the outcome of sampling at ﬁxed time intervals. • Eﬃciency. collaborative ﬁltering (where customers are matched with stored ‘prototypical’ customers) and retrieval of similar images from databases. psychology research suggests that human similarity judgments are also nonmetric. leading to inconsistent sampling rates. such as string (DNA) matching. For this kind of data we need distance functions that can address the following issues: • Diﬀerent Sampling Rates or diﬀerent speeds. In this case the Euclidean distance will completely fail and result to very large distance. The basic idea is to match two sequences by allowing them to stretch. The timeseries that we obtain. Das
on the portions that are similar and we are willing to pay less attention to regions of great dissimilarity. we extend it in order to address similar problems. In general its use gets complicated and the distance notion more vague. The sensors collecting the data may fail for some period of time. yet simple enough to allow eﬃcient computation of the similarity. without rearranging the sequence of the elements but allowing some elements to be unmatched. or pad with zeros the shorter etc. because (for example) this model cannot deal with parallel movements. • Outliers. • Diﬀerent lengths. but one moving at twice the speed of the other will result (most probably) to a very large Euclidean distance. Furthermore. Moreover. To cope with these challenges we use the Longest Common Subsequence (LCSS) model. In the case of diﬀerent lengths we have to decide whether to truncate the longer series.
.g. Vlachos. The LCSS is a variation of the edit distance [30. two time series moving at exactly the similar way. D. Gunopulos and G. The similarity model has to be suﬃciently complex to express the user’s notion of similarity. jerky movement during a tracking process).
42] The LCSS problem can easily be solved in quadratic time and space. what we have to do is. j] = 1 + LCSS[i − 1.
Dynamic Programming Solution [11. . . b2 . . . . m] will give us the length of the longest common subsequence between the two sequences A and B. j − 1]) if i = 0. The basic idea behind this solution lies in the fact that the problem of the sequence matching can be dissected in smaller problems. . otherwise. Head(B))) The above deﬁnition is recursive and would require exponential time to compute. . LCSS[i. Deﬁnition 2. if j = 0. For A let Head(A) = (a1 . bm ). n n LCSS(A. The LCSS between A and B is deﬁned as follows: 0 if A or B empty. B) = max(LCSS(Head(A). B)) if a = b . otherwise. So.1. which can be combined after they are solved optimally. j] denotes the longest common subsequence between the ﬁrst i elements of sequence A and the ﬁrst j elements of sequence B.Indexing TimeSeries under Conditions of Noise
75
3. The same dynamic programming technique can be employed in order to ﬁnd the Warping Distance between two sequences. . an ) and B = (b1 . B). solve a smaller instance of the problem (with fewer points) and then continue by adding new points to our sequence and modify accordingly the LCSS. j − 1] max(LCSS[i − 1. Similarly. . Now the solution can be found by solving the following equation using dynamic programming (Figure 4): 0 0 LCSS[i. LCSS[n. LCSS(A. Similarity Measures Based on LCSS 3.
. However. if ai = bi . . for B. a2 . 1 + LCSS(Head(A). an−1 ). j]. . there is a better solution that can be oﬀered in O(m∗ n) time. a2 . Original Notion of LCSS Suppose that we have two time series A = (a1 .
where LCSS[i. . Finally. using dynamic programming.
3. however in our model we want to allow more ﬂexible matching between two sequences. Das
Fig. B). The gray area indicates the elements that are examined if we conﬁne our search window. we extend this notion in order to deﬁne a new. Vlachos. too. B) as follows: 0 if A or B is empty 1 + LCSSδ. Head(B))) δ. more ﬂexible.76
M. we deﬁne the LCSSδ. Solving the LCSS problem using dynamic programming. If this is not the case then we can use interpolation [23. Extending the LCSS Model Having seen that there exists an eﬃcient way to compute the LCSS between two sequences. Gunopulos and G.ε (A. D.ε otherwise
.ε (Head(A). when the values are within certain range. 4. B) = if an − bn  < ε and n − m ≤ δ max(LCSS (Head(A). The solution provided is still the same. Moreover. Head(B)) LCSSδ. in certain applications.34]. Given an integer δ and a real positive number ε. LCSS (A. We assume that the measurements of the timeseries are at ﬁxed and discrete time intervals. Deﬁnition 3.ε (A. similarity measure. the stretching that is being provided by the LCSS algorithm needs only to be within a certain range.2.ε δ. The LCSS model matches exact values.
1 + c. A translation simply causes a vertical shift either up or down. Deﬁnition 4. more ﬂexible. as follows: S2(δ. ax. similarity measure. Let F be the family of translations. we consider the set of translations. as follows: S1(δ. we deﬁne the similarity function S2 between two sequences A and B. .m)
Essentially.ε (A. Deﬁnition 5. Then a function fc belongs to F if fc (A) = (ax. objects that are close in space at diﬀerent time instants can be matched if the time instants are also close. The ﬁrst similarity function is based on the LCSS and the idea is to allow time stretching. We deﬁne the similarity function S1 between two sequences A and B.
The constant δ controls how far in time we can go in order to match a given point from one sequence to a point in another sequence. ε. The notion of the LCSS matching within a region of δ & ε for a sequence. given δ and ε. A. B) = max S1(δ. ε and the family F of translations.n + c).Indexing TimeSeries under Conditions of Noise
77
Fig. The constant ε is the matching threshold (see Figure 5). First. . . 5. . we deﬁne a second notion of the similarity based on the above family of functions. A. using this measure if there is a matching point within the region ε we increase the LCSS by one. Next. We use function S1 to deﬁne another. B) = LCSSδ. ε. Then. B) min(n. A. The points of the two sequences within the gray region can be matched by the extended LCSS function. fc (B))
f c∈F
. ε. Given δ.
A. Given δ. B) = 1 − S2(δ. In Figure 6 we show an example where a sequence A matches another sequence B after a translation is applied. By allowing translations. 6. LCSSδ. Therefore we can deﬁne the distance function between two sequences as follows: Deﬁnition 6. B) and D2(δ. 3. The similarity function S2 is a signiﬁcant improvement over the S1. Gunopulos and G. (ii) the use of normalization does not guarantee that we will get the best match between two timeseries. A. A. or at diﬀerent times. ε. Diﬀerences between DTW and LCSS Time Warping and the LCSS share many similarities.
Translation of sequence A. we argue that the LCSS is a better similarity function for correctly identifying noisy
. ε.78
M. so we can detect similarities in movements that happen with diﬀerent speeds. In addition. ε. the LCSS model allows stretching and displacement in time.3. ε. but not identical.ε (A. Vlachos.ε (B. we can detect similarities between movements that are parallel.
So the similarity functions S1 and S2 range from 0 to 1. A) and the transformation that we use in D2 is translation which preserves the symmetric property. because: (i) now we can detect parallel movements. ε and two sequences A and B we deﬁne the following distance functions: D1(δ. the average value and/or the standard deviation of the timeseries that are being used in the normalization process can be distorted leading to improper translations. B) Note that D1 and D2 are symmetric. because of the signiﬁcant amount of noise. A. Usually. Here. Das
Fig. B) is equal to LCSSδ. D. B) = 1 − S1(δ.
This property of the LCSS is depicted in the Figure 7.
Fig. Taking under consideration that a large portion of the sequences may be just outliers. 8.
Fig. we need a similarity function that will be robust under noisy conditions and will not match the incorrect parts. Time Warping by matching all elements is also going to try and match the outliers which.
. 7. most likely.
Using the LCSS we only match the similar portions. DTW is not robust under noisy conditions. Left: The presence of many outliers in the beginning and the end of the sequences leads to incorrect clustering. In Figure 8 we can see an example of a hierarchical clustering produced by the DTW and the LCSS distances between four timeseries. is going to distort the real distance between the examined sequences. Hierarchical clustering of time series with signiﬁcant amount of outliers. Right: The LCSS focusing on the common parts achieves the correct clustering.Indexing TimeSeries under Conditions of Noise
79
sequences and the reasons are: 1. avoiding the outliers.
The DTW fails to distinguish the two classes of words.
4. Computing the Similarity Function S1 To compute the similarity functions S1 we have to run a LCSS computation for the two sequences. Using the LCSS similarity measure we can obtain the most intuitive clustering as shown in the same ﬁgure. Given two sequences A and B. especially in the beginning and in the end of the sequences. and this allows the use of a faster algorithm.
. Eﬃcient Algorithms to Compute the Similarity 4. we can ﬁnd the LCSSδ. Right: After normalization. Even though the ending portion of the Boston 2 timeseries diﬀers signiﬁcantly from the Boston 1 sequence. due to the great amount of outliers. Lemma 1.ε (A. within some user deﬁned error bound). Gunopulos and G. The following lemma has been shown in [12].B) in O(δ(n + m)) time. therefore producing the correct grouping of the four timeseries. However. D. Left: Two sequences and their mean values.80
M. 9. that we can try a set of translations which will provably give us the optimal matching (or close to optimal. Using the Euclidean distance we obtain even worse results. 2. However we only allow matchings when the diﬀerence in the indices is at most δ. Das
Fig. Simply normalizing the timeseries (by subtracting the average value) does not guarantee that we will achieve the best match (Figure 9). The LCSS can be computed by a dynamic programming algorithm in O(n∗ m) time. with A = n and B = m.1. we are going to show in the following section. Obviously an even better matching can be found for the two sequences. the LCSS correctly focuses on the start of the sequence.
The sequences represent data collected through a video tracking process (see Section 6). Vlachos.
Then we use the dynamic programming algorithm to compute the LCSS on RA and RB. applied to B can be thought of as a linear transformation of the form f (bi ) = bi + c . with high probability. xm + c). and there is a ﬁnite set of possible longest common subsequences.Indexing TimeSeries under Conditions of Noise
81
If δ is small. Such a transformation will allow bi to be matched to all aj for which i − j < δ. It is instructive to view this as a stabbing problem: Consider the O(δ(n+m)) vertical line segments ((bi . aj +ε)). the dynamic programming algorithm is very eﬃcient. and aj − ε ≤ f (bi ) ≤ aj + ε. xm ) = (x1 + c. In this case. fc1 (B)) = a. However. Here. We can show that. although there is an inﬁnite number of translations that we can apply to B. is a good approximation of the actual value. (bi . we can speedup the above computation using random sampling. These line segments are on a two dimensional plane. the result of the algorithm over the samples. given two sequences A.ε (A. fc (B)). we have to ﬁnd the translation fc that maximizes the length of the longest common subsequence of A. In this section we show that we can eﬃciently enumerate a ﬁnite set of translations. We describe this technique in detail in [40].
c∈R
The key observation is that. and constants δ. gives a longest common subsequence LCSSδ. B. ε. Let the length of sequences A and B be n and m respectively. fc1 (B)) = max LCSSδ. . . when applied to B. For every
.ε (A. A one dimensional translation fc is a function that adds a constant to all the elements of a 1dimensional sequence: fc (x1 . we compute two subsets RA and RB by sampling each sequence. fc (B)) over all possible translations. and it is also the translation that maximizes the length of the longest common subsequence: LCSSδ.ε (A. each translation fc results to a longest common subsequence between A and fc (B). 4. A translation by c . . . fc (B) (LCSSδ. . . for some applications δ may need to be large. where i−j < δ (Figure 10). . Let us also assume that the translation fc1 is the translation that. such that this set provably includes a translation that maximizes the length of the longest common subsequence of A and fc (B).ε (A. Given two sequences A and B. Computing the Similarity Function S2 We now consider the more complex similarity function S2.2. where on the x axis we put elements of B and on the y axis we put elements of A. . aj −ε).
A translation fc in one dimension is a function of the form fc (bi ) = bi +c . Therefore each line of slope 1 deﬁnes a set of possible matchings between the elements of sequences A and B. Therefore. aj ) and extends ε above and below this point. By applying the translation to a sequence and executing LCSS for δ = 1 we can achieve perfect matching. aj − ε).82
M.
. there are O(δ(n + m)) lines of slope 1 that intersect diﬀerent sets of line segments. aj in A and B that are within δ positions from each other (and therefore can be matched by the LCSS algorithm if their values are within ε). fc (bi ) is a line of slope 1. an element bi of B can be matched to an element aj of A if and only if the line fc (x) = x + c intersects the line segment ((bi . we create a vertical line segment that is centered at the point (bi . Given two one dimensional sequences A. in the plane we described above. aj + ε)). the total number of such line segments is O(δn). For example. An example of a translation. Lemma 2. After translating B by fc . 10. Vlachos. two diﬀerent translations can result to diﬀerent longest common subsequences only if the respective lines intersect a diﬀerent set of line segments. (bi . Das
Fig.
pair of elements bi . The number of intersected line segments is actually an upper bound on the length of the longest common subsequence because the ordering of the elements is ignored. B. The following lemma gives a bound on the number of possible diﬀerent longest common subsequences by bounding the number of possible diﬀerent sets of line segments that are intersected by lines of slope 1. However. the translations f0 (x) = x and f2 (x) = x+2 in Figure 10 intersect diﬀerent sets of line segments and result to longest common subsequences of diﬀerent length. Gunopulos and G. Since each element in A can be matched with at most 2δ + 1 elements in B. D.
The diﬀerence between fi and fi+1 is simply that we cross one additional endpoint. Lemma 3. There are O(δ(n + m)) endpoints. it still intersects the same number of line segments. Let f1 (x) = x + c1 . with A = n and B = m. For a given translation fc . Each translation is a line of the form fc (x) = x + c . By repeating this step (j − i) times we create Lf j .
. . In this section we present a much more eﬃcient approximate algorithm. ε. The following lemma shows that neighbor translations in this order intersect similar sets of line segments. fN (x) = x + cN be the diﬀerent translations for sequences Ax and Bx . Since running the LCSS algorithm takes O(δ(n + m)) we have shown the following theorem: Theorem 1. This set of O(δ(n + m)) translations gives all possible matchings for a longest common subsequence of A. unless we cross an endpoint of a line segment. Let us sort these translations by c . Let us consider the O(δ(n + m)) translations that result to diﬀerent sets of intersected line segments. Given two sequences A and B. Proof: Without loss of generality. Each such translation corresponds to a line fc (x) = x + c . assume i < j. The key in our technique is that we can bound the diﬀerence between the sets of line segments that diﬀerent lines of slope 1 intersect. based on how far apart the lines are. In this case. This means that we have to either add or subtract one line segment to Lf i to get Lf i+1 . An Eﬃcient Approximate Algorithm Theorem 1 gives an exact algorithm for computing S2. but this algorithm runs in quadratic time. we can enumerate the O(δ(n+m)) translations that produce diﬀerent sets of potential matchings by ﬁnding the lines of slope 1 that pass through the endpoints. . If we move this line slightly to the left or to the right. 4.3.Indexing TimeSeries under Conditions of Noise
83
Proof: Let fc (x) = x+c be a line of slope 1. let Lf c be the set of line segments it intersects. A. A line of slope 1 that sweeps all the endpoints will therefore intersect at most O(δ(n + m)) diﬀerent sets of line segments during the sweep. B) in O((n + m)2 δ 2 ) time. In addition. Then the symmetric diﬀerence Lf i ∆Lf j ≤ i − j. B. where c1 ≤ · · · ≤ cN . we can compute the S2(δ. . . the set of intersected line segments increases or decreases by one.
. Proof: Let a = S2(δ. with A = n and B = m (suppose m < n). ε. If we can aﬀord to be within a certain error range then we don’t have to try all translations. B with lengths n. A.
. B) − AS2δ. ε. A. by the previous lemma. δn in the orderb ing described above. Given sequences A. fi+b ) that have at most b diﬀerent matchings from the optimal. we can ﬁnd these translations in O( δn δn) time if we run b [ δn ] quantile operations. ε.β (A. and a constant 0 < β < 1. there are 2b translations (fi−b . for i = 1. Vlachos. B) of the similarity S2(δ. Das
Fig. A. we can ﬁnd an approximation AS2δ.84
M. . b So we have a total of δn translations and setting b = βn completes the b proof. In addition. . we are within b diﬀerent matchings from the optimal matching of A and B (Figure 11). . the approximation algorithm works as follows: (i) Find all sets of translations for sequences A and B. Given two sequences A and B.β (A. We can ﬁnd these translations in O(δn log n) time if we ﬁnd and sort all the translations. Therefore. . B). m respectively (m < n). . .
We can now prove our main theorem: Theorem 2. 11. Gunopulos and G. if we use the translations fib . D. B) such that S2(δ. . Alternatively. ε. B) < β in O(nδ 2 /β) time. and constants δ. β. There exists a translation fi such that Lf i is a superset of the matches in the optimal LCSS of A and B.
An example where the triangle inequality does not hold for the new LCSS
. obviously D2(δ. C) = 1. β (iii) Run the LCSSδ.ε algorithm on A and B.F (A. ε. C) > D2(δ. First we deﬁne: LCSSδ. A. for the computed translations.ε (A. ε. 12. However. ε. 1 ≤ i ≤ 2δ . The distance function D2 is not a metric because it does not obey the triangle inequality. D2(δ. A. model. ε. We can however prove a weaker version of the triangle inequality. This makes the use of traditional indexing techniques diﬃcult.ε.F (A. A. A.Indexing TimeSeries under Conditions of Noise
85
(ii) Find the iβnth quantiles for this set. Indexing Trajectories for Similarity Retrieval In this section we show how to use the hierarchical tree of a clustering algorithm in order to eﬃciently answer nearest neighbor queries in a dataset of sequences. B. B) = 1 − LCSSδ.ε. where we observe that LCSS(δ. ε. C) = 0 + 0. B)
Fig. (iv) Return the highest result. An example is shown in Figure 12. B. B) = max LCSSδ. B) min(A. therefore the respective distances are both zero. ε. fc (B))
f c∈F
Clearly.
2δ β
set of
5. B) = 1 and LCSS(δ. B) + D2(δ. which can help us avoid examining a large portion of the database objects.
F (B. Now we can show the following lemma: Lemma 4. Q) − LCSSδ. we want to examine whether to follow the subtree that is rooted at C. Q)
. Since there are at least B − (B − LCSSδ.ε. B) .F (B. C) − B where B is the length of sequence B. then the element of A can also match the element of C within 2ε.F (Mc . Proof: Clearly.F (B.ε. 5.F (MC .ε.F (MC . Vlachos. F is the set of translations).ε.ε.2ε. e). However. vj . and we use the tree that the algorithm produced as follows: For every node C of the tree we store the medoid (MC ) of each cluster. C) − B.ε. Q) B LCSSδ. C) > LCSSδ.ε.86
M. B)) − (B − LCSSδ.ε. B) or in terms of distance: D2(δ. The medoid is the sequence that has the minimum distance (or maximum LCSS) from every other sequence in the cluster: maxvi∈C minvj∈C LCSSδ.ε. Q) < B + LCSSδ.C: LCSSδ. Given trajectories A.ε.ε.B. B) + LCSSδ. Gunopulos and G. from the previous lemma we know that for any sequence B in C: LCSSδ. Indexing Structure We ﬁrst partition all the sequences into sets according to length. + min(B. C)) elements of B that match with elements of A and with elements of C. So given the tree and a query sequence Q. Q) = 1 − LCSSδ. and the same element of B matches an element of C within ε. ε.F (B.) We apply a hierarchical clustering algorithm on each set.1.F (Mc .F (A. D.ε.2ε. B)) − (B − LCSSδ. B. B) + LCSSδ.2ε (A.F (A. so that the longest sequence in each set is at most a times the shortest (typically we use a = 2. Das
(as before.ε. C)) = LCSSδ.F (B.F (vi . C) > B − (B − LCSSδ. Q) min(B. Q) LCSSδ. if an element of A can match an element of B within ε.F (A. Q) >1− − min(B.2ε.F (A. it follows that LCSSδ. Q) min(B.F (A.F (B.
F . max l. we have to subtract β ∗ min(MC .ε. Therefore. The purpose of our experiments is twofold: ﬁrst.F (MC .ε. B). we check if the child is a single sequence or a cluster. B)/min(B. Experimental Evaluation We implemented the proposed approximation and indexing techniques as they are described in the previous sections and here we present experimental results evaluating our techniques. for every node of the tree along with the medoid we have to keep the sequence rc that maximizes this expression. For simplicity we discuss the algorithm for the 1Nearest Neighbor query. Q). ε. Then we compute a lower bound L to the distance of the query with any sequence in the cluster and we compare the result with the distance of the current nearest neighbor mindist. In our scheme we use an approximate algorithm to compute the LCSSδ.2.
. Our experiments were run on a PC Pentium III at 1 GHz with 128 MB RAM and 60 GB hard disk. we check the length of the query and we choose the appropriate value for min(B. B. Q)/min(B.ε.
5. where given a query sequence Q we try to ﬁnd the sequence in the set that is the most similar to Q. the query Q and the distance to the closest timeseries found so far. The search procedure takes as input a node N in the tree.2 for indexing trajectories. For each of the children C. we just compare its distance to Q with the current nearest sequence. since we use the approximate algorithm of section 3. Consequently. Q). otherwise we use the minimum and maximum lengths to obtain an approximate result. In case that it is a sequence. If the length of the query is smaller than the shortest length of the sequences we are currently considering we use that. We describe the datasets and then we continue by presenting the results. Therefore. If it is a cluster. Q) from the bound we compute for D2(δ. Searching the Index Tree for Nearest Trajectories We assume that we search an index tree that contains timeseries with minimum length.
6. Q) that we compute can be up to β times higher than the exact value.Indexing TimeSeries under Conditions of Noise
87
In order to provide a lower bound we have to maximize the expression B − LCSSδ. the value of LCSSδ. min l and maximum length. We need to examine this cluster only if L is smaller than mindist.F (Mc . to evaluate the eﬃciency and accuracy of the approximation algorithm presented in Section 4 and second to evaluate the indexing technique that we discussed in the previous section.
938 30 0. whales.5258 1.
Table 1.5 0.938144 0. dataset.927835 0.edu/whalenetstuff/stop
. As can be observed from the experimental results. Our dataset here comes from marine mammals’ satellite tracking data2 . etc) tracked over diﬀerent periods of time.696701 0. In Table 1 we show the computed similarity between a pair of sequences in the SEALS dataset. the running times of the approximation algorithm is not proportional to the number of runs (K).669072 0. Also.71134 0. Vlachos.92087 cover. The main conclusion of the above experiments is that the approximation algorithm can provide a very tractable time vs accuracy tradeoﬀ for computing the similarity between two sequences. Similarity values between two sequences from our SEALS Similarity δ E Exact Approximate for K tries 15 2 2 4 4 6 6 0. Gunopulos and G. We run the exact and the approximate algorithm for diﬀerent values of δ and ε and we report here some indicative results.65974 0. This is achieved by reusing the results of previous translations and terminating early the execution of the current translation.907216 0.325 1. It consists of sequences of geographic locations of various marine animals (dolphins.5 0. sea lions.908763 0.698041 0.7274 Error(%) for K = 60
2 http://whale.1. in Table 2 we report the running times to compute the similarity measure between two sequences of the same dataset.25 0.698763 0.25 0.680619 0.886598 0. the number of translations c that we try).71134 0. if it is not going to yield a better result. that range from one to three months (SEALS dataset). The length of the sequences is close to 100.71230 0.865979 0.wheelock.25 0.922577 0. Time and Accuracy Experiments Here we present the results of some experiments using the approximation algorithm to compute the similarity function S2.9 0. As we can see using only a few translation we get very good results.71134 0.88
M. The running time of the approximation algorithm is much faster even for K = 60.2094 0. K is the number of times the approximate algorithm invokes the LCSS procedure (that is.893608 0.700206 0. when the similarity is deﬁned using the LCSS model. We got similar results for synthetic datasets.919794 60 0.5 0.7216 1. Das
6. D.4639 0.html 1. The approximation algorithm uses again from 15 to 60 diﬀerent runs.72459 0.
00441 0.5 0.00731 0.331 0.1. For most datasets we had at our disposal we discovered that setting δ to more than 20–30% of the sequences length did not yield signiﬁcant improvement.5 0.01482 0. (iii) For LCSS we used a randomized version with and without sampling. However.00921 0. Like this we can get the best possible results out of the Euclidean method.597 Error(%) for K = 60
89
6.25 0.006 0.699 17.25 0.470 9. and for various values of δ. Furthermore. Speciﬁcally: (i) The Euclidean distance is only deﬁned for sequences of the same length (and the lengths of our sequences vary considerably).0032 0. Our method does not need any normalization. This is necessary due to the randomized nature of our approach. In both DTW and Euclidean we normalized the data before computing the distances. Determining the Values for δ and ε The choice of values for the parameters δ and ε are clearly dependent on the application and the dataset.01332 6. We tried to oﬀer the best possible comparison between every pair of sequences.0062 30 0.0581 0. by sliding the shorter of the two sequences across the longer one and recording their minimum distance.00921 60 0.00481 0. dataset. 6.301 0.25 0.5 0.00641 0.17 0. Running time between two sequences from the SEALS Running Time (sec) δ E Exact Approximate for K tries 15 2 2 4 4 6 6 0.00822 0. The time and the correct clusterings represent the average values of 15 runs of the experiment. (ii) The DTW can be used to match sequences of diﬀerent length. Clustering using the Approximation Algorithm We compare the clustering performance of our method to the widely used Euclidean and DTW distance functions.01933 0.2. since it computes the necessary translations.0038 0.00911 0.123 22.2. imposing some time overhead.06 0.1 0.01031 0.Indexing TimeSeries under Conditions of Noise Table 2.343 11. after some point the
.01151 0. the computation of the Euclidean distance is the least complicated one.586 4.04 0.
4 0 0.3–0. In our experiments we used a value equal to some percentage (usually 0.
Whole Sequence Sample 80% Sample 60% Sample 40% Sample 30% Sample 25% Sample 20%
similarity stabilizes to a certain value (Figure 13).55 0. then we increase the LCSS by 1. Experiment 1 — Video Tracking Data These time series represent the X and Y position of a human tracking feature (e.2. If two sequence points match for very small ε.35 Delta percentage of sequence length
Fig.g. Nevertheless. D. otherwise we increase by some amount in the range r.2.8 0. In most datasets.1 0.75 0.
. where 0 ≤ r < 1. we can use weighted matching. ‘london’. Such matching functions can be found in Figure 14. ‘paris’.7 Similarity 0.5 0. which yielded good and intuitive results.05 0.65 0. tip of ﬁnger). ‘boston’.3 0. 13.25 0. 6. In this experiment we used only the X coordinates and we kept 3 recordings for 5 diﬀerent words. Vlachos.6 0. Das
0. The data correspond to the following words: ‘athens’.45 0.90
M. In conjunction with a “spelling program” the user can “write” various words [20]. Gunopulos and G.15 0. the similarity stabilizes around a certain point after a value of δ.2 0. The determination of ε is application dependent.5) of the smallest standard deviation between the two sequences that were examined at any time. when we use the index the value of ε has to be the same for all pairs of sequences. This will allow us to give more gravity to the points that match very closely and less gravity to the points that match marginally within the search area of ε. If we want to provide more accurate results of the similarity between the two sequences. ‘berlin’.
2.85). The same experiment is conducted with the rest of the datasets.Indexing TimeSeries under Conditions of Noise
91
Fig. Z hand position.Y. yaw. the top level should provide an accurate division into two classes.ics. While at the lower levels of the dendrogram the clustering is subjective. Experiment 2 — Australian Sign Language Dataset (ASL)3 The dataset consists of various parameters (such as the X . Performing unweighted and weighted matching in order to provide more accuracy in the matching between sequences. The shortest one is 834 points and the longest one 1719 points. 14. The results with the Euclidean and the Time Warping distance have many classiﬁcation errors. Experiments have been conducted for diﬀerent sample sizes and values of δ (as a percentage of the original series length). We take all possible pairs of words (in this case 5 ∗ 4/2 = 10 pairs) and use the clustering algorithm to partition them into two classes. Since the best results for every distance function are produced using the complete linkage. For the LCSS the only real variations in the clustering are for sample sizes s ≤ 10% (Figure 16).edu
. For 15% sampling or more. as well as the quality of the clustering. thumb bend etc) tracked while diﬀerent writers
3 http://kdd. pitch. we report only the results for this approach (Table 3. We evaluate the total time required by each method. roll. complete and average linkage.3. To determine the eﬃciency of each method we performed hierarchical clustering after computing the N (N − 1)/2 pairwise distances for all three distance functions. 15).
The average length of the series is around 1100 points. We clustered using single. Fig. based on our knowledge of which word each sequence actually represents. Still the average incorrect clusterings for these cases were constantly less than 2 (<1.uci. there were no errors. 6.
Examples of this dataset can be seen in Figure 17. ‘forget’. We used only the X coordinates and collected 5 recordings of the following 10 words: ‘Norway’. ‘crazy’.789 28. and the reason is that we used only 1 parameter out of the possible 10. which is expected. Results using the video tracking data some values of sample size s and δ.28 227.162 2. ‘later’. ‘spend’. So both present equivalent accuracy levels. δ25% s15%. δ25%
31. Vlachos. δ25% s20%. Gunopulos and G.
sign one the 95 words of the ASL.113
Fig. δ25% s10%. These series are relatively short (50–100 points). This is the experiment conducted also in [27]. ‘lose’. ‘happy’.645 8. since this dataset does not contain excessive noise and
. the performance of the LCSS in this experiment is similar to the DTW (both recognized correctly 20 clusters). ‘innocent’.240 15. Therefore accuracy is compromised.92
M. ‘eat’. ‘cold’. Speciﬁcally. Time required to compute the pairwise distance computations for various sample sizes and values of δ. 15. D. Distance function Time (sec) Correct clusterings (out of 10) complete linkage 2 5 8. Das Table 3. In this experiment all distance functions do not depict signiﬁcant clustering accuracy.45 9.85 10 10
Euclidean DTW S2: s5%.
.
Fig.
furthermore the data seem to be already normalized and rescaled within the range [−1. . our execution time is comparable to the Euclidean (without performing any translations). 16. . even though we don’t gain much in accuracy. 17. Three recordings of the word ‘norway’ in the Australian Sign Language. As a consequence. 1]. .Indexing TimeSeries under Conditions of Noise
93
Fig. . since the translations were not going to achieve any further improvement (Table 4. The graph depicts the x position of the writer’s hand. Average number of correct clusterings (out of 10) for 15 runs of the algorithm using diﬀerent sample sizes and δ values. Therefore in this experiment we used also the similarity function S1 (no translation). Figure 18). Sampling is only performed down to 75% of the series length (these sequences are already short).
Gunopulos and G.650
Fig.00 12. δ40% s75%. δ100% s100%.94 Table 4. δ100%
2.4.080 20. The noise was added using the function: xnoise . the running time is the same as with the original ASL data.297 1.00 Correct clusterings (out of 45) ASL with noise 1 2 9. chosen from a normal distribution with mean zero and variance one. δ40% s100%. Das Results for ASL data and ASL with added noise. Distance function
M.4667 12. ASL data: Time required to compute the pairwise distances of the 45 combinations (same for ASL and ASL with noise). In this last experiment we wanted to see how the addition of noise would aﬀect the performance of the three distance functions. Again.
.2. where randn produces a random number.00
Euclidean DTW S1: s75%.170 8. 18. Time (sec) Correct clusterings (out of 45) ASL 15 20 15.231 16. ynoise = x. D. y + randn ∗ rangeV alues.345 2.092 1.200 2. Vlachos. Experiment 3 — ASL with Added Noise We added noise at every sequence of the ASL at a random starting point and for duration equal to the 15% of the series length.450 19.933 10. and rangeValues is the range of values on X or Y coordinates.
6.
we used real sequences (from the SEALS and ASL datasets) as “seeds” to create larger datasets that follow the same patterns. the DTW recognized 2 and the LCSS up to 12 clusters (consistently recognizing more than 6 clusters). These results are a strong indication that the models based on LCSS are generally more robust to noise. Evaluating the Quality and Eﬃciency of the Indexing Technique In this part of our experiments we evaluated the eﬃciency and eﬀectiveness of the proposed indexing scheme. To generate large realistic datasets.
Noisy ASL data: The correct clusterings of the LCSS method using complete
The LCSS proves to be more robust than the Euclidean and the DTW under noisy conditions (Table 4. we used queries that do not have exact matches in the database. We executed a set of KNearest Neighbor (KNN) queries for K 1. linkage. 19. 6. Figures 17 and 18). We have tested the index performance for diﬀerent number of clusters in a dataset consisting of a total of 2000 sequences. We performed tests over datasets of diﬀerent sizes and diﬀerent number of clusters. recognizing only 1 cluster.3. 5. 15 and 20 and we plot the fraction of the dataset that has to be examined in order to guarantee
. The Euclidean again performed poorly. For each experiment we run 100 diﬀerent queries and we report the averaged results. but on the other hand are similar to some of the existing sequences.Indexing TimeSeries under Conditions of Noise
95
Fig. To perform tests. 10.
Fig. We also performed similar experiments where we varied the number of
. 8 and 10 clusters. 21. Vlachos. We used datasets with 5. As we can see the results indicate that the algorithm has good performance even for queries with large K.
Performance for increasing number of Nearest Neighbors.96
M. D.
that we have found the best match for the KNN query. Gunopulos and G. 20. In Figure 20 we show some results for KNearest Neighbor queries.
Index performance for variable number of data clusters. Das
Fig. Note that in this fraction we included the medoids that we check during the search since they are also part of the dataset.
for similarity (nearest neighbor) queries. we presented approximate algorithms with provable performance bounds. Our experiments indicate that the approximation algorithm can be used to get an accurate and fast estimation of the distance between two timeseries even under noisy conditions. References
1. Eﬃcient Similarity Search in Sequence Databases. we presented an eﬃcient index structure.. scaling and rotation) are necessary. results from the index evaluation show that we can achieve good speedups for searching similar sequences comparing with the brute force linear scan. As the number of clusters increased the performance of the algorithm improved considerably. A. shifting. This behavior is expected and it is similar to the behavior of recent proposed index structures for high dimensional data [6. 7. Moreover. Faloutsos. which is based on hierarchical clustering.. L. Conclusions We have presented eﬃcient techniques to accurately compute the similarity between timeseries with signiﬁcant amount of noise. Agarwal.
. R.g. of the 19th ACM Symp. Our distance measure is based on the LCSS model and performs very well for noisy signals.. the performance of the algorithm degrades. and Erickson. This behavior follows again the same pattern of high dimensional indexing methods [6. we prove that a similar inequality holds (although a weaker one) that allows to prune parts of the datasets without any false dismissals. Since the exact computation is ineﬃcient. and Swami.Indexing TimeSeries under Conditions of Noise
97
clusters in the datasets (Figure 21). since the majority of the sequences have almost the same distance to the query. The distance that we use is not a metric and therefore the triangle inequality does not hold. J.9. of the 4th FODO. We plan to investigate biased sampling to improve the running time of the approximation algorithms. P. 2. Indexing Moving Points. 69–84. C. pp.22]..43].K. The challenge of course is to ﬁnd an embedding that approximately preserves the original pairwise distances and gives good approximate results to similarity queries. Another approach to index timeseries for similarity retrieval is to use embeddings and map the set of sequences to points in a low dimensional Euclidean space [15]. However. 175–186. In Proc. pp. Arge. (1993). In Proc. Agrawal. Also. (2000). especially when full rigid transformations (e. On the other hand if the dataset has no clusters. on Principles of Database Systems (PODS).
I. In Proc. Deformable Markov Model Templates for TimeSeries Pattern Matching. 217–235. R. 490–501. N. pp.. 10. pp. and Hart. Betke. and Ozsoyoglu.. C. P. ACM SIGKDD. pp. C. John Wiley and Sons. Bollobas.S. of the First PKDD Symp. Introduction to Algorithms. France. D. pp. and Cliﬀord. Das
3. 20. (1999). Gionis. In Proc.. 2000). Indyk. 63–72. Orlando. (1999). 17. Inc. (1997). and Fleming.. Ramakrishnan. of the 13th SCG. R. 237–248. 9. 19. (1997). Fast TimeSeries Searching with Scaling and Shifting. of 25th VLDB. A. Scaling and TransLation in TimeSeries Databases. Goldstein. of VLDB. 12. 8. I.. R. D.. K. T. Mobile Computing and Databases – A Survey. In Proc. Matching And Indexing Sequences of Diﬀerent Lengths. Gunopulos. Das. pp. D. K. Bozkaya. 14. Trajectory Clustering with Mixtures of Regression Models. A. of the CIKM. The MIT Electrical Engineering and Computer Science Series. TimeSeries Similarity Problems and WellSeparated Geometric Sets. (2000).. and Lin. In SEQUENCES 97. H. (1999). G. Cormen. pp. S. Jagadish. D. 1995). Faloutsos. IEEE TKDE. In Proc.. R. Gips. (1994). J. pp.. Faloutsos. pp. FastMap: A Fast Algorithm for Indexing. and Rivest. (1997).. San Diego. of KDD Workshop. M. K. and Motwani. and Smyth.
. (July. 5. In Proc. H. Fast Similarity Search in the Presence of Noise. 7.. of the 5th ACM SIGKDD. Ge. 6. 108–117. Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. P. Datamining and Visualization of Traditional and Multimedia Datasets. Ranganathan. Mendelzon. Berndt. P. (1994). P. M. H. Yazdani. of ICDT. (1990). K. P. Sawhney. (1999). 419–429. Chu. Pattern Classiﬁcation and Scene Analysis. T. and Smyth. (1999). When is “Nearest Neighbor” Meaningful? In Proc. 98–100. In Proc.. In Proc. and Mehrotra. 18. pp. S. pp.L. M.98
M. Lin. MIT Press/McGraw Hill. Agrawal. 4. In Proc. and Shaft. Las Vegas. and Manolopoulos.. Leiserson.. In Proceedings of the Rehabilitation Engineering and Assistive Technology Society of North America 2000 Annual Conference.. 13. and Mannila. 518–529. 11. and Shim.H.. of (VLDB). Gaﬀney. 88–100. 15. In Proc. Beyer. D.. ACM Principles of Database Systems.E. J. (1995). R. Faloutsos. 1990. Nice. (May.. 16. 89–100. J. pp. Similarity Search in High Dimensions Via Hashing. T. In Proc. 163–174. K. In Proc. M. CA. (1997). H. Jerusalem. (1973). Gunopulos. and Wong. C. Das. G. Chakrabarti. Finding Similar Time Series.. X. ACM SIGMOD. Fast Subsequence Matching in Time Series Databases. B. Duda. The Camera Mouse: Preliminary investigation of automated visual tracking for computer access. Using Dynamic Time Warping to Find Patterns in Time Series. and Mannila. Signature Technique for SimilarityBased Queries. Gunopulos and G. (2000). and Milo. K.. of ACM SIGMOD. FL. C. Barbara. U. Vlachos.
of the 18th ACM Symp. Querying Time Series Data Based on Similarity. 18(2). In Proc. of ICDE. on Knowledge Discovery and Data Mining. D. Kim.S. R. and Mendelzon. (2001). (2001). L. In Soviet Physics — Doklady 10. and Ramakrishnan. 29. Chun.S. J. 34.V. 26. on Principles of Database Systems (PODS). (1997). In Proc. Lee. (1995). Mehrotra. of the 14th ACMPODS. pp. and Chiba. and Reversals. pp.. W. Scaling up Dynamic Time Warping for Datamining Applications. 28.. 271–280. 156–165. 22.. (2000). S. Chakrabarti. 273–282... On Similarity Queries for TimeSeries Data. Variable Length Queries for Time Series Data. 10.. H. J. In Proc. pp.. E. of ICDE. In Proc. 43–49. Keogh. V. of CP ’95. D. S. D.. 429–440. In Proc. Time in Nearest Neighbour Searches.. of the VLDB. In Proc. D. of VLDB.. Keogh. of the ACM CIKM. E.. IEEE Trans. Supporting Fast Search in Time Series for Movement Patterns in Multiple Scales. 36–45. and Hsu. Similarity Search for Multidimensional Data Sequences.V. Speech and Signal Processing (ASSP). SimilarityBased Queries. In Proc. pp. Landmarks: A New Model for SimilarityBased Pattern Querying in Time Series Databases. Yoon.O. Y. Cairo. Cairo Egypt. T. P. Goldstein. Perng. Binary Codes Capable of Correcting Deletions. (1998). V. 6th Int.. and Milo. Algorithmica. Capturing the Uncertainty of MovingObject Representations.. pp. (2000). (2000). and Pazzani. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. C. pp. IEEE Transactions on Knowledge and Data Engineering. C. A. 33. and Chung. Qu. Bunke. X. 25. T. M. France. M. Wang. Acoustics. Dynamic Programming Algorithm Optimization for Spoken Word Recognition. S. Conf. and Nartker. T. Sakoe. Grumbach. A. Kollios. 261–272. Pfoser. pp. S. and Pazzani.. Boston. 38. In Proc. 32. S. (2000). and Singh. Cassis. S. H. (1999). D. A. Gunopulos.K. (2000) Novel Approaches in Query Processing for Moving Objects. (1995). and Segouﬁn. of the 26th VLDB. 31. and Tsotras. C. 1651..W. Rigaux. 151–162. Levenshtein. and Parker. 27. Y. 599–608.
.J. of ICDE.. and Theodoridis. 23–32. Lecture Notes in Computer Science. Contrast Plots and pSphere Trees: Space vs. K. In Proc.. In Proc. 30. 36. of ACMSIGMOD. D. (1978). S. Manipulating Interpolated Data is Easier than you Thought. In Proc. and Kanellakis. G. D. (1966). Mendelzon. P.. Lee. pp. C. Wang. S. Goldin. Rice. On Indexing Mobile Objects. Kahveci.. pp. Classes of Cost Functions for String Edit Distance. J. 35.. (2000) Eﬃcient Searches for Similar Subsequences of Diﬀerent Lengths in Sequence Databases. Raﬁei. C. pp. pp. Jensen.H. S. Pfoser...Indexing TimeSeries under Conditions of Noise
99
21. H. and Jensen. 23. 251–258. (1999). and Wang. In Proc. (2000). In Proc.H. H. Chu. 26(1). 24. of IEEE ICDE..L. 707–710. Insertions. 37. 12(5). Jagadish.. 675–693. (2000). 33–42.. MA. Park. Zhang.
(1998). In Proc. Kollios. Gunopulos and G. M. H. of IEEE Int. 41. M. Das
39.J. 42... and Fisher. In Proc. Weber. Jagadish. (2001) Eﬃcient Similarity Measures and Indexing Structures for Multidimensional Trajectories. In Proc. 21(1). of VLDB. C. 194–205. Yi. pp..V. NYC. In Proc. C. and Lopez. 201–208. UC Riverside. C. (2002) Discovering Multidimensional Trajectories.. A Quantitative Analysis and Performance Study for Similarity Search Methods in HighDimensional Spaces. D. Yi.. Saltenis. pp. B. 168–173. S. Indexing the Positions of Continuously Moving Objects. 331–342. and Faloutsos.A. and Faloutsos. M. B.. of the VLDB. 43. D. Jensen.. M.100
M.J. and Blott. The StringToString Correction Problem. (1974). Fast Time Sequence Indexing for Arbitrary Lp Norms. Vlachos. Wagner. Vlachos. of the ACM SIGMOD. Cairo Egypt. S..A.K. Leutenegger. G. R. (1998).K. Gunopulos. S. Schek. Eﬃcient Retrieval of Similar Time Sequences Under Time Warping. MSc Thesis. (2000). R. Journal of ACM. Vlachos.
. (2000). In Proc. 40. of the 14th ICDE.. pp. 44. H. 45. Conference in Data Engineering (ICDE).
bgu. in the realworld datasets evaluation.ac. infofuzzy network.tau. This work presents a new change detection methodology. the method produced similar results.tau.ac. Keywords: Classiﬁcation. time series. TelAviv 69978. TelAviv 69978.ac. Israel Email: mlast@bgumail.il
Lior Rokach Department of Industrial Engineering TelAviv University. One important factor that must not be set aside is the time factor.il. By implementing the novel approach on a set of artiﬁcially generated datasets. one must examine whether the new data agrees with the previous datasets and make the relevant assumptions about the future. Also. Israel Email: ∗ zeiragil@post. † maimon@eng.ac.tau. These changes can be detected independently of the data mining algorithm. As more data is accumulated into the problem domain. BeerSheva 84105. incremental learning. change detection.
101
.il
Mark Last Department of Information Systems Engineering BenGurion University.il
Most classiﬁcation methods are based on the assumption that the historic data involved in building and verifying the model is the best estimator of what will happen in the future. all signiﬁcant changes were detected in the relevant periods.CHAPTER 5 CHANGE DETECTION IN CLASSIFICATION MODELS INDUCED FROM TIME SERIES DATA
Gil Zeira∗ and Oded Maimon† Department of Industrial Engineering TelAviv University. incrementally over time. which is used for constructing the corresponding model. Israel Email: liorr@eng. with a set of statistical estimators.
Some researchers have proposed various representations of the problem of large time changing databases and populations including: • Deﬁning robustness and discovering robust knowledge from databases [15] or learning stable concepts [13] in domains with hidden changes in concepts.36]. Liu and Setiono’s incremental feature selection (LVI) [24] and more. Introduction As mass of data is incrementally accumulated into large databases over time. and many of them performed signiﬁcantly better than running the algorithm from scratch. Jones et al. simple Backpropagation algorithm for neural networks [27]. An important question that one must examine whenever a new mass of data is accumulated is “Is it really wise to keep on reconstructing or verifying the current model. when everything or something in our notion of the model could have signiﬁcantly changed?” In other words. which is concerned with updating an induced model upon receiving new data. However the change detection in classiﬁcation is still not well elaborated. The main topic in most incremental learning theories is how the model (this could be a set of rules. neural networks. Change detection in time series is not a new subject and it has always been a topic of continued interest. For example: Utgoﬀ’s method for incremental induction of decision trees (ITI) [35. we tend to believe that the new data “acts” somehow resembling to the prior knowledge we have on the operation or facts that it describes. Alfonso Gerevini’s network constraints updating technique [12]. O.
. [17] have developed a change detection model mechanism for serially correlated multivariate data. WeiMin Shen’s semiincremental learning method (CDL4) [34]. These methods are speciﬁc to the underlying data mining model. Cheung technique for updating association rules in large databases [5]. ByoungTak Zhang’s method for feedforwarding neural networks (SELF) [40]. This problem has been challenged by many of the algorithms mentioned above. • Identifying and modeling a persistent drift in a database [11]. a decision tree. Maimon. the main problem is not how to reconstruct better. There are many algorithms and methods that deal with the incremental learning problem. generally when the records were received online and changes had a low magnitude. Last and L. and so on) is reﬁned or reconstructed eﬃciently as new amounts of data is encountered. Rokach
1. Zeira. Yao [39] has estimated the number of change points in time series using the BIC criterion. but rather how to detect a change in a model based on a time series database. For instance. David W.102
G. M.
etc.1).2). Given a database D which consist of X set of records. The data mining model M is one that by using an algorithm (G) generates a set of hypothesis
. and the methodology for the change detection procedure (2. In Section 2. describing the variety of possible signiﬁcant changes (2. Change Detection in Classiﬁcation Models of Data Mining 2. we present our change detection procedure in data mining models. we describe an experimental evaluation of the change detection methodology by introducing artiﬁcial changes in databases and implementing the change detection methodology to identify these changes. rule induction. by identifying distinct categories of changes and implementing a set of statistical estimators. a decision tree (see [34–36] ). a new model must be constructed.Change Detection in Classiﬁcation Models Induced from Time Series Data
103
• Adapting to Concept and Population Drift [14. and consequently.. The rest of the chapter is organized as follows.18.3). the above methods deal directly with data mining in changing environment. The induced classiﬁcation model can be represented as a set of rules (see the RISE system [8]. Section 4 describes evaluation of the change detection methodology on realworld datasets. • Activity Monitoring [9]. The major contribution of our change detection procedure (as will be described in later sections). and presenting several options for future research in implementation and extension of our methodology.21. deﬁnition of the hypothesis testing (2.19. the BMA method for rule induction [6]. informationtheoretic connectionists networks (see IFN [25]) and so on. PARULEL and PARADISER etc. This chapter introduces a novel methodology for detecting a signiﬁcant change in a classiﬁcation model of data mining. 2.).1. neural networks (see [35]).4).34]. In Section 3. Classiﬁcation Model Characteristics Classiﬁcation is the task which involves constructing a model predicting the (usually categorical) label of a classifying attribute and using it to classify new data. Section 5 presents validation of the work’s assumptions and Section 6 concludes this chapter by summingup the main contributions of our method. is the ability to make a conﬁdent claim that the model that was prebuilt based on a suﬃciently large dataset is no longer valid for any future use such as prediction. Rather than challenging the problem of detecting signiﬁcant changes. by deﬁning the data mining classiﬁcation model characteristics (2.
O. . the database D is divided into two parts — learning (Dlearn ) and validation (Dval ) sets.G = MK+1. . generate the best set of hypothesis HK to describe the accumulated model MK . . which generalize the relationship between a set of candidate input variables and the target variable. .e. Evaluating the prediction accuracy of any model M . Find the best set of hypothesis H within the available hypothesis space. the classiﬁcation task should be altered: in every period K. . . containing a complete set of records X = (AT ). this means. In most algorithms. which is built by a classiﬁcation algorithm G. a new set of records XK is accumulated to the database. M. The second part is supposed to ensure that the algorithm performs its goals by validation of the built models on unseen records. by estimating the Validation Error Rate of the examined model. When the database D is not ﬁxed but. . which generalize the relationship between a set of candidate input variables and the target variable (e. generating the following general equation: ˆ MG : A ¬ T . A = (A1 . and a new question is encountered: is MK. Ti = tj ). k?
. accumulated over time. alternatively. the model M ) using some Data Mining algorithm. . Therefore. The ﬁrst is supposed to hold enough information for assembling a statistically signiﬁcant and stable model based on the DM algorithm. and T is a target variable (i. Each record is regarded as a complete set of conjunction between attributes and a target concept (or variable). To simplify. . Last and L. Ani = anj (n). in every K = 1. (1)
It is obvious that not all attributes are proved to be statistically significant in order to be included in the set of hypothesizes. such as Xi = (A1i = a1j (1). The following notation is a general description of the data mining classiﬁcation modeling task: Given a database D. and DK will be the notation for the accumulated database Dk = dk . Maimon. The goal for the classiﬁcation task is to generate best set of hypotheses to describe the model M using an algorithm G. is commonly performed. the target concept). where A is a vector of candidate input variables (attributes from the examined phenomenon.G . A2 . dK will be the notation for the set of records XK that was added in the start of period K. A2i = a2j (2). Zeira.104
G. containing a complete set of records XK .g. An ) is a known set of attributes of the desired phenomenon and T (also noted in DM literature as Y ) is a known discrete or continuous target variable or concept. which might have some inﬂuence on the target concept). Rokach
tests H within the available hypothesis space. . given a database DK .
based on the data that was accumulated during the period K. For example. consists of 45% males and 55% females. then a change in the target distribution has occurred. in the case of examining the rate of failures in ﬁnal exams based on the characteristics of the students in the course of consecutive years. and in year 2000 the situation was the opposite. which might occur when inducing the model M using the algorithm G. If in the year 1999 the average failure rate was 20% and in the year 2000 was 40%. while in period K all records represent males. For instance. most existing methods have dealt with “how can the model M be updated eﬃciently when a new period K is encountered?” or “How can we adapt to the time factor?”. in the case of examining the rate of failures in a ﬁnal exam based on the characteristics of the students in consecutive years.Change Detection in Classiﬁcation Models Induced from Time Series Data
105
As noted before. rather than asking the following questions: • “Was the model signiﬁcantly changed during the period K?” • “What was the nature of the change?” • “Should we consider several of the past periods as redundant or not required in order for an algorithm G to generate a better model M ?” Hence. Variety of Changes There are various signiﬁcant changes. which was incrementally built over periods 1 to K−1. the objective of this work is: deﬁne and evaluate a change detection methodology for identifying a signiﬁcant change that happened during period K in a classiﬁcation model. then it is obvious that there was a change in the patterns of behavior. if a database in periods 1 to K −1. derived from a change in a set of hypothesis in H. A change in the “patterns” (rules). There are several possible causes for signiﬁcant changes in the data mining model: (i) A change in the probability distribution of one or more of the input attributes (A).2. which deﬁne the relationship of the input attributes to the target variable. This work deﬁnes this cause for a signiﬁcant change in a Data Mining model M by the following deﬁnition: A change C is encountered in the period K if the validation error of the
. (ii) A change in the distribution of the target variable (T ). That is a change in the model M . For example. 2. if in years 1999 male students had 60% failures and female students had 5% failures.
g. Since it is not trivial to identify the dominant cause of change. dK consists of the accumulated set of records obtained in period K. O. A change in “patterns” (rules) of the data mining model. which can also cause a change in the target variable. Zeira. and a change in the target variable. A change in the attribute variable(s). France has stopped producing white wine. Last and L. As noted in subsection 2. A change in “patterns” (rules) of the data mining model. This can also aﬀect the overall error rate of the data mining model if most rules were generated for men. the data mining model is generally described ˆ as: MG : A ¬ T . A change in “patterns” (rules) of the data mining model. and a change in the target and the input variables. Inconsistency of the DM algorithm. Maimon. A change in the target variable. that is a relationship between A and T . The ﬁrst cause is explained via a change in the set of attributes A. and the percent of white wine consumption has dropped from 50% to 30%). than the overall percent of white wine consumption (T ) will also increase.106
G. if the cause of a change in the relevant period is a change in the target variable (e.
. and a change in the input variable(s). Although. A − − + + − − + + T − + − + − + − + Description No change. Also. this work does not intend to deal with consistency or inconsistency measures of existing DM algorithms. Rules − − − − + + + + Deﬁnition of the variety of changes in a data mining model. We just point out here that the DM algorithm that was chosen for our experiments (IFN) was shown to be stable in our previous work [21]. A change in “patterns” (rules) of the data mining model..1. this work claims that a variety of possible changes might occur in a relevant
Table 1. The basic assumption is that the DM algorithms in use were proven to be consistent and therefore are not prone to be a major cause for a change in the model. For example. Rokach
model MK−1 (the model that is based on DK−1 ) on the database DK−1 is signiﬁcantly diﬀerent from the validation error rate of the model MK−1 over dK . this option should not be omitted. A change in the target and in the input variable(s). M. if the percent of women will rise from 50% to 80% and women drink more white wine than men. it is possible that the rules relating the relevant attributes to the target variable(s) will be aﬀected.
population change.
.5 or IFN). Detecting a change in “patterns” (rules). As noted.K−1 . The notion that all three major causes interact and aﬀect each other is quite new and it is tested and validated in this work for the ﬁrst time.K = eMK−1 . The same DM algorithm is used in all periods to build the classiﬁcation model (e.3. 1/3 holdout. etc..K−1 . 2. we can safely assume that the true validation error rate of a given incremental model is accurately estimated by the previous K−1 periods. Since massive data streams are usually involved in building an incremental model. 3. 2. “Patterns” (rules) deﬁne the relationship between input and target variables.g. Therefore. a change in the rules (R) is encountered during period K if the validation error of the model MK−1 (the model that is based on DK−1 ) on the database DK−1 is signiﬁcantly diﬀerent from the validation error rate of the model MK−1 over dK (the records obtained during period K). a set of statistical estimators is presented in this chapter. The same method is used for estimating the validation error rate in all periods (e.. and so on. The use of these estimators is subject to several conditions: 1. 5fold or 10fold cross validation. ˆ ˆ ˆ ˆ H1 : eMK−1 . the inherent noisiness of the training data. The decision of whether a period contains suﬃcient number of records should be based on the classiﬁcation algorithm in use. Table 1 presents the possible combinations of signiﬁcant causes in a given period.Change Detection in Classiﬁcation Models Induced from Time Series Data
107
period.). The deﬁnition of the variety of possible changes in a datamining model is quite a new concept.g. the acceptable diﬀerence between the training and validation error rates. activity monitoring. Therefore. C4. the parameter of interest for the statistical hypothesis testing is the true validation error rate and the null hypothesis is as follows: H0 : eMK−1 . etc.K = eMK−1 . several researchers tended to deal with concept change. Every period contains a suﬃcient amount of training data in order to rebuild a model for that speciﬁc period. The change can be caused by one of the three causes that were mentioned above or a simple combination of them. Statistical Hypothesis Testing In order to determine whether or not a signiﬁcant change has occurred in period K.
Maimon. then reject H0 . e ˆ σd = ˆ2 eMK−1 . M. O. In order to detect a signiﬁcant diﬀerence between the two error rates it is needed to test the following statistic (two sided hypothesis): ˆ d = ˆMK−1 .K−1 (1 − eMK−1 . This test examines whether a sample of the variable distribution is drawn from the matching probability distribution known as the true distribution of that variable. Again.K − (1 − eMK−1 . . H1 : otherwise. Last and L. it is safe to assume that the stationary distribution of any variable in a given incremental model can be accurately estimated by the previous K − 1 periods.K is the validation error rate of the aggregated model MK−1 on ˆ the set of records dK . The foundations for the above way of hypothesis testing when comparing error rates of classiﬁcation algorithms can be found in [29]. A change has occured in period K. The following null hypothesis is tested for every variable of interest X: H0 : the variable X’s distribution is stationary (timeinvariant).K − eMK−1 . The objective of this estimator in the change detection procedure is to validate our assumption that the distribution of an input or a target variable has signiﬁcantly changed in statistical sense. eMK−1 .
. .108
G. The decision is based on the following formula:
2 X p = nK · j i=1
(xiK /nK − xiK−1 /nK−1 )2 .K−1 ) ˆ ˆ ˆ ˆ + . xiK−1 /nK−1
(2)
where: nK is the number of records in the Kth period.K ) eMK−1 .K−1 is the validation error rate of model MK−1 measured on ˆ DK−1 set of records (the standard validation error of the model). since massive data streams are usually involved in building an incremental model. Detecting changes in variable distributions. .K−1 . Rokach
where: eMK−1 . Zeira. K −1 and nK = dK  is the number of records in period K. nK nK−1(val)
ˆ If d ≥ z(1−α/2) · σd . The second statistical test is Pearson’s chisquare statistic for comparing multinomial variables (see [28]). ˆ2 nK−1(val) = DK−1(val)  is the number of records which were selected for validation from periods 1. .
xiK is the number of records in the ith class of variable X in the K period..
. Deﬁne the data set DK−1(val) .
2 If Xp > χ2 (j − 1). Calculate the validation error rate eMK−1 . • K is the cumulative number of periods in a data stream. 5fold crossvalidation). • α is the desired signiﬁcance level for the change detection procedure (the probability of a false alarm when no actual change is present). . C4. Outputs: • CD(α) is the errorbased change detection estimator (1 – pvalue).. where j is the number of classes of the tested 1−α variable. . Count the number of records nK−1 = DK−1(val) .g. K − 1. . . . . . K − 1. then the null hypothesis that the variable X’s distribution has been stationary in period K like in the previous periods is rejected.5. Calculate xiK−1 . The explanation of Pearson’s statistical hypothesis testing is provided in [28]. • M is the classiﬁcation model constructed by the DM algorithm (e.
2. • XP (α) is the Pearson’s chisquare estimator of distribution change (1 – pvalue).Change Detection in Classiﬁcation Models Induced from Time Series Data
109
nK−1 is the number of records in periods 1. . K − 1. . a decision tree). Change Detection Procedure Stage 1: For periods K − 1 build the model MK−1 using the DM algorithm G. . Methodology This section describes the algorithmic usage of the previous estimators: Inputs: • G is the DM algorithm used for constructing the classiﬁcation model (e. .g.K−1 according to the validaˆ tion method V . 2..5 or IFN). nK−1 for every input and target variable existing in periods 1. .g. • V is the validation method in use (e.4. xiK−1 is the number of records in the ith class of variable X in periods 1.
σd . . It is obvious that the complexity of this procedure is at most O(nK ). e ˆ Calculate and Return CD(α). The ﬁrst one is designed to perform initialization of the procedure. Also. . nK and Xp . etc. absorb the new period and update the model by using an incremental algorithm.K according to the validation method V . Knowing the causes of the change (if any). Count the number of records nK = dK . The methodology is not restricted to databases with a constant number of variables. One may also apply multiple model approaches such as boosting. bagging.K − eMK−1 . The basic assumption is that if the addition of a new variable will inﬂuence the relationship between the target variable and the input variables in a manner that changes the validation accuracy. Rokach
Stage 2: For period K. voting. it is very easy to store information about the distributions of target and input variables in order to simplify the change detection methodology.K−1 ). explore the type of the change and its magnitude and eﬀect on other characteristics of the DM model. the user can make a distinction between the eight possible variations of a change in the data mining classiﬁcation model (see subsection above). it will be identiﬁed as a signiﬁcant change. Stage 3: For every input and target variable existing in periods 1. Calculate eMK−1 . Last and L. deﬁne the set of records dK . the user of this new methodology can act in several ways. K: 2 Calculate: xiK . ˆ Calculate the diﬀerence d = ABS(ˆMK−1 . Based on the outputs of the change detection procedure. The procedure has three major stages. The second stage is aimed at detecting a signiﬁcant change in the “patterns” (rules) of the prebuilt datamining model. ˆ2 H0 = z(1−α/2) · σd · ˆ2
. and incorporate other known methods dealing with the speciﬁc change(s) detected. . Zeira. including reapplying the algorithm from scratch to the new data. The third stage is designated to test whether the distribution of one or more variable(s) in the set of input or target variable(s) has changed. O.110
G. M. Calculate and Return XP (α). Maimon. . make K = K + 1 and perform the change detection procedure again for the next period. as described in the previous section.
a set of artiﬁcially generated datasets were built based on the following characteristics: • Predetermined deﬁnition and distribution of all variables (candidate input and target). (2002)] and was therefore found suitable for this study.g. and stability than other data mining methods [e. it is necessary to merge two or more periods to maintain statistically signiﬁcant outcomes. As indicated above. This chapter uses two sets of experiments to evaluate the performance of the change detection procedure: • The ﬁrst set is aimed to estimate the hit rate (also called the “true positive rate”) of the change detection methodology. Experimental Evaluation 3. interpretability. Twenty four diﬀerent changes in two diﬀerent databases were designed under the rules mentioned above in order to conﬁrm the expected outcomes of the change detection procedure..
. • Minimal randomly generated noise.1.Change Detection in Classiﬁcation Models Induced from Time Series Data
111
The basic assumption for using this procedure is the use of suﬃcient statistics for a run of the algorithm in every period. Design of Experiments In order to evaluate the change detection algorithm. • All changes were tested independently under the minimum 5% conﬁdence level by the following set of hypothesis. • Noncorrelated datasets (between periods). All hypotheses were tested separately with the purpose of evaluating the relationship of all tests.2 beta). 3. if this assumption is not valid. This novel method. • No missing data. In all generated datasets. • Predetermined set of rules. All datasets were mined with the IFN (InformationFuzzy Network) program (version 1. Table 2 below summarizes the distribution of the artiﬁcially generated changes in experiments on Database#1 and Database#2. developed by Mark Last and Oded Maimon was shown to have better dimensionality reduction capability. see Last et al. • Pure random generation of records. we have introduced and tested a series of artiﬁcially noncorrelated changes of various types. based on the InformationTheoretic Fuzzy Approach to Knowledge Discovery in Databases [Maimon and Last (2000)].
O. Change in A Change in T Change in the “patterns” (rules) TRUE/FALSE (N/A) TRUE/FALSE (N/A) TRUE/FALSE (N/A)
Database section tested Candidate Variables XP(5%) Target Variables XP(5%)
TRUE TRUE/FALSE (N/A) TRUE
TRUE/FALSE (N/A) TRUE TRUE
Candidate & Target Variables XP(5%)
. The second part of the experiments on artiﬁcially generated datasets evaluated the change detection procedure on time series data. Last and L. Rokach
• Also. M.112
G. Change in A CD (5%) TRUE/FALSE (N/A)
Change Detection in Rules (CD). Maimon. Change in A Database #1 Database #2 Sum 4 4 8 Change in Y 2 2 4 Change in the “Rules” 6 6 12 Total 12 12 24
Table 3. which do not contain any change. Distribution of the artiﬁcially generated changes in experiment part1. This error is also called “false positive rate” or “false alarm rate”. It is expected that
Table 2. Change in T TRUE/FALSE (N/A) Change in the “patterns” (rules) TRUE
Table 4. Zeira. It is expected that every uncorrelated change implemented will lead to the associated outcome of the relevant test as indicated in Tables 3 and 4.
Pearson’s estimator for comparing distributions of variables (XP). were implemented in order to estimate the actual 1st type error rate (α) — the “detection” of a change that doesn’t occur. The expected outcomes of the hypothesis testing on the various types of changes are described in the following tables. 12 noncorrelated sets of experiments. During the third and sixth periods (out of seven consecutive periods) two changes (in classiﬁcation “rules”) were introduced into the database and implemented disregarding the previous and the consecutive periods.
were recognized as signiﬁcant by the XP procedure. with an average of 100% detection rate. The average detection rate of CD was 100% meaning that all changes were recognized as signiﬁcant by the procedure. as described thoroughly in Tables 5 and 6: • All artiﬁcial changes in the “patterns” (rules) relating the input variables to the target variable were detected by the CD (5%) procedure. • All the changes.2. • All the changes. the following assumptions are validated by the results which were described above: a major change can cause secondary eﬀects and generate other changes. • The actual 2nd type error (false negative) rate of the change detection methodology in this experiment is 0%. the CD estimator did not produce false alarms (0% error rate) and the XP estimator failed to produce accurate estimation in only ﬁve out of 84 tests (5. were recognized as signiﬁcant by the XP procedure. • Two trials generated by an artiﬁcial change in the target variable were detected by the CD procedure. • According to the XP hypothesis testing. which were mainly introduced to aﬀect the target variable. • The actual 1st type error (false alarm) rate of the change detection methodology in these experiments is up to 6%. The Change Detection procedure can detect these changes with theoretically and actually low Type 1 and Type 2 error rates.
.9%). Results — Part 1 (Hit Rate and False Alarm Rate) Based on the 24 change detection trials the following accumulated results have been obtained. which were mainly introduced to aﬀect the relationship between the target and the candidate input variables. To conclude this part of the experiments with various possible changes in the data mining model. No change was left undetected by the methodology. When implementing the methodology in trials which do not include changes. 50% of changes in the candidate input variables produced a change in the target variable.
3. resulting in an average detection rate of 100% for input attributes and 98% for the target attribute. The average detection rate of CD in these trials was 100%.Change Detection in Classiﬁcation Models Induced from Time Series Data
113
the Change Detection Procedure will reveal the changes in these periods only and not in any other period.
• The eﬀect of the change decreases after one period for CD (when the change occurs in period K. if a change is not properly detected. Rokach Table 5. Results — Part 2 (Time Series Data) Table 7 and Figure 1 describe the outcomes of applying the IFN algorithm to seven consecutive periods from the same artiﬁcially generated database (Database#1). O. Type of change: Number of changes: CD XP(candidate) XP(target) XP(target+candidate) A 8 67% 100% 99% T 4 99% 100% 98% R 12 100% 100% 98% Average 24 90% 100% 100% 99%
Table 6.3. Maimon. • The XP estimator of distribution change in input and target variables is quite sensitive to the eﬀect of any change. Last and L. after one period.114
G. The results of the experiment can be summarized as follows: • All changes are detected by the change detection procedure only in the relevant period where the corresponding change has occurred.
. Type of change: Number of changes: CD XP(candidate) XP(target) XP(target+candidate) A 8 0 4 4 T 4 3 1 3 R 12 12 9 3 sum 24 15 4 10 10
3. Hence. M. • Whenever CD attained a signiﬁcant level (above 5%). the signiﬁcance level in period K + 1 is most similar to period K − 1). Distribution of 5% signiﬁcance outcomes in the 24 trials. Zeira. XP also reached a signiﬁcant value (more than 5%). and using the change detection methodology to detect signiﬁcant changes that have occurred during these periods. These results support the assumptions that a change in the target variable does not necessarily aﬀect the classiﬁcation “rules” of the database and that a change can mainly be detected in the ﬁrst period after its occurrence. Average detection rate of the 24 trials (all significant attributes included in XP). with two changes introduced. it will be absorbed and discarded.
70% 13.10% eM K−1.00% 88.90% 100.90% 100.80% 24.K−1 — 24.20% 2.00% 1 − pvalue — 91.00% 99. Change CD eM K−1.50% 2.00% 0.K — 20.40% 27.30% 18.60% 26.80% 2.10% H(95%) — 4.80% 22.00% 52. Period
Results of the CD hypothesis testing on an artiﬁcially generated time series database.00% d — 3.60% 26.30% 32.40% 0.20% 52.00% 19.20% 22.00% 100.Change Detection in Classiﬁcation Models Induced from Time Series Data
Table 7.60% 100.80% 26.50% 100.40% XP 1 − pvalue — 92.00% 2.20% 2.00% 53.10% 8.30% 3.80%
introduced 1 2 3 4 5 6 7 No No Yes No No Yes No
115
.
1. exponential smoothing. with a new model by using a voting schema (uniﬁed.
. . Rokach
Confidence level
93% 100% 100% 100% 100% 99. the incrementally built classiﬁcation model should be modiﬁed accordingly. Table 8. O. Summary of implementing the change detection methodology on an artiﬁcially generated time series database (1 − pvalue). Last and L. etc. . Inﬂuence of discarding the detected change (Illustration). . M. Prediction Periods 1–5 1 1 1 1 0 1 1 1 0 0 0 Period 6 1 1 1 1 1 1 1 1 0 0 0 Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Match
Candidate Variable Products 0 1 2 3 4 5 6 7 8 Holidays * * * * 0 1 0 1 * * *
If our procedure detects a change in period K of a time series. Modiﬁcation methods could be one of the following [see Last (2002)]: (a) rebuild a new model based on period K only.116
G.8% 100 % 100% 84 % 80% 76 % 60% 40% 20% 0% 2 CD 3 4 period 5 6 53 %
53 %
53 %
7 XP
Fig.) in order to improve predictive accuracy in the future. deterministic choice. . . . (b) combine the DM model from periods 1. K. any weighted majority. Zeira. . Maimon. we look at the rules induced from the database by the IFN algorithm before and after the second change has occurred (Period 6). . K + j. In order to illustrate the inﬂuence of discarding a detected change. (c) add j more periods in order to rebuild a model based on periods K.
Without further knowledge of the process or any other relevant information about the nature of change of that process. Time left to operate in order to meet demand (TIME TO OPERATE) — a discrete categorical variable which stands for the amount of time between the starting date of the production order and its due date. Customer code group (CUSTOMERGRP) — a discrete categorical variable. From now on this dataset will be referred as “Manufacturing”. A RealWorld Case Study 4. The dataset has been obtained from a large manufacturing plant in Israel representing daily production orders of products. Market code group (MRKTCODE) — a discrete categorical variable. Quantity (QUANTITY) — a categorical discrete variable which describes the quantity of items in a production order. Processing duration (DURATION) — a discrete categorical variable which represents the processing times as disjoint intervals of variable size. Each value represents a distinct quantity interval.Change Detection in Classiﬁcation Models Induced from Time Series Data
117
It can be shown by a simple calculation that if all rules have the same frequency (that is assuming a uniform distribution of all input variables). The target variable indicates whether the order was delivered on time or not (0 or 1). we apply our change detection method to a realworld time series dataset. 4. we may assume that no signiﬁcant changes of the operation characteristics are expected over such a short period of time. The ‘Manufacturing’ database was extracted from a continuous production sequence.
. which have occurred during some time periods. The objectives for the case study are: (1) determine whether or not our method detects changes. The candidate input attributes are: Catalog number group (CATGRP) — a discrete categorical variable.1. Presentation and Analysis of Results. Each value represents a distinct time interval. Table 9 and Figure 2 describe the results of applying the IFN algorithm to six consecutive months in the ‘Manufacturing’ database and using our change detection methodology to detect signiﬁcant changes that have occurred during these months. (2) determine whether or not our method produces “false alarms” while the dataset’s characteristics do not change over time signiﬁcantly. The time series database in this case study consists of records of production orders accumulated over a period of several months. the expected increase in the error rate as a result of the change in Period 6 is 9% (1 : 11). Dataset Description In this section.
50% 1.80% 2.K 1 2 3 4 5 6 — 14.40% 68.90% 6.30% 1 − pvalue — 58. CAT GRP Domain Month 2 Month 3 Month 4 Month 5 Month 6 18 100% 100% 100% 100% 100% MRKT Code 19 100% 100% 99. Maimon.10% 10. 2. as described in Table 9 and Figure 2. Last and L.9% 100% Duration 19 100% 100% 100% 100% 100% Time to operate 19 100% 100% 100% 100% 100% Quantity 15 100% 100% 100% 100% 100% Customer GRP 18 100% 100% 100% 100% 100%
.60% 78.90% Results of the CD hypothesis testing on the ‘Manufacturing’ database CD d — 2.80% 3.10% 11. XP conﬁdence level of all independent and dependent variables in ‘Manufacturing’ database (1 − pvalue).90% 2.30% 98.10%
100% 78% 80% 1 – p value 60% 40% 20% 0% 2 CD
99% 76%
100% 95% 79% 69% 63%
58%
54%
3
4 month
5
6 XP
Fig.40% 9.00% XP 1 − pvalue — 78. The magnitude of change in the candidate input variables as evaluated across the monthly intervals is shown in Table 10.50% 100.70% 10.80% 76.10% 8. Zeira.40% 2.8% 99. Summary of implementing the change detection methodology on ‘Manufacturing’ database (1 − pvalue).30% H(95%) — 4. O.30% 1. Month eMK−1.80% 2. M.60% 11. Table 10.00% 1.50% 54. Rokach
The XP statistics.10% 10. refer only to the target variable (delivery on time).K−1 — 12.90% 95.
Table 9.00% 63.118
G.60% eMK−1 .
although XP produced extremely high conﬁdence levels indicating a drastic change in the distribution of all candidate and target variables. The following statements summarize the case study’s detailed results: • Our expectation that in the ‘Manufacturing’ database. during all six consecutive months there was no signiﬁcant change in the rules describing the relationships between the candidate and the target variables (which is our main interest). However. this phenomenon has not aﬀected the CD statistic. Implementing the change detection methodology by validating the sixth month on the ﬁfth and the ﬁrst month did not produce contradicting results. the CD conﬁdence level of both months ranges only within ±8% from the original CD estimation based on the all ﬁve previous months. Nevertheless.
CAT GRP Metric XP (1 − pvalue) domain month 5 validated by month 6 month 1 validated by month 6 months 1 to 5 validated by month 6 100% 100% 100% 100% 100% 100% 100% 18 100% MRKT Code 19 100% Duration Time to Operate 19 100% Quantity Customer GRP 18 100% Target
19 100%
15 100%
2 98.Change Detection in Classiﬁcation Models Induced from Time Series Data
119
According to the change detection methodology. That is. Furthermore.1%
. the data mining model was not aﬀected. we have validated the sixth month on the ﬁfth and the ﬁrst months. is validated by the change detection
Table 11. it is easy to notice that major changes have been revealed by the XP statistic in distributions of most target and candidate input variables. Table 11 and Figure 3 describe the outcomes of the change detection methodology. One can expect that variables with a large number of values need greater data sets in order to reduce the variation of their distribution across periods. there are no signiﬁcant changes in the relationship between the candidate input variables and the target variable over time. and it kept producing similar validation error rates (which were statistically evaluated by CD). Outcomes of XP by validating the sixth month on the ﬁfth and the ﬁrst month in ‘Manufacturing’ database (pvalue). An interesting phenomenon is the increasing rate of the CD conﬁdence level from month 2 to month 6. In order to further investigate whether a change in frequency distribution has still occurred during the six consecutive months without resulting in a signiﬁcant CD conﬁdence level.4%
100%
100%
100%
100%
100%
100%
63.
O. No results of the changed detection metric (CD) exceeding the 95% conﬁdence level were produced in any period.120
G. M.0% 76. 3. Last and L. • The CD metric implemented by our method can also be used to determine whether an incrementally built model is stable. most methods of batch learning are based on the assumption that the training data involved in building and verifying the
. it should produce increasing conﬁdence levels of the CD metric over the initial periods of the time series. Thus the results obtained from a realworld time series database conﬁrm the conclusions of the experiments on artiﬁcial datasets with respect to reliability of the proposed change detection methodology.1% 75. to an accumulated amount of data.0% 90. This means that no “false alarms” were issued by the procedure. as more data supports the induced classiﬁcation model.0%
70. Zeira. CD conﬁdence level (1 − pvalue) outcomes of validating the sixth month on the ﬁfth and the ﬁrst month in ‘Manufacturing’ database.
procedure. Maimon. Rokach
95. like the InfoFuzzy Network.0% 85.0% 88. If we are applying a stable data mining algorithm. • Statistically signiﬁcant changes in the distributions of the candidate input (independent) variables and the target (dependent) variable across monthly intervals have not generated a signiﬁcant change in the rules. Conclusions and Future Work As mentioned above. which are induced from the database.0% months 1 to 5 validated by month 6 month 1 validated by month 6 month 5 validated by month 6
Fig.
5.5%
80.
This work has shown that although there are three main causes for signiﬁcant changes in the datamining models. deriving eight possible combinations for a signiﬁcant change in a classiﬁcation model induced from time series data. This technique is designed to detect all possible signiﬁcant changes. a change in the model M . (iii) The change detection method relies on the implementation of two statistical tests: (a) Change Detection hypothesis testing (CD) of every period K. with respect to the previous K − 1 periods. • A change in the distribution of the target variable T . This work presents a novel change detection method for detecting signiﬁcant changes in classiﬁcation models induced from continuously accumulated data streams by batch learning methods.Change Detection in Classiﬁcation Models Induced from Time Series Data
121
model is the best estimator of what will happen in the future. That is. As more data is accumulated in a time series database. (b) Pearson’s estimator (XP) for testing matching proportions of variables to detect a signiﬁcant change in the probability distribution of candidate input and target attributes.
. The following statements can summarize the major contributions of this work to the area of data mining and knowledge discovery in databases: (i) This work deﬁnes three main causes for a statistically signiﬁcant change in a datamining model: • A change in the probability distribution of one or more of candidate input variables A. based on the deﬁnition of a signiﬁcant change C in classiﬁcation “rules”. incrementally over time. (ii) The change can be detected by the change detection procedure using a threestage validation technique. • A change in the “patterns” (rules). An important factor that must not be set aside is the time factor. it is common that these main causes coexist in the same data stream. these causes aﬀect each other in a manner and magnitude that depend on the database being mined and the algorithm in use. which deﬁne the relationship of the candidate input to the target variable. Moreover. one must examine whether the data in a new period agrees with the data in previous periods and take the relevant decisions about learning in the future.
Shen’s CDL4. As indicated above. the inﬂuence of a change in a database can be absorbed in successive periods.5 and ID3 by Quinlan. decision trees. ITI and DMTI by Utgoﬀ. in order to give the user extra information about the mined time series. networks. the change detection procedure’s complexity is only O(n).122
G. IDTM by Kohavi. since it requires only testing whether the new data agrees with the model induced from previously aggregated data. M. stability of infofuzzy network has been conﬁrmed by the increasing conﬁdence levels over initial periods of the data stream. many future issues could be developed. etc. which will continue eﬃciently rebuilding an existing model if the procedure does not indicate a signiﬁcant change in the newly obtained data. In our realworld experiment. where nK is the total number of validation records in K periods. (vi) The change detection procedure has low computational costs. voting weights based on the CD conﬁdence level. such as: exponential smoothing.. ). ignoring old or problematic periods. Zeira. Rokach
(iv) The eﬀect of a change is relevant and can be mainly detected in the period of change. IFN by Last and Maimon. Last and L. etc. One can implement this procedure in an existing incremental (online) learning algorithm. O. C4. etc. were validated on artiﬁcially generated databases and on a realworld database. (iii) Integrating the change detection methodology in an existing data mining algorithm. KS2. The Change Detection Procedure. Its complexity is O(nK ). or the induced classiﬁcation model (rules. can detect signiﬁcant changes in classiﬁcation models of data mining. Maimon. As change detection is quite a new application area in the ﬁeld of data mining. with the use of the statistical estimators.). including the following: (i) Implementing metalearning techniques according to the cause(s) and magnitude(s) of a change(s) detected in period K for combining several models.
. This option is also applicable for metalearning and multimodel methods. There should be several subtypes of changes with various magnitudes of the model structure and parameters that could be identiﬁed by the change detection procedure. (ii) Increasing the granularity of signiﬁcant changes. hypothesis tests and deﬁnitions. These changes can be detected independently of the data mining algorithm used (e.g. (v) All procedures. (vii) The CD metric can also be used to determine whether an incrementally built model is stable. If not detected.
Ali. and Provost. Preliminary Papers of the Sixth International Workshop on Artiﬁcial Intelligence and Statistics. pp. (1995). Cheung. P. S. Bayesian Model Averaging in Rule Induction. (1997). P. Jain. Knowledge Acquisition from Examples Via Multiple Models. Domingos. and data types. (1997). pp. (1996). 7. References
1. FL: Society for Artiﬁcial Intelligence and Statistics.Change Detection in Classiﬁcation Models Induced from Time Series Data
123
(iv) Using the CD statistical hypothesis testing for continuous monitoring of speciﬁc attributes. Lauderdale. (1996).Y. Fawcett. Conf. Proceedings of the Twelfth International Conference on Machine Learning. 10. 1996. TN: Morgan Kaufmann. Sharing Learned Models among Remote Database Partitions by Local Metalearning. G. and Stolfo. Fayyad. PiatetskyShapiro. Using Partitioning to Speed Up SpeciﬁctoGeneral Rule Induction. (v) Using the CD statistical estimator or its variations as a metric for measuring the stability of an incrementally built data mining model. and Zeugmann. Chan. Ng. 98–106. C.. 152(1).K. pp. 29–34. Chan. Ft. Incremental Concept Learning for Bounded Data Mining Information & Computation. 90–98. (1999). K. 74–110. T. M. Knowledge Discovery and Data Mining: Towards a Unifying Framework. and Pazzani.K. New Orleans.. Louisiana. on Knowledge Discovery and Data Mining. Proceedings of 1996 International Conference on Data Engineering (ICDE’96). D. 6. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. P. (1996). Proceedings of the Fourteenth International Conference on Machine Learning. T. (vii) Further validation of the Change Detection Procedure on other DM models. S. Feb. (vi) Further analysis of the relationship between the two error rates α (false positive) and β (false negative) and its impact on the performance of the change detection procedure. International Journal on Artiﬁcial Intelligence Tools.. Activity Monitoring: Noticing Interesting Changes in Behavior.. F. V. A Comparative Evaluation of Voting and Metalearning on Partitioned Data. Lange. (1995). Han. 1–2. 9. (1996). Proceedings of the Second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD96). P. 8. P.. Second Intl. (1999).. AAAI Press. pp. Learning Multiple Descriptions to Improve Classiﬁcation Accuracy. 4. 3. 4. Nashville. Proc.
. J. 2.. S. pp. Case. pp. 157–164. Domingos. and Wong. 2–7. USA. and Stolfo. 5. algorithms. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD99).J. J. Proceedings of the AAAI96 Workshop on Integrating Multiple Learned Models. Domingos. P. S. and Smyth. U.
14. International Journal of Pattern Recognition and Artiﬁcial Intelligence. Improving Stability of Decision Trees. Knowledge Discovery and Data Mining.A. Incremental Algorithms for Managing Temporal Constraints. Hsu. Gerevini. Maimon. (1998). 20. (1998).. and Long. Martinez. Maimon. July 3. (1998). Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD99). 23. (2001). 27. O. P. E.P. pp. 145–159. Y. 129–147. Online Classiﬁcation of Nonstationary Data Streams.H. 26. Proceedings of European Conference on Machine Learning (ECML).. Kohavi. Crowell. (1997). Last. and Kapuniai. 25.. Proceeding of the International Symposium on Circuits and Systems. and Domingos. 269–280. 15. 383–390. and Solodov. 217–230. (2000). 6(2). 13. Liu. 9(3). Lane. C. (1996). M. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2001). G. and Minkov. Perini. K. and Knoblock. 18. 21. and Setiono.N. 21–31. Bari.. and Mansour. Backpropagation Convergence via Deterministic Nonmonotone Perturbed Mininization. D. IEEE International Conference on Tools with Artiﬁcial Intelligence (ICTAI’96). Incremental Feature Selection. O. Kelly.H. Learning under persistent drift. Y.
.. Biometrica. Mangasarian. (1970). 17. Advances in Neural Information Processing Systems. Italy. the InfoFuzzy Network (IFN) Methodology. 259–263. Hand. Zeira.L. 19.C. Helmbold. M.J. 1–28. pp. Machine Learning. Kluwer.H. and Ricci. C. and Brodley. Wiley. A. and Montgomery. Proceedings of EuroColt. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Change detection model for serially correlated multivariate data. D. Journal of Applied Intelligence. (2002). T.124
G. 16.. Harries. L. (1999). 6.V.. Hines. M. Journal of Data Mining and Knowledge Discovery. R. Proceedings of the ICML96 Workshop on Learning in ContextSensitive Domains. Maimon. 2(1). Hulten. Last. Consistency and Generalization in Incrementally Trained Connectionist Networks. P. Spencer. (1990). 109–118. (1996). R.M. R.. Approaches to Online and Concept Drift for User Identiﬁcation in Computer Security. Traching Drifting Concepts By Minimizing Disagreements. Mining TimeChanging Data Streams. pp. Probability and Statistics in Engineering and Management Science. A... 16(2). 27–46. O. 24.E. W. and Horn. C. The Power of Decision Tables. N. T. 22. (2002). 12. F. (1994). Third Edition.M. The Impact of Changing Populations on Classiﬁer Performance. Last and L. 706–709. 26. M. (1994). Intelligent Data Analysis. Freund.E. NY.G. O. 14(1). Learning stable concepts in domains with hidden changes in context. H. M. Rokach
11. New York. (1995). M. D. D. and Adams. Discovering Robust Knowledge from Databases that Change. pp. L. M. and Last. (1990). Jones.
M. 29. Nouira. and Silva.cs.. Second Edition. S. T. Available at: ftp://ftp. Austin. J. 30. Utgoﬀ. Technical Report. Utgoﬀ. and Fouet. Machine Learning. 33. University of Massachusetts. Performance of Incremental Update in Database Rule Processing. R. G. veriﬁcation and reﬁnement of BKS (ECAI96). Y. (1995). New York. Available at: http://www..ps 36.M. Stolfo. Available at: http://www. pp. 37. Wiley. 39. University of Massachusetts. (1988). P. and Clouse. University of Southern California.cs. H. (1994).umass. 69–101. 23(1).M. T.M.W.C. Ohsie. Statistics and probability letters. 318–325. Clarke. P.isi. Wiley. Bayesian Probability Theory — A General Method for Machine Learning. An Incremental Learning Algorithm that Optimizes Network Size and Sample Size in One Trial. Technical Report 963. 181–189.. Montgomery.Change Detection in Classiﬁcation Models Induced from Time Series Data
125
28. D. J.edu/ pub/techrept/techreport/1996/UMCS1996003. Machine Learning. Estimating the number of change points via Schwartz’ criterion. E.J. TX.umass. (1996). From MCCCarnot10193. and Runger. pp. IEEE. 215–220.C. 34. G. pp.ps. (1996). Yao. Department of computer science. (1995).edu/pub/techrept/ techreport/1995/UMCS1995018. (1994).D. (1996). Elements of Statistical Reasoning. 1997. Proceedings of International Conference on Neural Networks (ICNN94). (1997). Utgoﬀ.B. Decision tree induction based on eﬃcient tree restructuring. and Kubat. 32.edu/∼shen/BayesML. Mitchell. Dean. (1997). (1999).E.
. Minium. A KolmogorovSmirnoﬀ metric for decision tree induction. R. (1999). W. Shen.ps. Learning in the Presence of Concept Drift and Hidden Contexts.. Microelectronics and Computer Technology Corporation. D.edu/∼shen/activecdl4. An Active and SemiIncremental Algorithm for Learning Decision Lists.M. S. Available at: ftp://ftp. and Coladarci.isi. Widmer. A Knowledge Based Tool for the Incremental construction. Machine Learning: Proceedings of the Eleventh International Conference. B. Technical Report 9518. Preliminary Proceedings of Workshop on validation. 35. W.E. Applied Statistics and Probability for Engineers. P. Department of computer science.M.. An improved algorithm for incremental induction of decision trees.T. Validation and Reﬁnement of Large Knowledge Bases. McGraw Hill.E. Shen. 40.A. Information Sciences Institute.ps 38. USCISI97. (1997). 31. Zhang.
This page intentionally left blank
.
or conceptual dependencies between the objects that are modelled through the nodes. for example. In this chapter.unibe. Based on any of the introduced graph similarity measures. computer network monitoring. Examples of applications
127
. A number of graph similarity measures are introduced.ch
M. its application to abnormal event detection in the context of computer networks monitoring is studied.gov
Graphs are widely used in science and engineering. Kraetzl ISR Division. while the edges describe relations between the nodes. Keywords: Graph.CHAPTER 6 CLASSIFICATION AND DETECTION OF ABNORMAL EVENTS IN TIME SERIES OF GRAPHS
H. Switzerland Email: bunke@iam. an abnormal change is detected if the similarity between two consecutive graphs in a time series falls below a given threshold. time series of graphs. Australia Email: mkraetz@nsa. However. Introduction Graphs are a powerful and ﬂexible data structure useful for the representation of objects and concepts in many disciplines of science and engineering. Defense Science and Technology Organization Edinburgh SA 5111. Bunke Institute of Computer Science and Applied Mathematics University of Bern. to demonstrate its feasibility. The approach proposed in this chapter is not geared towards any particular application. CH3012 Bern. or object properties. the problem of detecting abnormal events in times series of graphs is investigated. spatial. object parts. the nodes typically model objects.
1. temporal. These measures are useful to quantitatively characterize the degree of change between two graphs in a time series. graph similarity. abnormal event detection. In a graph representation.
at a particular point of time. graphs have been successfully applied to casebased reasoning [5]. In such series. the detection of abnormal change of the state of the system when going from one point in time to the next. software engineering [3]. Hence this class of graphs oﬀers the high representational power of graphs together with the low computational complexity typically found with feature vector representations. Data mining is generally concerned with the detection and extraction of meaningful patterns and rules from large amounts of data [16. or distance. graphics recognition [11]. planning [7]. or categories. the task considered in the present paper is a particular instance of classiﬁcation. molecular biology [2]. On the other hand. graphs have a much higher representational power than feature vectors and attributevalue lists as they are able to model not only unary. or a system. Particular applications include character recognition [10]. Also in computer vision and pattern recognition graphs are widely used.128
H. and data mining [9]. and database retrieval [4]. On the one hand. many operations on graphs are computationally much more costly than equivalent operations on feature vectors and attributevalue lists. Classiﬁcation is considered to be one of the main subﬁelds of data mining [18].17]. but higher order relations and dependencies between various entities. of two objects that are represented through feature vectors is an operation that is linear in the number of features. The goal of classiﬁcation is to assign an unknown object to one out of a given number of classes. each individual graph represents the state of an object. shape analysis [12]. knowledge representation [8]. computing the dissimilarity. such a classiﬁcation may serve as the ﬁrst step of a more complex data mining
. In artiﬁcial intelligence. From the general point of view. machine learning [6]. For example. In many artiﬁcial intelligence applications object representations in terms of feature vectors or lists of attributevalue pairs are used [14]. and object classiﬁcation [13]. In this chapter we focus on a special class of graphs that is characterized by a low computational complexity with regard to a number of operations. The computational eﬃciency of the graph operations results from constraining the graphs to have unique node labels.e. In particular we are concerned with time series of graphs. Obviously. where the transitions between the individual elements of a given sequence of graphs are to be classiﬁed as normal or abnormal. Kraetzl
where graph representations have been successfully used include chemical structure analysis [1]. but computing the edit distance of two graphs in exponential in the number of their nodes [15]. Bunke and M. i. The task under consideration is the detection of abnormal events.
a pair of vertices denotes the endpoints of an edge. Hence a sequence of graph similarity values is derived from a given time series of graphs. The formal tools introduced in this chapter can also be used in these applications to detect abnormal change in the system’s behaviour. Preliminaries A graph G = (V. The median of a set. and coauthorship and citation networks in science [22]. to demonstrate the feasibility of the proposed approach. at a certain time each day. E) consists of a ﬁnite set of vertices V and ﬁnite set of edges E which are pairs of vertices. using graph edit distance. However. That is. which change their behaviour or properties over time. Graph similarity measures based on graph spectra are presented in Section 3.Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
129
and knowledge extraction process that is able to infer hitherto hidden relationships between normal and abnormal events in time series. are modelled through graphs. for example. or sequence.
2. Finally. it is assumed that an abnormal change has occurred if the similarity between two consecutive graphs is below a given threshold. is discussed in Section 4. There are other applications where complex systems. Another approach to measuring graph similarity. Two vertices u. a special problem in the area of computer network monitoring is considered [19]. In this application each graph in a sequence represents a computer network at a certain point of time. Then in Section 7 the application of the similarity measures introduced in the previous sections to the detection of abnormal change in communication networks is discussed. the given sequence of graphs may represent any time series of objects or systems. The remainder of this chapter is organized as follows. some conclusions are drawn in Section 8. regulatory networks controlling the mammalian cell cycle [21]. The procedures introduced in this chapter do not make any particular assumptions about the underlying problem domain. Changes of the network are captured using a graph distance measure. and its use for the classiﬁcation of events in a sequence of graphs is discussed in Section 6. v ∈ V are said to be adjacent if they are endpoints of
. Examples of such systems include electrical power grids [20]. In this sequence of similarity values abnormal change can be detected by means of thresholding and similar techniques. of graphs is introduced in Section 5. Each similarity value is an indicator of the degree of change that occurred to the network between two consecutive points of time. In Section 2 our basic terminology is introduced. In other words.
where λi are the eigenvalues of the weighted adjacency matrix AG .g. Algebraic aspects of spectral graph theory are useful in the analysis of graphs [24–27]. Analysis of Graph Spectra One of the known approaches to the measurement of change in a time series of graphs is through the analysis of graph spectra. and likewise the number of edges is denoted by E. Objects such as vertices or edges (or their combinations) associated with a graph are referred to as elements of the graph. A unique labeling is a onetoone weight. Edges in G can be directed. The graph G = (V. σ(G) does not depend on the ordering of V . where the value of an assigned vertexweight is wV (u) for vertex u. Edges are also weighted with weight G wE : E → R+ . the class of all isospectral graphs (graphs having
. There are several ways of associating matrix spectra (sets of all eigenvalues) with a given weighted graph G. the amount of traﬃc in a computer network as values of edgeweight). Values of weight assigned to elements in G may be numerical (e. Bunke and M. For a given ordering of the set of n vertices V in a graph G = (V. A weight whose domain is all vertices and edges is called total. one can investigate the spectrum σ(G) = {λ1 . E. The number of vertices in G = (V. λ2 . The set of possible values in the range of the weight function are called attributes. wV (u) = wV (v). The most obvious way is to investigate the structure of a ﬁnite digraph G by analyzing the spectrum of its adjacency matrix AG . Note that AG is not a symmetric matrix for directed graphs because weight may not be a symmetric function.g. Obviously. or symbolic (e. Also. v) ∈ E originates at node u ∈ V and terminates at node v ∈ V . for a vertexweight this serves to uniquely identify vertices of a graph. ∀u. A weight is a function whose domain is a set of graph elements in G. wE ) is vertexlabeled with weight wV : V → LV assigning vertex identiﬁer attributes LV to individual vertices. Furthermore. λn }. v ∈ V. . node identiﬁers as values of vertexweight). . Kraetzl
the same edge [23]. The domain can be restricted to that of edge or vertex elements only.130
H.
3. where the function is referred to as edgeweight or vertexweight respectively. The notation wE is used to indicate that edge weight wE belongs to graph G. . the matrix spectrum does not change under any nonsingular transformation. Since the adjacency matrices have nonnegative entries. uv since the graphs in our application possess unique vertexlabelings. E). . A graph with a unique labeling is called a labeled graph. In this case edge (u. E) is denoted by V . wV .
x2 . random walks on graphs. Notice that this graph matching problem is computationally expensive because it is a combinatorial optimisation problem. v) − wE (Φ(u). diagonal elements of DG are simply the vertex degrees of indexed vertices V . isoperimetric problems. . A diﬀerent approach to graph spectra for undirected graphs [25.30] investigates the eigenvalues of the Laplace matrix LG = DG −AG . semideﬁnite programming. and inﬁnite graphs [25.29]. and because isomorphic graphs are isospectral. the Laplacian is deﬁned as LG = DG − (AG + AT ) (in order to G ensure that σ(LG ) ⊆ {0} ∪ R+ ). The relationship between graph spectra and graph distance measures can be established using an eigenvalue interpretation [27]. Consider the weighted graph matching problem for two graphs G = (VG . with positive edgeweights wE and wE . It follows that the Laplacian spectrum is always nonnegative. where VG  = VH  = n. coarsening.29. and in various clustering. and that the number of zero eigenvalues of LG equals the number of connectivity components of G. In the case of digraphs. This property is extremely useful in ﬁnding connectivity components of G. it can be shown that the class of isospectral graphs is larger [24]. However. or graph condensing implementations [28]. Φ(v)) 2
(3. EH ). Because of that fact. Note that in the unweighted case. .v∈VH G H wE (u. the components of x can be interpreted as positive vertexweight values (attributes) of the corresponding vertices in the digraph G. where the G degree matrix DG of graph G is deﬁned as DG = diag{ v∈V wE (u. xn ] of AG to n various vertices in V . EG ) and H = G H (VH . Laplacian spectra of graphs have many applications in graph partitions. It is easy to verify that by (arbitrary) assignment of xi ’s in an eigenvector x = [x1 . v)u ∈ VG }. The second smallest eigenvalue (called the algebraic connectivity of G) provides graph connectivity information and is always smaller or equal to the vertex connectivity of G [30]. . where AG and AH are the (weighted) adjacency matrices of the two graphs.
. one might expect that classes of isospectral and isomorphic graphs coincide. and · is the ordinary Euclidean matrix norm. which is deﬁned as ﬁnding a onetoone function Φ : VG → VH such that the graph distance function: dist(Φ) =
u∈VG .Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
131
identical spectra) must be relatively small. .1)
is minimal. This is equivalent to ﬁnding a permutation matrix P that minimises J(P) = PAG PT − AH 2 .
H) =
k i=1 (λi − k 2 i=1 λi . . ged algorithms assign costs to
. v) 2
.3) justiﬁes the use of the adjacency matrix spectra. given two weighted graphs G and H with respective spectra σ(AG ) = {λ1 . Graph Edit Distance In this section we introduce a graph distance measure. but empirical studies in pattern recognition and image analysis show that k ≈ 20 is a good choice. for Hermitian matrices A. In fact.
µi )2
k j=1
min
µ2 j
. Kraetzl
In the case of two graphs G and H with the same vertex set V .v∈V G H wE (u. v) − wE (u. λ2 .
(3. µn2 }. . in addition to weight value substitutions [31]. the k largest positive eigenvalues are incorporated into the graph distance measure [26] as: dist(G. Bunke and M. in the restricted case of orthogonal (permutation) matrices.2)
In spite of the computational diﬃculty of ﬁnding a permutation matrix P which minimises J(P).
(3. . This can include the possible insertion and deletion of edges and vertices.3)
where the minimum is taken over all unitary matrices Q. and with eigendecompositions A = UA DA UT and A B = UB DB UT (U(·) are unitary. H) =
u. λn1 } and σ(AH ) = {µ1 .
(3. termed graph edit distance ged. and G H (diﬀerent) edgeweights wE and wE . . . one can use the spectra (eigendecompositions) of the adjacency matrices AG and AH . B ∈ Mnn (R) with corresponding spectra σ(A) = {α1 > α2 > · · · > αn } and σ(B) = {β1 > β2 > · · · > βn }.132
H. the (Umeyama) graph distance is expressed as: dist(G. and D(·) are diagonal matrices): B
n i=1
(ai − bi )2 = min
Q
QAQT − B
2
. . .4)
where k is an arbitrary summation limit. Notice that similar approaches to distance measures can be applied to the case of Laplacian spectra. . Equation (3. This distance measure evaluates the sequence of edit operations required to modify an input graph such that it becomes isomorphic to some reference graph. that is based on graph edit operations. µ2 . Generally. 4.
However. H) = VG  + VH  + EG  + EH  when G ∪ H = ∅. H) is bounded below by d1 (G. In the second graph edit distance measure studied in this paper. for the class of graphs introduced in Section 2. Edit distance d1 (G. a unique sequence of edit operations does not exist due to the occurrence of multiple possible vertex mappings. In general graph matching problems with unlabeled graphs.1) Clearly the edit distance. vertexweight value substitution is not a valid edit operation because vertexweight values are unique.
. To simplify our notation we let our graphs be completely connected (i. H) describing topological change that takes place when going from graph G to H then becomes: d1 (G. G G Using the above cost function. we consider not only change in graph topology.. we assign a cost c to each vertex deletion and insertion.e. wV wV be two graphs the similarity of which is to be evaluated. As a result. wV . there is no change). H) = VG  + VH  − 2VG ∩ VH  + EG  + EH  − 2EG ∩ EH . let G = VG . as a measure of topology change. EH . H) = 0 when H and G are isomorphic (i. edge deletions. The implementation requires linear time in size of the problem. but also in edge weight.32]. wV and H = H H VH . The resultant lowest total edit cost is a measure of the distance between the two graphs. and above by d1 (G. the case where the graphs are completely diﬀerent. For this purpose.e. we consider the topology only). The ged algorithms are required to search for the edit sequence that results in a minimum edit cost.Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
133
each of the edit operations and use eﬃcient tree search techniques to identify the sequence of edit operations resulting in the lowest total edit cost [15. where G c > 0 is a constant. The cost of changing weight wE (e) on edge e ∈ EG into H G H wE (e) on e ∈ EH is deﬁned as wE (e) − wE (e) . (4. If the cost associated with the insertion or deletion of individual elements is one. EG .. and edgeweight value substitution is not considered (i.. Hence substitution of edge weights. there is an edge e ∈ EH between any two vertices in G and an edge e ∈ EH between any two vertices in H) and assign a weight equal to zero to edge e ∈ EG (e ∈ EH ) if this edge does not exist in G(H).e. increases with increasing degree of change. the edit sequence cost becomes the diﬀerence between the total number of elements in both graphs. and all graph elements in common. The graph edit distance d1 (G. the combinatorial search reduces to the simple identiﬁcation of elements (vertices and edges) inserted or deleted from one graph G to produce the other graph H.
1) is called the set median of S. d1 (G. Then G is a median graph of the sequence S = (G1 . The term “median graph” is used because the median x of an ordered ¯ sequence of real numbers (x1 . . Note that the deletion of an G G edge e ∈ EG with weight wE (e) has a cost equal to wE (e). wE (e).134
H. . . Gn ) is a single graph that represents the given Gi ’s in the best possible manner. Median Graphs Intuitively speaking. . Bunke and M. for example. e∈EH \(EG ∩EH )
+
e∈EG \(EG ∩EH )
G wE (e)
+
(4. . . H) or d2 (G. the median of a sequence of graphs S is deﬁned as a graph that minimises the sum of the edit distances to all members of sequence S. assigned as its cost. . 5. It is to be noted that median graphs need not be unique. However. in the context of the present paper where we have uniquely labelled graphs. we will exclusively focus on median graph computation and its application to abnormal change detection. Similarly the H insertion of an edge e ∈ EH has the weight of that edge. . . . let U be the family of all graphs that can be constructed using labels from LV for vertices and real numbers for ¯ edges.2)
The constant c is a parameter that allows us to weight the importance of a vertex deletion or insertion relatively to the weight changes on the edges. which is deﬁned as if n is even. xn ). . . xn ) = xn/2 .1)
¯ If we constrain G to be a member of S then the graph that satisﬁes Eq. the graph edit distance under this cost function becomes d2 (G. . . . . (5. . Using any of the available graph distance measures.
(5. . the median of a sequence of graphs S = (G1 . . Gi ). Formally. ¯ else x = median(x1 . xn ) = x ¯
n/2
. H) = c[VG  + VH  − 2VG ∩ VH ] +
e∈EG ∩EH G H wE (e) − wE (e) H wE (e). Gn ) if:
n
¯ G = arg min
G∈U i=1
d(G. In a general context where node labels are not unique a set median is usually easier to compute than a median graph. Kraetzl
and edge insertions can be treated uniformly. then x = median(x1 . H) introduced in Section 4. Consequently.
v) > n/2 can be relaxed by inclusion of a vertex. v) > n/2). v) = C(u. . and in the number of graphs in S. Let G = (V. v)) is included in G if C(u) > n/2 (C(u. Formally. For each vertex u a counter. v) = 0. E) with V = i=1 Vi and E = n i=1 Ei . a practical procedure for computing the median of sequence S of graphs is as follows. which is computationally very costly. ¯ Then it can be proven that G is a median of sequence (G1 . Gn ) [33]. v). v)) is incremented by one. it minimises n x the expression i=1 ¯ − xi . . Consequently. C(u) (and C(u. ¯ E = {(u. v) + 1 ¯ ¯ ¯ Next we deﬁne graph G = (V . Obviously. . C(u. In contrast to the general method discussed in [34]. It is easy to see that the median graph computed by the procedure introduced in this section is always unique.. or
. for i = 1 to n do if u ∈ Vi then C(u) = C(u) + 1 An analogous procedure can be deﬁned for the edges (u. for i = 1 to n do if (u. v) ∈ E ∧ C(u. v) > n/2}. . and for each edge (u. is deﬁned. We consider the union of all vertices and all edges in S. this procedure is linear in the number of vertices and edges. . . Furthermore. . v) ∈ Ei then C(u. Gn ) using the topological graph distance n measure d1 deﬁned in Eq.e. . . it can be expected that the procedure introduced in this paper can handle long sequences of large graphs. Finally. In [34] it was shown that the median of a set of graphs is not necessarily unique. Next we describe a procedure for the computation of the median G of a sequence of graphs (G1 . (4. vertex u ¯ (edge (u. However. v)) in sequence S. Vn . C(u). . .Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
135
has a similar property as the median of a sequence of graphs: it minimises the sum of distances to all elements of the given sequence i. v) a counter. For each instance of vertex u (and edge (u.1). v) ∈ E: C(u. note that the condition C(u) > n/2 and C(u. let C(u) denote the total number of occurrences of vertex u in V1 . C(u) is deﬁned by the following procedure: C(u) = 0. . E) where: ¯ V = {uu ∈ V ∧ C(u) > n/2}. v)(u.
v) occurring in sequence S is included in the median if and only if it occurs in k > n/2 graphs.
Gi wE (u. Gn )). We start again from S = (G1 . . wE ) be deﬁned as follows: ˆ V = {uu ∈ V ∧ C(u) > n/2}. Gn ). . adding its median number of occurrences to set S will result in a median ¯ that is identical to the median of S. v)u. ˆ ˆ E = {(u. v) be the same as introduced before. ˆ ˆ ˆ ˆ Let graph G = (V . Bunke and M. . each vertex u and each edge (u. v) = median{wEi (u. . G2 ). G2 ) in Figure 1. G1 . An example of the situation where the median graph of a sequence is not unique is shown in Figure 1. . G1 shown in Figure 2 is ¯ 1 is also a median of the sequence a median of (G1 . for the graph distance measure d1 used in this section. . This leads to a total of nine possible median graphs depicted in Figure 2. In the remainder of this section we derive a procedure for median graph computation using the generalized graph distance measure d2 deﬁned in Eq. . Here we have the choice for both vertex 0 and 3 and their incident edges of either including or not in the median graph. (4. . Gn ) = median(G1 . v ∈ V }. and let C(u) and C(u. . Another property worth mentioning is the convergence property of the median under distance d1 . for any given sequence S = (G1 . For example. Gn . v)) with C(u) = n/2 (or C(u. . 1. v) = n/2. . . That is. Obviously. several medians for a given sequence of graphs may exist. . respectively.
.
1 1
2 2
3
Two graphs (G1 . E. . . G2 ). Consequently. Gn ) of graphs. but G ¯ (G1 . . median(G1 . .
G1: G2:
0
Fig. .2). v)i = 1. or C(u. any graph resulting from this relaxed procedure is also a median of sequence S.136
H. v) = n/2) in G. n}. . Hence. . . median(G1 . E) with n n V = i=1 Vi and E = i=1 Ei . They result from either inclusion or noninclusion of a vertex u (or an edge (u. ˆ
ˆ Then it can be proven that graph G is a median of sequence S under d2 [33]. G = (V. . in G if C(u) = n/2. Clearly. . . Kraetzl
an edge.
xk . Edge weight zero (or one) indicates the absence (or presence) of an edge. v) in the ¯ median graph G because it occurs in more than n/2 of the given graphs ˆ is equivalent to labeling that edge in G with the median of the weights assigned to it in the given graphs. 2. . we notice that the former is a special case of the latter. . xk ) = median(x1 . because for any sequence of edge weights (x1 . constraining edge weights to assume only binary values. . . G2 ) from Fig. xk ) one has: median(x1 . . The convergence property holds also for d2 . .
0
1
2
1
2
3
1
2
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
1
2
3
All possible medians of the sequence (G1 . 1. . .2) is unique. . .Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
137
G1 :
0
1
2
3
G2 : G3 : G4 : G5 : G6 : G7 : G8 : G9 :
Fig. . Including an edge (u.
Comparing the median graph construction procedure for d1 with the one for d2 .
. The median of a set of numbers according to Eq. Hence when constructing a median graph under graph distance measure d2 . (5. . xk )). median(x1 . . . there will be no ambiguity in edge weights. . . But the nonuniqueness of the existence of vertices observed for graph distance measure d1 still exists under d2 .
ˆ G:
1 1 3 1 4
2 3
Fig. . . G3 ). Gn ) to detect abnormal change. . G3 ) in Fig. and G3 are shown in Figure 3. 6. 3. and the change from time i − 1 to i is classiﬁed abnormal if d(Gi−1 . values of d(Gi−1 . Kraetzl
G1:
1 1 3 1 4
2 3
G2:
1 1 3 1 4
2 3
G3:
1 1 3 1 4
2 3
Fig.e. G2 . Bunke and M. 4. though these changes are not really signiﬁcant. n. . Their median is unique and is displayed in Figure 4. In statistical signal processing the median
. 3 using d2 . Gi ) is larger than a certain threshold. Median Graphs and Abnormal Change Detection in Sequences of Graphs The measures deﬁned in Sections 3 and 4 can be applied to consecutive graphs in a time series of graphs (G1 . One expects a more robust change detection procedure is obtained through the use of median graphs. That is.
Three graphs (G1 . .
The median of (G1 .. . G2 . .138
H. the random appearance or disappearance of some vertices together with some random ﬂuctuation of the edge weights may lead to a graph distance larger than the chosen threshold. G2 . . it can be argued that measuring network change only between consecutive points of time is potentially vulnerable to noise. Three diﬀerent graphs G1 . However. Gi ) are computed for i = 2.
We conclude this section with an example of median graph under graph distance d2 . i.
Gn−1 ) ≥ αϕ1 ∧ · · · ∧ ˜ n . Gn+1 ) ≥ αϕt . Gn . Let Gn be the median of the ˜ n . Gi )
i=n−L+1
(6. Gn+1 ). 6. Gn+1 . (6.2) on all of them. . Note that the median Gn is. . by deﬁnition. This will result in a series of values ϕ1 . All these approaches assume that a time series of graphs (G1 . .2)
where α is a parameter that needs to be determined from examples of ˜ normal and abnormal change. Median vs. . . Gn ). one can apply Eqs. Single Graph. d(Gn1 . ϕ1 ˜ ˜ and a series of values d(Gn1 .1). . . . . where L is a parameter that is to be speciﬁed by the user ˜ dependent on the underlying application. . . a graph that minimises ϕ in Eq. Increased robustness can be expected if we take the average deviation ϕ. This process is also termed the running median [35]. If ˜ ˜ ˜ n . . A median ﬁlter is computed by sliding a window of length L over data values in a time series. . we compute the median graph in a window of length L. ˜ In Section 5 it was pointed out that Gn is not necessarily unique. . The median graph of a subsequence of these graphs can be computed using either graph distance measure d1 or d2 . . We classify the change between Gn and Gn+1 as abnormal ˜ if d(Gn . . (6. . Gn+1 ) can be used to measure sequence (Gn−L+1 . Then d(G abnormal change.1)
and classify the change between Gn and Gn+1 abnormal if ˜ d(Gn . . . d(G t By contrast a more sensitive change detector is obtained if a change is reported as soon as there exists at least one i for which ˜ d(Gni . Adjacent in Time (msa) Given the time series of graphs. . .) is given. of graphs (Gn−L+1 . Gn ) into account. Gn+1 ). At each step the output to this process is the median of values within the window.1) several instances of G 1 1 and (6. an abnormal change will be reported if d(Gn1 . In the following we discuss four diﬀerent approaches to abnormal change detection that utilise median ﬁlters. We compute ϕ= 1 L
n
˜ d(Gn . . Gn+1 ) ≥ αϕ (6. . . Gn+1 ) ≥ αϕi .1. 1 ≤ i ≤ t. .Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
139
ﬁlter is widely used for removing impulsive noise. Under a conservative ˜ scheme.
. Gn+1 ) is larger than some threshold. Gn of Gn exist. . .
140
H. Bunke and M. Kraetzl
6.2. Median vs. Median Graph, Adjacent in Time (mma) ˜ ˜ Here we compute two median graphs, G1 and G2 , in windows ˜ 1 is the median of the of length L1 and L2 , respectively, i.e., G ˜ sequence (Gn−L1 +1 , . . . , Gn ) and G2 is the median of the sequence (Gn+1 , . . . , Gn+L2 ). We measure now the abnormal change between time ˜ ˜ n and n + 1 by means of d(G1 , G2 ). That is, we compute ϕ1 and ϕ2 for each of the two windows using Eq. (6.1) and classify the change from Gn to Gn+1 as abnormal if L1 ϕ1 + L2 ϕ2 ˜ ˜ . d(G1 , G2 ) ≥ α L1 + L2 Measure mma can be expected even more robust against noise and outliers than measure msa. If the considered median graphs are not unique, similar techniques (discussed for measure msa) can be applied. 6.3. Median vs. Single Graph, Distant in Time (msd) If graph changes are evolving rather slowly over time, it may be better not to compare two consecutive graphs, Gn and Gn+1 , with each other, but Gn and Gn+l , where l > 1. Instead of msa, as proposed above, we ˜ use d(Gn , Gn+1 ) as a measure of change between Gn and Gn+l , where l is a parameter deﬁned by the user and is dependent on the underlying application. 6.4. Median vs. Median Graph, Distant in Time (mmd) This measure is a combination of the measures mma and msd. We use ˜ ˜ G1 as deﬁned for mma, and let G2 = median(Gn+l+1 , . . . , Gn+l+L2 ). Then ˜ ˜ 1 , G2 ) can serve as a measure of change between time n and n + l + 1. d(G Obviously, Eqs. (6.1) and (6.2) can be adapted to msd and mmd similarly to the way they are adapted to mma. 7. Application to Computer Network Monitoring 7.1. Problem Description In managing large enterprise data networks, the ability to measure network changes in order to detect abnormal trends is an important performance monitoring function [36]. The early detection of abnormal network events and trends can provide advance warning of possible fault conditions
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
141
[37], or at least assist with identifying the causes and locations of known problems. Network performance monitoring typically uses statistical techniques to analyse variations in traﬃc distribution [38,39], or changes in topology [40]. Visualisation techniques are also widely used to monitor changes in network performance [41]. To complement these approaches, speciﬁc measures of change at the network level in both logical connectivity and traﬃc variations are useful in highlighting when and where abnormal events may occur in the network [42]. Using these measures, other network management tools may then be focused on problem regions of the network for more detailed analysis. In the previous sections, various graph similarity measures are introduced. The aim of the study described in the present section is to identify whether using these techniques, signiﬁcant changes in logical connectivity or traﬃc distributions can be observed between large groups of users communicating over a wide area data network. This data network interconnects some 120,000 users around Australia. For the purposes of this study, a network management probe was attached to a physical link on the wide area network backbone and collected traﬃc statistics of all data traﬃc operating over the link. From this information a logical network of users communicating over the physical link is constructed. Communications between user groups (business domains) within the logical network over any one day is represented as a directed graph. Edge direction indicates the direction of traﬃc transmitted between two adjacent nodes (business domains) in the network, with edgeweight indicating the amount of traﬃc carried. A subsequent graph can then describe communications within the same network for the following day. This second graph can then be compared with the original graph, using a measure of graph distance between the two graphs, to indicate the degree of change occurring in the logical network. The more dissimilar the graphs, the greater the graph distance value. By continuing network observations over subsequent days, the graph distance scores provide a trend of the logical network’s relative dynamic behaviour as it evolves over time. In the study described in this section, log ﬁles were collected continuously over the period 9 July 1999 to 24 December 1999. Weekends, public holidays and days where probe data was unavailable were removed to produce a ﬁnal set of 102 log ﬁles representing the successive business days’ traﬃc. The graph distance measures examined in this paper produce a distance score indicating the dissimilarity between two given graphs.
142
H. Bunke and M. Kraetzl
Successive graphs derived from the 102 log ﬁles of the network data set are compared using the various graph distance measures to produce a set of distance scores representing the change experienced in the network from one day to the next.
7.2. Experimental Results Figures 5 and 6 represent the outcomes of connectivity and traﬃc (weighted graph) spectral distances, respectively, as introduced in Section 3, applied to the time series of graphs derived from the network data. In spite of the less intuitive interpretation of these graph distance measures, there exists reasonable correlation with the peaks of other distance measures and far less sensitivity to daily variations using this approach (see below). Figure 7 shows results for edit distance applied to consecutive graphs of the time series for the topology only measure d1 . This measure produces three signiﬁcant peaks (on days 25, 65 and 90). The ﬁgure also shows several secondary peaks that may also indicate events of potential interest. Also, there is signiﬁcant minor ﬂuctuation throughout the whole data set
2.5
2 spectral distance (connectivity)
1.5
1
0.5
0 0
20
40
60 days
80
100
120
Fig. 5.
Special distance (connectiviy).
Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
143
5 4.5 4 spectral distance (traffic) 3.5 3 2.5 2 1.5 1 0.5 0 0 20 40 60 days 80 100 120
Fig. 6.
Special distance (traﬃc).
900 800 700 600 edit distance d1 500 400 300 200 100 0 0
10
20
30
40
50
60
70
80
90
100
110
time (days)
Fig. 7.
Consecutive day using measure d1 .
That is. This suggests that these peaks were a result of network change consisting of large change in edge weights. it would be desirable that each point of signiﬁcant change be validated against known performance anomalies. Kraetzl
900 800 700 edit distance d2 600 500 400 300 200 100
0
10
20
30
40
50 60 time (day)
70
80
90
100
110
Fig. Unfortunately.
Consecutive day using measure d2 . Ideally. because all data used in the experiments were collected from a network under real operating conditions.144
H. Figure 8 shows the results for distance measure d2 being applied to consecutive graphs in the time series. it does not appear to provide any additional indicators of change additional to those found using the topology only measure. This is especially useful for identifying outliers in a time series of graphs. information generated by the network management system of the network under observation is not suﬃcient for validation purposes. the data are unlabelled as there is no ‘superinstance’ that would know whether a network change
.
appearing as background noise. The main diﬀerence with this measure is that it has increased the amplitude of the ﬁrst two peaks so that they now appear as major peaks. The ﬁndings for this technique indicate that the application of measures to consecutive graphs in a time series is most suited for detecting daily ﬂuctuations in network behaviour. Whilst this measure considers both topology and traﬃc. 8. Bunke and M.
computed over a running window of the considered time series. or the median of a sequence of graphs. of graphs is a graph that minimises the average edit distance to all members in S. rather than individual graphs has a smoothing eﬀect. Abnormal events in a sequence of graphs representing elements of a time dependent process can be detected by computing the distance of consecutive pairs of graphs in the sequence. Furthermore. If the distance is larger than a threshold than it can be concluded that an abnormal event has occurred. This property greatly reduces the computational complexity of many graph operations and allows a user to deal with large sets of graphs. to make the abnormal change detection procedure more robust against noise and outliers. validated data set. Conclusion Graphs are one of the most popular data structures in computer science and related ﬁelds. Thus the median can be regarded as the best single representative of a given set or sequence of graphs. and compare it to an individual graph. Communication transactions. which reduces the inﬂuence of outliers. This chapter has examined a special class of graphs that are characterised by the existence of unique node labels. However. Visualisation tools have been used to assist with validating some of the major changes by observing the network before and after a signiﬁcant change event. None of the formal concepts introduced in this paper is geared towards a particular application. one can compute the median of the time series of graphs over a window of a given span. each consisting of many nodes and edges. Alternatively. The median of a set. an application in computer network monitoring was studied in this chapter. They are based on eigenvalues from spectral graph theory and graph edit distance. to demonstrate their usefulness in practice. collected by a network management
. S. following that window. A number of additional experiments using median graphs have been reported in [44]. or a sequence. More work needs to be done in producing a reliable. a number of graph distance measures have been analysed. In these experiments it was demonstrated that using the median graph. A novel concept that can be used to measure the similarity of graphs is median graph.Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
145
that has occurred is normal or abnormal. This shortfall of useful information for validation of network fault and performance anomalies is common [43].
8.
In R. Plaza. J. Academic Press. Applications of Graph Theory. Baxter. Knowledge Acquisition Via Incremental Conceptual Clustering. Leake and E. Wilson and L. Advances in Casebased Reasoning. S. B. B¨rner.J. Pattern Recognition. Kettler. Introduction to Graph Grammars with Applications to Semantic Networks. D. In D. 557–572. 8.J. 177–221.. 58–75. Computers and Mathematics with Applications. Rodgers. C. network operators are able to more readily identify the individual user groups contributing to this change in aggregated traﬃc. eds.. Faltings.G. A Graph–Rewriting Visual Language for Database Programming. 6. 23. 1266 of Lecture Notes in Computer Science. pp.146
H. K. 34. 641–674. 179–186.. 7. A. Vision Interface 2000. Shearer. As a practical implication. and Coulon. H. Rouvray. 245–254. Protein Structure Determination: Combinnig Inexact Graph Matching and Deformable Templates. 8. K. 5. Bunke and M. Using these data in a number of experiments. and Balaban. and Glasgow. In J. 2. pp..W. and King. E. The graph distance measures introduced in this chapter are used to assess the changes in communication between user groups over successive time intervals. (1997). D. E. Kraetzl
system between logical nodes occurring over periodic time intervals are represented as a series of weighted graphs. P. Video Indexing and Similarity Retrieval by Largest Common Subgraph Detection Using Decision Trees. Smith and B. eds. pp. (2000). 1075–1091. Tammer.T. Beineke. Morgan Kaufmann. Journal of Visual Languages and Computing. (1996). Pippig. The Case for GraphStructured Representations. 267–283.. Future work will concentrate on enhancing the present tools for computer network monitoring and identifying other application areas for the proposed formal concepts. Sanders. Ehrig. eds. 3. the usefulness of the proposed concepts could be established. Chemical Applications of Graph Theory. Bunke. LNCS 1168. (1990). Dietterich..
References
1. CaseBase Reasoning Research and Development. J. (1997). K. if connectivity and traﬃc patterns can be shown to be similar to the activity over previous time periods.W. Proc. In I. eds. Structural Simio larity and Adaption.. Readings in Machine Learning. P. Springer. pp. Shavlik and T. if traﬃc volumes are seen to abruptly increase or decrease over the physical link. (1992). Springer. (1979).
. K.H. and Hendler. H. and Venkatesh.H. Fisher. 4. pp.H.J. Alternatively. the proposed methods have the potential to relive human network operators from the need to continually monitor a network.
B.J. Measuring Abnormal Change in Large Data Networks. H.S. S.M. Hierarchical Attributed Graph Representation and Recognition of Handwritten Chinese Characters. H. Quantitative Measures of Change Based on Feature Organization: Eigenvalues and Eigenvectors. and Dickinson. Nat. Sanniti di Baja.... Cvetkovic. 268–276. pp. and Uthurusamy. (2001). 18. A Uniﬁed Framework for Indexing and Matching Hierarchical Shape Structures. 67–84. 98. and Bunke. In C. Visual Form 2001. 866–883.. 10. Exploring Complex Networks. and Boyer. B. 493–504. Sarkar. B. Data Mining and Computational Intelligence. (1996). An Eigendecomposition Approach to Weighted Graph Matching Problems. Proc. and Bunke. A New Algorithm for ErrorTolerant Subgraph Isomorphism Detection. Pattern Anal. 575–586. Machine Learning. and Villanueva. P.. 617–632. 9. (2001). Pattern Recognition. Pattern Anal.J. 10. 40... 10(5). Cordella. S.. Fayyad.E. New Jersey. The Structure of Scientiﬁc Collaboration Networks. Physica Verlag.
. S. AAAI/MIT Press. Cell.S. (1996). Arcelli. M. and Holder.T. J. Y. IEEE Trans. 8. 27. D.. (2001). Visual Form 2001. Newman. PiatetskySharpiro. IEEE Trans. Bunke. Introduction to Graph Theory. K. Prentice Hall. 24. Shokonfandeh. (1998). 20. H. IEEE Trans. 23. Marti. Kandel. LNCS 2059. Knowledge and Data Engineering.. pp. 19. 13.Classiﬁcation and Detection of Abnormal Events in Time Series of Graphs
147
9. On Matching Algorithms for the Recogniton of Objects in Cluttered Background. Eds. 11.L.. Mc GrawHill. J. E. H. 110–136. Data & Knowledge Engineering.. D. Chen. Sanniti di Baja. Data Mining: An Overview from a Database Perspective. 51–66. Ren. and Yu. 15. Symbol Recognition by ErrorTolerant Subgraph Matching Between Region Adjacency Graphs. Academic Press. L. Kittler. Springer Verlag. A. T. 20. and Suen. (1980). (1996). A. 22. IEEE Trans. S.W.B. Llados. 24. New York. 17. Laplace Eigenvalues of Graphs — A Survey. 71. and Ahmadyfard.M. Cordella. 171–183. Strogaz. Molecular Interaction Map of the Mammalian Cell Cycle Control and DNA Repair System. West. In C. J.. U. Kuhn.. (2001). Last. 26. Nature. 1144–1151.Y. M. Springer Verlag. IEEE Trans. (1999). P. LNCS 2059.. and Wallis. Submitted for publications. pp. M. Spectra of Graphs. Djoko. and G. 14. 695–703. Shoubridge. (2001). Cook. 404–409. Computer Vision and Image Understanding. eds. (1988). (1997). Umeyama.W. Lu. (2001). Mohar. D. Messmer. J. Mitchell. (1991). and Sachs.& Machine Intell. 109. 23. K. Arcelli. 2703–2734. Discrete Math. P. S.. and G. Acad. Han. 25.J.. R. & Machine Intell.. A. An Empirical Study of Domain Knowledge and its Beneﬁts to Substructure Discovery.D.. G. M. S. 21. L. Doob.H. Advances in Knowledge Discovery and Data Mining. L. C. Kraetzl. 16. Science USA. W. (1997).. (1998). Biol. Mol. 12. Smyth. Pattern Analysis and Machine Intelligence. M. (1992).
L. A. In IEEE Global Telecommunications Conference. 30. K. Cybernetics and Informatics. P.. 4. IDC ’99 conference. (1995). CBMS. and Wang. C. J. Pattern Anal. 169–203. B. Matrix Anal. Application of Median Graphs in Detection of Anomalous Change in Time Series of Networks. 34. Becker. (2001). H. Submitted for publication. (1998). Sydney. Bunke.E. 29. IEEE Communications Magazine. (1997). (1973). IEEE Trans. Orlando. M. White.. Astola.. Math.. 701–719. A Distance Measure Between Attributed Relational Graphs for Pattern Recognition. On the Quality of Spectral Separators. In IEEE Information.. Decision and Control. Czech. Australia. (1997). Guemhioui. Boutaba. M. 12(5). (2001). 31. Hood. (1983). 37. pp. 11(1).. and Morrow.. C. 44. 572– 574. IEEE Trans. and Kraetzl. 5. Bunke. International J. D.A. Chung. Appl. (1989)... IEEE Network. S. Recent Advances in Graph Matching. X. Proactive Anomaly Detection Using Distributed Intelligent Agents. Visualizing Network Data.G. J.S. IEEE Transactions on Systems Management and Cybernetics. 13(3). Dadej. 1. F. Algebraic Connectivity of Graphs. 1(1)..N. and Ji. A Close Look at Traﬃc Measurements from Packet Networks. M. M.L. C. 42. 38. Higginbottom. 40.A. Adelaide. 39. On median Graphs: Properties. (1998). A. (1995). Dickinson. An Outlook on Intranet Management. R. E. 36.C.R. G.. 33. 43. and Wilks.L. J. Kraetzl.. (1998). Shoubridge. In Proc. Visualization and Computer Graphics. (1997). Dickinson. (1997). Sanfeliu. H. T. 487–491..T.
. Jiang. On Computation of the Running Median. 557–562. and Ray. Bunke. and Kraetzl. and Signal Processing. P. pp. Spectral Graph Theory. and Dini. and Messmer. Jerkins. and Machine Intell. Fiedler. Guattery. (1999) Detection of Abnormal Change in Dynamic Networks. M. 35.A. 23(10). P. J. Kraetzl
28.R. and Bunke. Regional Conference Series in Mathematics. and Ji. 397–413. SIAM J. R. H. Performance Evaluation of Communication Networks. 1144– 1151. and Campbell. American Mathematical Society. 298–305. Artech House. Sykes. Speech. Beyond Thresholds: An Alternative Method for Extracting Information from Network Measurements. 194–197. IEEE Trans. World Multiconference on Systemics. 353–362. Acoustics. 37(4). Algorithms.. S. In IEEE GLOBECOM 98. Dadej.. Bunke and M. An Analytical Approach to the Dynamic Topology Problem. and Fu. 21–27. US.S. J.G. 23(98).148
H. (1998). Massachusetts. A. Thottan. A. 19(3). Application of Median Graphs in Detection of Abnomalous Change in Communication Networks. GLOBECOM ’97. H. 41. K.K. and Applications. Pattern Recognition and Artiﬁcial Intelligence..T. Eick. G. and Miller. 92–99. 3. C. 2405–2411. 32. pp. Telecommunication Systems. P. 16–21..
where it is necessary to give an alarm signal as soon as possible. Spain a Grupo de Sistemas Inteligentes Email: jjrodriguez@ubu.CHAPTER 7 BOOSTING INTERVALBASED LITERALS: VARIABLE LENGTH AND EARLY CLASSIFICATION∗
Carlos J. time series classiﬁcation. Nevertheless. de Inform´tica. The induced classiﬁers consist of a linear combination of literals. Alonso Gonz´lez a Dpto. which illustrate that the proposed method is highly competitive with previous approaches in terms of classiﬁcation accuracy. like on line supervision or diagnosis. The method had already been developed for ﬁxed length time series. the system has been slightly modiﬁed in order that it is now able to learn directly from variable length time series. obtained by boosting base classiﬁers that contain only one literal. Keywords: Interval based literal. capable of learning from series of diﬀerent length and able of providing a classiﬁcation when only part of the series are presented to the classiﬁer. o
149
.
∗ This work has been supported by the Spanish CYCIT project TAP 99–0344 and the “Junta de Castilla y Le´n” project VA101/01. Several experiments on diﬀerent data test are presented.es
Juan J.uva. First. This “early classiﬁcation” is essential in some task. the classiﬁer can be used to identify partial time series. boosting. This work exploits the symbolic nature of the classiﬁer to add it two new features. Universidad de Burgos. these literals are speciﬁcally designed for the task at hand and they test properties of fragments of the time series on temporal intervals. Second. Rodr´ ıguez Diez Lenguajes y Sistemas Inform´ticos. Spain a Grupo de Sistemas Inteligentes Email: calonso@infor. machine learning.es
This work presents a system for supervised time series classiﬁcation. Universidad de Valladolid.
it requires as
. The method for learning time series classiﬁers that we propose in this work is based on literals over temporal intervals (such as increases or always in region) and boosting (a method for the generation of ensembles of classiﬁers from base or weak classiﬁers) [Schapire (1999)] and was ﬁrst introduced in [Rodr´ ıguez et al. they may manage comprehensible temporal concepts diﬃcult to capture by the preprocessing technique — for instance the concept of permanence in a region for certain amount of time — and secondly. since they vary over time.150
C. ﬁrstly. reﬂecting the fact that the based classiﬁers consist of clauses with only one literal. there are several heuristics applicable to temporal domains that preprocessing methods fails to exploit. it imposes two restrictions that limits its application to real problems. These base classiﬁers are inspired by the good results of works using ensembles of very simple classiﬁers [Schapire (1999)]. Alonso Gonz´lez and J. as the Auslan example (Australian sign language. Rodr´ a ıguez Diez
1. The input for this learning task consist of a set of examples and associated class labels. this approach has several drawbacks [Kadous (1999)]: the preprocessing techniques are usually ad hoc and domain speciﬁc. Time series classiﬁcation may be addressed like an static classiﬁcation problem. diagnosis of continuous dynamic systems [Alonso Gonz´lez and Rodr´ a ıguez Diez (1999)] or data mining in temporal databases [Berndt and Cliﬀord (1996)]. extracting features of the series through some kind of preprocessing. However. (2001)] providing very accurate classiﬁers. The output of the learning task is a weighted combination of literals. On the other hand.4) [Kadous (1999)] illustrates. (2000)]. and the descriptions obtained using these features can be hard to understand. where each example consists of one or more time series. it requires that all the time series were of the same length. Instances of these kind of tasks may be found in very diﬀerent domains. J. for example analysis of biomedical signals [Kubat et al. each point of each series is an attribute of the example. Although the series are often referred to as variables. (1998)]. and using some conventional machine learning method. The design of speciﬁc machine learning methods for the induction of time series classiﬁers allows for the construction of more comprehensible classiﬁers in a more eﬃcient way because. Although the method has already been tested over several data set [Rodr´ ıguez et al. sometimes named stumps. form a machine learning point of view. see Section 6. which is not the case in every task. Introduction Multivariate time series classiﬁcation is useful in those classiﬁcation tasks where time is an important dimension. J. On the one hand.
In this case. suited to our method.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
151
an input the completed time series to be classiﬁed. Regarding variable length time series. of which AdaBoost is the most
. The method can now be used with variable length series because the literals are allowed to abstain (that is. the system has been modiﬁed to add it the capacity of assigning a preliminary classiﬁcation to an incomplete example. Section 2 is a brief introduction to boosting. it is necessary to indicate a possible problem as soon as possible. Nevertheless. an active research topic is the use of ensembles of classiﬁers. For instance. Again. usually. interval literals. The base classiﬁers. One of the most popular methods for creating ensembles is boosting [Schapire (1999)]. a family of methods. the example to classify is the current state. The target of these ensembles is to increase the accuracy with respect to the base classiﬁers. it is always possible to preprocess the data set in order to obtain a new one with ﬁxed length series. the own length of the series provides essential information for its classiﬁcation. considering the historical values of the variables. This feature is crucial when the examples to classify are being generated dynamically and the time necessary to generate an example is long. Hence. Finally. The rest of the chapter is organized as follows. because. it is necessary to obtain classiﬁcations using as input series of diﬀerent lengths.
2. we give some concluding remarks in Section 7. constructed using other machine learning methods. respectively. This may be a severe drawback if the classiﬁers are to be use on line in a dynamic environment and a classiﬁcation is needed as soon as possible. A variant of boosting that can work with base classiﬁers with abstentions is used. with variable length series and early classiﬁcation. This work shows how the original method can be extended to cope with these weaknesses. Boosting At present. are described in Section 3. this cannot be considered as a generic solution. In order to conﬁrm the problem. or early classiﬁcation. it will be necessary to wait and see how the variables evolve. Regarding the classiﬁcation of partial examples. consider a supervision task of a dynamic system. Section 6 presents experimental results. They are obtained by generating and combining base classiﬁers. For this problem. Sections 4 and 5 deal. the capability of the literals to abstain allows tackling this problem. false or an abstention) if the series is not long enough to evaluate some literal. the result of their evaluation can be true.
For y. Afterwards. A real value. l ∈ Y. but there are several methods of extending AdaBoost to the multiclass case. l) = 1/(mk) For t = 1. which belongs to a ﬁnite label space Y . In each iteration a base (also named weak) classiﬁer is constructed. l) = Dt (i. then Given (x1 . according to the distribution of weights. (xm . J. Rodr´ a ıguez Diez
prominent member. the weights are readjusted.
Two open questions are how to select αt and how to train the weak learner. y1 ). the version presented here does not include this case. They work by assigning a weight to each example. l) exp(−αt yi [l] ht (xi . if l = y. Initially. the weight of each example is readjusted. AdaBoost. AdaBoost is only for binary problems.152
C.
1 Although
AdaBoost. .MH associates a weight to each combination of examples and labels. Alonso Gonz´lez and J.
AdaBoost. based on the correctness of the class assigned to the example by the base classiﬁer. ym ) where xi ∈ X.MH1 [Schapire and Singer (1998)]. . is selected. l))/Zt where Zt is a normalization factor (chosen so that Dt+1 will be a distribution) Output the ﬁnal hypothesis
Fig. the weight of the base classiﬁer. .MH [Schapire and Singer (1998)]. If the base classiﬁer returns a value in {−1. all the examples have the same weight. . The base learner generates a base hypothesis ht .MH also considers multilabel problems (an example can be simultaneously of several classes). . . T : • • • • Train weak learner using distribution Dt Get weak hypothesis ht : X × Y → R Choose αt ∈ R Update Dt+1 (i. y[l] is deﬁned as y[l] = +1 −1 if l = y. according to the weights. . yi ∈ Y Initialize D1 (i. Each instance xi belongs to a domain X and has an associated label yi .
. The ﬁnal result is obtained by weighted votes of the base classiﬁers. J. +1}. For two class problems. αt . The ﬁrst question is addressed in [Schapire and Singer (1998)]. Then. 1. . Figure 1 shows AdaBoost.
Classiﬁcation of the predicates. The output of this weak learner. For the last question. Two kinds of interval predicates are used: relative and region based. Relative predicates consider the diﬀerences between the values in the interval. This predicate is introduced to test the results obtained with boosting without using interval based predicates.MH. the sum of the weights of the examples well and bad classiﬁed. Point based predicates use only one point of the series: • point le(Example. we do not generate a unique αt . This section only introduces the predicates [Rodr´ ıguez et al. l) = αtl hB (x) and αt = 1 and we use AdaBoost. 2. Now. we can deﬁne ht (x. an αtl is selected. t 3.
. for the Example. respectively. it must be a multiclass learner. and we want to use binary learners (only one literal). true_percentage
Fig. l. including how to select them eﬃciently. (2001)] gives a more detailed description. Threshold) it is true if. for each iteration. This is a binary problem so the selection of the values can be done as indicated in [Schapire and Singer (1998)]. the value of the Variable at the Point is less or equal than Threshold. Point. Interval Based Literals Figure 2 shows a classiﬁcation of the predicates used to describe the series. Variable.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
153
the best is α= 1 ln 2 W+ W− . how to train the base learner. always. On the other hand. hB (x) is binary. Note that a learner that only uses this predicate is equivalent to an attributevalue learning algorithm.
Where W+ and W− are. Region based predicates are based on the presence of the values of a variable in a region during an interval. They are selected considering how good is the weak learner for discriminating between the class l and the rest of classes.
Point based: point_le Predicates Interval based Relative: increases. stays Region based: Sometimes. decreases. Then. but t for each class. we train the weak learner using a binary problem: one class (selected randomly) against the others.
Region. but we believe that a system for time series classiﬁcation must not rely on the assumption that the data is clean. It is true. Variable. Beginning.. • true percentage(Example. Rodr´ a ıguez Diez
3. Region Based Predicates The selection and deﬁnition of these predicates is based in the ones used in a visual rule language for dynamic systems [Alonso Gonz´lez and Rodr´ a ıguez Diez (1999)]. the use and deﬁnition of the predicates always and sometimes is natural. End. if the Variable is always in this Region in the interval between Beginning and End. Value). 3. End).2. Region. Variable. for the Example. Beginning. Frequently. for the Example. J. Beginning. It is true. series are noisy and. End. Once that it is decided to work with temporal intervals. Beginning. These predicates deal with these concepts: • increases(Example. These predicates are: • always(Example. J. Region. Variable. • sometimes(Example. Beginning. Value). End. due to the fact that they are the extension of the conjunction and disjunction to intervals. Percentage). decrease or stay. • decreases(Example. Value). End. It is true. Variable. In this case all the points in the interval are considered. Alonso Gonz´lez and J. Beginning. is not useful. a strict deﬁnition of increases and decreases in an interval i. hence. if the percentage of the time between Beginning and End where the variable is in Region is greater or equal to Percentage. Variable. End). the relation holds for all the points in the interval. It is true. if the diﬀerence between the values of the Variable for End and Beginning is greater or equal than Value. For the predicate stays it is neither useful to use a strict deﬁnition. Relative Predicates A natural way of describing series is to indicate when they increase. for the Example. The parameter Value is used to indicate the maximum allowed diﬀerence between the values in the interval.154
C. • stays(Example. The parameter Value is necessary for indicating the amount of change. Variable. For these two predicates we consider what happens only in the extremes of the interval.
. It is possible to ﬁlter the series prior to the learning process.e. for the Example.1. if the range of values of the Variable in the interval is less or equal than Value.
for the CBF data set (Section 6. then. 30) Cylinder −0.104166 0. In order to classify a new example.085362 0.624340 0.134683
. x.1).771224 −0. Literal true percentage(E.336955 −0. The ﬁrst column shows the literal.1). 34. x. they can be obtained with a discretization preprocess.289225 −0. . 3.733433 −0.783590 −0. .899025 −0.832881 0. x. x. 76. 3. its weight is updated adding the weight of
Table 1. . x.3. For each literal. a third one has been introduced. 0.431576 0.20) not true percentage(E. 12. x. ﬁxed and not overlapped. 34. 15) decreases(E. with the weight associated to the literal for that class. Initially.543762 −0. 25) decreases(E. If it is true.248722 −0. For each base classiﬁer. 95) not true percentage(E. true percentage. 113. 1. 1. 4.462483 −0. x.351317 0. 38. x. 95) true percentage(E. 16.527967 1. In some cases the deﬁnitions of these regions can be obtained from an expert. 32. 49. It is a “relaxed always” (or a “restricted sometimes”). 35. 1 4. For each class in the data set there is another column. 40) true percentage(E. Classiﬁer example.041179 Bell 3.60) not true percentage(E. The literals are easier to understand if the regions are few.028495 0. consecutive intervals.053417 Funnel −0. 1 4. which obtains r disjoint. The regions that appear in the previous predicates are intervals in the domain of values of the variable.234799 −0. x. 3.534271 0. 4. 64. 36.297363 −0. 80. The reasons for ﬁxing the regions before the classiﬁer induction. This classiﬁer is composed by 10 base classiﬁers. for each class. The additional parameter indicates the degree of ﬂexibility (or restriction). . a weight is calculated for each class. The regions considered are these r intervals (equality tests) and others formed by the union of the intervals 1. 1 4.180668 −0. instead of obtaining them while inducing. 34. 51. Otherwise. 1 3.179658 0.103075 2. 15) true percentage(E. 50.20) decreases(E. the literal is evaluated.552084 1. 14. 78. the weight of each class is 0.477789 −0. as background knowledge. Regions. and then the example is assigned to the class with greater weight. are eﬃciency and comprehensibility.676715 3. Classiﬁer Example Table 1 shows a classiﬁer. a weight is associated to every class. x. i (less or equal tests).Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
155
Since one appears too demanding and the other too ﬂexible.971446 0. 18. It was obtained from the data set CBF (Section 6.
more or less complex.43 and −0. the use of these methods is not a general solution. Nevertheless. 4. However. 3. There are several methods. The value of α is again selected as α= 1 ln 2 W+ W− . 1 4. Therefore. but due
.
Until now. To learn from series of variable length. we have opted for a slight modiﬁcation of the learning algorithm. the learning system that we have introduced in the preceding sections requires series of equal length.156
C. J. With the basic boosting method. In this case. The ﬁrst literal in the table “true percentage(E. it is important that the learning method can deal with variable length series. Variable Length Series Due to the method used to select the literals that forms the base classiﬁers. The value 0 indicates that that base classiﬁer abstains. normalizing the length of the series. If the series are of diﬀerent lengths. This modiﬁcation also requires some change on the boosting method. x. These methods. Alonso Gonz´lez and J. Rodr´ a ıguez Diez
the class for the literal. J. it can still be used with preprocessed data sets of uniform length. the result of the evaluation of the literal could be an abstention. 36. then the weight of the class for the literal is subtracted from the weight of the class. because they destroy a piece of information that may be useful for the classiﬁcation: the own length of the series. Of course. which preprocess the data set. This means that if the literal is true (false). there will be literals with intervals that are after the end of the shortest series.54. the base learner returns a real value: the sign indicates the classiﬁcation and the absolute value the conﬁdence in this prediction. 95)” has associated for each class the weights −0. the base classiﬁers return +1 or −1. a literal could be true or false. A special case consists in the use of three values: −1. it is likely that the class is (not) the second one. Consequently.55. If the literal is false. For these cases. it is necessary to preprocess the data. because all the series had the same length. 0 and +1. a 0. that allow normalizing the length of a set of series. can be adequate for some domains. there are variants that use conﬁdencerated predictions [Schapire and Singer (1998)]. for binary problems. that allow a literal to inhibit — or abstain — when there are not enough data for its evaluation. This literal is not useful for discriminating between the ﬁrst and the third class. to apply it to series of diﬀerent lengths. 4.
• For always. In other case the result will be 0. the result will be 1 if there is some available point in the interval that is out of the region. If the end of the series is in the interval. Somewhat surprisingly. Nevertheless. A very simple approach to identify literals with unknown values would be to abstain whenever there are point values that are still unknown for the example. the result depends on the predicate: • For increases and decreases the result will be always 0. maybe form series of diﬀerent length. In other case the result will be 0. because only the extremes of the interval are considered. the result will be 1 if the series has already enough points in the interval inside the region. In other case the result will be 0. If the end of the series is before the beginning the interval. In other case the result will be 0. the simple omission of literals with unknown value allows to obtain a classiﬁcation from the available data. and the variations occur on the length of the series to be classiﬁed. the result will be 0. exploiting the same ideas that allow learning from series of diﬀerent length. • For true percentage. • For sometimes.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
157
to the nature of the literals it will not always be 0. if the values of some points in the interval are
. from partial examples to full series. Early Classiﬁcation The capability of obtaining an initial classiﬁcation as soon as the available data give some cue about the possible class they belong to. When a partial example is presented to the classiﬁer some of its literals can be evaluated.
5. the setting is diﬀerent to the variable length series learning problem of the previous paragraph: the classiﬁer has already been obtained. because their intervals refer to areas that are already available in the example. is a very desirable property if the classiﬁers are to be used in a dynamic environment were data are generated and analyzed on line. Now. • For stays the result will be −1 if the available points are not in the allowed range. Given that the classiﬁer consists of a linear combination of literals. some literals cannot be evaluated because the intervals that they refer to are still not available for the example: its value is still unknown. this early classiﬁcation ability can be obtained without modifying the learning method. the result will be −1 if there is some available point in the interval that is out of the region. Certainly. The classiﬁcation given to a partial example will be the linear combination of those literals that have known results.
The setting Interval also includes these literals.83 8.98 8. the results of some data set can be improved using more
Table 2.00 44.28 1.03 5. 128.83 5. 100 literals were used. 10 fold stratiﬁed cross validation was used.00 3.51 3.00 7. J.13 0.49 4. Alonso Gonz´lez and J. because they can be considered as special cases of true percentage. 6.00 70.158
C. decreases and stays.00 0. Experimental Validation The characteristics of the data sets used to evaluate the behavior of the learning systems are summarized in Table 2.50 1.00 95.75 1.17 7.50 4.00 Relative 2. and it shows clearly that the method does not overﬁt.50 3.00 5. It does not include always and sometimes. Nevertheless.50 True percentage 0.00 0. The setting named Relative uses the predicates increases. Length Min.20 4.63 1.49 49. and true percentage.00 2.50 5.83 2.33 4.25 1.34 128 128 60 60 99 102
Classes Variables Examples Training/ Test CBF CBFVar Control ControlVar Trace Auslan Table 3. Table 3 shows the obtained results for diﬀerent data sets with diﬀerent combination of predicates.50
. J. Characteristics of the data sets.50 2.00 3.70 6. Normally.50 6.50 Interval 0.78 3. For some data sets this value is more than enough. it is sometimes possible to know the result of the literal. Average Max. This is done in a similar way than when evaluating literals for variable length series. Data set CBF CBFVar Control ControlVar Trace Trace Gloves Gloves Gloves 3 3 6 6 16 10 1 1 1 1 4 8 798 798 600 600 1600 200
10 fold CV 128 10 fold CV 64 10 fold CV 60 10 fold CV 30 800/800 67 10 fold CV 33
Results (error rates) for the diﬀerent data sets using the diﬀerent literals.00 72.50 1. In other case. the results are obtained by averaging 5 experiments.50 4.61 1.38 2.65 5.50 2.86 82. Rodr´ a ıguez Diez
known.00 13. Literals 100 100 100 100 100 200 100 200 300 Point 3.75 60.78 0. If a training test partition it speciﬁed.00 Always/ sometimes 1.
45 4.45 4.17 1. a detailed study of this topic for ﬁxed length time series can be found in [Rodriguez et al.86 1. the usefulness of interval literals is clearly conﬁrmed.10 0.50 Interval 0.33 3. including its description. The obtained results for the setting named Interval are the better or very close to the better results. introduced in [Saito (1994)]. bell (b) or funnel (f ).00 34.68 3. (2001)]. f (t) = (6 + η) · χ[a.00 5.50
literals.33 70. Table 3 also shows for these data sets the obtained results using more iterations.33 1.70 1. Table 4 shows some results for early classiﬁcation.00 17.75 30.50 13.00 Relative 4.00 2. Bell and Funnel) This is an artiﬁcial problem. The table shows early classiﬁcation results for 60% and 80%.17 5.b] (t) · (b − t)/(b − a) + ε(t)
.00 55. The considered lengths are expressed in terms of the percentage of the length of the longest series in the data set. 6.50 20.11 3.17 19.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
159
Table 4.50 True percentage 1.33 5. Early Classiﬁcation.67 22.63 2.95 5. CBF (Cylinder.33 47. Examples are generated using the following functions: c(t) = (6 + η) · χ[a.00 1. Again.48 5. The learning task is to distinguish between three classes: cylinder (c).63 1.01 3.67 7. Data set CBF CBF CBFVar CBFVar Control Control ControlVar ControlVar Trace Trace Gloves Gloves Percentage 60 80 60 80 60 80 60 80 60 80 60 80 Point 4.26 4.50 4.61 1.1.b] (t) · (t − a)/(b − a) + ε(t).00 0. b(t) = (6 + η) · χ[a.39 0.00 Always/ sometimes 2. An interesting issue of these results is that the error rate achieved using point based literals are always the worst.17 11.00 5.b] (t) + ε(t).87 1.50 1.73 7.12 2.49 42. Results (error rates) for the diﬀerent data sets using the diﬀerent literals.88 0.00 37. The rest of this section contains a detailed discussion for each data set.00 83.00 4.25 35. The comparison between relative and region based literal clearly indicates that its adequacy depends on the data set.67 20.50 1.51 4.17 32. the results for boosting point based and interval based literals and another known results for the data set.65 2.87 17.75 1.
Two examples of the same class are shown in each graph. Alonso Gonz´lez and J. Examples of the CBF data set.
. J. J.160
C. 3. Rodr´ a ıguez Diez
8 6 4 Cylinder 2 0 –2 20 40 60 80 100 120
8 6 4 Bell 2 0 –2 20 40 60 80 100 120
8 6 4 Funnel 2 0 –2 20 40 60 80 100 120
Fig.
96]. . but in this case using early classiﬁcation and the maximum number of considered literals. one using point based literals and another using interval based literals. a ≤ t ≤ b.b] = 0 if 1 if t < a ∨ t > b. The ﬁrst one shows the error rate as a function of the number of literals.
(A) ORIGINAL DATA SET error/number of literals 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 70 points intervals 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 points intervals error/series length percentage
(B) VARIABLE LENGTH DATA SET error/number of literals 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 70 points intervals 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 points intervals error/series length percentage
Fig. Two kinds of graphs are used. 2. a is an integer obtained from a uniform distribution in [16. 1). the results obtained with the 60% of the series length are very close to those obtained with the full series. 4. The results are also shown in Table 5.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
161
where χ[a. In each graph. The obtained results are shown in Figure 4(A).
Error graphs for the CBF data set. The second also shows the error rate. For early classiﬁcation.
and η and ε(t) are obtained from a standard normal distribution N (0. . two results are plotted. . Figure 3 shows some examples of this data set. The error rate is shown as a function of the percentage of the length of the series.
. 128. The examples are generated evaluating those functions for t = 1. . 32] and b − a is another integer obtained from another uniform distribution in [32.
00 1. 10 20
Results for the CBFVar data set.13 5.78 34. Moreover. in this data set 25% corresponds to the ﬁrst 32 points of each series.15 0.11 2.75 22.75 5.01 1.62 3. J.75 27. 10 20 30 5.13 3. event clustering and decision trees.34
Table 6.13 100 3.22 16.38 100 3. a random number of values were deleted from its end. Alonso Gonz´lez and J.25 3. 40 4. named CBFVar.49 1.13 3.49 1.
.13
Error/Number of literals Points 8.64 0.17 48.67 50 4.75 10.25 60 3.11 1.9. An interesting issue is that the obtained results with this altered data set are still better than the results reported in [Kadous (1999)] using the original data set.77 0.85 Intervals 67.98 1.50 5.30 49. Variable Length Version.50 3.38 32.44 50 4. are shown in Figure 4(B) and in Table 6.37 1.10 15. All the examples of the CBF data set have the same length.51 0.9. The maximum number of values eliminated is 64. 30 4.00 90 4. the half of the original length.19 53.13 40 3. 0. The obtained results for this new data set.88 4.15 35.00 80 4.00 Results for the CBF data set.64 1. For each example.67 6.24 Intervals 4.25 4.13 3.25 1. the data set was altered.39 0.04 60 3. Our results using interval based literals are better than this value using 20 or more literals. For instance.84 54.39 0.51 0.50 3.75 4. because the longest series have 128 points. using event extraction. When using a variable length data set and early classiﬁcation.12 1.83
The error reported in [Kadous (1999)] is 1.37 2.02 37.02 0.26 0.02 Intervals 3.88.88 70 3.75 90 3.162
C.88 80 3.51 0.50 3. our method is able to obtain a better result than 1. Rodr´ a ıguez Diez Table 5. In order to check the method for variable length series. the series length percentage in graphs and tables are relative to the length of the longest series in the data set.99 Intervals 65.63 1.49 1. using early classiﬁcation with 100 literals when only 60% of the series length is available.51 0.50
Error/Number of Literals Points 8.49 1.99
Error/Series length percentage Points 68.99 1. J.61 70 3.64 0.64
Error/Series Length Percentage Points 67.
20] and k = 0 before time t3 and 1 after this time.
.5]. g is in [0. The combination of the behaviors of the variables produces 16 diﬀerent classes. 2n/3].3. so the data set was preprocessed. averaging the values in groups of four. Variable Length Version. a and T are in [10. Where m = 30. Trace This dataset is introduced in [Roverso (2000)]. The results are shown in Figure 6(A) and in Table 7. The data used was obtained from the UCI KDD Archive [Hettich and Bay (1999)].Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
163
6. The running times of the algorithms depend on the length of the series. In any case.2. For this data set it is speciﬁed a training/test partition. This is possibly due to the fact that most of the literals refer to intervals that are after these initial fragments.2. • Increasing: y(t) = m + rs + gt. and it is deﬁned by y(t). s = 2 and r is a random number in [−3. The length of the series is between 268 and 394. • Decreasing: y(t) = m + rs − gt.15]. ControlVar. The lengths of the examples in the resulting data set have lengths between 67 and 99. The results are shown in Figure 6(B) and in Table 8. The results are clearly better for interval based literals. Figure 5 shows two examples of each class. It is proposed as a benchmark for classiﬁcation systems of temporal patterns in the process industry. because the error rates are too large. The Control data set was also altered to a variable length variant. with the exception of early classiﬁcation with very short fragments. Control Charts In this data set there are six diﬀerent classes of control charts. • Cyclic: y(t) = m + rs + a sin(2πt/T ). synthetically generated by the process in [Alcock and Manolopoulos (1999)].3]. with 1 ≤ t ≤ n: • Normal: y(t) = m + rs. Each time series is of length n = 60. and each variable has two behaviors. This data set was generated artiﬁcially.0. as shown in Figure 7. x is in [7. There are four variables. • Upward: y(t) = m + rs + kx. 6. In this case the resulting series have lengths between 30 and 60.5. • Downward: y(t) = m + rs − kx. early classiﬁcation is not useful with so short fragments. t3 is in [n/3.
5% of the examples are not assigned to any class. the error is less than 1%.
.
The results are shown in Figure 8 and in Table 9. 5. For early classiﬁcation. Alonso Gonz´lez and J. boosting point based literals is not able to learn anything useful. the error is always greater than 70%. using boosting of interval based literals and more than 80 literals. the error is under 1% with the 80% of the series length.4%. Rodr´ a ıguez Diez
Normal
70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 60 50 40 30 20 10 0 0 10 20
Cyclic
30
40
50
60
70 60 50 40 30 20 10 0 0 70 60 50 40 30 20 10 0 0 10 10
Increasing
70 60 50 40 30 20 10 0
Decreasing
20
30
40
50
60 70 60 50 40 30 20 10 0
0
10
20
30
40
50
60
Upward
Downward
20
30
40
50
60
0
10
20
30
40
50
60
Fig.164
C. The result reported in [Roverso (2000)]. J. but 4. For this data set. Nevertheless. Two examples of the same class are shown in each graph. using recurrent neural networks and wavelets is an error of 1. J. Examples of the Control data set.
50 40.83 4. 10 20 30
Error/Number of literals Points 20.50 5.50 6.00 Intervals 87.33 22.83 31.83 13.00 11.00 42.33 12.00 41.00 90 4.17 41. 40 15.17 70 12.33 60 13.00 13.83
Table 7.17 45.50 1.83 4.50 13.50 6.50 17.33 1.00 Error/Series length percentage Points 65.00
Error/Number of literals Points 30.00 48.33 0.00 0.17 Table 8.83 15.67 5.83 5. 10 20 30
Results for the ControlVar data set.00 100 13.17 1.00 5.67 1.17 47.50 1. Results for the Control data set.67 Intervals 74.67 100 4.67 1. 40 7.17 34.33 60 5.00 56.67 22.17 54.17 5.33
Error/Series length percentage Points 63.67 0.50 41.00 50 13. 6.67 43.83 5.00 51.33 70 4.17 5.33 13.67 80 4.00 31.83 Intervals 6.50 1.67
9.17 46.17 45.00 5.50 1.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
165
(A) ORIGINAL DATASET error/number of literals
30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100
error/series length percentage
80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100
points intervals
points intervals
(B) VARIABLE LENGTH DATASET error/number of literals
30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100
error/series length percentage
80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100
points intervals
points intervals
Fig.50 80 12.00 50 6.00 90 13.00 5.67 48.00 0.50
.50 6.00 5.00 5.
Error graphs for the Control data set.83 Intervals 4.00 28.
two examples of each behavior are shown.2 0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Variable 1 Variable 1
0.8 0. J.4 0. J.2 1 1.8 0.2 0
100 150 200 250 300 350 400
Variable 1
0
50
100 150 200 250 300 350 400
0
50
100 150 200 250 300 350 400
1
1 0. Rodr´ a ıguez Diez
Case 1
1.5 0 –0.6 0.2 0 0 1 0.4 0. Alonso Gonz´lez and J.8 0.5 –1 50
Case 2
Variable 1
0. Each example is composed by four variables.6 0.4 0.6 0.2 0
1 0.5 –1 50 100 150 200 250 300 350 400 0 1 0.2 1 0.4 0.4 0.2 0
Fig.8 0.166
C.8 0.6 0.2 0 0 50 100 150 200 250 300 350 400 1 0.8 0. Trace data set.5 0 –0.
. and each variable has two possible behaviors. In the graphs.6 0.4 0. 7.6 0.
Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
167
error/number of literals
90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 120 140 160 180 200
error/series length percentage
100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60
points intervals
points intervals
70
80
90 100
Fig.63 90 70. thumb.25
200 70.28 100 70.10 70 72.
6. Instances of the signs were collected using an instrumented glove [Kadous (1999)].78 1.
Error graphs for the Trace data set.63 7.83 2.05
80 72.98 0.53
140 71.43
160 70.20
Error/Series length percentage Percentage Points Intervals 10 93.13 6.15 30 89.05 40 88.75 20 92.55 0.40 0.05 60 83.98
100 72. 8.98 0.45 0. the language of the Australian deaf community. Results for the Trace data set.30
40 73.25 0.48 58. Error/Number of literals
Table 9.48 88. y and z position. wrist roll.
Literals Points Intervals
20 72.20
60 72.32 37.00 0.78
120 72. Auslan Auslan is the Australian sign language. 9.90 80 70.70 0.28 73.4. Each example is composed by 8 series: x.28
180 70. middle and ring ﬁnger bend.63 93.98 0.
Error graphs for the Auslan data set.95 0.20 75.05 50 85. fore.20
20 15 10 5 0 0 50
error/number of literals points intervals 90 80 70 60 50 40 30 20 10 0
error/series length percentage points intervals
100
150
200
250
300
0
10
20
30
40
50
60
70
80
90 100
Fig. There are
.
J.50
Error/Series length percentage Percentage Points Intervals 10 76. Conclusions This chapter has presented a novel approach to the induction of multivariate time series classiﬁers. because the examples are generated very quickly.50 11.00 3. and in this data set the lengths vary a lot.50
10 classes and 20 examples of each class.5 using 300 ıve interval based literals. These kind of predicates. J.50 80 5.00 1.50 70 5.50 4. From only one literal. because it only makes use of very general techniques.00 1. Alonso Gonz´lez and J.50.00 1.00 82. include many examples that are already completed. Relative predicates consider the evolution of the values in the interval.00 2.00
120 8.00 40 5.00
210 7.50 2.50
90 8. the obtained results for a length of 50%.00 1. First.50
180 8. The second fact is that the used percentage is for the longest series.00 1. event clustering and Na¨ Bayes Classiﬁers.50
270 6.50 30 10.00 13. In this work we have resorted to two kind of interval predicates: relative and region based. The minimum length is 33 and the maximum is 102. while region based predicates consider the occurrence of the values of a variable in a region during an interval. speciﬁcally design for
.168
C.00
240 6. the base classiﬁers. Error/Number of literals
Literals Points Intervals
30 16. Hence.50 20 42. like boosting.00 60 5. and only employs very generic descriptions of the time series.00 3.50 3.00 1.50 90 5. It summarizes a supervised learning method that works boosting very simple base classiﬁers.50 100 5. boosting creates classiﬁers consisting of a linear combination of literals. using event extraction.00
60 11.00
150 8.50 50 5. With respect to the results for early classiﬁcation. it is necessary to consider two facts.00 4. Rodr´ a ıguez Diez Table 10. The results are shown in Figure 9 and in Table 10. Results for the Auslan data set.
7. for this problem it is not practical to use early classiﬁcation.00 2. Our result is 1. interval based literals. The best result reported in [Kadous (1999)] is an error of 2.50 7. The learning method is highly domain independent.00
300 5.50 3.00 41.
The symbolic nature of the base classiﬁers facilitates their capacity to abstain. An interesting advantage of the method is its simplicity. the number of iterations. it is relevant the fact that in order to get early classiﬁers there are neither necessary additional classiﬁers nor more complex classiﬁers. For instance. an open issue is if it would be possible to obtain better results for early classiﬁcation by modifying the learning process. make possible to learn from series of diﬀerent length. This early classiﬁcation is indispensable for some applications. the method has only one free parameter. allows the assembly of very accurate classiﬁers. Moreover. when the time necessary to generate an example can be rather big. Particularly. the simple mechanism of allowing that the evaluation of a literal may give as a result an abstention when there are not enough data to evaluate the literal. It is important to notice that this early classiﬁcation does not inﬂuence the learning process. at the expenses of classiﬁer comprehensibility. it achieves better than all previously reported results we are aware of. It also requires the use of a boosting method able to work with abstentions in the base classiﬁers. As we have shown. Although less important. On several data sets.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
169
time series although essentially domain independent. although the strength of the method is based on boosting. especially when using less iterations. From a user point of view. the experimental results using point based predicates show that the incorporation of interval predicates can improve signiﬁcantly the obtained classiﬁers. The implementation of boosting stumps is one of the easiest among classiﬁcation methods. Experiments on diﬀerent data sets show that in terms of error rate the proposed method is highly competitive with previous approaches. Hence. Moreover. and where it is not an option to wait until the full example is available. Another feature of the method is the ability to classify incomplete examples. it is possible (i) to select only an initial fragment of the ﬁnal classiﬁer and (ii) to continue adding literals to a previously obtained classiﬁer. This early classiﬁcation capacity is another consequence of the symbolic nature of the classiﬁer. The main focus of this work has been classiﬁcation accuracy. Nevertheless. the classiﬁers created with more iterations includes the previously obtained. literals with later intervals could be somewhat penalized. An important aspect of the learning method is that it can deal directly with variable length series. from the programmer point of view the method is also rather simple. There are methods that produce
.
In Michalski. and Rodr´ a ıguez Diez. (2000). Kadous.J. (1999).. Berndt. Rodr´ ıguez. 6. J. Hettich. The UCI KDD archive.J. where we obtained less accurate but more comprehensible classiﬁers.J. eds. G. Rodr´ ıguez. J. Learning First Order o Logic Time Series Classiﬁers: Rules and Boosting.J. 4. and Kubat. (1999).. Nevertheless. Finding Patterns in Time Series: A Dynamic Programming Approach. R. S. R. In Proceedings of the 7th Hellenic Conference on Informatics.170
C. they cannot be generalized for arbitrary data sets. R. I. (1999).
. In Bratko. PiatetskyShapiro.
References
1. (1999). Machine Learning and Data Mining. Smyth. John Wiley & Sons. 2.. the method will be adequate as long as the series may be discriminated according to what happens in intervals.uci.. A Graphical Rule Language for Continuous Dynamic Systems. and Bostr¨m. S. and Manolopoulos. J. A ﬁrst attempt to learn rules of literals is described in [Rodr´ ıguez et al. a short remark about the suitability of the method for time series classiﬁcation. 409–428. eds. Finally.J. Learning Comprehensible Descriptions of Multivariate Time Series. I. Since the only properties of the series that are considered for their classiﬁcation are the attributes tested by the interval based literals. Proceedings of the 16th International Conference of Machine Learning (ICML99).. AAAI Press/MIT Press. 7. the use of these literals with other learning models requires more research.J. TimeSeries Similarity Queries Employing a FeatureBased Approach. Applying Boosting to Similarity Literals for Time Series Classiﬁcation. P. (1996). Bratko. 229–248. Alcock. For instance [Rodr´ ıguez and Alonso (2000)] uses literals based on distances between series. 3. 5. Greece. such as decision trees or rules. In Fayyad. H. M. (2000).edu/. Control and Automation (CIMCA’99). and Cliﬀord. this learning framework can be used with other kinds of literals. (2000)]. pp. U. M. Springer. Rodr´ a ıguez Diez
more comprehensible models than weighted literals. and Dzeroski.. although the experimental results are rather good... Advances in Knowledge Discovery and Data Mining. J. Learning to Classify Biomedical Signals. and Bay. http://kdd. I. G.W. J. S. Alonso Gonz´lez and J. Vienna.. Kubat. Ioannina. eds. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2000). and literals over intervals can also be used with these models.D. Austria. Springer.. and Pfurtscheller. M.J. more suited for the problem at hand. C. Y. (1998).. Alonso Gonz´lez. C. Nevertheless. Morgan Kaufmann. In International Conference on Computational Intelligence for Modelling. In Proceedings of the 1st International Workshop on Multiple Classiﬁer Systems (MCS 2000). J. pp. and Alonso. C. D.ics. 8.. and Uthurusamy. Hence. Alonso. Koprinska.
(2000). Multivariate Temporal Classiﬁcation by Windowed Wavelet Decomposition and Recurrent Neural Networks. Y. (1999). Schapire. 5(3). Rodr´ ıguez. and Singer. PhD thesis. In 16th International Joint Conference on Artiﬁcial Intelligence (IJCAI99). (1998). H. In 11th Annual Conference on Computational Learning Theory (COLT 1998). 12. 13. Morgan Kaufmann. Control and HumanMachine Interface. 10. Alonso. 80–91. In 3rd ANS International Topical Meeting on Nuclear Plant Instrumentation. Roverso. Schapire. (1994). ACM.. and Bostr¨m. pp. (2001). Saito. Local Feature Extraction and its Applications Using a Library of Bases. N.. 1401–1406.J. Improved Boosting Algorithms Using ConﬁdenceRated Predictions. 245–262. D.
. J. Intelligent Data Analysis. C.E. A Brief Introduction to Boosting. R. Boosting Interval o Based Literals. pp.J. R. 11. Department of Mathematics. Yale University.E.Boosting IntervalBased Literals: Variable Length and Early Classiﬁcation
171
9.
This page intentionally left blank
.
University of Bern Neubr¨ckstrasse 10. online handwritten digits. Keywords: String distance. we give a review of algorithmic procedures for eﬃciently computing median strings. D10587 Berlin. Germany Email: jiang@iam.CHAPTER 8 MEDIAN STRINGS: A REVIEW
Xiaoyi Jiang Department of Electrical Engineering and Computer Science Technical University of Berlin Franklinstrasse 28/29. Introduction Strings provide a simple and yet powerful representation scheme for sequential data.
173
.
1. Then.34]. Hungary e
Time series can be eﬀectively represented by strings. In this chapter its adaptation to the domain of strings is discussed.unibe. speech recognition.unibe.ch
Horst Bunke Department of Computer Science. generalized median string. Switzerland u Email: bunke@iam. In particular time series can be eﬀectively represented by strings. Some experimental results will be reported to demonstrate the median concept and to compare some of the considered algorithms. Numerous applications have been found in a broad range of ﬁelds including computer vision [2]. and molecular biology [13. The median concept is useful in various contexts. H6720 Szeged. CH3012 Bern. Attila Jozsef University Arpad t´r 2. We introduce the notion of median string and provide related theoretical results. set median string.ch
Janos Csirik Department of Computer Science.
q ∈ U . Interesting applications of the median concept have been demonstrated in dealing with 2D shapes [16. anatomical structures [37]. In Section 6 we report experimental results to demonstrate the median concept and to compare some of the considered algorithms. clustering and machine learning. x = x1 x2 · · · xn where xi ∈ Σ for i = 1. Given the space U of all strings over Σ. Let S be a set of N strings from U . lines. q) to measure the dissimilarity between two strings p. The outline of the chapter is as follows.34. 2.e. a string x is simply a sequence of symbols from Σ. The median concept is useful in various contexts. we need a distance function d(p. 3D rotation [9].5. It represents a fundamental quantity in statistics. The essential information of S is captured by a string p ∈ U that minimizes the sum of distances of p to all strings from S. Sections 4 and 5 are devoted to algorithmic procedures for eﬃciently computing set median and generalized median strings. Bunke and J. If the search is con¯ strained to the given set S. . In this paper we discuss the adaptation of the median concept to the domain of strings. 33]. ¯
p∈U
where
ES (p) =
q∈S
d(p. and facial images [31]. Averaging the results of several classiﬁers is used in multiple classiﬁer systems in order to achieve more reliable classiﬁcations. In data mining. binary feature maps [23]. multisensory measurements of some quantity are averaged to produce the best estimate. brain models [12]. Finally. Jiang. Median String Problem Assuming an alphabet Σ of symbols. i. a typical task is to represent a set of (similar) objects by means of a single prototype. Some of them are inherent to the special nature of strings such as the shortest common superstring and the longest common substring. or 3D frames) [32]. . also ¯ ¯ called the consensus error ES (p): p = arg min ES (p). In sensor fusion. n. Csirik
A large number of operations and algorithms have been proposed to deal with strings [1.36]. H. . geometric features (points. while others are adapted from other domains.13. the resultant string p = arg min ES (p) ˆ
p∈S
. q).174
X. . We ﬁrst formally introduce the median string problem in Section 2 and provide some related theoretical results in Section 3. some discussions conclude this paper.
The string p is called a generalized median of S.
respectively. Examples of
. This is the parallel of mean from elementary statistics. B) = min{c(s)  s is a sequence of edit operations transforming A into B}. q). While the latter represents the point where all the weight of the masses can be considered to be concentrated. The Levenshtein edit distance d(A. [27] returned to this deﬁnition and investigated the possibility of using mean instead of median. To model the fact that some distortions may be more likely than others. the Levenshtein edit distance is given by d(A. Here we would like to search for p that minimizes d2 (p . Given two strings A and B. c(a → b). and c(a → ε). Let A = a1 a2 · · · an and B = b1 b2 · · · bm be two words over Σ. namely (1) substitution of a symbol a ∈ A by a symbol b ∈ B. neither the generalized median nor the set median is necessarily unique.
q∈S
MartinezHinarejos et al. a = b. ε → a for an insertion. The most popular one is doubtlessly the Levenshtein edit distance. and a → ε for a deletion. let us consider two words A = median and B = mean built on the English alphabet. Diﬀerent possibility is mentioned by Kohonen [20] too. This deﬁnition was introduced by Kohonen [20]. We deﬁne the cost of this sequence k by c(s) = i=1 c(li ). c(ε → a). To illustrate the Levenshtein edit distance. For a given set S. Note that diﬀerent terminology has been used in the literature. we write a → b for a substitution. Symbolically. In [13] the set median string and the generalized median string are termed center string and Steiner string. In [24] the generalized median was called consensus sequence. three diﬀerent types of edit operations are considered.Median Strings: A Review
175
is called a set median of S. are introduced. Note that in a broad sense. Several string distance functions have been proposed in the literature. Usually. Let s = l1 l2 · · · lk be a sequence of edit operations transforming A into B. (2) insertion of a symbol a ∈ Σ in B. and (3) deletion of a symbol a ∈ A. the median string concept is related to the center of gravity of a collection of masses. B) is deﬁned in terms of elementary edit operations which are required to transform A into B. the median string corresponds to a single representation of strings based on a string distance. costs of edit operations.
176
X. Jiang, H. Bunke and J. Csirik
sequences of edit operations transforming A into B are: • s1 = d → a, • s2 = d → a, • s3 = d → ε, i → n, i → ε, i → ε. a → ε, a → ε, n → ε,
Under the edit cost c(a → ε) = c(ε → a) = c(a → b) = 1, a = b, s3 represents the optimal sequence with the minimum total cost 2 for transforming median into mean among all possible transformations. Therefore, we observe d(median, mean) = 2. In [38] an algorithm is proposed to compute the Levenshtein edit distance by means of dynamic programming. Other algorithms are discussed in [13,34,36]. Further string distance functions are known from the literature, for instance, normalized edit distance [28], maximum posterior probability distance [20], feature distance [20], and others [8]. The Levenshtein edit distance is by far the most popular one. Actually, some of the algorithms we discuss later are tightly coupled to this particular distance function. 3. Theoretical Results In this section we summarize some theoretical results related to median strings. The generalized median is a more general concept and usually a better representation of the given strings than the set median. From a practical point of view, the set median can be regarded an approximate solution of the generalized median. As such it may serve as the start for an iterative reﬁnement process to ﬁnd more accurate approximations. Interestingly, we have the following result (see [13] for a proof): Theorem 1. Assume that the string distance function satisﬁes the triangle p p inequality. Then ES (ˆ)/ES (¯) ≤ 2 − 2/S. That is, the set median has a consensus error relative to S that is at most 2 − 2/S times the consensus error of the generalized median string. Independent of the distance function we can always ﬁnd the set median of N strings by means of 1 N (N − 1) pairwise distance computations. This 2 computational burden can be further reduced by making use of special properties of the distance function (e.g. metric) or resorting to approximate procedures. Section 4 will present examples of these approaches. Compared to set median strings, the computation of generalized median strings represents a much more demanding task. This is due to the huge search space which is substantially larger than that for determining the set median string. This intuitive understanding of the computational
Median Strings: A Review
177
complexity is supported by the following theoretical results. Under the two conditions: • every edit operation has cost one, i.e., c(a → b) = c(ε → a) = c(a → ε) = 1, • the alphabet is not of ﬁxed size. It is proved in [15] that computing the generalized median string is NPhard. Sim and Park [35] proved that the problem is NPhard for ﬁnite alphabet and for a metric distance matrix. Another result comes from computational biology. The optimal evolutionary tree problem there turns out to be equivalent to the problem of computing generalized median strings if the tree structure is a star (a tree with n + 1 nodes, n of them being leaves). In [39] it is proved that in this particular case the optimal evolutionary tree problem is NPhard. The distance function used is problem dependent and does not even satisfy the triangle inequality. All these theoretical results indicate the inherent diﬃculty in ﬁnding generalized median strings. Not surprisingly, the algorithms we will discuss in Section 5 are either exponential or approximate.
4. Fast Computation of Set Median Strings The naive computation of set median requires O(N 2 ) distance computations. Considering the relatively high computational cost of each individual string distance, this straightforward approach may not be appropriate, especially in the case of a large number of strings. The problem of fast set median search can be tackled by making use of properties of metric distance functions or developing approximate algorithms. Several solutions [19,30] have been suggested for fast set median search in arbitrary spaces. They apply to the domain of strings as well.
4.1. Exact Set Median Search in Metric Spaces In many applications the underlying string distance function is a metric which satisﬁes: (i) d(p, q) ≥ 0 and d(p, q) = 0 if and only if p = q, (ii) d(p, q) = d(q, p), (iii) d(p, q) + d(q, r) ≥ d(p, r).
178
X. Jiang, H. Bunke and J. Csirik
A property of metrics is: d(p, r) − d(r, q) ≤ d(p, q), ∀p, q, r ∈ S,
which can be utilized to reduce the number of string distance computations. The approach proposed in [19] partitions the input set S into subsets Su (used), Se (eliminated), and Sa (alive). The set Sa keeps track of those strings that have not been fully evaluated yet; initially Sa = S. A lower bound g(p) is computed for each string p in Sa , i.e., the consensus error of p satisﬁes: ES (p) =
q∈S
d(p, q) ≥ g(p).
Clearly, strings with small g values are potentially better candidates for set median. For this reason the string p with the smallest g(p) value among all strings in Sa is transferred from Sa to Su . Then, the consensus error ES (p) is computed and, if necessary, the current best median candidate p∗ is updated. Then, the lower bound g is computed for all strings that are alive, and those whose g is not smaller than ES (p∗ ) are moved from Sa to Se . They will not be considered as median candidates any longer. This process is repeated until Sa becomes empty. In each iteration, the consensus error for p with the smallest g value is computed by: ES (p) =
q∈Su
d(p, q) +
q∈Se ∪(Sa −{p})
d(p, q).
Using (1) the term d(p, q) in the second summation is estimated by: d(p, q) ≥ d(p, r) − d(r, q), ES (p) ≥
q∈Su
∀r ∈ Su .
Taking all strings of Su into account, we obtain the lower bound: d(p, q) +
q∈Se ∪(Sa −{p})
max d(p, r) − d(r, q) = g(p).
rεSu
The critical point here is to see that all the distances in this lower bound are concerned with p and strings from Su , and were therefore already computed before. When strings in Sa are eliminated (moved to Su ), their consensus errors need not to be computed in future. This fact results in saving of distance computations. In addition to (2), two other lower bounds within the same algorithmic framework are given in [19]. They diﬀer in the resulting ratio of the number of distance computations and the remaining overhead, with the lower bound (2) requiring the smallest amount of distance computations.
Median Strings: A Review
179
Ideally, the distance function is desired to be a metric, in order to match the human intuition of similarity. The triangle inequality excludes the case in which d(p, r) and d(r, q) are both small, but d(p, q) is very large. In practice, however, there may exist distance functions which do not satisfy the triangle inequality. To judge the suitability of these distance functions, the work [6] suggests quasimetrics with a relaxed triangle inequality. Instead of the strict triangle inequality, the relation: d(p, q) 1+ε is required now. Here ε is a small nonnegative constant. As long as ε is not very large, the relaxed triangle inequality still retains the human intuition of similarity. Note that the strict triangle inequality is a special case with ε = 0. The fast set median search procedure [19] sketched above easily extends to quasimetrics. In this case the relationship (1) is replaced by: d(p, r) + d(r, q) ≥ d(p, q) ≥ max d(q, r) d(p, r) − d(q, r), − d(p, r) , 1+ε 1+ε ∀p, q, r ∈ S
which can be used in the same manner to establish a lower bound g(p). 4.2. Approximate Set Median Search in Arbitrary Spaces Another approach to fast set median search makes no assumption on the distance function and covers therefore nonmetrics as well. The idea of this approximate algorithm is simple. Instead of computing the sum of distances of each string to all the other strings of S to select the best one, only a subset of S is used to obtain an estimation of the consensus error [30]. The algorithm ﬁrst calculates such estimations and then calculates the exact consensus errors only for strings that have low estimations. This approximate algorithm proceeds in two steps. First, a random subset Sr of Nr strings is selected from S. For each string p of S, the consensus error ESr (p) relative to Sr is computed and serves as an estimation of the consensus error ES (p). In the second step Nt strings with the lowest consensus error estimations are chosen. The exact consensus error ES (p) is computed for the Nt strings and the string with the minimum ES (p) is regarded the (approximate) set median string of S. 5. Computation of Generalized Median Strings While the set median problem is characterized by selecting one particular member out of a given set of strings, the computation of generalized median
In the following we describe various algorithms for computing generalized median strings.0. wk ) + δ(ui . .k d i. vj . A threedimensional distance table of dimension l × m × n is constructed as follows: Initialization: d0.k = min di−1.j−1. ε.m.j. Jiang. wk ) 0 ≤ k ≤ n + δ(ε. as it determines the best value v at a given stage of computation.k−1 d i. they are either of exponential complexity or approximate.k−1 d i−1. Assume the three input strings be u1 u2 · · · ul . and w1 w2 · · · wn . The path in the distance table that leads from d0. An Exact Algorithm and Its Variants An algorithm for the exact computation of generalized median strings under the Levenshtein distance is given in [21].j−1.m. If the strings of S are
. 5. . Finding an optimal value of v requires an exhaustive search over Σ in the most general case. The theoretical results from Section 3 about computational complexity indicate the fundamental diﬃculties we are faced with. but in practice the cost function is often simple such that a shortcut can be taken and the choice of the optimal v is not costly.j−1. r2 . vj . vj .180
X. we only discuss the case N = 3. rN ) = min [c(v → r1 ) + c(v → r2 ) + · · · + c(v → rN )]. ε) + δ(ε. Iteration: di−1. Having deﬁned δ this way.0 to dl. H.j. Note that ¯ a generalization to arbitrary N is straightforward.j.
v∈Σ
The operator δ can be interpreted as a voting function. Not surprisingly. ε.k−1 di.0. vj . the generalized median string can be computed by means of dynamic programming in an N dimensional array. Bunke and J.0 = 0.1.k−1
+ δ(ui . ε) + δ(ε. .n being the consensus error ES (¯). v1 v2 · · · vm . wk ) 0 ≤ i ≤ l 0≤j≤m + δ(ui .k di. wk )
End: if (i = l) ∧ (j = m) ∧ (k = n) The computation requires O(lmn) time and space.k d i−1. Csirik
strings is inherently constructive. similarly to string edit distance computation [38]. ε) + δ(ui . We deﬁne: δ(r1 .n deﬁnes the generalized p median string p with dl. .j−1. For the sake of notational simplicity. ε.j. Let ε be the empty string and Σ = Σ ∪ {ε} the extended alphabet.
The work [25] makes use of this observation to exclude a large portion of the full N dimensional search space from consideration. 5. Details of the restricted search can be found in [24]. k is typically much smaller than n.2. Obviously.2. An example is the approach in the context of classiﬁer combination for handwritten sentence recognition [25]. Then. More generally. the generalized median string p p satisﬁes ES (¯) ≤ k with k being a small number. In this case the opti¯ mal dynamic programming path must be close to the main diagonal in the distance table.Median Strings: A Review
181
of length O(n). Such an approach from [24] assumes that the string of S be quite similar. i. the consensus string is expected to yield the best overall recognition performance. the location of each individual word in a sequence of handwritten words.1. 5. Therefore only part of the ndimensional table needs to be considered. eﬀorts have been undertaken to develop approximate approaches which provide suboptimal solutions in reasonable time. In this section we will discuss several algorithms of this class. Under reasonable constraints on the cost function (c(a → ε) = c(ε → a) = 1 and c(a → b) nonnegative). resulting in a substantial speedup compared to the full search of the original algorithm [21]. Greedy Algorithms The following algorithm was proposed by Casacuberta and de Antoni [4]. both the time and space complexity amounts to O(nN ) in this case. Approximate Algorithms Because of the NPhardness of generalized median string computation. Its asymptotic time complexity is O(nk N ). a greedy algorithm constructs an
. it is very unlikely that a word at the beginning of a sequence corresponds to a word at the end of another sequence. Despite of its mathematical elegance the exact algorithm above is impractical because of the exponential complexity. While this remains exponential. The input strings from the individual classiﬁers are associated with additional information of position. There have been eﬀorts to shorten the computation time using heuristics or domainspeciﬁc knowledge. An ensemble of classiﬁers provide multiple classiﬁcation results of a scanned text. only words at a similar position in the text image are meaningful candidates for being matched to each other.e. Starting from an empty string. We may also use any domainspeciﬁc knowledge to limit the search space.
¯ pl = pl−1 al . l + + { p a1 = arg min ES (¯l−1 a).
a∈Σ
approximate generalized median string p symbol by symbol.182
X.2. the history of the selection process can be taken into account to make a more reliable decision [22]. each symbol from Σ is considered as a candidate for al . When we are ¯ going to generate the lth symbol al (l ≥ 1). All the candidates are evaluated and the ﬁnal decision of al is made by selecting the best candidate. ¯
Fig. Then. The output is the preﬁx of pl with the smallest consensus error relative to S.2. ¯ for l = 1. Csirik
p0 = ε. The process is continued until a termination criterion is fulﬁlled. The space complexity amounts to O(nN Σ). They are generalpurpose optimization methods that have been successfully applied to a large number of complex tasks. pl−1 is regarded the approximate generalized median string.
5. 1. Alternatively. Bunke and J. Genetic Search Genetic search techniques are motived by concepts from the theory of biological evolution. Jiang. ¯ if (termination criterion fulﬁlled) break. . The ﬁrst approach is based on a straightforward representation of strings
. } choose preﬁx of pl for output. the substring a1 · · · al−1 (ε for l = 1) has already been determined. the author of [22] suggests the ¯ termination criterion l = maxp∈S p. In [17] two genetic algorithms are proposed to construct generalized median strings. A general framework of greedy algorithms is given in Figure 1. General framework of greedy algorithms. There are several possible choices for the termination criterion and the preﬁx. Then. In the general framework in Figure 1 nothing is stated about how to select al if there exist multiple symbols from Σ with the same value of p ES (¯l−1 a). Besides a simple random choice. H. For both variants a suitable data structure [4] enables a time complexity of O(n2 N Σ) for the Levenshtein distance. The greedy algorithm proposed in [4] stops the iterative construction prop ¯ p cess when ES (¯l ) > ES (¯l−1 ).
implying a substantial speedup. The string “median”. The input strings are used as part of the initial population. insertion or substitution is performed at each position of a chromosome. When using the Levenshtein edit distance. see [17] for details.2. The homogeneity is measured by means of the average and the variance of the ﬁtness values of a population. PerturbationBased Iterative Reﬁnement The set median represents an approximation of the generalized median string. The rest of the initial population is ﬁlled by randomly selecting some input strings and applying the mutation operator on them. The population evolution process terminates if one of the following two criteria is fulﬁlled. The time complexity becomes O(N nmpP ). But no algorithmic details are speciﬁed there.
5. The greedy algorithms and the genetic search techniques also give approximate solutions. Let P denote the population size and p the replacement percentage. For each
. for instance. the ﬁtness function is simply deﬁned by the consensus error of a chromosome string. if the population becomes very homogeneous. Second. An approximate solution p can be further improved ¯ by an iterative process of systematic perturbations. m << n. the process terminates as well.3. the population is considered enough homogeneous to stop the evolution. If their product is smaller than a prespeciﬁed threshold. A concrete algorithm for realizing systematic perturbations is given in [26]. With a mutation probability either deletion. The roulette wheel selection is used to create oﬀspring. This idea was ﬁrst suggested in [20]. For crossover the singlepoint operator is applied. is coded as chromosome m e d i a n
Given a set S of N input strings of length O(n) each. For better computational eﬃciency a second genetic algorithm is designed using a more elaborate chromosome coding scheme of strings. the time complexity of this genetic algorithm amounts to O(N n2 pP ) per generation. The genetic algorithm above requires quite a large number of string distance computations due to the ﬁtness function evaluation in each iteration. The ﬁrst criterion is that the maximal number of generations has been reached.Median Strings: A Review
183
in terms of chromosomes by onedimensional arrays of varying lengths consisting of symbols from Σ.
the following operations are performed: (i) Build perturbations • Substitution: Replace the ith symbol of p by each symbol of Σ in ¯ turn and choose the resulting string x with the smallest consensus error relative to S. the set S t of existing strings is incremented by a new string. p2 . t. y. • Insertion: Insert each symbol of Σ in turn at the ith position of p ¯ and choose the resulting string y with the smallest consensus error relative to S. however. Csirik
position i. we compute the generalized median string of S t+1 from scratch without utilizing any knowledge about S t .3. . Doubtlessly. The process is repeated until ¯ there is more improvement possible. All algorithms for computing generalized median strings are of such a static nature and thus not optimal in a dynamic context. Jiang. Dynamic Computation of Generalized Median Strings In a dynamic context we are faced with the situation of a steady arrival of new data items.e. The inspiration for the algorithm comes from a fundamental fact in real space. in particular its generalized median string. z} with the smallest consensus error ¯ p relative to S. . Bunke and J. At each point of time. H. and its generalized median string is to be computed. ¯ (ii) Replace p by the one from {¯. pt } of t points is the wellknown mean: pt = ¯ 1 pi . the squared Euclidean distance of pi and pj . • Deletion: Delete the ith symbol of p to generate z. represented by strings. . For the Levenshtein distance one global iteration that handles all positions of the initial p needs O(n3 N Σ) time. . By doing this. i. resulting in S t+1 . 5. but not the individual members of S t . a trivial solution consists in applying any of the approaches discussed above to S t+1 . x. · t i=1
t
. Under the distance function d(pi . in which the update scheme only considers the generalized median string of S t together with the new data item.184
X. the generalized median of a given set S t = {p1 . pj ) = (pi − pj ) · (pi − pj ). The work [18] proposes a genuinely dynamic approach.
(k − 1)/k is probed and the α value that results in the smallest consensus error is chosen.edu/pub/machinelearningdatabases/pendigits/. . pt ) = p
t+1
which is the socalled weighted mean of pt and pt+1 satisfying ¯
Geometrically. however. we have no possibility to specify α in advance. To determine the optimal α value a series of α values 0. Experimental Evaluation In this section we report some experimental results to demonstrate the median string concept and to compare some of the computational procedures described above. pt+1 ) p p to pt and pt+1 . we resort to a search procedure. pt ) = α · d(¯t . pt+1 is a point on the straight line segment connecting pt ¯ ¯ t t and pt+1 that has distances 1/(t + 1) · d(¯ . pt+1 ) and 1/(t + 1) · d(¯ . For strings. . respectively.e. p2 . pt+1 ) p where α ∈ [0. pt+1 ). Given a set S t = {p1 .uci. An online handwritten digit is a time
1 Available
at ftp: //ftp. . the generalized median of a new set S t+1 = ¯ ¯ S t ∪ {pt+1 } is estimated by a weighted mean of pt and pt+1 . The used data are online handwritten digits from a subset1 of the UNIPEN database [14]. i. . pt } of t strings and its generalized median pt . . It is an extension of the Levenshtein distance computation [38]. Therefore. . . pt+1 ). ¯ On a heuristic basis the special case in real space can be extended to the domain of strings. The dynamic algorithm uses the method described in [3] for computing the weighted mean of two strings. p d(¯t+1 . by a string pt+1 such that ¯ ¯ p d(¯t+1 . 6.Median Strings: A Review
185
When an additional point pt+1 is added to S t .ics. pt+1 ) = (1 − α) · d(¯t . pt+1 ). p p d(¯t+1 .
. p t+1 t ¯ · d(¯t . In real space α takes the value 1/(t + 1). 1/k. 1]. Pt+1 ) = p t+1 ¯ d(¯t+1 . Remember that our goal is to ﬁnd pt+1 that ¯ minimizes the consensus error relative to S t+1 . the resultant new set S t+1 = S t ∪ {pt+1 } has the generalized median p ¯
t+1
1 t 1 · · pt + ¯ · pt+1 = pi = t + 1 i=1 t+1 t+1 1 · d(¯t . .
1). Bunke and J. the dynamic algorithm [18] (Section 5. . the dynamic approach needs more computation time than the greedy algorithm. but not the average. totally 9 (97) generalized medians have eﬀectively been computed for digits 1/2/3 (6). t1 ). see Figure 2 for an illustration (for space reasons only ten samples of digit 6 are shown there). yk . Looking at the median for digits 2. Therefore. (xk . Note that the consensus errors of digit 6 are substantially larger than those of the other digits because of the deﬁnition of consensus error as the sum. the greedy algorithm reveals some weakness. 2 and 3 each.2. yi+1 )−(¯i . . That is. the greedy algorithm [4] (Section 5. For comparison purpose the consensus error ES and the computation time t in seconds on a SUN Ultra60 workstation are also listed in Figure 3. It works well for the (short) words used there. H. The reason lies in the simple termination criterion deﬁned in [4]. yi ) to (¯i+1 . yi ) = ∆ for i = 1. Taking the actual computation time for each update step of the whole
. Each digit is originally given as a sequence d = (x1 .3). In our experiments we ﬁxed ∆ = 7 so that the digit strings have between 40 and 100 points. .2. l − 1.2). The results of four algorithms are shown in Figure 3: the genetic algorithm [17] (GA. We conducted experiments using 10 samples of digit 1. resulting in a string (digit) much shorter than it should be. where tk is the time of recording the kth point during writing a digit. In order to transform such a sequence of points into a string. . . and 98 samples of digit 6. . starting from a set consisting of the ﬁrst sample. . The best results are achieved by GA. c(a → b) = a − b. but obviously encounters diﬃculties in dealing with longer strings occurring in our study.186
X. At ﬁrst glance. The costs of the edit operations are deﬁned as follows: c(a → ε) = c(ε → a) = a = ∆. yl ) where (¯i+1 . Except for digit 1. d is transformed x ¯ x ¯ x ¯ x ¯ into sequence d = (¯1 . . of the distances to all input samples. Notice that the minimum cost of a substitution is equal to zero (if and only if a = b). (¯l . y1 ). we ﬁrst resample the given data points such that the distance between any consecutive pair of points has a constant value ∆. But one has to take into account that the recorded time is the total time of the dynamic process of adding one sample to the existing set each time. yi+1 ). . Section 5. Jiang. 3 and 6 it seems that the iterative process terminates too early. and the set median. tk ) of points in the xyplane. . followed by the dynamic approach. while the maximum cost is 2∆. Csirik
sequence of 2D points recorded from a person’s writing on a special tablet. Then. The latter case occurs if a and b are parallel and have opposite direction. a string s = a1 a2 · · · al−1 is generated from sequence x ¯ x ¯ d where ai is the vector pointing from (¯i . . y1 . .
54 0 95.7
188 64.62
0
96
200 223 154.5
100
100
100
100
100
188 64.7
188 64.85 76
100 154.10 seconds in contrast to 22.54 0 95.54 0 59 0 100 161.85 59 0
100 154.
dynamic process into account.62
0
96
0
0
0
0
0
100
100
100
100
100
200 223 154.9 100
0
96
200 223 154.26
0
0
0
0
0
100 133 40 76 0 94.85 76
0
100 147.13 0 94.9 100
0
96
200 223 154.5
0
100 147. respectively.26 133 40 74.54 0 95.13 74.62 100 161.
Samples of digits 1. would require 46.6 62.26
0
0
0
0
0
100 154.85 76
100 154.6 62.5
0
100 147.6 62. Considering the computational eﬃciency alone.8 seconds of the dynamic approach.85 59 0
100 154.7
188 64.7
188 64.9 100 95.6 62.85 76
100 154.6 62.62 100 161. But the generalized median string is a more precise abstraction of the given set of strings with a smaller consensus error.6 62.26
0
94.
.5
0
100 147.54 0 59 0 100 161.13 74.5
0
100 147.Median Strings: A Review
74.62 100 161.13
100 0 94.9 100 95. set median is the best among the compared methods.13
100 133 40 74.85 59 0
100 154.6 62.13
100 133 40 74. the dynamic algorithm is attractive compared to static ones such as the greedy algorithm.54 0 95.5
0
100 147.13 74. for instance.6 62. then the computation for digit 3.26
0
94.62 100 161.13
100 133 40 74.85 76
100 154.7
100
100
100
100
100
188 64.62
0
96
200 223 154.85 59 0
100 154.26
100 133 40 76 0 94.54 0 95.54 0 59 0 100 161.7
188 64.54 0 59 0 100 161.85 59 0
0
100 147.13 74.62
0
96
200 223 154.6 62.26
100 133 40 76 0 94.9 100
0
96
200 223 154.5
0
100 147.5
0
0
0
0
0
100 154.7
188 64. 2.7
0
0
0
0
0
100
100
100
100
100
200 223 154.7
188 64.9 100 95. 2.9 100 95.9 100
0
96
200 223 154.9 100
0
96
Fig.7
188 64.62
0
96
200 223 154.6 62.5
0
100 147. If we use the static greedy algorithm to handle the same dynamic process.26
0
94.9 100 95.26
100 133 40 76 0 94. 3 and 6.13
187
0
0
0
0
0
100 133 40 74.54 0 59 0 100 161.26
100 133 40 76 0 94.5
0
100 147.62 100 161.
Csirik
52.6 30.88 52.84
0
71.13
95. Experimental results were reported to demonstrate the median concept in dealing with a special case of time series.24 0 10 20 30 40 50 60 70 80 90 0 100 143.66
0
82.29
0
0
0
0
100
100
100
124.15 31.8 30.63 10 0 10 20 30 40 50
0
98. and to compare some of the discussed algorithms. 9 100 0 96
Fig.85
0
100
147. 9 0 10 20 30 40 50 60 70 171.6 60 50 40 30 20 10 0
36.1 70 60 50 40 30 20 10 0
64.26
154. 7 223 154.48
0
100
151.6 11. we have brieﬂy described several procedures for computing median strings.
Median computation of digits. 3.69 0 50. 5
188 64.188
X. 6 62. Particularly.9
159.1 79.06
147. 5 160.54
200 0 100 161.05 0
0 0
100 100 100
140. Bunke and J.47
52. H. 5 100 0
74.
7.1 46. Jiang.92 151.99 50. namely online handwritten digits.5 26. Discussions and Conclusion In this paper we have reviewed the concept of median string.56 10
0
10
20
30
40
50
60
51.
.62 76 59
0 0 0 0
100 100 100
100
133 40 0 94.94
147.32 61.75
0
0
0 0
100 100
100 106.26
30.
24.22] Genetic algorithm [17] Dynamic algorithm [18] perturbationbased iterative reﬁnement [20. conﬁdence. The generalized median string represents one way of capturing the essential characteristics of a set of strings. If necessary. String distance (original paper) Levenshtein Levenshtein Levenshtein Levenshtein Arbitrary distance Extension to arbitrary distance No Yes Yes No N/A Handling weighted median Yes Yes Yes Yes Yes
189
Handling center median No Yes Yes No Yes
Exact algorithm and its variants [21. Note that an extension to an arbitrary string distance function usually means a computational complexity diﬀerent from that for the Levenshtein edit distance. Under the two conditions given in Section 3. In the deﬁnition of median string. One example is the socalled center string [15] deﬁned by: p∗ = arg min max d(p. the weighted generalized median string is simply p = arg min ¯
p∈U q=S
wq · d(p. q). The ability of the algorithms to compute the center string is summarized in Table 1. it is proved in [15] that computing the center string is NPhard.Median Strings: A Review Table 1. q ∈ S.
p∈U q∈S
It is important to note that the same term is used in [13] to denote the set median string. all the input strings have a uniform weight of one. Given the weights wq . this basic deﬁnition can be easily extended to weighted median string to model the situation where each string has an individual importance.25] Greedy algorithms [4. q).. 1}) and the Hamming string distance. Algorithm Characteristic of median computation algorithms. There do exist other possibilities. etc.e. Σ = {0. Another result is given in [7] where the NPhardness of the center string problem is proved for the special case of a binary alphabet (i.
All the computational procedures discussed before can be modiﬁed to handle this extension in a straightforward manner. The algorithms’ applicability to an arbitrary string distance function is summarized in Table 1.
.26]
The majority of the algorithms described in this paper are based on the Levenshtein edit distance.
Journal on Computer Vision. 129–135.P. (1994). Eﬃcient Dynamic Programming Alignment of Cyclic Strings by Shift Elimination. On Averaging Rotations. and Sequences: Computer Science and Computational Biology. Algorithms on Strings.. Gregor. J. and Kandel.. A. A. (1997).G. 113–119. Conf. A. and Litman. pp. Gusﬁeld. Fred. and Leit˜o. On Covering Problems of Codes. Theory of Computing Systems.L. A. 3.N. (1998). 23–30. Bunke (ed. 192–210. H. On the Weighted Mean of a Pair of Strings. (1996). Abegglen. of National Symposium on Pattern Recognition and Image Analysis. Jiang. M. 1179–1185.. 10. 30(2). (1998). Relaxing the Triangle Inequality in Pattern Matching. J. J. Fagin. Int. Proc. L.. 7. Spain. Int. pp. (2000). C. M. Journal on Computer Vision. (1992).
Acknowledgments The authors would like to thank K. 29(7). Average Brain Models: A Convergence Study. in H. M. W. and Rytter. (1993). 11.
References
1. 9. Abegglen for her contribution to parts of the chapter. World Scientiﬁc. R. on Document Analysis and Recognition. A Comparative Study of String Disa similarity Measures in Structural Clustering. 42(1/2).). 193–198.11. Jiang. K.N.29]. 7–16. Advances in Structural and Syntactic Pattern Recognition. 385–394. Cambridge University Press. F.).M. It remains an open problem to determine medians of this kind of strings. Bunke and J. Apostolico and Galil. 15(2). 28(3). Text Algorithms. (1997). Gramkow. 6. 4. 2. Gregor. A Greedy Algorithm for Computing Approximate Median Strings. Proc. pp.D.G. 12. Recent Advances in String Matching. on Pattern Analysis and Machine Intelligence. M. IEEE Trans. Several methods have been proposed to eﬃciently compute the Levenshtein distance of cyclic strings [10.. 77(2). Frances. (eds. Guimond. Oxford University Press. 13. (1997). 5. and Thirion. Dynamic Programming Alignment of Sequences Representing Cyclic Patterns. Pattern Matching Algorithms. (1997). Bunke. Bunke. and Stockmeyer. Computer Vision and Image Understanding.190
X. J. 219–231. H. of Int. Barcelona. Meunier. M. 5(1). 3–21. Pattern Analysis and Applications. (2001). Oxford University Press. X. 8. Csirik
Another issue of general interest is concerned with cyclic strings. Casacuberta. D. and Thomason. and Thomason. Crochemore. and de Antoni. Z. J.
. (2002). Pattern Recognition. H. Trees.
Lewis. 17. 29–33. 3D Shape and 2D Surface Textures of Human Faces: The Role of “Averages” in Attractiveness and Age. Proc. H. Fast Median Search in Metric Spaces. 9(1). Dori (eds. on PAMI. 1145–1151.. (2000). H. (1998). X. Topology of Strings: Median String is NPComplete. and Bunke.Median Strings: A Review
191
14. (1985). Marti. (1983). L. A. Vetter.B. T. 39–47... 905–912. of 12th Int. Asian Conf. 300–305. K. J. 31. F. Taipei. 29. pp. of National Symposium on Pattern Recognition and Image Analysis. Abegglen. O’Toole. Using Consensus Sequence Voting to Correct OCR Errors. (2002). pp. MartinezHinarejos. and Zhou. Pattern Recognition Letters. (1993). Use of Positional Information in Sequence Alignment for Multiple Classiﬁer Combination. Use of Median String for Classiﬁcation. (1998). 39–48. of Int. on Pattern Recognition.. H. pp. Jiang. L. pp. Kohonen.. in A. Conf. Lopresti. and Baddeley. 22(10). Jiang.D. Pennec. Averaging Feature Maps. 15.. SIAM Reviews. (1999). 32.J. V. R. SpringerVerlag.). Dynamic Computation of Generalized Median Strings. and Blanz. A.. Roli (eds. 25. Jiang. Proc. 26. Improving Classiﬁcation Using Median String and NN Rules. 49–67. 14. X. Genetic Computation of Generalized Median Strings. Pattern Analysis and Applications. Speeding up the Computation of the Edit Distance for Cyclic Strings. (1994). of Int.. 25(2). J. Acta Cybernetica. (2001). 230(1/2). 3. Guyon. (2001). and Macromolecules. SpringerVerlag. (Submitted for publication. 24. Liberman.D. and Oncina. on Computer Vision. Median Strings. on Pattern Recognition. 28. Computation of Median Shapes. Proc. (2000). Theoretical Computer Science. 1615–1630. D. Schiﬀmann.C. R. F.) 18. Plamondon. A. (2001). C. and Casacuberta. J. MartinezHinarejos. L.. Kittler and F. (2001). F. V. M. 67(1). C. X. Computer Vision and Image Understanding. Barlett. pp. and Bunke. 23. Multiple Classiﬁer Combination. de la Higuera.) 19. X. S. 9–19. 2. Amin and D. 22. Conf. E. and Ayache. (Accepted for publication. Uniform Distribution. Owens. and Vidal. 21. J. Bunke. 27.. Price. Juan. 907–910. Marzal. String Edits. and Bunke. Computation of Normalized Edit Distance and Applications. (1999) Improved Greedy Algorithm for Computing Approximate Median Strings. Image and Vision Computing. Mico. Juan. C. UNIPEN Project on OnLine Exchange and Recognizer Benchmarks. E. Conf. (1999). IEEE Trans.. 309–313.. 201–237.. 18(1). (2000). T. Abegglen. Marzal.
. A. A. F. An Overview of Sequence Comparison: Time Warps. Proc. of 4th. 15(9). Juan and Vidal. J.. and Casacuberta.). (2000).. N. A. Kruzslicz. 926–932. and Barrachina. Journal of Mathematical Imaging and Vision. Schomaker. Pattern Recognition Letters. pp. Kruskal. and Csirik. Distance and Expectation Problems for Geometric Features Processing. Proc. K. 895–898. 32(9). 331–339. H. 16. and Janet. 388–398. Advances in Pattern Recognition.. 30. (1997). T.. and Casacuberta. in J. 20. An Approximate Median Search Algorithm in NonMetric Spaces. S. on Pattern Recognition. T. Pattern Recognition.
J. J. 37.J. R.. 337–348. 21(1). 35.A. (1994). D. J. (2000). A Procedure to Average 3D Anatomical Structures. D. 2(1). 4(4). Journal of Discrete Algorithms. 39. Journal of Computational Biology. 168–173. L. Subramanyan. and Fischer. 36. AddisonWesley. Jiang. Llados. Sankoﬀ. 34. Y. and Kruskal. String Edits. The StringtoString Correction Problem. K. (1983). (1974). Journal of the ACM. G. The Consensus String Problem for a Metric is NPComplete. World Scientiﬁc. (2001). Medical Image Analysis. Csirik
33. String Searching Algorithms. (2002). and Macromolecules: The Theory and Practice of String Comparison.. H. Wagner. Sanchez. Wang. K. Pattern Recognition Letters.B. and Park.). (1994). 1(4). 38. G.192
X. Bunke and J. 203–213.S.A. and Tombre. and Dean. Time Warps. and Jiang. Stephen. (eds. Sim. K. 317–334. On the Complexity of Multiple Sequence Alignment. 23(1–3). M.
. A Mean String Algorithm to Compute the Average Among a Set of 2D Shapes.