You are on page 1of 2

Statement of Research:

‘‘Similarity Search and Data Mining in Large Databases of Time Sequences’’


Byoung-Kee Yi

Time sequences of real numbers arise in many application domains such as finance, science/medicine,
and multimedia. Examples include daily stock prices, human electrocardiograms (ECG), and voice clips.
Two important and closely related issues are similarity search and data mining . Unlike traditional
atomic data, the comparison between two time sequences is based on similarity rather than exact
matching. Since the size of these sequences is potentially very large, efficient indexing methods are
needed to make possible fast and interactive queries by similarity. Data mining is a process to uncover
useful, yet previously unknown information from a large collection of raw data. It is perceived a vital
technology for companies to succeed in today’s ever competitive global market. Since time sequences
tend to grow indefinitely as new data are available, efficient incremental algorithms are essential for
mining such sequences.
Applications of similarity search and data mining in time sequence databases are numerous. In
finance, for example, a stock market analyst may be interested in such queries as ‘‘Find all stocks whose
prices moved similarly to that of the company XYZ over the last two months.’’. Query results can be
used for further analysis to find out the market trends and/or key factors behind certain market events.
An economist may want to keep track of correlations among some currencies in the exchange market. A
sudden deviation in the correlation structure may indicate an important event or a change in economic
policies. In medicine, a doctor watching ECG readings of a patient may want to find similar patterns in
ECG readings of other patients to find the most probable cause of the symptom. In network
management, correlations in traffics between certain network nodes/switches can be used for the
resource planning and the early warning of a traffic surge or a node failure.
I investigated efficient methods for similarity search and data mining. The primary focus is on
scalable methods for very large databases of time sequences which reside on disks. For similarity search,
I examined the case when the given similarity measure is defined by the time warping distance which
allows for deformation of signals in time. I further developed efficient indexing techniques that can
support multiple distance functions simultaneously. Our method reuses the same index structure for
arbitrary Lp-norm based distance functions. For data mining, I examined the case of large collections of
co-evolving time sequences which grow indefinitely. In particular, I studied on-line, incremental
algorithms for estimation of missing, delayed, or corrupted values. These algorithms also allow for
quantitative mining and outlier detection. More specifically, my work focuses on the following
problems:

1) Indexing for Time Warping: Time warping is useful for comparing signals deformed in time and
has been successfully used mainly in the (voice) signal processing area. The focus was, however, on
some small number of templates for which most of the computation is done in the main memory. I
investigated the case of large, disk-based databases of time sequences in which efficient indexing
techniques are crucial for the fast retrieval of sequences. The challenges are (1) the time warping
distance violates the triangle inequality which is the most common assumption for correct pruning of
search space, (2) no particular indexable features (e.g., a few DFT coefficients) for time warping are
known, and (3) sequential scanning is even costlier because of its quadratic computation complexity in
the length of sequences. I addressed these issues by proposing two basic techniques. I proposed to use a
method called "FastMap" for feature extraction and to apply it on the square root of the distance to
reduce false dismissals. For fast sequential scanning, I defined a new distance function that is cheaper to
compute and lower-bounds the original distance so that no false dismissals would occur. In experiments,
the combination of the two basic techniques achieved a significant speedup (up to 12 times) over the
naive sequential scanning method, with almost no false dismissals.

2) Indexing for Arbitrary Lp Norms: One severe drawback of previous approaches in time sequence
indexing is that they support only a single similarity model and tend to focus on Euclidean distance or its
derivatives. I believe, however, the choice of similarity models is at the hands of application engineers
and a DBMS must support a broad class of similarity models in an efficient way. To that end, I
examined the problem of efficient indexing for arbitrary Lp norms, since they by themselves are the
most popular models used in diverse applications and, also, constitute basic building blocks for more
sophisticated similarity measures. I introduce the concept of "segmented means" as features, and
showed how to use them for fast indexing with no false dismissals. My proof uses a well-known
mathematical result on convex functions. The novelty of the approach is that it can support all Lp norms
simultaneously with a single index structure and, hence, makes it simpler to implement other DBMS
functions such as query optimization. Experimental results showed that our method was up to 10 times
faster than the state-of-the-art method based on the discrete wavelet transform (DWT).

3) On-line Data Mining: In many applications, the data of interest comprises of multiple sequences that
evolve over time. Examples include currency exchange rates, network traffic data, and demographic data
on multiple variables. I developed a fast method to analyze such co-evolving time sequences jointly to
allow (1) estimation/forecasting of missing, delayed, or future values, (2) quantitative data mining,
discovering correlations (with or without lag) among the given sequences, and (3) outlier detection. Our
method, MUSCLES (for MUlti-SequenCe LEast Square), adapts to changing correlations among
sequences over time. It can handle indefinitely long sequences efficiently using an incremental algorithm
and requires only small amount of storage and less I/O operations. To make it scale for a large number
of sequences, I proposed a variation, the Selective MUSCLES method, and an efficient algorithm to
reduce the problem size. Experiments on real datasets showed that MUSCLES outperformed some
popular competitors in prediction accuracy up to 10 times and discovered interesting correlations.
Moreover, Selective MUSCLES scaled up very well for large number of sequences, reducing response
time up to 110 times over MUSCLES and sometimes even improved the prediction quality.

In-Progress and Future work: l plan to work further on the issues in similarity search and data mining
as well as other interesting database problems, including, but not limited to:

Mining time sequences of heterogeneous data types (real, categorical, binary, etc.)
Mining data on the web (stock quotes available from Yahoo!, etc.)
Similarity search for multimedia data (video, audio, etc.) by approximate examples.
Visualization of long time sequences (‘Visualize the correlations among all stocks in NYSE.’)
Classification/clustering of long time sequences (‘Group Internet users by their usage patterns.’)
Database support for forecasting of large-scale, non-linear, chaotic time sequences
Approximate estimation of distributions of datasets for mining and query optimization, using the
state-of-the-art statistical and signal-processing techniques (e.g., wavelets, PCA, fractals)

In the long run, my goal is to design and implement fast, scalable algorithms for information discovery
problems, so that average people connected to the Internet can easily find what they are looking for,
through easy-to-use graphical user interfaces.