You are on page 1of 20

Data & Knowledge Engineering 68 (2009) 318–337

Contents lists available at ScienceDirect

Data & Knowledge Engineering


journal homepage: www.elsevier.com/locate/datak

Establishing relationships among patterns in stock market data


Dietmar H. Dorr, Anne M. Denton *
Department of Computer Science and Operations Research, North Dakota State University, Fargo, ND, USA

a r t i c l e i n f o a b s t r a c t

Article history: Similarities among subsequences are typically regarded as categorical features of sequen-
Received 6 September 2007 tial data. We introduce an algorithm for capturing the relationships among similar, contig-
Received in revised form 1 October 2008 uous subsequences. Two time series are considered to be similar during a time interval if
Accepted 6 October 2008
every contiguous subsequence of a predefined length satisfies the given similarity crite-
Available online 17 October 2008
rion. Our algorithm identifies patterns based on the similarity among sequences, captures
the sequence–subsequence relationships among patterns in the form of a directed acyclic
graph (DAG), and determines pattern conglomerates that allow the application of addi-
Keywords:
Knowledge discovery
tional meta-analyses and mining algorithms. For example, our pattern conglomerates
Pattern mining can be used to analyze time information that is lost in categorical representations. We
Financial applications apply our algorithm to stock market data as well as several other time series data sets
Stock market and show the richness of our pattern conglomerates through qualitative and quantitative
Time series data evaluations. An exemplary meta-analysis determines timing patterns representing rela-
tions between time series intervals and demonstrates the merit of pattern relationships
as an extension of time series pattern mining.
Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction

Time series data are ubiquitous in fields as diverse as economics, science, and industry; hence, it is not surprising that
there has been a strong interest in applying data mining techniques to time series data. Time series can be very long, and
users are often interested in similarities that extend over a comparatively short time interval, which suggests the use of
sliding-window techniques. An approach that is based on sliding windows starts with all possible fixed length, contiguous
subsequences of the time series under consideration. Note that the term ‘‘subsequence” has multiple meanings in the liter-
ature. We use subsequence in the sense of a contiguous section of a sequence that is also sometimes called ‘‘substring”. In
order to address the properties of time series data, special similarity measures have been devised that are defined over var-
iable-length subsequences, as well as making other generalizations [45,7,6,12]. With well-established similarity measures in
place, researchers have pursued pattern mining, clustering and classification tasks, as they are common in data mining.
The richness of temporal data is, however, not alone captured in modified similarity measures. In sequential data, strong
reasons may be given as to why it can be beneficial to revise even the concept of pattern mining itself: conventionally pattern
mining is seen as returning isolated, frequent occurrences in the data. Although relationships among patterns have been
extensively used as a basis for pruning through closure properties [1], these set–subset relationships do not normally con-
tribute much to the expressiveness of the result when time series are considered. In comparison to record data, time series
data inherently provides an additional dimension (time) for each data item. The time dimension can be utilized not only for
mining patterns but also for capturing the relationships among patterns. In our interpretation, a revised concept of pattern
mining should include the interrelations among patterns.

* Corresponding author. Tel.: +1 701 231 6748; fax: +1 701 231 8255.
E-mail address: anne.denton@ndsu.edu (A.M. Denton).
URL: http://www.cs.ndsu.nodak.edu/adenton/ (A.M. Denton).

0169-023X/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.datak.2008.10.001
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 319

For example, knowing that a group of stock series shares a pattern over a long period of time, while other stock series
show a related pattern over a much shorter interval can provide valuable insights into the price developments of stocks.
The relationships among patterns have important information content by themselves. It is our goal to capture the similarities
among stock market time series such that their sequence–subsequence relationships are preserved. We identify patterns
representing collections of contiguous subsequences that share the same shape for a particular time interval. Patterns are
defined on the basis of contiguous sections of normalized sliding windows that show pairwise similarities among sequences.
The relationships among sliding-window patterns are represented using a directed acyclic graph (DAG) that is constructed
based on the overlap between patterns. Leaf nodes within the DAG denote entire sequences, internal nodes represent pat-
terns, and the sequence–subsequence relationships among patterns are represented by the edges. In a directed graph, an
internal node, in contrast to a leaf node, has at least one directed edge to another node. The information contained within
the DAG, as well as timing information, is represented using a pattern conglomerate notation that constitutes a new level of
abstraction. The pattern conglomerate concept is designed to allow meta-analyses. In the context of this paper, a meta-anal-
ysis is an analysis applied to the results of another analysis, i.e., our pattern conglomerates (result of the first analysis) can be
used as input to another, second analysis (meta-analysis). A pattern conglomerate incorporates the structure of the DAG and
the order of clustered sequences, as well as the extent of the subsequences considered during the execution of our algorithm
(Section 3.3). The panel (a) of Fig. 1 depicts an example of four time series that shows a total of three characteristic shapes.
The sliding-window pattern that is signified by  is shared by all four sequences. Sequences A and B show a longer pattern
that extends as far as the section with a h. Time series C and D have a different extended pattern comprised of  and s. The
corresponding DAG representation is shown in panel (b) of Fig. 1. Each time series is represented by a leaf node, and all three
patterns are represented as internal nodes. The root node, , connects to the two other internal nodes, which represent the
longer patterns. Note that the DAG is different from similarity-based representations that are common in hierarchical clus-
tering, where degrees of similarities are used to group sequences. In our case, length of overlap determines the position in
the DAG and similarity is defined through a single window-based threshold. Accordingly, the  node is created based on the
overlap between patterns A/B (h) and C/D (s) rather than the degree of the similarity between the sequences. The third
panel (c) of Fig. 1 depicts the abstraction of the DAG in form of a pattern conglomerate. The structure of the DAG is repre-
sented using parentheses, and the beginning and ending of regions of similarity between pairs of sequences are indicated by
braces with subscripts.

16
15 A
14
13
12
11
10
9
Value

8 B
7
6
5 C
4
3 D
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time

Fig. 1. An example of four time series that are similar to each other over different time intervals (a). The clustering result of the time series is shown in
panel (b) using a DAG and (c) by the corresponding pattern conglomerate. In all three panels, the similarities among the time series are denoted by the
symbols h, , and s.
320 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

We demonstrate the usefulness of our pattern conglomerates by determining timing patterns of the form begins earlier,
ends later, and is longer between time series of the same pattern conglomerate. Examples for timing patterns in Fig. 1a are A
and B begin earlier than C and D. We apply our algorithm to 460 stock market time series of the S&P 500 index as well as to
four additional time series data sets (Section 5.1). The additional data sets serve as a means to highlight the applicability of
our approach to different time series data sets (Section 5.5) and to provide a more comprehensive performance analysis (Sec-
tion 5.7).

1.1. Financial interpretation

The stock price of a company is influenced by a wealth of internal and external factors. An internal factor may be the per-
ceived potential of the company to be successful in the future (e.g., competent management or ability to generate profit), and
an external factor could be the future expectations of a market in which the company operates. There have been several
studies addressing the influence of external factors such as news on stock market behavior [48,26,49,5,30]. We do not re-
strict our analysis by the assumption that there is a single external factor, such as a news report, affecting stock prices. It
is our objective to observe the effects of combinations of external influences that have an impact on the stock prices of
two or more companies. Note that we do not attempt to identify the nature of any factors but rather observe their effects.
We assume that stocks of two companies may show a similar shape when major influences or economic pressures on these
companies are similar. For example, if the future expectations of a particular market (e.g., e-commerce) are very positive (or
negative), then the stock time series of companies that operate in this market are likely to show a very similar shape. The
application of our algorithm to stock time series results in a DAG representation, e.g., Fig. 1b, where contiguous subse-
quences of stocks that exhibit a similar shape for some time interval are grouped together. Based on the above interpretation,
the companies that issue these stocks are under the pressure of similar factors for that particular interval. Our exemplary
meta-analysis focuses on the onset and progression of factors affecting two companies. Temporal relationships of interest
include the observation that the impact of factors on some companies begin earlier, end later, and are longer than others.

1.2. Related work

Traditionally, work on stock market data has focused on predictive modeling [4,11] and study of anomalies [41]. In recent
years, data mining approaches have increasingly gained importance [19,25,40,31,15], despite negative connotations of the
term ‘‘data mining”, which is sometimes interpreted as being ‘‘synonymous with data dredging and fishing” [21]. Predictive
tasks are still in the foreground of data mining [25] and machine learning [43,46,27] technique development. Applications
have been introduced to address the technical challenges of monitoring and mining time-critical financial data in conjunc-
tion with mobile computing devices [23].
The utilization of standard clustering algorithms for grouping fixed length, contiguous subsequences of time series [14]
has been shown to be a challenging problem [24]. Although the observed problems are not insurmountable [16,10], this pa-
per avoids them by only comparing windows at a fixed time point and only considering windows that have matches that are
statistically significant. Relationships between different time series have been studied in [2,50,36,9]. These techniques are
based on the interpretation of a sequence as an ordered list of events and are usually discussed under the term sequential
pattern mining. Sequential pattern mining addresses the identification of frequent, but not necessarily contiguous subse-
quences [2,36].
Analysis of time series is also discussed in the area of stream mining [18]. Typically, the application of mining algorithms
to continuous data streams has real-time constraints and requires one-pass searches or fast responses [28,47]. Accordingly,
stream mining approaches are limited by the available computational resources and the frequency of newly arriving data.
Similar techniques to the ones discussed in this work have been applied to categorical gene sequences [17]. Fundamental
ideas can often be applied both to sequences of categorical values such as gene sequences and to time series data. An exam-
ple is dynamic time warping for time series data which corresponds to the Needleman–Wunsch alignment algorithm for cat-
egorical sequences. Differences in normalization, similarity measures and evaluation of thresholds require substantial new
algorithm development. The focus of Dorr and Denton [17] is on the identification of motifs in biological sequences and it is
shown that the identified motifs are useful for assigning functional annotations to protein sequences. In contrast, this paper
addresses the sequence–subsequence relationships among stock market time series, and the usefulness of abstracting these
relationships to pattern conglomerates is shown through timing patterns. Since time series are based on real numbers (Def-
inition 1) and protein sequences are composed of categorical values, time series data must be processed differently than pro-
tein sequences. Several algorithms have been proposed for discovering motifs in general, and for addressing specific aspects
of the discovery problem in particular. Some algorithms focus on the discovery of motifs with a particular length [8,13], and
others address the problem of identifying motifs satisfying certain composition criteria [51,52]. The maximization of the
number of sequences associated with a motif is the focus of Gouzy et al. [20] as well as Sonnhammer and Kahn [42], and
the discovery of motifs that cannot be extended without reducing the number of supporting sequences is addressed by
the algorithms TEIRESIAS [39] and Gemoda [22].
This paper explicitly focuses on the identification of relationships among patterns and their abstraction to pattern con-
glomerates. Pattern conglomerates are derived through a clustering-like algorithm that establishes a DAG representing
the relationships among patterns. Algorithms have been proposed focusing on decomposition of clusterings [34], or utilizing
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 321

a DAG for representing relationships among entire time series [38]. In contrast to Queen et al. [38], it is our objective to rep-
resent the relationships among subsequences of time series. Villafane et al. [44] focus on containment relationships among
time series subsequences and use a DAG for representing these relationships. Our approach also captures the containment
relationships among time series subsequences and additionally considers the overlap between subsequences that do not sat-
isfy the containment criterion. Our timing patterns (begins earlier, ends later, and is longer) are a subset of Allen’s as well as
Freska’s relations between intervals [31,33]. We address the identification and abstraction of relationships among patterns
and show the usefulness of our pattern conglomerates by deriving timing patterns.
In Section 2, fundamental definitions related to time series are introduced. Our approach is discussed in Sections 3 and 4
provides an exemplary meta-analysis that utilizes our pattern conglomerates. The experimental evaluation in Section 5 pro-
vides several examples (Section 5.2), results for particular sectors (Section 5.3), identified timing patterns (Section 5.4) and a
discussion addressing additional data sets (Section 5.5), as well as a significance (Section 5.6) and performance analysis (Sec-
tion 5.7).

2. Background

2.1. Time series

Often sequences are formed by collecting attributes of the same type at different points in time. If those data are real-val-
ued, which is the case for stock prices, we commonly talk of time series data. For this paper we will limit our discussion to
real-valued time series.
Definition 1. A time series T = t1, . . ., tn is a sequence of real numbers, corresponding to values of an observed quantity,
collected at regular time intervals.

Definition 2. A subsequence of a time series T = t1, . . ., tn with length w0 , is a contiguous sequence T 0 ¼ t u ; . . . ; t uþw0 1 with
1 6 u 6 n  w0 + 1.
The process of extracting contiguous subsequences of a fixed length w0 by incrementing u in steps of one is called appli-
cation of a sliding-window. A subsequence of length w0 represents the atomic unit, for which a similarity to other subse-
quences may be observed and sequence–subsequence relationships determined. We use differences between successive
time points to define n  w0 + 1 vectors v~ in a w = w0  1 dimensional vector space. The rationale for this choice is that noise
in time series data often has the shape of a random walk time series [37] rather than that of white noise. Given the definition
of a random walk time series in [37], differences between adjacent time points generate a white noise time series, which
satisfies standard assumptions on the Gaussian nature of noise. The vector v ~ðT;uÞ contains w differences between w + 1 suc-
cessive time points of the time series T beginning at position u

~ðT;uÞ
vi ¼ t uþiþ1  tuþi for 0 6 i < w; 1 6 u 6 n  w: ð1Þ
Following [19], we apply piecewise z-normalization, i.e., we subtract the mean and divide by the standard deviation for each
~ðT;uÞ . The normalization adjusts the amplitude of the signal (window) without affecting its shape. Hence, similarly
vector v
shaped signals can be identified independently from the amplitude:
ðT;uÞ
~ðT;uÞ  lðfv
vðT;uÞ ¼ v ~i gÞ
ð2Þ
r ~ðT;uÞ
ðfv i gÞ for 0 6 i < w; 1 6 u 6 n  w;
where l is the mean of values of the vector, and r is the standard deviation.

2.2. Alignment

In order to evaluate the similarity between two time series, we compare vectors as defined in Eq. 2 using the Euclidean
distance measure. If the distance is smaller than a threshold h for one or more consecutive windows we consider the two
corresponding subsequences to be aligned. We refer to regions, in which two time series satisfy a similarity criterion, as
alignments, in analogy to the terminology used for genomic sequences.
Definition 3. Two time series subsequences Ta and Tb have an alignment of length l P w if the following holds
a b
jvðT ;uþiÞ
 vðT ;xþiÞ
j < h for 0 6 i < l  w þ 1: ð3Þ
In general, the two subsequences of an alignment do not have to start at the same time point. However, for the purpose of
this paper we only consider alignments for which u = x. There are multiple reasons for restricting the subsequences within
alignments to begin and end at the same time point. First, the concept of timing patterns that are used in the evaluation
depends on comparisons that are specific to points in time (Section 4). It is in the nature of our meta-analysis that relative
ordering in the actual time is to be studied. Second, we use a threshold h such that spurious matches are rare (Section 5.6). If
we were to allow arbitrary alignments, h would have to be chosen much smaller to achieve the same level of significance.
322 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

That would mean that many of the alignments discussed in the evaluation would be lost. Allowing a choice of h that disre-
gards statistical significance is not considered an option due to the known problems related to clustering of arbitrary time
series windows [24]. Notice that the performance is primarily governed by large hierarchies that are expected when there
are strong external factors that affect several time series simultaneously. If the algorithm were applied without the con-
straint that alignments start and end at the same time we would expect the additional hierarchies to be comparatively small.

Definition 4. A sliding-window alignment of length l between two sequences Ta and Tb each beginning at the time point u is
a b
maximal, if there is no i 2 {1, u + l  w + 1} such that jvðT ;uþiÞ  vðT ;uþiÞ j < h holds. A maximal, sliding-window alignment is
a b
represented by {T , T }u:u+l1, where the subscripts denote the first (u) and last (u + l  1) time point of the alignment.
Note that there can be longer, non-overlapping alignments between the same two sequences. The above definition only
excludes shorter alignments that are included in a maximal alignment.

3. Approach

3.1. Clustering

Our algorithm represents sequence–subsequence relationships as edges in a DAG. Each leaf node within the DAG denotes
a time series. An internal node represents a set of subsequences that show a direct or indirect mutual similarity. We refer to
the internal nodes as sliding-window patterns.
Definition 5. A sliding-window pattern is given by a set of sequence sections of uniform length, all of which are nodes in a
connected graph of sliding-window alignments (see Definition 4). All alignments are required to extend at least over the full
length of the sequence sections, but may extend longer.

16
15 A
14
13
12
11
10
9
Value

8 B
7
6
5 C
4
3 D
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time

Fig. 2. Example time series (a) together with the corresponding DAG (b) and pattern conglomerates (c). In comparison to Fig. 1, only sequence B is modified,
but due to its impact on the sequence–subsequence relationships, the representations of the clustering results differ.
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 323

The directed edges of the DAG of sliding-window patterns can be interpreted as representing a ‘‘has-subsequence(s)-of”
relationship. Edges begin at an internal node and end either at another internal node or a leaf node. In Fig. 1b, the s node
has subsequences of the time series C and D. Similarly, the root node has subsequences of the remaining two internal nodes,
which in turn have subsequences of the time series A and B as well as C and D.
Since our algorithm also has some similarity to agglomerative hierarchical clustering, we refer to nodes as clusters. A sub-
graph of the DAG that is induced by a set of nodes connected to a common root node is equivalent to a hierarchy in agglomerative
clustering. The higher a cluster is in a hierarchy, the more subsequences it contains. Since our DAG represents ‘‘has-subse-
quence(s)-of” relationships, the subsequences of a cluster are at most as long as the subsequences within a connected cluster.
A cluster can have direct ‘‘has-subsequence(s)-of” relationships with two time series, another cluster and a time series, or
two different clusters. If a cluster combines subsequences of two time series, then it extends over the full length of the cor-
responding alignment. Clusters that combine a time series and a cluster, or two clusters, only contain those sections of sub-
sequences that overlap with each other. Thus, non-overlapping sections of subsequences are not represented in clusters. For
example, in Fig. 1b, the root node () only represents the subsequences signified by , although the subsequences of the
connected clusters represent longer subsequences. Alignments are defined to start at the same time point (u = x in Definition
3). As a result, all subsequences within a cluster begin and end at the same positions u and u + l  1.
Non-overlapping parts of a long alignment can be contained in separate hierarchies within the DAG. An example depict-
ing such a situation is shown in Fig. 2a. In comparison to Fig. 1a, only the sequence B is modified so that it also has a part of
the pattern signified by s. This modification has a direct influence on the resulting DAG (panel (b) of Fig. 2), causing a change
in the sequence–subsequence relationships. The sections of the sequences that are signified by s in Fig. 1a are divided into s
and + sections in Fig. 2a. As before, the longest alignment is clustered first, which combines the subsequences of C and D

16
15 A
14
13
12
11
10
9
Value

8 B
7
6
5 C
4
3 D
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Time

Fig. 3. An example of four time series (a) that results in the same lattice structure (b) as the example shown in Fig. 1, but has a different DAG (c).
324 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

(signified by s+). The second cluster h represents sections of the time series A and B. This is followed by the creation of
the cluster signified by , which combines sections of all four time series. Finally, the fourth cluster represents the s sections
of the sequences B–D. Note that after the creation of the third cluster (), non-overlapping subsequences of the second clus-
ter (h) are reused for creating the last cluster (s). If reusing non-overlapping subsequences were not allowed, the s cluster
could not have been created. The corresponding pattern conglomerate of the example shown in Fig. 2a is depicted in panel
(c). A discussion addressing pattern conglomerates follows in Section 3.3. One might think that our DAG representation bears
strong similarities to the lattice structure of frequent itemset generation. However, an itemset lattice structure strictly rep-
resents set–subset relationships and the sequential information is lost. This becomes clear when applying frequent itemset
mining to sequences as transactions, where subsequences are considered to be the items. The relationships among the sub-
sequences are lost in such an approach and an itemset lattice structure cannot capture the sequence–subsequence relation-
ships among the time series. For the example shown in Fig. 1, all itemsets with non-zero support count are depicted in Fig. 3b
in form of an itemset lattice structure. In this setting, an item is a characteristic shape (h, , or s), and the number of se-
quences exhibiting a particular shape combination denote the support count. The same itemset lattice structure, as shown in
Fig. 3b, can also be obtained by a different example depicted in panel (a) of Fig. 3. This indicates that the itemset lattice struc-
ture is less suitable for representing the sequence–subsequence relationships among time series. In contrast, the correspond-
ing DAG representation to the example shown in Fig. 3a is depicted in panel (c) of Fig. 3. While the representations of the
examples in Figs. 1a and 3a using the itemset lattice structure result in the same depiction, our approach using a DAG pro-
vides two very different representations.

3.2. Algorithm

The pseudocode of the clustering algorithm is shown in Algorithm 1. Input to the algorithm are the set of all maximal,
sliding-window alignments (Definition 4) and the window length w. At each step the next longest alignment is considered,
and it is determined whether it is at least as long as w. It is possible that there are alignments shorter than w due to splitting
(described later). Since these alignments do not satisfy the sliding-window alignment criterion, the clustering process stops
once all alignments with length Pw have been processed and the created clusters are returned. After validating the length,
clusters are identified that overlap with a subsequence of the currently processed alignment. Depending on the presence of
overlapping clusters and the type of overlap, one of two different paths is taken.

Algorithm 1. Clustering algorithm.

(1) If a subsequence of the alignment partially overlaps with a cluster, then the alignment is split into fragments. The posi-
tion(s) of the split(s) are given by the beginning and ending of the subsequences within the overlapping cluster(s). The
fragments are then added to the list of alignments for later iterations. The splitting can be considered as a preprocess-
ing step for the clustering of the fragments.
(2) If neither subsequence of the alignment partially overlaps with a cluster the second path is taken; hence, an existing
cluster (cluster1 or cluster2) either is not present (null) or fully includes a subsequence of the alignment. In either case,
a new cluster is created that is based on the beginning and ending of the subsequences in the alignment. If no existing
cluster (cluster1 or cluster2) is found that overlaps with either subsequence of the alignment then the newly created
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 325

cluster is merely based on the alignment’s subsequences. Otherwise one or two existing clusters are found that fully
include a subsequence of the alignment. In that case, the newly created cluster is based on both the subsequences of
the alignment and the subsequences of the existing cluster(s) (cluster1 or cluster2). After the creation of a new cluster,
its subsequences are trimmed such that only the overlap region of the alignment and the existing clusters (if present)
are represented. Furthermore, the non-overlapping subsequences of the existing clusters are returned to the set of all
unprocessed alignments. This is achieved by splitting the underlying alignments used to create the existing clusters
and adding the non-overlapping fragments to the alignments list. Here, the position(s) of the split(s) are given by
the beginning and ending of the alignment’s subsequences.

There are some minor details that are not shown in the clustering algorithm: in some situations, there are multiple op-
tions for assigning a cluster to the variables cluster1 and cluster2. In this case, the clustering algorithm determines the best
combination of clusters (cluster1 and cluster2) that yields the longest overlap with the alignment. Another detail has been
omitted addressing the trimming of clusters. In deep hierarchies of three or more levels, clusters may be trimmed multiple
times. After each addition of a level, the subsequences within lower levels are trimmed and the resultant fragments are
added to the set of all unprocessed alignments. In order to guarantee that the alignments considered by the clustering algo-
rithm are maximal, alignment fragments resulting from trimming clusters are combined and represented as a single align-
ment in the alignments list.

3.3. Pattern conglomerate

Typically, the DAG of a clustering result contains multiple hierarchies. Each hierarchy represents a subgraph of the DAG
that is induced by the set of nodes connected to a common root node. We represent the internal nodes (clusters) of each
hierarchy in postorder and distinguish between partial subgraphs using parentheses. Each internal node is denoted by
the alignment (Definition 4) that has been used to create the cluster. While the DAG aids the clustering algorithm, a general-
ized representation is more useful for the application of meta-analyses and additional mining algorithms, because the begin-
ning and ending of clustered subsequences are included within the pattern conglomerates.
Definition 6. A pattern conglomerate is a notation for sliding-window patterns (Definition 5) that are uniquely specified by
overlapping sliding-window alignments (Definition 4).
In the panels (c) of Figs. 1 and 2 our pattern conglomerates are shown for the two example clusterings. Since the
DAG in Fig. 2b contains two hierarchies, two pattern conglomerates are depicted. The left hierarchy is based on the clus-
ters h, s+, and the root cluster . In our representation the cluster h is denoted by the alignment {A, B}2:8. The
s+ and  clusters are represented by the alignments {C, D}5:13 and {B, C}5:10, respectively. The second hierarchy to
the right shares its alignments with the left hierarchy, but denotes a different sequential feature (s). The alignments
in the second representation are shown in gray in order to indicate that the same alignments are also used for repre-
senting the left hierarchy.
The pattern conglomerates can be used to determine the time interval each cluster represents. The time interval of a clus-
ter corresponds to the maximum overlap of the alignments that has been used to create the cluster. For example, the root
node  of the left hierarchy in Fig. 2b is based on the alignments {A, B}2:8, {C, D}5:13, and {B, C}5:10. The maximum overlap of
these alignments represents the time interval from max{2, 5, 5} = 5 to min{8, 13, 10} = 8, which corresponds to the time inter-
val denoted by the root cluster ().

4. Example meta-analysis: timing patterns

We show the usefulness of our pattern conglomerates by utilizing them in a meta-analysis. Timing patterns are deter-
mined that describe relationships between time series represented in the same pattern conglomerate. We determine
whether a subsequence of one time series begins earlier, ends later, or is longer than a subsequence of another time series.
Since a time series can be involved in multiple pattern conglomerates, relationships between subsequences of several time
series can be observed multiple times. The example in Fig. 1a supports the following timing patterns: The subsequences of C
and D are longer and end later than the subsequences of A and B. The timing patterns represent a proper subset of Allen’s as
well as Freska’s relations between intervals [31]. We focus on the subset of these relations that appears most immediately
interesting to a potential user, but all relations are in principle applicable. Note that our restriction to a proper subset of these
relations between intervals does not constitute a limitation of our approach. Additional relations or operators could be deter-
mined through a different meta-analysis, e.g., some of Roddick’s 49 midpoint interval operators [31]. In order to discover
timing patterns, at least three sequences have to have a similar shape within the same time interval. Since sliding-window
patterns are identified based on alignments between two time series, three or more subsequences must be present within a
pattern conglomerate to possibly observe differences in the beginning and ending positions. Also note that we only focus on
timing patterns relating pairs of sequences. Timing patterns among three or more time series could be easily identified by
evaluating multiple time series within each pattern conglomerate.
326 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

For identifying timing patterns, only pattern conglomerates with three or more time series have to be considered. Within
a pattern conglomerate, we focus on each path of alignments from a leaf node to the root node, and compare the beginning
and ending positions of the alignments to each other. For example, the time series Ta begins earlier than another time series
Tb, whenever the longest alignment of Ta begins earlier than the longest alignment of Tb for that particular path. Within a
pattern conglomerate there are possibly multiple, overlapping paths of alignments from a leaf node to the root node. For
example, in Fig. 2b, the pattern conglomerate of the left hierarchy is composed of two overlapping paths of alignments:
{A, B}2:8{B, C}5:10 and {C, D}5:13{B, C}5:10. Furthermore, it is permissible in our clustering algorithm that the same alignment
is used multiple times to cluster time series. In order to prevent that the same pair of alignments is used multiple times
for determining a timing pattern between two time series, a function ‘‘alreadyProcessed” is introduced that tests if the com-
bination of alignments and time series has already been considered.
We formalize the criteria for timing patterns using the above test function. The formulae are based on two different time
series Ta and Tb as well as two different alignment {Ta, Tc}u:v and {Tb, Td}x:y that are on a common alignment path within a
pattern conglomerate. The time series of Ta begins earlier than the time series Tb, if

u < x ^ :alreadyProcessedðT a ; T b ; fT a ; T c gu:v ; fT b ; T d gx:y Þ: ð4Þ

The time series Ta ends later than the time series Tb, if

v þ y > n þ q ^ :alreadyProcessedðT a ; T b ; fT a ; T c gu:v ; fT b ; T d gx:y Þ: ð5Þ


a b
The time series T is longer than the time series T , if

v  u þ 1 > y  x þ 1 ^ :alreadyProcessedðT a ; T b ; fT a ; T c gu:v ; fT b ; T d gx:y Þ: ð6Þ

Support count and confidence measures can be defined as usual in pattern mining. The support count denotes the number of
identified occurrences (pattern conglomerates) of the corresponding timing pattern.
Definition 7. The support count for a relationship Ta  Tb of type  between two time series Ta and Tb is defined as the
number of pattern conglomerates r exhibiting the relationship: suppcðT a  T b Þ ¼ jfrjT ar  T br gj.
The confidence is the ratio of the support count and the number of occurrences identified for the particular combination
of time series and relationship type. For the relationship begins earlier and two time series Ta and Tb, the denominator of the
confidence value is the sum of the support counts for Ta begins earlier than Tb and for Tb begins earlier than Ta.
Definition 8. The confidence for a relationship Ta  Tb of type  between two time series Ta and Tb is defined as
suppcðT a T b Þ
confðT a  T b Þ ¼ a b b a .
suppcðT T ÞþsuppcðT T Þ

4.1. Algorithm

In Algorithm 2, the pseudocode is shown for determining timing patterns based on our pattern conglomerates (Section
3.3). Input to the algorithm is a list of all pattern conglomerates (R) that are derived from a clustering DAG. The support
counts of the begins earlier (be), ends later (el), and is longer (il) relationships are determined for each combination of time
series. The algorithm traverses through each alignment path p of each pattern conglomerate r and considers every combi-
nation of time series {Ta,Tb} within the alignment path. Subsequently, for both time series Ta and Tb the longest alignments
of the time series on the path is determined. In the case that the two longest alignment {Ta, Tc}u:v and {Tb, Td}x:y are equivalent
or have already been processed, the algorithm proceeds to the next combination of time series. Otherwise, the beginning and
ending positions of the alignments are compared to each other, in order to determine whether one time series begins earlier,
ends later, or is longer than the other time series.

Table 1
List of sectors and their abbreviations.

Sector name Sector abbreviation


Energy E
Materials M
Industrials I
Consumer discretionary CD
Consumer staples CS
Health care H
Financials F
Information technology IT
Telecommunication services T
Utilities U
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 327

Algorithm 2. Example meta-analysis: timing patterns.

5. Experimental evaluation

5.1. Data and parameter choices

We use daily historical data for stocks of the S&P 500 index from http://kumo.swcp.com/stocks/. The obtained data set
includes stock prices for an entire year from 02/01/2006 to 01/31/2007. The analysis of the stocks is done using the closing
values of each day, and excludes all those stocks that have not been a member of the S&P 500 index for the entire year. How-
ever, our approach does not need to be limited to time series of a particular length. Overall, the data set consists of 460 dif-
ferent stocks each having 250 prices. For brevity, we refer to specific stocks using their ticker symbol.
The stocks in the S&P 500 index are assigned a sector based on the type of business conducted. This categorization is
available at the Standard & Poor’s website (http://www2.standardandpoors.com/spf/csv/index/sp500.csv), and includes 10
sectors, such as energy, financials and health care. The entire list of sectors together with their abbreviations is shown in
Table 1. Ticker symbols and corresponding company names of the stocks in the energy sector are provided in Table 2.
The crude oil price shown in panel (a) of Fig. 4 is the Cushing, Oklahoma WTI (West Texas Intermediate) spot price FOB
(Free On Board) in dollars per barrel obtained from the Energy Information Administration (http://tonto.eia.doe.gov/dnav/
pet/pet_pri_spt_s1_d.htm), which is a statistics agency of the US Department of Energy.
Furthermore, we use three multivariate time series data sets as well as an artificial data set from E. Keogh’s UCR Time
Series Data Mining Archive (http://www.cs.ucr.edu/eamonn/TSDMA/index.html). The buoy_sensor data set consists of four
time series, each with 13,991 values. Both the spot_exrates and wind data sets contain 12 time series with 2567 and 6574
values, respectively. The artificial random_walk data set consists of a single time series with 65,536 values.
The time series are aligned and clustered using a single parameter, window size w. The window size w of 10 is chosen
such that the shortest similarity between two subsequences extends over two workweeks in the case of the stock market
data and 10 consecutive values in case of the additional data sets. Especially considering the volatility of the stock market,
we regard similarities between stock time series that extend over 2 weeks of particular interest. The maximum Euclidean
distance h varies for different data sets. The threshold is calculated such that a similarity between two subsequences of
328 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

Table 2
List of ticker symbols and corresponding company names of the stocks in the energy sector.

Ticker symbol Company name


APA Apache Corp.
APC Anadarko Petroleum
BHI Baker Hughes
BJS BJ Services
COP ConocoPhillips
CVX Chevron Corp.
DVN Devon Energy Corp.
EOG EOG Resources
EP El Paso Corp.
HAL Halliburton Co.
KMI Kinder Morgan
MRO Marathon Oil Corp.
MUR Murphy Oil
NBR Nabors Industries Ltd.
NE Noble Corporation
NOV National Oilwell Varco Inc.
OXY Occidental Petroleum
RDC Rowan Cos.
RIG Transocean Inc.
SLB Schlumberger Ltd.
SUN Sunoco Inc.
VLO Valero Energy
WFT Weatherford International Ltd.
WMB Williams Cos.
XOM Exxon Mobil Corp.
XTO XTO Energy Inc.

length w has a p-value of approximately 0.04. See Section 5.6 for a discussion addressing the significance. A change of either
value, w or h, affects the definition of an alignment (Definition 3). A more stringent similarity criterion is obtained by
decreasing the similarity threshold h or increasing the window length w. In the latter case, more time series data points
are considered in the determination of the similarity between two windows. However, a modification of the values also af-
fects the p-value of a similarity between two subsequences (Section 5.6). An equivalent p-value can be obtained by multiple
value combinations of w and h. The longer the window size w chosen, the higher may be the maximum Euclidean distance h
to achieve an equivalent p-value. However, the relationship between the window size w and distance threshold h is not lin-
ear. By varying the value combination of w and h, different but equally significant sets of alignments may be obtained that
result in different DAGs and pattern conglomerates. Since the focus of this paper is on capturing of similarities among se-
quences in form of a DAG and their abstraction to pattern conglomerates, we merely analyze the data sets using a single
window size.

5.2. Examples

The alignment algorithm determines 1269 alignments for the stocks in the energy sector, which are grouped into 213
clusters and result in 67 pattern conglomerates. Fig. 4a shows the clustered subsequences of one hierarchy extending from
10/05/06 to 11/01/06. In Fig. 4a, it can be seen that the sequences are clustered over a variety of lengths. The longest sub-
sequences are represented by the cluster combining RIG and WFT and consisting of 19 prices, and the shortest subsequences
are represented by the root cluster, which includes 12 consecutive prices of each stock shown in the hierarchy. The stock
prices of RIG and WFT are clustered first, since their alignment is longest. Other stock series that match the pattern over
shorter time intervals are clustered later.
Fig. 4a also includes the crude oil price. Although crude oil is not part of the S&P 500 index, this time series is shown,
because of the apparent economic connections between crude oil and stocks in the energy sector. Surprisingly, we find
that similarities among different stocks dominate the pattern DAG, which is depicted in panel (b) of Fig. 4. If we clustered
the crude oil time series with the stocks in the energy sector, only the highlighted subsequence of crude oil, which is
shorter than any of the stock series sections, would be included in the hierarchy. The corresponding DAG would be ex-
tended and receive a new root cluster that combines the root cluster of the DAG shown with an additional node repre-
senting crude oil. This indicates that we can find patterns that are not directly a consequence of a single external factor
such as crude oil. Multiple timing patterns are supported by the corresponding pattern conglomerate shown in Fig. 4c. For
example, the subsequences of RIG and WFT are longer, and end later than the subsequence of RDC, but the subsequence of
RDC begins earlier than the subsequences of RIG and WFT. Furthermore, the subsequences of RIG and WFT begin and end
at the same time and therefore have the same length. In this situation none of the timing patterns under consideration is
observed.
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 329

80
BHI
RIG
70 DVN
APA
CVX

Price in Dollars
60
Crude Oil

50

WFT
40

RDC
30
10/05/06
10/06/06
10/09/06
10/10/06
10/11/06
10/12/06
10/13/06
10/16/06
10/17/06
10/18/06
10/19/06
10/20/06
10/23/06
10/24/06
10/25/06
10/26/06
10/27/06
10/30/06
10/31/06
11/01/06
Time

Fig. 4. Clustered subsequences of stocks in the energy sector (a) together with resultant DAG (b) and pattern conglomerate (c). The crude oil prices are
included in panel (a) in order to show that there seems to be a dependency between crude oil and stocks of the energy sector. See Table 2 for the company
names of the ticker symbols shown.

Table 3 lists 10 pattern conglomerates utilizing the notation introduced in Section 3.3. The table includes all clustering
results among the sequences of the energy sector that are based on a hierarchy whose root cluster represents a time interval
within the first 30 days. Note that half of these pattern conglomerates cannot be used for determining timing patterns, be-
cause they are only based on a single alignment between two sequences.

Table 3
List of pattern conglomerates among the stocks of the energy sector. The table shows all pattern conglomerates that are based on hierarchies with a root cluster
representing a time interval within the first 30 days. Table 2 provides the company names of the ticker symbols shown.

Pattern conglomerate
({MUR, WMB}16:27)
({BJS, WFT}5:15)
((({BJS, RIG}17:32){BJS, APC}18:29){BJS, NBR}19:30)
({MRO, CVX}13:23)
(({MUR, VLO}4:14){SUN, VLO}2:14)
({BHI, NE}1:10)
((({HAL, SLB}10:24)(({BHI, RDC}2:25)({OXY, APA}11:27)
{APA, RDC}12:25){RDC, SLB}11:24)
((({COP, VLO}12:30){DVN, COP}13:26){DVN, EOG}15:27){HAL, COP}15:26)
({MRO, XOM}1:10)
(({NOV, XTO}16:32){EP, NOV}19:28)
((({APC, XTO}0:15){APC, EOG}3:13)({APA, DVN}4:13){DVN, EOG}4:13)
330 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

5.3. Results per sector

It is in the nature of our cluster hierarchies that they can contain clusters and thereby sequences of varying length. We
now look at the length distribution of sequences. For this analysis we consider only the longest subsequence of each time
series in each hierarchy. Referring back to the DAG representation in Fig. 1b, it can be seen that subsequences of the same
time series may be represented in two connected clusters with different lengths. For example, a subsequence of A is included
in the h and  cluster as well as in the  cluster. For this analysis we consider only the subsequence in the longest cluster,
i.e., the h and  cluster.
Fig. 5 depicts the histogram for the clustering results of the energy and financials sectors. Since pattern conglomerates
only consisting of two subsequences cannot be used to derive timing patterns, the subsequences of these hierarchies are
not included in the histogram shown. Also note, that there is no subsequence shorter than the window size w of 10, because
the clustering algorithm terminates once an alignment shorter than w is encountered. Fig. 5 shows that the time series of the
energy sector appear to be similar over longer time intervals. A second histogram is depicted by Fig. 6. It shows the number
of time series clustered in different hierarchies for the energy and financials sectors. As with the previous histogram, Fig. 6

35
Energy (E)
Financials (F)
30
Number of Subsequences

25

20

15

10

0
10 15 20 25 30
Length of longest Subsequences in Hierarchies

Fig. 5. Histogram of the length of longest clustered subsequence over all time series and hierarchies for the energy and financials sectors.

25
Energy (E)
Financials (F)

20
Number of Hierarchies

15

10

0
0 5 10 15 20 25
Number of Sequences

Fig. 6. Histogram of the number of time series clustered in different hierarchies for the energy and financials sectors.
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 331

only includes those hierarchies that combine three or more time series. It can be seen that in the energy sector there are
hierarchies that combine most of the 26 time series, while the largest hierarchy in the financials sector only includes 10
stocks out of a total of 78.

5.4. Resulting timing pattern

We now address timing patterns as introduced in Section 4 and focus on one relationship type, begins earlier, in order to
discuss a visualization in form of a matrix. Table 4 shows such a visualization matrix for the begins earlier relationship and
the stocks in the energy sector. Each row and column represents a time series, and the relationship between two time series
is represented by the color of the corresponding matrix element. A white matrix element indicates that the time series of the
row begins earlier than the time series of the column in more than 50% of the cases. Black matrix elements represent begins
earlier relationships that are true for fewer than 50% of the cases. The gray coloring indicates either a tie or the absence of a
relationship. By comparing the number of white matrix elements in individual rows, it can be observed that some time series
begin earlier than most other time series.
Example timing patterns for several sectors of the stock data set and different relationship types are shown in Table 5. The
support count and confidence values are provided in parentheses following each timing pattern. The table also includes tim-
ing patterns that are determined using pattern conglomerates obtained from clustering all 460 stock market time series.

Table 4
A matrix showing the begins earlier relationships between stocks of the energy sector. A white element indicates that in more than 50% of the cases, clustered
subsequences of the stock in the row begin earlier than subsequences of the stock in the column. Black coloring is used for representing less than 50% of the
cases. Gray elements of the matrix represent either a tie or the absence of a relationship. The company names of the depicted ticker symbols are listed in
Table 2.
332 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

Table 5
Examples of discovered timing patterns for different sectors, data sets, and relationship types. Patterns are represented as Ta > Tb denoting that the time series
Ta is longer, begins earlier, or ends later than the time series Tb. The numbers in the parenthesis show the support count and confidence for the corresponding
pattern. The row of the stock-ALL data set contains patterns across the boundary of sector categorizations. The particular sectors are denoted by their
abbreviation in brackets. See Table 1 for the names of the sector.

Data Is longer Begins earlier Ends later


E XOM > APA COP > XTO RIG > APC
(5, 1.0) (6, 1.0) (5, 1.0)
M NUE > MWV EMN > SEE AA > MWV
(2, 1.0) (2, 1.0) (2, 1.0)
I BNI > CSX BNI > CSX TYC > CSX
(3, 1.0) (3, 1.0) (1, 1.0)
CD DHI > KBH CTX > KBH DHI > KBH
(3, 1.0) (3, 1.0) (3, 1.0)
F SPG > AIV GS > FII VNO > EQR
(3, 1.0) (2, 1.0) (2, 1.0)
U XEL > PPL DTE > PPL EXC > EIX
(2, 1.0) (2, 1.0) (2, 1.0)
Stock-ALL BMS[M] > KBH[CD] PHM[CD] > MAS[I] BMS[M] > KBH[CD]
(2, 1.0) (1, 1.0) (2, 1.0)
Buoy_sensor 3>2 1>3 3>1
(4, 0.67) (3, 0.6) (4, 0.67)
Spot_exrates NLG > CHF NLG > CHF NLG > ESP
(87, 0.92) (75, 0.87) (66, 0.90)
Wind CLA > BEL BIR > MAL BIR > VAL
(114, 0.89) (99, 0.83) (110, 0.86)
Random – – –

Such timing patterns are listed in the row labeled stock-ALL, and represent a relationship between two time series that be-
long to different sectors. The sector abbreviation of the ticker symbols are provided in brackets.

5.5. Additional data sets

We have applied our approach to four additional data sets (Section 5.1) in order to highlight its applicability to different
time series data sets, as well as to provide a more comprehensive performance analysis (Section 5.7).
Since our approach is designed for multiple time series, the random_walk data set, which only consists of a single time
series, must be processed before the algorithm can be applied. We utilize the first 6500 values of the random_walk data set
in order to create 26 time series each with 250 values. The resultant data set (called random), thereby, resembles the setup of
the energy sector, which is also composed of 26 sequences with 250 values. We choose the same values (w = 10 and h = 1.16)
used for the time series of the energy sector in order to determine sliding-window alignments (Definition 4) and clusters
among the 26 random time series of the constructed data set. Only three sliding-window alignments are found among
the 26 random sequences. Two of these alignments have the minimum length of 10, and the third alignment is based on
12 value differences. In contrast, 1269 alignments are found among the sequences of the energy sector. The clustering of
the three alignments results in three clusters as well as three hierarchies. Each hierarchy is merely composed of a single clus-
ter and is based on a single alignment. Since none of these hierarchies combines three or more time series, no timing patterns
are determined between pairs of the 26 random sequences.
The last four rows of the Table 5 provide example timing patterns for the additional data sets and different relationship
types. Note that the support depends on the number of similarities among the sequences under consideration, which in turn
is affected by the length of the time series. The sequences of the stock market data set only contain 250 prices and the time
series of the buoy_sensor, spot_exrates, and wind data sets are composed of more than 2000 values (Section 5.1). The sup-
port counts of timing patterns from different data sets have to be normalized before they can be compared. The normaliza-
tion has to take the length of the time series into consideration, because more alignments (Definition 4), pattern
conglomerates (Section 3.3), and timing patterns (Section 4) may be derived based on longer sequences. In addition, we take
the lengths of alignments into account for normalizing the support count of timing patterns. The alignments between time
series of one data set may be very long, so that only a few timing patterns can be derived and the alignments between time
series of another data set could be rather short resulting in many pattern conglomerates and timing patterns. For each data
set, we normalize the support count by multiplying the ratio of the average alignment length used in pattern conglomerates
L and the length of the time series |Ta|:
L
n suppcðT a  T b Þ ¼ suppcðT a  T b Þ ; ð7Þ
jT a j
where we assume that |Ta| = |Tb|. The normalized support count measure n_suppc typically results in a number between zero
and one that indicates the cumulative portion of a time series represented in timing patterns with a particular support count.
For example, a normalized support count close to one means that the respective timing patterns represent nearly all
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 333

Table 6
Average alignment lengths (L) and normalized support counts (n_suppc) for the begins earlier timing patterns in Table 5.

Data Average alignment length Support count Normalized support count


E 13.91 6 0.33
M 11.92 2 0.10
I 10.87 3 0.13
CD 11.65 3 0.14
F 11.83 2 0.09
U 12.32 2 0.10
Stock-ALL 11.07 1 0.04
Buoy_sensor 12.24 3 2.62E3
Spot_exrates 25.77 75 0.75
Wind 17.43 99 0.26
Random 10.66 – –

elements of the time series. Correspondingly, only a small portion of the time series is represented by timing patterns with a
normalized support count close to zero. Table 6 shows the average alignment length and the normalized support count for
the begins earlier timing patterns in Table 5. For most data sets, the average alignment length used in pattern conglomerates
is close to the window size w of 10. The spot_exrates and wind data sets show longer average alignment lengths of 25.77 and
17.43, respectively. The normalized support count of the begins earlier timing patterns also varies for the different data sets.
Especially, the timing patterns of the energy sector, as well as the spot_exrates and wind data sets show a relatively high
normalized support count, which indicates that these timing patterns represent extended portions of the corresponding time
series.

5.6. Significance

In general, the probability of finding an alignment with a Euclidean distance smaller than the threshold depends on the
properties of the time series. Stock market data typically have statistical properties that are similar to those of a random walk
time series; hence, taking differences between successive time points is expected to result in attributes that have an approx-
imately normal distribution. Fig. 7 shows the distribution of attribute values for the energy sector time series discussed in
the previous sections. Note that all individual attributes share the same general shape. This observation is important: had we
used the data without taking differences, then the first and last values within each window would have a much broader dis-
tribution than values in the middle of the window. Taking differences effectively removes this systematic dependence on
position within a window. The figure also shows that the attributes are not perfectly represented by a normal distribution
with standard deviation equal to one. This is not surprising since we base the normalization on only w attribute values. This
results in much smaller tails than would be expected for a normal distribution. We nevertheless use the normal approxima-
tion for estimating the significance of window matches, but it should be understood that the result is only an estimate.
The probability distribution of the Euclidean distance among w components of vectors x and y that are normally distrib-
uted with lx = ly = 0 and rx = ry = 1 can be derived as follows
Z
2 =2 2 =2
hjx  yj2 i  fdxi gfdyi gjx  yj2 ex ey : ð8Þ

Fig. 7. Histogram of attribute values of normalized vectors.


334 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

Coordinate transformation z = |x  y| results in


Z
2 =2 2
hz2 i  fdxi gfdzi gðzÞ2 ex eðxzÞ =2
: ð9Þ

The integration overqxffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi


representspconvolution
ffiffiffi of Gaussians, which can be performed using a Fourier transform. The result is a
Gaussian with r ¼ r2x þ r2y ¼ 2
Z
2 =2 2 2 =ð22Þ
fdxi gex eðzxÞ =2
 ez : ð10Þ
pffiffiffi
Hence the distribution of distances is a normal distribution with r¼ 2
Z
2 =ð22Þ
hz2 i  fdzi gz2 ez : ð11Þ
pffiffiffi
That means that ðz2 = 2Þ, with z being the distance, is expected to be distributed according to a v2 statistic. The number of
degrees of freedom is w  2 since mean and variance were set as part of the normalization. For w = 10 and a distance thresh-
old of h = 1.16 the p-value is p1 = 1.5E3 for each match between two windows to be the result of random chance. The p-
value for not finding a match in any of N sequences is p = 1  exp(p1N). For the 26 sequences in the energy sector we
get p = 0.037 which is below the 5% significance level. That means that h = 1.16 can be considered a threshold, for which spu-
rious matches are expected to be rare. In practice, we consider hierarchies that involve many windows and are far less likely
to be encountered by random chance.

5.7. Performance

The performance of the clustering algorithm depends on (1) the number and length of time series under consideration, (2)
the number of alignments between the sequences, and (3) the number of overlaps among the alignments. The data sets (Sec-
tion 5.1) can be categorized into three distinct groups: (1) data sets that are not expected to show any patterns (i.e., random),
(2) data sets that are expected to provide some patterns (i.e., the time series of the individual sectors as shown in Table 1 and
stock-ALL), and (3) data sets that are expected to have many patterns (i.e., buoy_sensor, spot_exrates, and wind). The time
series of different data sets show varying degrees of similarity between each other, which results in different numbers of
alignments. Fig. 8 shows the number of alignments determined for each sector and data set in relation to the number of time
series under consideration. In addition, the graph of the function xðx1Þ
2
is depicted to serve as an orientation. If each pair of
sequences showed one alignment, we would expect the points to lie on this line. The time series of the energy sector and the
sequences of three additional data sets (buoy_sensor, spot_exrates, and wind) show a particularly large number of align-
ments in relation to the number of time series. Due to the measured quantity of the buoy_sensor, spot_exrates, and wind
data sets, it is not surprising that the time series of these data sets have many similarities to each other. The time series
of each data set are related to each other, and expected to be similar.
In bioinformatics, efficient algorithms have been devised for identifying alignments among genome sequences. These
heuristic algorithms, e.g., BLAST [3] and FASTA [35], have limitations such that they do not guarantee to determine all or
entire alignments. However, since these algorithms balance efficiency and sensitivity considerations, they are useful for

1E5

1E4 wind x(x-1)/2

spot_exrates
Number of Alignments

E stock-ALL
1E3
F
buoy_sensor
M U I CD
IT
1E2
H
CS
T
1E1

random

1E0
1E0 1E1 1E2 1E3
Number of Sequences

Fig. 8. Number of alignments determined for each sector and data set. The function xðx1Þ
2
shows the scaling under the assumption that one alignment is
determined for each sequence pair. The sectors are denoted by their abbreviation. See Table 1 for the names of the sector.
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 335

wind

1E5

spot_exrates
1E4

Clustering Time in ms

buoy_sensor
1E3
E
F stock-ALL
1E2
I M
U CD
CS IT
1E1
random

T H
1E0

1E0 1E1 1E2 1E3 1E4


Number of Alignments

Fig. 9. Execution time of the clustering algorithm for each sector and data set. The sectors are denoted by their abbreviation according to the list provided in
Table 1.

wind

1E5
Alignment and Clustering Time in ms

spot_exrates
1E4 stock-ALL

1E3
E F
CD
buoy_sensor IT
I H
1E2 M CS
U
random

1E1

1E0 T

1E0 1E1 1E2 1E3


Number of Sequences

Fig. 10. Execution time of aligning and clustering sequences of each sector and data set. Sectors are abbreviated and explained in Table 1.

identifying alignments among a massive number of sequences. An algorithm that utilizes similar heuristics for identifying
alignments among time series could be a means for improving the performance of our approach. In order to take advantage
of an existing algorithm that is designed for sequences of categorical values, the time series have to be discretized. Ap-
proaches like SAX [29] and Persist [32] address time series discretization and can be used as a preliminary step. As a con-
sequence of utilizing a heuristic algorithm for identifying alignments: (1) the obtained alignments do not necessarily
satisfy Definition 3, (2) a different DAG as well as pattern conglomerates could be determined, and (3) a different set of tim-
ing patterns are possibly identified. Nevertheless, timing patterns with a high support count (Definition 7) and confidence
(Definition 8) are likely to be determined using either alignment algorithm. An alternate approach towards decreasing
the running time for determining the alignments between time series is the modification of the value combination w and
h. For a discussion addressing the parameter choices see Section 5.1.
Two limiting cases can be considered to determine the performance of our algorithm. If the alignments among the se-
quences do not overlap with each other, then our clustering algorithm has a linear time complexity in relation to the number
of alignments. However, in the worst case scenario, all determined alignments overlap with each other. Then, the clustering
algorithm may determine fragments of all previously considered alignments and adds them to the list of unprocessed align-
ments. This results in a quadratic worst case time complexity in relation to the number of alignments.
336 D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337

Empirical results are depicted in Fig. 9. It shows the execution times for clustering the alignments of each sector and data
set. The worst performance is obtained by clustering the spot_exrates and wind data sets, which also show the two largest
numbers of alignments in Fig. 8. On average, each time series of the wind data set participates in 952 alignments, and a se-
quence of the spot_exrates data set shows an average of 282 similarities. In contrast, all other data sets as well as the indi-
vidual sectors only have an average number of alignments per sequence that is below 85 and the majority is even below 10.
Hence, it is the large number of overlapping alignments among few sequences that causes the unusually high execution time
for these two data sets.
In Fig. 10, the overall performance of our approach is shown in relation to the number of sequences under consideration.
Note the strong similarity between the number of alignments of a data set (Fig. 8) and the overall performance of our ap-
proach (Fig. 10). The performance of our clustering algorithm is mostly affected by the number of alignments determined
among the sequences. The number of alignments, in turn, depends on the similarities among the time series data. Strongly
similar data, e.g., wind and spot_exrates data sets, show a much larger number of alignments and running time than less
similar data, e.g., random data set.

6. Conclusions

We introduce an algorithm for representing the sequence–subsequence relationships among patterns based on subse-
quence similarities. The relationships between similar, contiguous subsequences are based on their overlap and result in
a directed acyclic graph (DAG). Our DAG representation is abstracted to pattern conglomerates, which in turn are evaluated
by examining the differences between the beginning and ending positions of similar subsequences. We apply our approach
to stock market time series of the S&P 500 index as well as to four additional time series data sets, and determine timing
patterns that capture relations between time series intervals. The extension of pattern discovery to include temporal rela-
tionships among patterns, in the form of pattern conglomerates, opens up the field of time series pattern mining to further
meta-analyses and mining algorithms.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. IDM-0415190.

References

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th International Conference of Very
Large Data Bases, Morgan Kaufmann Publishers Inc., 1994, pp. 487–499.
[2] R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data Engineering, IEEE Computer Society
Press, 1995, pp. 3–14.
[3] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped blast and psi-blast: a new generation of protein database
search programs, Nucleic Acids Research 25 (17) (1997) 3389–3402.
[4] E.M. Azoff, Neural Network Time Series Forecasting of Financial Markets, John Wiley & Sons Inc., 1994.
[5] B.S. Bernanke, K.N. Kuttner, What explains the stock market’s reaction to federal reserve policy?, The Journal of Finance 60 (3) (2005) 1221–1257.
[6] D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in: Proceedings of the Workshop on Knowledge Discovery and
Databases, AAAI Press, 1994, pp. 229–248.
[7] D.J. Berndt, J. Clifford, Advances in knowledge discovery and data mining, chap, Finding Patterns in Time Series: A Dynamic Programming Approach,
AAAI Press, 1996.
[8] J. Buhler, M. Tompa, Finding motifs using random projections, in: Proceedings of the Fifth International Conference on Computational Biology, ACM
Press, 2001, pp. 69–76.
[9] G. Chen, X. Wu, X. Zhu, Sequential pattern mining in multiple streams, in: Proceedings of the Fifth International Conference on Data Mining, IEEE
Computer Society, 2005, pp. 585–588.
[10] J. Chen, Making clustering in delay-vector space meaningful, Knowledge and Information Systems 11 (3) (2007) 369–385.
[11] S.H. Chen, P.P. Wang, Computational Intelligence in Economics and Finance (Advanced Information Processing), Springer, 2006.
[12] Y. Chen, M.A. Nascimento, B.C. Ooi, A.K.H. Tung, SpADe: On shape-based pattern detection in streaming time series, in: Proceedings of the 23rd
International Conference on Data Engineering, IEE, 2007, pp. 786–795.
[13] B. Chiu, E. Keogh, S. Lonardi, Probabilistic discovery of time series motifs, in: Proceedings of the Ninth International Conference on Knowledge
Discovery and Data Mining, ACM Press, 2003, pp. 493–498.
[14] G. Das, K.I. Lin, H. Mannila, G. Renganathan, P. Smyth, Rule discovery from time series, in: Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining, AAAI Press, 1998, pp. 16–22.
[15] S. de Amo, D. Furtado, First-order temporal pattern mining with regular expression constraints, Data and Knowledge Engineering 62 (2007) 401–420.
[16] A. Denton, Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model, in: Proceedings of the Fifth
International Conference on Data Mining, IEEE Computer Society, 2005, pp. 122–129.
[17] D.H. Dorr, A.M. Denton, Clustering sequences by overlap, International Journal of Data Mining and Bioinformatics, in press.
[18] M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams: a review, ACM SIGMOD Record 34 (2) (2005) 18–26.
[19] M. Gavrilov, D. Anguelov, P. Indyk, R. Motwani, Mining the stock market: which measure is best?, in: Proceedings of the Sixth International Conference
on Knowledge Discovery and Data Mining, ACM Press, 2000, pp 487–496.
[20] J. Gouzy, F. Corpet, D. Kahn, Whole genome protein domain analysis using a new method for domain clustering, Computers & Chemistry 23 (3–4)
(1999) 333–340.
[21] D. Hand, Data mining: statistics and more?, The American Statistician 52 (2) (1998) 112–118
[22] K.L. Jensen, M.P. Styczynski, I. Rigoutsos, G.N. Stephanopoulos, A generic motif discovery algorithm for sequential data, Bioinformatics 22 (1) (2006)
21–28.
[23] H. Kargupta, B.H. Park, S. Pittie, L. Liu, D. Kushraj, K. Sarkar, MobiMine: monitoring the stock market from a PDA, ACM SIGKDD Explorations Newsletter
3 (2) (2002) 37–46.
D.H. Dorr, A.M. Denton / Data & Knowledge Engineering 68 (2009) 318–337 337

[24] E. Keogh, J. Lin, Clustering of time-series subsequences is meaningless: implications for previous and future research, in: Proceedings of the Third
International Conference on Data Mining, IEEE Computer Society, 2003, pp. 115–122.
[25] B. Kovalerchuk, E. Vityaev, Data Mining in Finance: Advances in Relational and Hybrid Methods, Kluwer Academic Publishers, 2000.
[26] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan, Mining of concurrent text and time-series, in: Proceedings of the Sixth International
Conference on Knowledge Discovery and Data Mining, Workshop on Text Mining, ACM Press, 2000, pp. 37–44.
[27] C.H. Lee, A. Liu, W.S. Chen, Pattern discovery of fuzzy time series for financial prediction, IEEE Transactions on Knowledge and Data Engineering 18 (5)
(2006) 613–625.
[28] X. Lian, L. Chen, J.X. Yu, G. Wang, G. Yu, Similarity match over high speed time-series streams, in: Proceedings of the 23rd International Conference on
Data Engineering, IEE, 2007, pp. 1086–1095.
[29] J. Lin, E. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, in: Proceedings of the Eighth
SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM Press, 2003, pp. 2–11.
[30] M.A. Mittermayer, G. Knolmayer, News CATS: a news categorization and trading system, in: Proceedings of the Sixth International Conference on Data
Mining, IEEE Computer Society, 2006, pp. 1002–1007.
[31] F. Mörchen, Unsupervised pattern mining from symbolic temporal data, SIGKDD Explorations 9 (1) (2007) 41–55.
[32] F. Mörchen, A. Ultsch, Optimizing time series discretization for knowledge discovery, in: Proceedings of the 11th International Conference on
Knowledge Discovery and Data Mining, ACM Press, 2005, pp. 660–665.
[33] F. Mörchen, A. Ultsch, Efficient mining of understandable patterns from multivariate interval time series, Data Mining and Knowledge Discovery 15 (2)
(2007) 181–215.
[34] J. Park, S.A. Teichmann, DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain
proteins, Bioinformatics 14 (2) (1998) 144–150.
[35] W.R. Pearson, D.J. Lipman, Improved tools for biological sequence comparison, Proceedings of the National Academy of Sciences of the United States of
America 85 (8) (1988) 2444–2448.
[36] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.C. Hsu, PrefixSpan: Mining sequential patterns efficiently by prefix projected pattern
growth, in: Proceedings of the 17th International Conference on Data Engineering, IEEE Computer Society, 2001, pp. 215–226.
[37] M. Priestley, Non-linear and Non-stationary Time Series Analysis, Academic Press, 1988.
[38] C.M. Queen, B.J. Wright, C.J. Albers, Eliciting a directed acyclic graph for a multivariate time series of vehicle counts in a traffic network, Australian &
New Zealand Journal of Statistics 49 (3) (2007) 221–239.
[39] I. Rigoutsos, A. Floratos, Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics 14 (1) (1998) 55–67.
[40] J.F. Roddick, M. Spiliopoulou, A survey of temporal knowledge discovery paradigms and methods, IEEE Transactions on Knowledge and Data
Engineering 14 (4) (2002) 750–767.
[41] G. Schwert, Anomalies and Market Efficiency, first ed., vol. 1, Elsevier, 2003 (chapter 15).
[42] E. Sonnhammer, D. Kahn, Modular arrangement of proteins as inferred from analysis of homology, Protein Science 3 (3) (1994) 482–492.
[43] H. Teoh, C.H. Cheng, H.H. Chu, J.S. Chen, Fuzzy time series model based on probabilistic approach and rough set rule induction for empirical research in
stock markets, Data and Knowledge Engineering 67 (2008) 103–117.
[44] R. Villafane, K.A. Hua, D. Tran, B. Maulik, Mining interval time series, in: Proceedings of the First International Conference on Data Warehousing and
Knowledge Discovery, Springer, 1999, pp. 318–330.
[45] M. Vlachos, D. Gunopoulos, G. Kollios, Discovering similar multidimensional trajectories, in: Proceedings of the 18th International Conference on Data
Engineering, IEEE Computer Society, 2002, pp. 673–684.
[46] Y.F. Wang, On-demand forecasting of stock prices using a real-time predictor, IEEE Transactions on Knowledge and Data Engineering 15 (4) (2003)
1033–1037.
[47] H. Wu, B. Salzberg, D. Zhang, Online event-driven subsequence matching over financial data streams, in: Proceedings of the International Conference
on Management of Data, ACM Press, 2004, pp. 23–34.
[48] B. Wuthrich, V. Cho, S. Leung, D. Permunetilleke, K. Sankaran, J. Zhang, Daily stock market forecast from textual web data, in: Proceedings of the
International Conference on Systems, Man, and Cybernetics, vol. 3, IEE, 1998, pp. 2720–2725.
[49] T. Yu, T. Jan, J. Debenham, S. Simoff, Classify unexpected news impacts to stock price by incorporating time series analysis into support vector machine,
in: Proceedings of the International Joint Conference on Neural Networks, IEE, 2006, pp. 2993–2998.
[50] M.J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Machine Learning 42 (1–2) (2001) 31–60.
[51] Y. Zhang, M.J. Zaki, ExMotif: efficient structured motif extraction, Algorithms for Molecular Biology 1 (21) (2006).
[52] Y. Zhang, M.J. Zaki, SMOTIF: efficient structured pattern and profile motif search, Algorithms for Molecular Biology 1 (22) (2006).

Dietmar Dorr is a Ph.D. student in Computer Science at the North Dakota State University (NDSU). He received his M.S. in
Software Engineering at the University of St. Thomas, St. Paul, MN, USA. Recently, Dietmar joined the Research & Development
team at Thomson Reuters. His research interests include data mining, information retrieval, and natural language processing.

Anne Denton is Assistant Professor in the Computer Science Department at North Dakota State University (NDSU). She received
her Ph.D. in Physics from the University of Mainz, Germany, in 1996, and a M.S. in Computer Science from NDSU in 2003. Her
research interests center on data mining of diverse data, including time series, sequence-, graph-, vector- and item data. She
serves on the editorial board of the Biomed Central journal Source Code for Biology and Medicine.

You might also like