Ts Art 1

Upps al a univ ersitets l ogotyp
UPTEC F 23021
Examensarbete 30 hp
Juni 2023
Evaluating clustering
techniques in financial time
series
Johan Millberg Civilingenj örspr ogrammet i tek nisk fysik
Civilingenjörsprogrammet i teknisk fysik

Upps al a univ ersitets l ogotyp
Evaluating clustering techniques in financial time series

Johan Millberg
Abstract
This degree project aims to investigate different evaluation strategies for clustering methods
used to cluster multivariate financial time series. Clustering is a type of data mining technique
with the purpose of partitioning a data set based on similarity to data points in the same cluster,
and dissimilarity to data points in other clusters. By clustering the time series of mutual fund
returns, it is possible to help individuals select funds matching their current goals and portfolio. It
is also possible to identify outliers. These outliers could be mutual funds that have not been
classified accurately by the fund manager, or potentially fraudulent practices.
To determine which clustering method is the most appropriate for the current data set it is
important to be able to evaluate different techniques. Using robust evaluation methods can
assist in choosing the parameters to ensure optimal performance. The evaluation techniques
investigated are conventional internal validation measures, stability measures, visualization
methods, and evaluation using domain knowledge about the data. The conventional internal
validation methods and stability measures were used to perform model selection to find viable
clustering method candidates. These results were then evaluated using visualization techniques
as well as qualitative analysis of the result. Conventional internal validation measures tested
might not be appropriate for model selection of the clustering methods, distance metrics, or data
sets tested. The results often contradicted one another or suggested trivial clustering solutions,
where the number of clusters is either 1 or equal to the number of data points in the data sets.
Similarly, a stability validation metric called the stability index typically favored clustering results
containing as few clusters as possible. The only method used for model selection that
consistently suggested clustering algorithms producing nontrivial solutions was the CLOSE
score. The CLOSE score was specifically developed to evaluate clusters of time series by
taking both stability in time and the quality of the clusters into account.
We use cluster visualizations to show the clusters. Scatter plots were produced by applying
different methods of dimension reduction to the data, Principal Component Analysis (PCA) and
t-Distributed Stochastic Neighbor Embedding (t-SNE). Additionally, we use cluster evolution
plots to display how the clusters evolve as different parts of the time series are used to perform
the clustering thus emphasizing the temporal aspect of time series clustering. Finally, the results
indicate that a manual qualitative analysis of the clustering results is necessary to finely tune the
candidate clustering methods. Performing this analysis highlights flaws of the other validation
methods, as well as allows the user to select the best method out of a few candidates based on
the use case and the reason for performing the clustering.
Tek nisk-naturvetensk apliga fak ulteten, Upps ala universitet. Utgiv nings ort U pps al a. H andl edare: Erik Br odi n, Ämnesgransk are: Antônio Horta Ri beir o, Ex aminator: T omas Ny berg
Teknisk-naturvetenskapliga fakulteten
Uppsala universitet, Utgivningsort Uppsala
Handledare: Erik Brodin Ämnesgranskare: Antônio Horta Ribeiro

Examinator: Tomas Nyberg
Acknowledgements
I would like to express my deepest gratitude to my supervisor Erik Brodin for his
support and guidance throughout the course of this thesis. Our discussions helped
me immensely, both with my research and the management of the project itself. I
would also like to thank the subject reviewer of this thesis, Antônio Horta Ribeiro. His
insights and feedback has helped me significantly with the refinement of the thesis as
well as development of ideas. Finally, I want to thank my family including my wonderful
girlfriend Emilia.
3
Populärvetenskaplig sammanfattning
I denna rapport behandlades olika metoder för att utvärdera klustringsalgoritmer som
applicerades på dataset av finansiella tidsserier. Klustring är en typ av oövervakat
lärande, som är en klass av maskininlärning där datapunkterna i dataseten inte har
några förbestämda etiketter, utan uppgiften istället går ut på att hitta strukturer i
datasetet. Klustring innebär att datapunkterna delas upp i olika grupper eller kluster,
beroende på någon form av avståndsmetrik som används för att definiera hur lika eller
olika två datapunkter är.
Finansiella tidsserier är en typ av data som beskriver exempelvis hur exempelvis en

fond eller en akties pris eller avkastning varierar över tid. Genom att exempelvis klustra
dataset av fonders avkastning med hjälp av ett korrelationsmått som avståndsmetrik kan
kluster av fonder som tenderar att följa samma trender identifieras. Genom att välja
fonder från olika kluster till sin fondportfölj kan individen i teorin öka sin diversifiering,
eftersom fonderna i olika kluster tenderar att inte följa samma mönster i samma grad
som om endast fonder från ett kluster hade valts.
För att välja den optimala klustringsmetoden behövs evalueringsmetoder som kan hjälpa
användaren att avgöra vilken metod som ger optimala kluster. De metoder som behand-
lades i denna rapport är konventionella interna valideringsmetoder, stabilitetsbaserade
metoder, samt visualiseringsmetoder. De konventionella interna valideringsmetoderna
består av ett antal metriker vars gemensamma faktor är att metrikerna endast är baser-
ade på klustringsresultaten, det vill säga distanserna mellan de olika datapunkterna
samt vilket kluster de har blivit indelade i. Stabilitetsbaserade metoder värderar hur
stabila resultaten av en klustringsalgoritm är, antingen över tid eller när ny data in-
troduceras. Visualiseringsmetoder är som namnet antyder metoder för att visualisera
klustringsresultatet. I denna rapport presenterades både metoder för att visualisera
klustren i sig, samt en metod som visar hur klustren förändras över tid när olika delar
av tidsserierna klustras. Slutligen så analyserades klustringsresultaten kvalitativt i syfte
att tolka resultaten samt utvärdera de andra evalueringsmetoderna.
Experimenten genomfördes genom att utvärdera olika klustringsresultat med de tidi-

gare nämnda evalueringsmetoderna. Resultaten indikerade att de konventionella in-
terna valideringsmetoderna var otillräckliga för att välja den bästa klustringsmetoden för
majoriteten av klustringsmetoderna, då metrikerna antydde att de bästa klustringsme-
toderna var de som resulterade i antingen ett kluster, eller lika många kluster som
det fanns datapunkter i datasetet. En av de stabilitetsbaserade metoderna kallas för
CLOSE. Denna metod värderar både stabilitet över tid för klustringsresultaten samt
kvaliteten av klustren. Klustringsmetoder som fick högst CLOSE bedömdes fånga mer
komplicerade mönster i datan. Av den anledningen ansågs denna metod vara den som
var bäst lämpad för kvantitativ klustringsvärdering av de metoder som testades. Att en
klustringsmetod är stabil i tiden och resulterar i ungefär samma kluster oavsett vilken
tidsperiod som används för att skapa klustrena tyder på en pålitlighet hos metoden. Det
observerades dock att det sker en avvägning mellan klusterkvalitet och stabilitet. Detta
eftersom en samling större kluster tenderar att vara mer stabila över tid, men leder till
att kvaliteten av klustren blir sämre. Kvaliteten av ett kluster mättes genom att an-
4
vända avstånden som datapunkterna i ett kluster hade till den teoretiska mittpunkten
av klustret, och i takt med att ett kluster blev större och innefattade fler datapunkter
så ökade även detta avstånd. Därmed minskade kvaliteten.
Den mest informativa visualiseringsmetoden kallas för klusterevolutions-diagram. Detta

diagram visade hur klustrena förändras över tid när olika delar av tidsserierna klustras.
På det sättet kunde metodens stabilitet visualiseras, då det blev möjligt att se om
klustrenas storlek var ungefär konstant eller om de splittrades eller sattes ihop med andra
kluster. Slutligen så drogs slutsatsen att en kvalitativ analys av klustringsresultaten är
nödvändig för att säkerställa att resultaten är användningsbara, samt för att utvärdera
hur effektiva de andra evalueringsmetoderna varit. Denna analys genomfördes genom att
klustren och dess innehåll granskades, och de klustrade finansiella instrumentens likheter
och skillnader observerades. Genom att kvalitativt utvärdera resultaten kan den som
utför analysen dessutom avgöra om stabilitet över tid eller klusterkvalitet är viktigast
för slutresultatet, och använda den klustringsmetod som bäst uppfyller önskemålen.
5
Contents
1 Introduction 8
2 Background 9
3 Theory 10
3.1 Financial time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Stylized facts of financial time series . . . . . . . . . . . . . . . . 10
3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Measures of similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Kendall’s tau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Robustness and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Data 15
4.1 Mutual funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Data frequency and length . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Methodology 17
5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.1 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.2 Internal clustering evaluation . . . . . . . . . . . . . . . . . . . . 22
5.2.3 Measures of stability in time . . . . . . . . . . . . . . . . . . . . . 25
5.2.4 Replication stability evaluation . . . . . . . . . . . . . . . . . . . 28
5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.3 Pre-evaluation processing of cluster results . . . . . . . . . . . . . 32
5.3.4 Reference methods based on a priori knowledge . . . . . . . . . . 33
5.3.5 Clustering experiments . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Results 35
6.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Agglomerative clustering using Kendall’s tau . . . . . . . . . . . . 36
6.1.2 Agglomerative clustering using the Hellinger distance . . . . . . . 39
6.1.3 Hybrid hierarchical clustering using Kendall’s tau . . . . . . . . . 42
6.1.4 Hybrid hierarchical clustering using the Hellinger distance . . . . 45
6.1.5 Summary of model selection results . . . . . . . . . . . . . . . . . 48
6.2 Results of reference clustering methods . . . . . . . . . . . . . . . . . . . 49
6.2.1 Mutual fund data set . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.2 Stock data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Results of tuned methods . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6
6.3.5 Summary and comparison to reference methods . . . . . . . . . . 52
6.3.6 Selected clustering methods for mutual fund data . . . . . . . . . 52
6.3.7 Selected clustering methods for stock data . . . . . . . . . . . . . 52
6.4 Visualization of clustering results . . . . . . . . . . . . . . . . . . . . . . 53
6.4.1 Scatter plots created using dimension reduction . . . . . . . . . . 53
6.4.2 Cluster evolution plots . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5 Qualitative analysis of clustering results . . . . . . . . . . . . . . . . . . 58
6.5.1 Comparing the clustering results . . . . . . . . . . . . . . . . . . 61
7 Discussion 61
7.1 The clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Quantitative evaluation methods . . . . . . . . . . . . . . . . . . . . . . 62
7.2.1 Conventional internal validation measures . . . . . . . . . . . . . 62
7.2.2 CLOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2.3 Stability index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Cluster visualization methods . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3.1 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3.2 Cluster evolution plots . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4 Qualitative analysis using domain knowledge . . . . . . . . . . . . . . . . 66
7.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 Conclusions 68
9 References 71
10 Appendix A 74
10.1 Internal validation measures of bootstrapped time series . . . . . . . . . 74
11 Appendix B 80
11.1 Process description for performing clustering of financial time series . . . 80
11.1.1 Data inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.1.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7
1 Introduction
Clustering is a form of unsupervised machine learning used for partitioning data into
groups. This can help identify patterns in the data previously not observed by analysts.
The manner in which the data is partitioned is heavily dependent on the data and which
features are chosen to be used when clustering. The choice of features depends on the
reasons why the clustering is carried out in the first place, and will lead to different
classifications of the data.
One special application of clustering techniques is clustering of time series, and partic-
ularly financial time series. A general time series is simply data measured over a period
of time, in evenly spaced intervals. Financial time series can describe a multitude of
different measured quantities, such as the price of fund or stock over a period of time.
There are a number of properties that empirical financial time series display, and these
properties are referred to as the stylized facts of financial time series [1]. When perform-
ing analysis financial time series, it is important to keep these properties in mind since
these properties are some of the unique concepts of financial time series.
Due to the nature of time series the data has many dimensions, since each observation in
the series corresponds to one unique dimension. One of the main challenges of clustering
financial time series is therefore dimension reduction when selecting features to cluster
on. One can for example choose to calculate the similarity between the financial time
series in the data set, and use these distances to cluster time series that are close to each
other in the same partition. Another approach is to use basic statistic metrics such as
mean, variance, or skew of the distribution of returns of the time series.
The choice of clustering algorithm is another factor that will heavily affect the result.
Many clustering algorithms rely on some kind of defined similarity between the data
points to perform the clustering, so which metric that will be used to calculate the sim-
ilarity is another choice that needs to be made.
Due to the options previously mentioned as well as many more, the choice of clustering
method is not straight forward. In order to enable data driven decision making when de-
signing clustering algorithms, it is therefore important to be able to evaluate clustering
techniques to see which design choices result in the best partitioning of the data. Bear
in mind that it is not always trivial to define which clustering result is optimal, and this
is the case when clustering financial time series. The evaluation of the clustering results
will in the cases where a known cluster solution is not available rely on the properties of
the clusters themselves, such as intra-cluster distance between data, distance to other
clusters, or stability.
The main objectives of this thesis will be determining methods and metrics that can
be used to evaluate clustering results, as well as how these evaluation methods can
measure "robustness" of a clustering algorithm in order to make the results of the al-
gorithm as robust as possible. Since financial time series by definition vary over time,
the clustering result of an algorithm likely varies depending on the time period used to
perform the clustering. Examining how these results change will help determining ro-
8
bustness, and can give insight into how to develop better, robuster clustering techniques.
Our goal is to investigate how clustering techniques applied to financial time series can
be evaluated. Our specific objectives are to:
1. Determine metrics for evaluating how well a clustering algorithm performs.
2. Investigate robustness of clustering results of financial time series
3. Using the metrics defined, compare the results of different clustering algorithms.
A framework for classifying financial time series using different clustering algorithms has
been implemented by Kidbrooke® , and it is a subset of these algorithms that will be
evaluated using the defined metrics. Kidbrooke® is a Swedish company providing finan-
cial software for financial institutions [2]. This master thesis is written at Kidbrooke® in
order to investigate how their clustering methods can be evaluated, so that they in turn
can use the methods to find new, better clustering methods to use in their applications.
2 Background
During recent years, the amount of available tools for planning and structuring the per-
sonal finances of individuals have seen a steady increase. The digitalization of these tools
have been enabled by an increased use of mathematics, statistics and machine learning.
An example of a use case of these tools is to help individuals not only choose mutual
funds to invest in, but also over time evaluate if the choice of funds is still good.
Clustering algorithms can be used in order to classify a large number of mutual funds.
The resulting clusters can then be used in order to recommend mutual funds to cus-
tomers. If a customer owns a number of mutual funds within a cluster, better funds
within that same cluster can be recommended, similarly to how recommendation algo-
rithms in other applications operate.
There exists a multitude of different clustering algorithms, and an infinite amount of

different combinations of hyperparameters for each algorithm. Thus, it is important to
be able to evaluate the different clustering algorithms that are used in order to determine
which one that succeeds best. Clearly defined metrics for evaluating clusters of financial
time series need to exist in order to facilitate this task.
Additionally, due to the time dependent nature of financial time series, the resulting
clusters of each clustering algorithm may change over time. It is therefore of interest
to define a metric of robustness for clustering methods used on financial time series.
This enables the investigation of how time series are moved between clusters when the
clustering algorithms are applied to different time periods.
9
3 Theory
3.1 Financial time series
Time series are defined as sequences of ordered data T = ⟨t1 , t2 , ..., td ⟩, where T ∈ Rd
and d ∈ N+ . A multivariate time series is a set D = {T1 , T2 , ..., Tn } of n ∈ N+ time
series [3].
3.1.1 Stylized facts of financial time series

Financial time series such as stock prices and exchange rates are a significant area of
study in economics and statistics. Financial time series tend to display certain statistical
properties, commonly known as the stylized facts of financial time series. In "Empiri-
cal properties of asset returns: stylized facts and statistical issues" by Rama Cont [1],
a number of stylized facts are described. Since stylized facts are properties that are
shared by many different financial time series of a wide range of different financial in-
struments, they are determined by finding common patterns in the data. Since many
different financial time series are considered, the generality of the facts is increased. At
the same time, the gain in generality results in a loss of precision [1]. For example if
the only type of financial time series investigated would have been returns of companies
operating within the same industry, additional stylized facts applicable to the same type
of financial time series could be found. Being aware of the different properties of the
data can explain why certain methods do or do not work well.
A description of each stylized fact as specified by Cont follows:
• Autocorrelations of returns are usually not significant, except for when examining
shorter periods of time during intraday trading.
• The distribution of the returns is heavy tailed, meaning that extreme events are
more likely compared to if the returns had been normally distributed. If that
would have been the case, events with large negative or positive returns would
have been more unlikely.
• There exists an asymmetry when it comes to gains and losses. In empirical financial
time series data, large decreases in stock price can be observed while increases in
the same range are significantly less common.
• As the time period over which the returns are calculated increases, the distribution
of returns start to more and more resemble a Gaussian distribution.
• Returns of financial time series display a significant amount of irregularity. This

phenomenon is quantified by large irregular changes of a number of different volatil-
ity estimators thoughout the time series.
• Estimators of volatility display a positive autocorrelation spanning multiple days.

This suggests that large changes in prices are usually followed by more large
changes in the near future, and that low volatility events also tend to follow each
other. This is known as volatility clustering.
10
• Despite correcting returns to account for volatility clustering, the distribution of
returns of the time series still display heavy tails. These tails are however less
heavy than in the distribution of the uncorrected returns.
• The autocorrelation function of the absolute value of the returns decay slowly.
This means that the correlation between a data point and the previous data points
decreases more and more for data points further back in time.
• The correlation between the majority of volatility measures and the returns of an
asset is negative. This is called the leverage effect.
• The trading volume of a financial instrument is correlated with the volatility.
• Measures of volatility that are calculated using coarser data is able to more ac-
curately predict measures of volatility calculated using finer data than the other
way around [1].
3.2 Clustering
The task of partitioning a data set into groups is called clustering. This partitioning is
carried out in such a way that the data points within the same cluster are similar to
each other in some way. Data points that are different from one another are partitioned
into separate clusters. The methods used when carrying out the partitioning of data
are called clustering algorithms. Clustering is a form of unsupervised learning, and a
useful technique for discovering patterns and common traits of unlabeled data points [4].
One application of clustering algorithms is partitioning financial time series. More for-
mally, when clustering time series data the goal is to partition multivariate time series
D = {T1 , T2 , ..., Tn }, n ∈ N+ into clusters C = {C1 , C2 , ..., Cm }, where Ci ⊆ D and
1 ≤ m ≤ n such that the time series in D are clustered together based on some metric
of similarity [3]. There exists a multitude of different clustering algorithms, as well as
different ways to measure similarity between data points of different kinds of data. Some
clustering algorithms perform clustering using the raw time series, while others utilize
some reduction method to reduce the dimensionality of the data [4].
In this section, the clustering methods evaluated in this report will be presented. All
time series clustering algorithms discussed are based on a pair-wise distance matrix of
time series. The distance measures between time series used will for this reason also be
presented here.
3.2.1 Clustering algorithms

Hierarchical clustering
Hierachical clustering is a subset of clustering algorithms that include agglomerative and
divisive clustering. An agglomerative clustering algorithm starts by dividing each data
point into a separate cluster, such that the total amount of clusters is the same as the
amount of data points. The algorithm then recursively merges the clusters based on
the distances between them [4]. Clusters are merged until a certain stop condition is
met. The way that the algorithm determines which clusters are merged is called link-
age. One example of a linkage criterion is average linkage, that calculates the distances
11
between each data point in the two clusters and uses the average of these. Clusters are
then merged such that this average distance between clusters is minimized. Another
example is single linkage, which minimizes the minimum value of the distance between
data points in separate clusters [5]. An agglomerative clustering algorithm will be one of
the clustering methods evaluated in this degree project, and the stop condition for this
method will be a predetermined number of clusters. When the algorithm has reached
a certain number of clusters, the partitioning task is considered finished. When per-
forming divisive clustering the starting number of clusters is one, in stark contrast to
agglomerative clustering. This initial cluster contains all data points in the data set.
The cluster is then partitioned recursively into more and more clusters until the stopping
condition is met [4].
Another type of hierarchical clustering that will be evaluated is a method that will
be referred to as hybrid hierarchical clustering. This method combines elements from
agglomerative and divisive clustering algorithms, and incorporates a stopping criterion
that is based on the maximum distance within the clusters. An elaborate description of
this algorithm follows:
1. Begin by using agglomerative clustering to divide the entire data set into two
clusters. This step is similar to the first step of a divisive clustering algorithm,
where the entire data set is split in two.
2. For both clusters that were obtained in the previous step, check whether the max-
imum distance between the data points within the cluster exceeds the predefined
maximum distance threshold. If it does, this means that this cluster requires
further partitioning.
3. For the clusters whose internal distances exceed the maximum distance threshold,
an agglomerative clustering algorithm is used to again split the cluster into two
new clusters.
4. This process is repeated recursively until no clusters in the entire set of clusters
have members whose distance between each other exceed the predetermined max-
imum distance threshold. Once this condition has been fulfilled, the clustering
task is considered finished.
In this degree project, the hierarchical clustering methods are implemented using aver-
age linkage for scope limiting reasons. Additionally, as opposed to the Ward linkage this
linkage is available for an arbitrary distance metric in scikit-learn’s implementation of
agglomerative clustering [6].
Sequential clustering
Kidbrooke® uses a clustering technique called sequential clustering. Sequential cluster-
ing uses multiple clustering methods to partition the data recursively. When sequentially
clustering a data set D, the first clustering method is applied to the data set, which
leads to the data being partitioned into clusters C = {C1 , C2 , ..., Cm }, where m is the
total amount of clusters after the first clustering round. In the next clustering round,
the next clustering algorithm is applied to each cluster Ci in order to partition the
data further. This methodology creates a flexibility, since many clustering algorithms
12
can be combined in different ways, yielding different results. This sequential clustering
algorithm will be used as a reference method when experiments are performed. Assum-
ing that the sequential algorithm consists of two hierarchical clustering methods this
method will have two parameters, one for each clustering step. In order to find the
optimal parameters to use when clustering the data sets, the clustering result that the
algorithm produces is evaluated for a range of different parameter values. When the
number of parameters two rather than one, the number of experiments that need to be
evaluated increase drastically. Since the main focus of this thesis is evaluation methods
rather than the clustering methods themselves, the sequential clustering method will
be excluded from the majority of the experiments, and will only be used as a reference
method when comparing clusterings of mutual funds.
3.3 Measures of similarity

In this section, the different metrics used to calculate similarity between time series are
described. By calculating the similarity between each time series in the data set allows
the creation of a distance matrix, which will be used in the clustering algorithms in
order to partition the data.
3.3.1 Hellinger distance

The Hellinger distance is a metric used to measure the similarity between two probability
distributions P = (p1 , ..., pk ) and Q = (q1 , ..., qk ). If the distributions are discrete over
some set Ω = {1, ..., k}, the Hellinger distance is defined as follows [7]:
v
u k
u1 X √ √ 1 √
(1)
p
dH (P, Q) = t ( pi − qi )2 = √ ∥ P − Q∥2 .
2 i=1 2
In the context of financial time series, P and Q are the distributions of the returns
of two time series. In order to find the distributions, the probability density function
of the returns are calculated by determining the histogram bins of each time series’
returns. In order to calculate the histogram bin the returns are divided into a number
of sub-intervals, and the number of return values that fall into each bin (sub-interval)
are counted. The probability density function is then defined for each bin by dividing
the number of values in each bin with the total number of return values. Since the
Hellinger distance is a measure of similarity between distributions, using the distance
when clustering could enable the separation of different types of mutual funds. Mutual
funds of different kinds have different objectives and strategies, and the choices made
regarding these factors will affect the distribution of the returns. For example, the
distribution of the returns of equity mutual funds may have heavier tails than fixed-
income mutual funds that invest in bonds. This difference in distribution will create
a distance between the two time series, and it would be possible to divide them into
separate partitions using a clustering algorithm.
3.3.2 Kendall’s tau

Kendall’s tau is used as a measure of correlation between two rankings, X and Y . In
order to calculate Kendall’s tau, first let {(X1 , Y1 ), ..., (Xn , Yn )} be a collection of pairs
13
of data from the two rankings. Let one pair of observation pairs be (Xi , Yi ) and (Xj , Yj ).
This pair of observations is classified as concordant if Xj − Xi and Yj − Yi has the same
sign. If this is not the case for the pair, they are discordant instead [8].
The algorithms used for generating the distance matrix between time series based on
Kendall’s tau uses a specific version of the metric, called Tau-b. This statistic is imple-
mented in the scientific computation library SciPy [9], and accounts for the case where
Xi = Xj or Yi = Yj . The Tau-b metric is calculated as follows:
C −D
τB = p , (2)
(C + D + T )(C + D + U )
where C is the total amount of concordant observation pairs and D is the amount of
discordant pairs. T is the total number of tied values present in the first ranking, and
U is the number of tied values in the second ranking [10]. When using Kendall’s tau
to measure correlation between two financial time series, Xi and Yi in each observation
pair is the returns of each time series at the index i.
Kendall’s tau is bounded to be in the range −1 ≤ τB ≤ 1, where a value of -1 indicates

a perfect disagreement and a value of 1 indicates a perfect agreement between the series
being compared [8]. In order to utilize Kendall’s tau as a distance between two time
series T1 and T2 , the following calculations are made:
1 − τB (T1 , T2 )
dKendall (T1 , T2 ) = . (3)
2
Kendall’s tau distance dKendall (T1 , T2 ) is then bounded in the range 0 ≤ dKendall (T1 , T2 ) ≤
1, where a distance of 0 indicates perfect agreement and a distance of 1 indicates perfect
inversion, or a total disagreement. Due to the fact that Kendall’s tau is a measure of
correlation, it can help identifying the relationships between financial time series in a
data set. Partitioning financial time series using Kendall’s tau can facilitate the creation
of a more diversified portfolio, since investing in assets that have a low correlation with
each other can help minimize the risk of the portfolio. By partitioning according to
Kendall’s tau, clusters of mutual funds that have a high correlation between one another
can be identified. One can then create a portfolio containing assets from different clusters
in order to reduce the overall risk.
3.4 Robustness and stability

One of the main goals of this degree project is to investigate the robustness of clustering
methods applied to financial time series. In this context, robustness will be defined as
the level of stability of a clustering result as a clustering method is applied to different
time periods of time series data. A clustering method with a higher level of robustness
will partition the data into similar or even identical clusters for all time intervals tested.
On the other hand, a clustering method that is not deemed as robust will produce clus-
ter results that vary at a greater rate.
14
Additionally, how appropriate a clustering method is for clustering a given data set can
be evaluated by examining another type of clustering stability. This stability is based
on the similarity between the clustering results from clustering methods applied to two
different data sets of the same characteristics. More formally, the two data sets that
are clustered are assumed to contain data drawn from the same probability distribution
[11]. A measure of this type of stability, henceforth referred to as replication stability
is discussed in Section 5.2.4. As previously described in Section 3.1.1 regarding the
stylized facts of financial time series, the distribution of returns is typically heavy tailed
to a varying degree. The return distributions of the time series in the data sets will
likely have assume a range of different shapes depending on the historical events of the
underlying asset the time series is describing. When examining this type of stability it is
therefore important to be mindful of the properties of the two data sets whose clustering
results are being compared, and ensure that the data consists similar types of financial
time series.
4 Data
In this section, a description of the data that will be clustered will be provided. In order
to evaluate the performance of the clustering algorithms, different kinds of financial
time series will be used. The purpose of this is to evaluate how the different clustering
methods perform on a wider variety of time series data. While the main purpose of this
report is evaluating the clustering of mutual funds, mixing the types of financial time
series used in the data set allows the performance to be evaluated for different scenarios.
The meta data of the different kinds of financial time series will also be utilized in
the evaluation of clustering methods. Having access to meta data enables additional
analysis of the clustering result. For example, it is possible to investigate how well the
clustering techniques can partition the financial time series into clusters based on the
specific region of the financial time series. This facilitates the detection of outliers, and
poor classification of the fund made by the fund manager for example. Since the relevant
meta data available for financial time series differs between the kind of underlying asset
that the time series represents, the meta data used for each data type is also presented
in this section. The returns of the financial time series are adjusted to have a mean
of 0, allowing for a better comparison between the different time series based on their
patterns and fluctuations rather than a mean value.
4.1 Mutual funds

A mutual fund is a type of financial instrument managed by a professional investor. It
consists of a collection of financial assets. The type of financial assets in the portfolio
differ greatly between funds, and are often based on the strategy or objective specified
for the mutual fund. For example, mutual funds may be focused on a specific region or
business sector. Individuals can buy shares of the instrument, and by doing so accept
that the money committed will be invested according to the goals of the mutual fund by
the manager [12]. Since mutual funds are made up of a collection of different assets, the
volatility of mutual funds is typically smaller than the volatility of the price of a single
stock. The financial time series of mutual funds that are clustered describe the price of
one share.
15
The mutual funds that will be clustered are a subset of those available on the Swedish
market. By ensuring that a diverse collection from different industries and regions are
clustered, it is possible to analyze how well these regional classifications seem to fit the
fund, for example. The meta data available for financial time series describing mutual
funds is presented in the following table. It will be used to perform a reference clustering
method as well as facilitate qualitative analysis of the clusters since the data may help
explain similarities between funds. The meta data may for example help the qualitative
analysis of the cluster results. For example, one can look at the region of each fund in
one cluster and determine that this cluster contains most of the Swedish funds in the
data set.
Data headers
Ticker
ISIN
Yearly fee
Name
Region
Asset class
Category
Currency
Table 1: Mutual fund meta data
4.2 Stocks
As opposed to mutual funds, stocks describe the price of one share of a single company.
Stock prices tend to display a larger volatility and level of unpredictability. There is no
diversification in a single stock, and the price of the stock is impacted by a significant
number of parameters, apart from the performance of the company. The stock price
is ultimately based on the supply and demand and influenced by market sentiment,
financial reports, and news [13]. The portfolio of a mutual fund is affected by these
phenomena as well, but the diversification tends to soften the short term effects.
Including stocks as part of the benchmark data enables the evaluation of clustering
methods on different kinds of financial time series, and provides an insight into the clus-
tering tasks that each clustering algorithm is most fit to perform. It is entirerly possible
that a clustering algorithm struggles to cluster mutual funds, but performs significantly
better when clustering financial time series representing stock prices.
The time series of stocks that will be clustered represent a number of stocks of companies
listed on Nasdaq Stockholm. The reason for choosing these stocks is to investigate
how well the clustering algorithm partitions the time series with respect to variables
unrelated to geographic region. Since the companies listed on Nasdaq Stockholm are
based in Sweden, their main differences may be caused by other factors. The meta data
available for financial time series of stocks is described in the table below, and will be
16
used to perform a reference clustering. Similarly to the mutual funds, the data can aid
in explaining differences and similarities between stocks.
Data headers
Ticker
SEDOL
Name
Industry sector
GICS Industrial name
GICS Sub industry name
Currency
Table 2: Meta data of the stocks used
4.3 Data frequency and length

The frequency of the financial time series that are clustered will be weekly. The reason-
ing behind this choice is that weekly time series data is less noisy than daily data. For
example, the time of the day when the price of the fund is set varies between funds and
contributes to noise. Additionally, previous experience acquired when clustering mutual
funds indicates that certain interest funds are more easily partitioned for weekly data.
Choosing to use monthly time series would possibly decrease the amount of noise in the
time series even further, but would decrease the number of observations available.
The amount of observations in each financial time series will have a strong impact on the
result of the clustering algorithms. A time period of 5 years is chosen for the experiments
in this report. Due to the cyclical nature of the global economy, a relatively short time
frame of five years could possibly capture patterns and trends that would not be present
if the time series extended further back in time. Using a time period of 5 years also
enables the inclusion of funds and stocks which made an entry into the market in more
recent years, since their time series data does not extend as far back in time as funds or
stocks which have been present on the market longer.
5 Methodology
5.1 Clustering
The clustering methods described in Section 3.2.1 will be used to partition data sets
containing financial time series that represent mutual funds and stocks. The clustering
process starts by calculating the pairwise distance between the returns of the financial
time series in the data set.
Let a time series describing the pricing of an asset with n observations be T = {p1 , p2 , ..., pn }.
The returns at observation i is then calculated as:
pi − pi−1
ri = . (4)
pi−1
17
The returns of each asset are calculated and stored as time series of d observations
T = ⟨r1 , r2 , ..., rd ⟩. The return time series of each asset are then used to calculate the
distance between each asset using a metric of similarity, as described in Section 3.3.
This enables the creation of the distance matrix D, that is a M × M matrix where M
is the total amount of time series in the data set.
 
d1,1 d1,2 . . . d1,M
 d2,1 d2,2 . . . d2,M 
D =  .. .. ... ..  .
 
 . . . 
dM,1 dM,2 . . . dM,M
Once the distance matrix has been acquired, the clustering algorithm of choice is applied
in order to partition the data set into clusters. The clustering result of each method and
choice of hyperparameters is then saved to facilitate evaluation and further analysis of
the clusters.
5.2 Evaluation methods

This section describes the different methods of clustering evaluation that were imple-
mented and tested in this thesis.
5.2.1 Qualitative evaluation

Qualitative evaluation methods are methods that can be used to evaluated clustering
results without relying on a numerical value or score derived from the results them-
selves. One qualitative method of cluster evaluation is cluster visualization. Visualizing
the clustering results in informative and intuitive plots can give insight into the struc-
ture of the underlying data as well as the similarity between data points in clusters and
dissimilarity between data points in separate clusters.
Dimension reduction
Since financial time series are high dimensional where each entry in the series adds an-
other dimension, it can be difficult to visualize the time series in a way that maintains
the differences between the different time series in a data set. This is due to the fact that
in order to create visualizations such as two dimensional scatter plots, the dimensions
of the data need to be reduced to two dimensions. This procedure is called dimension
reduction, and there exists many algorithms for carrying out this task. The goal when
performing dimension reduction is, apart from reducing the dimensions of data, to pre-
serve the structure of the data set so that the as little information as possible is lost in
the dimension reduction process [14].
While performing dimension reduction on the data set enables the creation of visual-
izations, these visualizations can become misleading if too much information about the
local or global structure of the data is lost. Using our solar system as an example,
preserving the global structure could be likened to preserving the distances between the
planets. Preserving the local structure would instead be preserving the relative dis-
tances between planets and their moons [14]. In other words, the global structure is
the structure between the different clusters of the data and the local structure is the
18
internal structure of the clusters.
When the dimensions of the data is reduced to two dimensions, the projection may even
seem to display patterns and structures in the data that is not there in the original,
non reduced data. Methods of dimension reduction tend to either preserve the global or
local structure. Due to this, a number of different dimension reduction methods will be
used to create data visualizations. One method of dimension reduction that preserves
global structure is Principal Component Analysis (PCA), and a method for local struc-
ture preservation is t-Distributed Stochastic Neighbor Embedding (t-SNE) [14]. Scatter
plots will be created using both methods to reduce the dimensions of the time series
data, and the cluster result in order to assign labels to the data points.
PCA is a dimension reduction, and is generally defined as the orthogonal projection of

each data vector in the data set onto a linear space of lower dimension. This linear space
is called the principal subspace, where the variance of the data that has been projected
is the largest [15]. Let X be a data set with N observations. Each vector xn in the
data set has D features (dimensions). Let x̄ be the data set mean, and let S be the
covariance matrix of the data such that:
N
1 X
x̄ = xn , (5)
N n=1
N
1 X
S= (xn − x̄)(xn − x̄)T . (6)
N n=1
Let K be the number of dimensions of the principal subspace such that K < D. The
direction of his subspace can be defined using K number of D dimensional vectors
{u1 , u2 , ..., uK }. The variance of the projected data for each vector un is given by:
N
1 X
{un T xn − un T x̄}2 = un T Sun . (7)
N n=1
The goal is then to maximize the variance with respect to un , constrained by the
normalization condition un T un = 1. A Lagrange multiplier λn is introduced, so that
the following can be maximized:
un T Sun + λn (1 − un T un ). (8)
By deriving with respect to un and letting the derivative be 0, the variance will be
stationary when
un T Sun = λn . (9)
This means that the variance will be maximized when un is the eigenvector of S that
corresponds to the eigenvalue with the largest magnitude λn This eigenvector is known
as the first principal component, and additional principal components can be extracted
by using the eigenvector corresponding to the second largest eigenvector, and so on [15].
The projection of the original data into this subspace is then defined by:
19
Z = XU, (10)
where U is the matrix where all columns are the K eigenvectors of the covariance ma-
trix S. In the context of plotting a two dimensional scatter plot, K = 2. The Z matrix
would in this case have the dimensions (N × 2), and could be visualized in a scatter plot.
The SNE algorithm defines a distance metric pi|j and then attempts to find a low di-
mensional mapping such that the relative distances in the low dimensional space qi|j
resemble the high dimensional distances as much as possible. The first step of the al-
gorithm is to convert the original, high dimensional distances between the data points
into conditional probabilities that can be viewed as similarity between points [16]. Let
the similarity between data points xi and xj in the high dimensional space be:
−||xi −xj ||2

exp( 2σ 2i
)
pj|i = P 2 , (11)
k̸=i exp( −||x2σ
i −xk ||
2 )
i
where σ i is the variance of a Gaussian distribution centered on data point i. This pa-
rameter is found by performing a binary search, such that σ i produces a value Pi with a
user specified fixed perplexity [16]. In the equation above, the distance between the data
points is the Euclidean distance. It is important to note that the method is versatile,
and other distance metrics can be used as well.
The perplexity is defined as:
Perp(Pi ) = 2H(Pi ) , (12)

X
H(Pi ) = − pj|i log2 pj|i . (13)
j
H(Pi ) is the Shannon entropy of Pi in bits. The conditional probability that indicates
similarity between data points yi and yj in the low dimensional space is defined as:
exp(−||y i − y j ||2 )
qj|i = P 2
. (14)
k̸=i exp(−||y i − y k || )
The goal of SNE is to find a mapping to lower dimensional space such that the difference
between pj|i and qj|i . This is done by minimizing the cost function C, which is the sum
of the Kullback-Leibler divergences:
X XX pj|i
C= KL(Pi ||Qi ) = pj|i log . (15)
i i j
qj|i
The optimization is performed iteratively using gradient descent [16]. Once the opti-
mization is complete, the mapped data set will be Y, where each row is a lower dimension
20
mapping of one data point.
t-SNE builds on this method. It uses a Student t-distribution to calculate the similarity
between two data points in the low dimensional space. Alternatively to minimizing
the cost function in Equation 15, one can minimize the Kullback-Leibler divergence
between a joint probability distribution in the high dimensional space called P , and a
joint probability distribution in the low dimensional space called Q [16]:
XX pij
C = KL(P ||Q) = pij log . (16)
i j
qij
The joint probabilities qij is calculated using a Student t-distribution with one degree
of freedom:
(1 + ||yi − yj ||2 )−1

qij = P 2
. (17)
k̸=1 (1 + ||yk − y1 || )−1)
The joint probabilities in the high dimensional space, pij , are defined as the symmetrized
p +p
conditional probabilities pij = j|i2n i|j , where n is the number of points in the data set.
Using a Student t-distribution to calculate the low dimensional joint probabilities allows
a moderate distance in the high dimensional distance in the high dimensional space to be
transformed into a longer distance in the lower dimensional space due to the heavy tails
of the Student t-distribution, providing a more faithful low dimensional representation
[16].
PCA and t-SNE have been implemented in the Python machine learning library Scikit-
learn [17]. It is these implementations that will be used to perform dimension reduction
in this project. Since PCA is performed using the data vector of each data point, the
returns of each time series will be the input to be reduced. Since t-SNE is based on
finding lower dimensional mappings using the distances between the data points, a dis-
tance matrix will describing the point-wise distances between the financial time series
will be used instead. In order to produce scatter plots of the clusters the data will be
reduced to two dimensions.
Cluster evolution tracking

Another method of cluster result visualization is to use a directed network graph to dis-
play how the clustering results change when a clustering method is applied to different
subsections of time series in a data set. One such method is proposed by Arratia and
Cabaña [18]. As opposed to the previously described methods of cluster result visual-
ization by dimension reduction, this method includes the temporal aspect of time series
clustering into account as well. The proposed method results in a directed weighted
graph where the nodes represent clusters containing financial assets, such as mutual
funds or stocks. Edges in the graph link clusters with a non empty intersection to sig-
nify that time series has moved from one cluster to another. The weights of the edges are
determined by the amount of time series that has moved from one cluster to another [18].
More formally, let S be a set of financial time series. Let T be the time period that will
be analyzed. T is then split into m subsequences τ1 , τ2 , ..., τm [18]. For each subsequence
21
of T , the returns of the price of each financial time series during this period is used to
calculate the distance matrix between the time series. This distance matrix is then used
as input to a clustering method that partitions the data into clusters.
The set of nodes is defined as the set of clusters produced by clustering each time inter-
val. All clusters that only contain one element are excluded from this set. The relational
edges between all nodes in adjacent subsequences are computed, by adding an edge be-
tween them when the intersection of the time series within the two clusters is non zero.
The weight assigned to each specific edge is the cardinality of the intersection [18].
Clustering evaluation using domain knowledge

Having sufficient domain knowledge regarding the data set is essential when performing
cluster validation. Knowledge about the data type that is being clustered helps inter-
pret the meaning behind the clusters that the clustering algorithms applied to the data
produce. Having sufficient knowledge about the data ensures that the results makes
sense to the user and it helps reveal the structure of the data. Additionally, knowledge
of the data sets that are being clustered can be used in order to evaluate whether the
clustering results align with expected patterns and can help evaluate how well the clus-
tering method that was applied to the data set is able to capture these patterns.
In a similar vein, domain knowledge can help in outlier and anomaly detection. For ex-
ample, a person knowledgeable in mutual funds can inspect the clustering result and find
clusters where mutual funds that have been clustered together or apart unexpectedly.
For example, let one cluster contain three Swedish interest funds. The user observes
that a fourth interest fund has been placed in a different cluster, or even been classified
as an outlier. This would possibly warrant a more thorough investigation into the fourth
interest fund, since it has not been clustered with the other funds with similar strategies.
Since there is usually no ground truth available when performing clustering on financial
time series, domain knowledge could possibly be invaluable when determining the use-
fulness and quality of the clustering results. One of the main problems regarding this
type of cluster evaluation is the fact that it requires manual inspection. For this reason,
the method is less feasible when performing model selection and parameter tuning since
the amount of clustering results to inspect become very large. Ideally, this type of eval-
uation would instead be used once quantitative scores have singled out a few clustering
method candidates.
5.2.2 Internal clustering evaluation

The term internal clustering evaluation method refers to measures of clustering result
quality that only utilizes information available in the data used to perform the clus-
tering [19]. Different evaluation metrics measure different aspects and features of the
clustering result, and it is therefore important to use a multitude of different metrics
to get a more general idea of how well a clustering algorithm performs. Many different
quantitative methods will be used in this thesis, and the amount of information gained
by applying each evaluation metric will be investigated. The evaluation metrics rely
on a distance matrix where the similarities between each time series is stored, and the
results will depend heavily on the similarity metric chosen.
22
Silhouette score
One evaluation metric that is high when the clusters in the result are clearly separated
and dense is the Silhouette score. The score is based on the Silhouette coefficient, which
is calculated using the cluster labels of each data point, as well as the distances between
the objects in the clusters. Here, a distance matrix created some measure of time series
similarity is used.
Let a(i) be the average distance of an object in a cluster i to all other objects in the
same cluster. Let d(i, C) be the average distance between an object i in cluster A and
all objects in a cluster C [20]. Finally, let:
b(i) = min d(i, C). (18)

C̸=A
Let the cluster where the minimum described in 18 is found be B. This is the cluster
that is determined to be the cluster that is the closest to the object i that is not the
cluster A where the object is currently located. B is in this way determined to be the
neighboring cluster to i. The Silhouette coefficient for each sample in the data set can
then be calculated as follows [20]:
b(i) − a(i)
s(i) = . (19)
max{a(i), b(i)}
The Silhouette score can then be acquired by calculating the mean of all Silhouette
coefficients in the data set [21]:
n
1X
S= s(i). (20)
n i=1
Calinski-Harabasz score
The Calinski-Harabasz score is also known as the Variance Ratio Criterion (VRC). Let
n be the total number of data points in the data set P = {p1 , p2 , ..., pn }. If each data
point (observation) has v features, the data matrix X has dimensions v × n. This ma-
trix can be written as X = (x̄1 , x̄2 , ..., x̄n ), where x̄i is the feature vector of data point
pi Let k be the total number of clusters C = {C1 , C2 , ..., Ck }. The score is defined as [22]:
BGSS n − k
V RC = , (21)
W GSS k − 1
where W GSS is the within-group sum of squares and BGSS is the between-group sum
of squares. These are calculated using the trace of two different matrices:
23
W GSS = tr(Wk ), (22)
k X
X
Wk = (x̄ − cq )(x̄ − cq )T (23)
q=1 x̄∈Cq
BGSS = tr(Bk ), (24)

k
X
Bk = nq (cq − cP )(cq − cP )T , (25)
q=1
where Cq is the q:th cluster, cq is the centroid of cluster q, nq is the number of data
points in the cluster, and cP is the center of the data set P .
Davies-Bouldin score
The Davies-Bouldin score is a similarity measure that defines a measure of cluster sep-
aration R. R will then be used to determine the average similarity between each cluster
and the cluster in the set that is the most similar to it [23]. Let a cluster C in a cluster
set contain the data points {X1 , X2 , ..., Xm } ∈ Ep . Here, Ep is a Euclidean space of p
dimensions. Let S(X1 , X2 , ..., Xm ) be a measure of dispersion such that:
S(X1 , X2 , ..., Xm ) ≥ 0 (26)

S(X1 , X2 , ..., Xm ) = 0 iff Xi = Xj ∀Xi , Xj ∈ C. (27)
While keeping the properties of S in mind, Si is defined to be the dispersion of cluster C

[23]. Si is calculated as the average distance between the points inside of cluster and the
centroid of the cluster. In order to calculate the Davies-Bouldin score, Mij is defined as
the distance between the cluster centroids of clusters Ci and Cj . Using these definitions
of Si and Mij , a measure of cluster separation R can be defined:
Si + Sj
Rij = . (28)
Mij
The Davies-Bouldin score can then be defined as the average of the similarity measure
R of each cluster and the cluster closest to it:
N
1 X
R̄ = max Rij . (29)
N i=1 i̸=j
The optimal choice of cluster algorithm when only considering the Davies-Bouldin score
will thus be the algorithm which results in clusters that minimize Equation 29. A lower
score indicates that the similarity between the different clusters are smaller, which in-
dicates a better cluster separation [23].
Modifications for non-Euclidean distances between time series

The Calinski-Harabasz score as well as the Davies-Bouldin score are both calculated
by utilizing the centroid of each cluster or the the entire data set in some capacity.
24
The centroid of a cluster or data set is defined as the arithmetic mean of all points
in the cluster or data set [24]. This works well when the dissimilarity between the
points in the data set is described by a Euclidean distance. In the case of clustering
time series however, this distance metric may not be the best choice. The Euclidean
distance between two time series T and S is the distances between each corresponding
observation:
v
u n
uX
d(T, S) = t (Ti − Si )2 . (30)
i=1
This limits the comparison to time series of the exact same length. This is not unique
for the Euclidean distance, but the distance metric is also sensitive to noise and outliers
in the data as well as signal transformations [25]. In the interest of clustering financial
time series in order to potentially diversify a portfolio, measures such as the Hellinger
distance and Kendall’s tau is likely more appropriate. Constructing a portfolio of funds
whose returns has a low or negative historic correlation can increase the diversification,
since the funds has historically responded differently to conditions of the market. This
can reduce the impact of individual fund return fluctuations. According to the styl-
ized facts of financial time series, there exists an asymmetry when it comes to gains
and losses. Large decreases in price is more common in empirical financial data than
large increases, and diversification by correlation is a method for mitigating the large
decreases. This is the main motivation why Kendall’s tau is appropriate for measuring
dissimilarity between financial time series. The Hellinger distance is in this case on the
other hand a measure of similarity between the distributional properties of the time
series, which allows it to capture both shape and spread of the distributions. Since the
Hellinger distance is a measure between distributions, it is also not sensitive to outliers
in the data.
An issue that arises with these distances is that the concept of a centroid is not defined
in the context of non-Euclidean distances. In order to utilize the quantitative metrics
that depend on a center of a cluster or data set, the metrics need to be modified. Instead
of calculating the centroids when computing the different metrics, the medioid will be
computed instead. The medioid of a set is the object where the sum of its dissimilarities
to the other objects in the set is the smallest [26]. By using this definition of a cluster
center instead of the centroid, metrics that utilize centroids can also be extended to non-
Euclidean distances between objects. More formally, the medioid of cluster C containing
data points {p1 , p2 , ..., pn } is calculated as follows for any distance metric d(p, q) [26]:
n
X
pmedioid = arg min d(q, pi ). (31)
q∈C
i=1
5.2.3 Measures of stability in time

It is desirable that a clustering method that is applied to a data set of financial time
series results in a similar clustering result regardless of the time period the financial
time series describe. It it therefore necessary to find measures that can quantify the
robustness in time of the clustering method being evaluated. This section introduces
25
one measure of robustness that will be evaluated. Do note that this evaluation measure
is an internal validation measure as well, but the since the focus of the method lies
within the temporal stability of the results this will be classified as a separate type of
metric compared to the scores in in the previous section.
CLOSE score
When performing clustering for multiple successive time periods it is desirable that the
clustering methods produces similar results for each time period, while also producing
reasonable clusters given the current data. Encapsulating both of these criteria into an
evaluation method will for this reason allow the user to find clustering methods that are
both robust over time, and produces clusters of quality. One such evaluation method is
presented by Klassen et al. [27]. The algorithm is called CLOSE, which is an acronym
for Cluster Over-Time Stability Evaluation. The method is designed to evaluate multi-
variate time series clustering. The CLOSE algorithm can be used to evaluate clustering
results of methods that produce crisp clusters, meaning that each data point the the
set is assigned to one cluster only. Additionally, it was created specifically to evaluate
evolutionary clustering results. In evolutionary clustering the time series that will be
clustered are split into subsequences, effectively creating k data sets where k is the num-
ber of subsequences the time series data has been split into. The clustering algorithm
to be evaluated is then applied to each subsequence data set, and the clustering result
is stored in the order of the subsequences. Klassen et al. designed the algorithm to
compare the cluster result similarity between all subsequences, rather than two succes-
sive sequences. This strategy enables the capturing of large changes in cluster contents
caused by small changes over longer periods of time. This way of quantifying change
in clusters is called over-time stability [27]. The CLOSE method is parameter free, and
makes no assumptions regarding the nature of the clustering algorithm that has pro-
duced the clusters. This facilitates the comparison of a wide range of different clustering
algorithms, granted that the clusters have been generated based on the same distance
metric.
The following notation is used by Klassen et al. [27] and will be used here as well
in order to describe the algorithm and its implementation. Let one time series be
T = {ot1 , ot2 , ..., otn }, where oti is one observation at time point i and n is the total num-
ber of observations. The set O = ot1 ,1 , ..., otn ,m includes the vectors of all time series in
the data set, and Oti is used as shorthand for all data points at time i. Let a subsequence
of one time series be defined as Tti ,tj l = {oti ,l , ..., otj ,l } where j > i. In other words, it
is a section of a time series Tl beginning at time ti and ending at time tj . The CLOSE
algorithm makes a distinction between data points oti ,l that are cluster members, and
points that are noise (also known as outliers). One cluster containing time series data
points based on subsequence i of each time series is denoted as Cti ,j ⊆ Oti . Any data
point in Oti that has been partitioned into cluster Cti ,j is a member of this cluster. A
data point in the same set that is not assigned to any cluster during time period ti is
defined as noise. The second index of the cluster j indicates which unique identifier the
cluster has in the entire evolutionary clustering results. Here, j ∈ {1, ..., NC } where NC
is the total number of clusters across all subsequence clusterings [27].
The CLOSE score of an evolutionary clustering result depends on the over-time stability
of each cluster in the clustering result. The over-time stability of each cluster is in turn
26
dependent the subsequence scores of all data points that have been assigned to that
cluster. The subsequence score of data point otk ,l is defined as
k−1
1 X
subseq_score(otk ,l ) = p(cid(oti ,l ), cid(otk ,l )), (32)
ka i=1
where ka is the number of previous timestamps where the data point is assigned to a
cluster, and cid(oti ,j) is the cluster identity function that returns the cluster assignment
of data point oj at time ti . The function p describes the proportion of data points that
have remained together from one cluster in the previous timestamp to another cluster
in the current timestamp:
|Cti ,a ∩t Ctj ,b |
p(Cti ,a , Ctj ,b ) = , ti < tj . (33)
|Cti ,a |
Klassen et al. [27] describes ∩t as the temporal cluster intersection, and is defined as
∩t {Cti ,a , Ctj ,b } = {Tl |oti ,l ∈ Cti ,a ∧ otj ,l ∈ Ctj ,b }. (34)
In words, the temporal cluster intersection is the set of data points that have been
partitioned into the same clusters on different time intervals. An additional factor that
influences the over-time stability of a cluster is how many clusters from previous time
intervals that have merged when forming the cluster. More formally:
m(Ctk ,i ) = |{Ctl ,j |tl < tk ∧ ∃a : otl ,a ∈ Ctl ,j ∧ otk ,a ∈ Ctk ,i }|. (35)
The over-time stability of a cluster Ctk ,i is the calculated as:
1
P
|Ctk ,i | otk ,l ∈Ctk ,i subseq_score(otk ,l )
ot_stability(Ctk ,i ) = 1 . (36)
k−1
m(Ctk ,i )
Finally, the CLOSE score for an evolutionary clustering result ξ can be defined:
2 ! X !
1 n
CLOSE(ξ) = 1− ot_stability(C)(1 − quality(C)) . (37)
NC NC C∈ξ
The function quality(C) is measure of cluster quality. Klassen et al. suggests the usage
of the mean squared error [27], this measure is thus chosen to be used in this degree
project. The mean squared error of each cluster is defined in the inner sum of Equation
23, where cq is the medioid of cluster q.
The formula for the CLOSE score described in Equation 37 does not punish outliers
directly. Instead, one can choose to include an exploitation term in the equation, which
serves the purpose of punishing outliers more harshly. This exploitation term is defined
27
by Klassen et al. [27] as the number of data points that are assigned to clusters Nco
divided by all the data points in the entire data set, No :
Nco
exp_term = , (38)
No
2 ! !
1 n X Nco
CLOSE(ξ) = 1− ot_stability(C)(1 − quality(C)) . (39)
NC NC C∈ξ
N o
It is worth noting clustering methods that result in outliers are not directly undesired,
since it can be of great interest why a certain fund was not clustered with other funds
that have been labeled similarly by the fund managers. However, a clustering result with
a very large amount of outliers may become less useful, as the purpose of performing
the clustering in the first place may be lost. For this reason, the exploitation term will
be used when performing calculations of CLOSE in this thesis.
During evaluation, the financial time series that will be clustered will be split into sub-
sequences of equal size. Each subsequence will contain weekly data, and have a length
of 200 weeks. A better choice would possibly have been to let each subsequence have a
length of 260 weeks (5 years), which corresponds to the length of the time series in the
clustering results that are evaluated using the other evaluation methods in this report.
The main problem with this is that this approach would require that the entire time se-
ries would cover 20 years. The historical data available for the stocks in the benchmark
stock data set does not extend far enough back in time for many stocks. This would in
turn result in the stock data set shrinking drastically.
Since the CLOSE algorithm takes both temporal robustness as well as cluster quality into
account, this measure is deemed a promising candidate for selecting hyperparameters of
the clustering methods that will be evaluated using the validation methods described in
this report.
5.2.4 Replication stability evaluation

In addition to the temporal stability that is evaluated using the CLOSE score, another
desired quality of a clustering method is replication stability. When a clustering al-
gorithm is used to partition two data sets of the same kind with similar properties,
the clustering task should result in two similar clustering results. In Stability-Based
Validation of Clustering Solutions by Roth et al. [11], the authors define this type of
cluster stability by considering the average dissimilarity of two different clustering results
computed by the same clustering algorithm on two different data sets from the same
probabilistic source. When performing clustering tasks, it is often the case however that
only one data set is available. The solution presented to this problem is to implement
a subsampling scheme in order to emulate two different independent data set from the
same probabilistic source. This can be performed by randomly splitting the available
data set into two disjoint data sets, X and X ′ . After the data set has been divided,
the clustering method ξ that will be evaluated is used to perform clustering on both
data sets. The cluster labels of each data point in the data sets are stored in vectors
ξ(X) = Y and ξ(X ′ ) = Y ′ [11].
28
A classifier ϕ is then trained on data set X and the clustering result Y to perform
cluster label predictions. The trained classifier is then used to predict the cluster labels
of the other data set, X ′ . The predicted labeling ϕ(X ′ ) is used as a method to extend
the clustering solution Y of the training data set X to X ′ . The labels predicted by
the classifier can then be compared to the actual cluster assignments ξ(X ′ ). By quan-
tifying the dissimilarity between the predicted labels and the labels retrieved from the
clustering solution, a stability measurement for the clustering method can be calculated.
The choice of classifier will have a significant impact on the calculated stability of the
method, and it is therefore important that the classifier is chosen such that the misclas-
sification rate is as low as possible.
In order to compare the predicted clustering solution ϕ(X ′ ) and the actual clustering
labels Y ′ , Roth et al. proposes the normalized Hamming distance [11]:
n
1X
d(ϕ(X ′ ), Y ′ ) := 1{ϕ(Xi′ ) ̸= Yi′ }. (40)
n i=1
1{ϕ(Xi′ ) ̸= Yi′ } = 1 if the prediction ϕ(Xi′ ) ̸= Yi′ . If the prediction is correct, it is 0.

This distance effectively measures the rate of misclassification of the classifier ϕ. One
issue with that arises when training a classifier on one clustering result and using the
classifier to make predictions regarding another clustering result is the fact that it is not
guaranteed that the clustering labels in Y ′ correspond to the labels of the clusters in
Y . For example, a cluster that has the identifier 1 in Y may have the identifier 3 in Y ′ .
In order to facilitate a comparison between the clustering solutions, the cluster labels
in one of the clustering solutions are mapped to maximize the similarity between the
clustering solutions. This optimization problem can be solved by finding the permutation
of the labels in the prediction ϕ(X ′ ) that minimizes the Hamming distance between the
predicted labels and the actual cluster solution Y ′ . The solution to this optimization
problem is described in additional detail by Roth et al. [11]. The minimzed Hamming
distance will henceforth be denoted dSk . The stability index of a clustering algorithm ξ
can then be defined as:
S(ξ) := EX,X ′ dSk (ϕ(X), Y ′ ). (41)
The stability index is thus a measure of the average dissimilarity between the predicted
labels given the data set X ′ and the actual clustering result Y ′ . In order to compute
this empirical expectation value, the entire data set is split in half randomly r times. For
each random split of the data, both splits are clustered using the clustering algorithm
ξ, and the classifier is trained on one half of the data X and the clustering labels of
the training data Y . The predictions of the cluster labels of the other half of the data
set, ϕ(X ′ ), is then computed and the distance between ϕ(X ′ ) and Y ′ is calculated and
stored.
Once r splits resulting in r dissimilarity calculations have been performed, the empirical
average stability index of the cluster method can be computed as
29
r
1X
Ŝ(ξ) = dS (ϕ(X)i , Y ′ i ). (42)
r i=1 k
In order to use the stability index to evaluate clustering methods equally, the stability
index needs to be normalized. A classification accuracy of 0.5 when the total amount of
clusters is 2 indicates that the classifier does not do a better job than simply guessing the
cluster label of data points. However, if the amount of clusters is 50, an accuracy of 0.5
is significantly better. In order to facilitate a comparison between the stability indices
for clustering solutions that result in different amounts of clusters, the stability index
is normalized using the asymptotic random misclassification rate. This normalizing
factor can be estimated by producing two new clustering results by clustering n points
randomly twice, where the number of clusters k is equal to the number of clusters
received when applying the clustering algorithm to X and X ′ . Here, n is the number
of data points in X and X ′ . By calculating the Hamming distance between the two
random clustering solutions, the normalizing factor Ŝ(Rk ) is acquired. The random
Using this, the normalized stability index can be calculated for clustering method ξ:
Ŝ(ξ)
S̄(ξ) := . (43)
Ŝ(Rk )
When comparing multiple clustering methods during model selection, the choice of pa-
rameters that results in the smallest value of S̄(ξ) is the most stable of the clustering
methods being compared [11]. These are the best choice of parameters according to the
stability index.
The classifier chosen in this project is a K nearest neighbor classifier, or KNN. KNN is
a non parametric classifier that classifies data points according to the majority vote of
the K closest data points in the training data set. Given a data point and a training
the data set, the distances between the data point and all data points in the entire data
set is computed. The class assignments of the majority of the K closest data points in
the training set will determine the classification of the unknown data point [28]. The
distance metric that is chosen in this application is the same distance metric used to
perform the clustering.
The parameter K is chosen to be K = 1, and this selection is based on the accuracy of

the classifications made by KNN classifiers with different values of K. In this particular
case, K = 1 yielded the best accuracy but it is important to note that the optimal
choice of K will vary between application and data sets and it is up to the user to tune
the classifier for optimal performance.
5.3 Analysis
This describes more in detail how the different experiments and evaluations where carried
out. It also describes the reference methods used for comparison.
30
5.3.1 Benchmarking
In order to evaluate the different clustering algorithms, a benchmark data set is needed.
Using a benchmark data set ensures that clustering algorithms are evaluated using the
same underlying data, and will therefore provide a fair comparison. Establishing a
benchmark data set containing different kinds of financial time series such as mutual
funds and stocks can also showcase how well the different clustering algorithms performs
on different kinds of data. This is due to the fact that a financial time series describing
a mutual fund will be different to a time series describing the price of a stock. For
example, prices of stocks tend to display more volatility than time series of mutual fund
prices due to the diversification of mutual funds. The meta data available will also differ
between stocks and mutual funds.
Performing clustering on mutual funds and stocks separately will showcase how well the
methods performs on different kinds of data. Some method may perform better when
clustering mutual funds rather than stocks, or vice versa. The benchmark data sets will
therefore be one mutual fund data set, and one stock data set. The mutual fund data
set consists of 430 funds from the Swedish fund market. The stock data set is made up
of 116 stocks that are listed on Nasdaq Stockholm. The time series in both data sets
consist of 273 observations.
5.3.2 Bootstrapping
In order to construct confidence regions for scores received from quantitative evaluation
methods, a technique known as bootstrapping will be utilized. The technique is used
to create a number of bootstrap samples that are drawn from the same empirical dis-
tribution as the data points in the original data sets. Multiple bootstrap samples can
be created using the underlying data sets, and then clustered and evaluated in order to
estimate the median and variance of a given evaluation score [29]. By doing this, it is
possible to get an idea how a clustering method may perform on new data that is similar
to the data in the bootstrapped data set.
It is important to ensure that the bootstrapped samples are able to successfully emulate
the empirical distribution of the original data. In the case of multivariate time series
such as the returns of mutual funds and stocks that are clustered in this project, a
bootstrapping method called moving block bootstrap [30]. The method is a resampling
scheme applicable to dependent time series data. Instead of simply sampling individual
observations at a time to form a new bootstrapped sample, the moving block bootstrap
method samples blocks of consecutive observations from the time series. The purpose of
this is that the structural dependence within each sampled block is maintained. Once
the total length of all sampled blocks is the same as the length of the original time series,
the blocks are stitched together to form the bootstrapped sample.
More formally, let a sequence of stationary random variables be defined as Xn =

{X1 , ..., Xn }. Let l be an integer in the range l ≡ ln ∈ [1, n]. Let Bi = (Xi , ..., Xi+l−1 )
the a block from the sequence Xn that starts at observation Xi , where 1 ≤ i ≤ N ,
N = n − l + 1. In order to create the bootstrapped sample, a predetermined number
31
of blocks from {B1 , ..., BN } are selected randomly. This is accomplished by randomly
selecting the start indices of the blocks, and extracting l observations forward in time.
By concatenating the blocks of consecutive observations into a new sequence, the boot-
strapped sample Xn∗ can be acquired [30].
Since the time series in the data sets that are clustered in this thesis have a dependence
on one another, it is important that the blocks are extracted from the time series in
such a way that the temporal relationships between the multivariate time series are
maintained. For this reason, the starting indices and length of all blocks are determined
before the bootstrapping process is started. The same indices and block lengths are then
used to create bootstrap samples from all time series in the data sets. By performing the
moving block bootstrap in this manner, the temporal dependence between the financial
time series are transferred to their bootstrapped counterparts.
When calculating the internal validation measures as well as the CLOSE score during
model selection, both bootstrapped cluster results as well as the cluster results of the
actual data will be evaluated. In order to determine medians and confidence intervals
of the evaluation metrics, the bootstrapped evaluation will be performed a number of
times, where each iteration evaluates results of new bootstrap samples. The stability
index will on the other hand only be calculated for clusterings of the actual data sets.
The reason for this is the fact that the method relies on calculating the metric for
an ideally large number of random splits of the data. Due to limitations in computer
hardware, bootstrapping will not be used for this metric.
5.3.3 Pre-evaluation processing of cluster results

Throughout this report, outliers are treated as singleton clusters. For this reason, the
presence of a significant amount of outliers in the clustering result may have a signifi-
cant impact on the internal validation measures. The silhouette coefficient of a singleton
cluster is guaranteed to be positive, a(i) in Equation 19 will be zero. For this reason,
a cluster result with a significant number of outliers is likely to have a higher, better
silhouette score. In a similar fashion, the Calinski-Harabasz index is impacted by a sig-
nificant number of outliers. This is due to the fact that the within group sum of squares
in Equation 23 will be zero, leading to a higher Calinski-Harabasz index. Finally, the
Davies-Bouldin score of a clustering result with a large number of outliers will also be
misleading. This is because the similarity measure in Equation 29 will be zero, resulting
in a better score.
For this reason, all singleton clusters will be removed from the clustering result before
performing evaluation using internal validation measures. When performing evaluation
using the other methods presented in this report, the singleton clusters will still be
considered. The CLOSE score handles outliers using the exploitation term in Equation
39, and the stability index is normalized with respect to the total number of clusters in
Equation 43.
32
5.3.4 Reference methods based on a priori knowledge
In order to facilitate the evaluation of the clustering methods, a number of different
baseline clustering methods will be evaluated as well. The evaluation methods will then
be applied to the baseline results, and compared to the results of the clustering methods
that have achieved the best CLOSE score. The reference methods will be evaluated
using the internal validation measures, as well as the stability index.
Filter clustering
By utilizing the meta data of the financial time series, it is possible to perform filter
clustering. These clustering methods are executed by partitioning the financial time
series according to some chosen meta data. For the data set containing mutual funds,
the attributes that will be used to partition the data is the asset type, region and the
currency of the mutual funds. For the data set containing stocks, the data will instead
be partitioned according to GICS sector name [31]. GICS is an acronym for Global
Industry Classification Standard, and is an analysis framework with the purpose of clas-
sifying companies according to their sector, industry group, industry, and sub-industry.
The sector classification is the coarsest classification in the GICS framework, which will
theoretically result in a clustering result with fewer, larger clusters.
The filter clustering methods will not be evaluated using the CLOSE score. This is
due to the fact that the filter clustering will result in the exact same result regardless of
which time period it is being applied to, and will for this reason achieve an unreasonably
high CLOSE score.
Sequential method with known efficacy for clustering fund data

In addition to the filter clustering, another clustering method will be used as reference.
This clustering method is a sequential clustering method that produces clustering results
that have been determined to have a high quality and usability when the data being
clustered is mutual funds available on the Swedish fund market. In addition to the
internal validation measures and the stability index, this reference clustering method
will also be evaluated using the CLOSE score.
5.3.5 Clustering experiments

As previously stated, there are many different approaches when evaluating clustering
results. The results of the clustering experiments performed will be evaluated using eval-
uation methods presented in this section, and it is therefore necessary to apply these
to cluster results that have been acquired using different methods and distance metrics,
as well as different sets of data. This will provide information regarding how well the
clustering methods perform on a range of different financial time series such as funds
and stocks. The different evaluation metrics can also be compared with one another,
in order to draw conclusions regarding how well each evaluation method fits the given
method and underlying data.
The different sub classes of evaluation methods will be evaluated over different desirable
properties. These properties will differ between internal validation measures, qualitative
clustering methods, and stability evaluation methods. It is not trivial to determine these
33
properties, since what is classified as an appropriate measure of similarity or a good clus-
tering result differs between applications and data sets. The purpose of performing the
clustering is to detect similar financial time series and partition them together, as well
as detecting outliers. One example of an outlier in this case could be if a Swedish equity
fund is partitioned into the same cluster as an American bond fund. For this reason,
it is not beneficial to use partitions according to meta data from the fund managers
such as the region or asset type of a mutual fund as a ground truth. Instead, the best
cluster result when comparing two methods is the result where the similarity of data
points within clusters is the highest, and the dissimilarity between data points in differ-
ent clusters is the highest.
The methods for cluster evaluation will be evaluated in terms of usability in analysis.
This way of evaluating a visualization method is admittedly a bit vague. When evalu-
ating the usability of a cluster visualization method, these are the criteria that will be
examined:
• The extent to which a plot enables a user to quickly and efficiently comprehend
the clustering results
• How well the visualization method scales for large data sets and clusters
• How easily a user can discern the clustered groups
All quantitative evaluation methods such as the internal validation measures, the CLOSE
score, and the stability index will be calculated for each tuned clustering method and will
be presented in tables. By comparing the results obtained from the variety of clustering
validation methods, it is possible to discuss their usefulness for evaluating clustering
results of financial time series.
Experiments
A number of different clustering algorithms will be applied to the benchmark data in
order for their result to be evaluated. The clustering methods and distance metrics that
will be evaluated are:
Table 3: Specification of the clustering methods and distance metrics that will be tuned
Clustering method Distance metric(s)

Agglomerative clustering Kendall’s tau
Agglomerative clustering Hellinger
Hybrid hierarchical clustering Kendall’s tau
Hybrid hierarchical clustering Hellinger
In order to compare the clustering methods in a way that is as fair as possible, the pa-
rameters of each method are fine tuned with the purpose of maximizing CLOSE score.
Other measures of stability or internal validation measures can be used as well, one can
for example use the "elbow" of the plot of the within cluster sum of squares in order
34
to find the optimal parameters for a model. This metric is utilized in the method com-
monly known as the Elbow method [32]. Due to the occasional ambiguity regarding the
exact number of clusters that represent the elbow in the plot, the method of maximizing
the CLOSE score is chosen. Most importantly, this method of cluster validation has
been specifically developed in order to evaluate clusterings of time series. In addition
to evaluating the temporal stability of the clustering results, the CLOSE score takes
the quality of the clusters into account as well. The optimal parameters will have to be
determined for each data set that is being clustered, due to the nature of different kinds
of financial time series.
The reason for performing different evaluations for different kinds of financial time series
is that the method that succeeds best at clustering mutual funds may not be as success-
ful when clustering stocks. When performing clustering analysis it is important to have
knowledge about the data that is being clustered, and which methods performing well
on this particular kind of data.
Once the different clustering methods have been fine tuned to each data set, the cluster-
ing results of each clustering method applied to each data set in use is saved for further
evaluation. It is these clustering results that will be used when evaluating different
methods of cluster result evaluation. The collection of different cluster validation meth-
ods will then be used to draw conclusions regarding the performance of the evaluated
cluster methods, as well as the usefulness of the validation methods themselves for these
particular methods and data sets.
6 Results
6.1 Model selection
In this section, the CLOSE score, stability index and internal validation measures of
each method applied to each data set is presented for a range of parameters. Addi-
tionally, the CLOSE score of the bootstrapped clustering results are included as well.
The CLOSE score of the bootstrapped results are displayed using a box plot. The box
plotted for each parameter choice ranges from the first quartile to the third quartile of
the data, and the median is shown as a green line inside the box. The whiskers of the
plot show the range of the data, extending no longer than 1.5(Q3 − Q1) from the edges
of the box. Data points that fall outside of this interval are plotted separately as outliers
[33].
The internal validation measures are applied to the clusters of bootstrapped samples,
and these can be viewed in the appendix. In the following results, each parameter has
been evaluated 15 times, using new bootstrapped samples for each evaluation. Ideally,
this number of evaluations is as large as possible. Due to limitations in computer hard-
ware, 15 evaluations is deemed sufficient for the purpose. The number of randoms splits
of the data when calculating the stability index is also 15 in the evaluation experiments
shown below.
The parameters that are selected for each clustering method for further evaluation are
35
the parameters that result in the highest CLOSE score. The internal validation measures
have all been normalized to be in the range 0 to 1 for each clustering method. The
Davies-Bouldin score has also been inverted, so that it is easily compared to the other
validation measures where a higher score is better. It is important to bear in mind
that this means that the internal validation measure plots can not be compared to other
methods or distance metrics, and that the main point of including this plot is to showcase
which parameter yields the optimal clustering result according to these measures. The
actual values of the internal validation measures will be presented in the results of the
tuned methods.
6.1.1 Agglomerative clustering using Kendall’s tau

Mutual fund data set
Figure 1: CLOSE score of clustering results of mutual funds
Figure 2: CLOSE score of clustering results of bootstrapped mutual funds
36
Figure 3: Stability index of clustering results of mutual funds
Figure 4: Internal validation scores of clustering results of mutual funds
In Figure 1, it can be observed that the CLOSE score achieves its maximum value when
the number of clusters is approximately 50. This choice of cluster method parameter is
strengthened further when observing the bootstrapped result in Figure 2. According to
the stability index in Figure 3 the optimal number of clusters is the smallest number
of clusters tested, which in this case is four clusters. The internal validation scores in
Figure 4 seem to instead favor the largest amount of clusters tested, 150 clusters. It is
worth noting however, that the slope of all three scores increase drastically when the
number of clusters is approximately 50.
37
Stock data set
Figure 5: CLOSE score of clustering results of stocks
Figure 6: CLOSE score of clustering results of bootstrapped stocks
Figure 7: Stability index of clustering results of stocks
38
Figure 8: Internal validation scores of clustering results of stocks
For the stock data set, the highest CLOSE score is achieved when the number of clusters
is 16 clusters. Please also note that the difference in CLOSE score between the different
number of clusters is significantly smaller in Figure 5 than in Figure 1. The stability
index seems to increase linearly as the number of clusters increases in Figure 7, suggesting
that the optimal number of clusters is instead 6. The silhouette score and the Davies-
Bouldin index in Figure 8 seem to once again suggest that the largest number of clusters
is the most optimal. The Calinski-Harabasz index has a local maximum the number of
clusters is 16, but the index suggests that the smallest number of clusters tested results
in the best results. Since the scores are normalized for the purpose of being comparable
in one plot, the actual values of the scores are not shown. The reader is referred to the
figures in Section 10.1.1 in the appendix, where the bootstrapped results are presented
in box plots.
6.1.2 Agglomerative clustering using the Hellinger distance

39
The CLOSE score in Figure 9 indicates that the optimal number of clusters is 24, while
the bootstrapped results in Figure 10 instead seem to indicate that the optimal number
of clusters is 4. Other than this discrepancy, the two graphs are quite similar. Since
the median of the bootstrapped CLOSE score for 4 clusters is significantly higher than
40
the CLOSE score for 24 clusters, 4 clusters are chosen as the optimal parameter. This
choice is particularly motivated by the fact that the difference in CLOSE score between
4 and 24 clusters in Figure 9 is fairly small.
Stock data set
41
As can be observed in Figure 13, the have two maxima when the number of clusters is
12 and 18. Due to the fact that the stability index is increasing steadily as the number
of clusters increase and the internal validation measures worsen, the smaller number of
clusters is picked as the optimal parameter.
6.1.3 Hybrid hierarchical clustering using Kendall’s tau

42
This clustering method achieves its highest CLOSE score when the maximum distance
between the points in the clusters is 0.24. The stability index seen in Figure 19 instead
seems to suggest that the optimal maximum distance is significantly larger, resulting
43
in clusterings containing fewer and larger clusters. On the contrary, the internal val-
idation measures in Figure 20 suggest that the maximum distance is kept at a minimum.
Stock data set
44
As can be observed in Figure 21, the maximum CLOSE score is achieved when the
maximum intra cluster distance is 0.4. Again, the stability index in Figure 23 indicates
essentially that the greater the allowed distance between the points in the clusters are,
the better. This is equivalent to selecting a clustering result with very few and large
clusters. When observing the internal validation scores in Figure 24, it appears that
the silhouette score and the Davies-Bouldin index favors a smaller distance between the
points in the clusters, while the Calinski-Harabasz index favors larger and fewer clusters.
6.1.4 Hybrid hierarchical clustering using the Hellinger distance

45
As previously seen in Figure 9, the CLOSE score is the largest for the parameters that
result in fewer clusters when the clustering is performed using the Hellinger distance. In
this case, the CLOSE score has a maximum when the maximum intra cluster distance
is 0.75, as can be seen in Figure 25. The stability index in Figure 27 seems to support
46
this choice of parameter as well, since one of the local minima of the index is achieved
when the maximum distance is 0.75. The other minima of the stability index occurs
when the maximum distance is 0.6, but this is approximately where the CLOSE score
is has its global minima. The silhouette score and the Davies-Bouldin index in Figure
28 also indicate that a larger maximum distance in the range [0.7, 0.9] achieves the best
score. The Calinski-Harabasz index suggests a smaller value of 0.4 instead. The volatile
movements of the normalized internal validation scores in Figure 28 are a result of the
fact that the values of the internal validation scores do not change much when changing
the parameters, which can be seen in Section 10.1.1 in the appendix.
Stock data set
47
The CLOSE score in Figure 29 does not change much once it has reached a value of
approximately 0.35. The maximum CLOSE score is approximately the same for the
parameter values 0.2 and 0.3. Since the stability index in Figure 31 indicate a better
stability index for 0.3, this value is chosen.
6.1.5 Summary of model selection results

As can be seen in the figures describing the CLOSE scores of the different clustering
methods for different parameters, the CLOSE score seems to be the only cluster val-
idation method that is able to consistently give a decisive result regarding the best
parameter choice for the clustering method given the data that is being clustered. The
plotted CLOSE scores typically has a maximum value that can used when performing
model selection. The plots of the stability indices for the different parameter values
seems to be less conclusive, in the sense that the index consistently seems to favor a
smaller amount of clusters, regardless of cluster quality. When it comes to the inter-
nal validation measures, it varies whether the scores suggest similar parameters as the
48
CLOSE score. Often, the internal validation measures contradict one another, and tend
to favor clustering results with a very large amount of small clusters and many data
points not assigned to clusters at all.
6.2 Results of reference clustering methods

In the following tables, the quantitative evaluation methods have been used to evaluate
the reference clustering methods. The sequential clustering method used as a reference
is evaluated using the Kendall’s tau distance between the time series in the data set.
Kendall’s tau is the distance metric that is used in the first clustering sequence. The
clusters are then further partitioned using the Hellinger distance. Since clustering using
Kendall’s tau seemingly leads to a larger amount of clusters, the Kendall’s tau distance
may have the largest impact on the cluster result. For this reason, this metric will also
be used during evaluation.
Note that the filter clustering does not use a distance matrix to perform the clustering
task, and the results of the filter clustering have thus been calculated for both distance
metrics to allow comparison to the tuned clustering methods. The reference clustering
method primarily uses Kendall’s tau to perform the clustering, and will therefore be
evaluated using the Kendall’s tau distance between the time series. Please bear in mind
that the Davies-Bouldin score was previously inverted in the plots of internal validation
scores to facilitate comparison to the other scores. In the tables below, a smaller value
of the Davies-Bouldin index is a better score.
6.2.1 Mutual fund data set
Table 4: Quantitative evaluation metrics of reference methods for fund data set
Sequential Filter
Filter
reference clustering
clustering
clustering (Hellinger
(Kendall’s tau)
method distance)
CLOSE score 0.45 N/A N/A
Stability index 0.89 0.72 0.73
Silhouette score 0.22 -0.06 -0.25
Calinski-
2695 699 171
Harabasz index
Davies-Bouldin
1.55 3.39 4.21
index
49
6.2.2 Stock data set
Table 5: Quantitative evaluation metrics of reference methods for stock data set
Filter clustering Filter clustering

(Kendall’s tau) (Hellinger distance)
Stability index 0.65 0.84
Silhouette score 0.01 -0.07
Calinski-Harabasz index 96 73
Davies-Bouldin index 2.23 2.69
6.3 Results of tuned methods

Here, the clustering methods using the optimal parameters determined in the previous
sections will be evaluated using all quantitative evaluation methods presented. The
clustering methods with the best performance on the two data sets will be selected for
further analysis. Due to the fact that internal validation measures are computed using
the distances between the data points, these can not be compared when comparing two
clustering methods using different distance metrics. Additionally, the different distance
metrics can possibly capture different structures within the data that are of interest when
further analysing the clusterings. By examining and selecting two clustering methods
using different distance metrics for each data set, a more comprehensive evaluation of
the patterns within the data can be performed.
Table 6: Quantitative evaluation metrics of tuned clustering method for both data sets
Mutual fund data set Stock data set

Parameter value 50 16
CLOSE score 0.67 0.36
Silhouette score 0.27 0.07
50

Parameter value 4 12

Parameter value 0.23 0.4

Parameter value 0.75 0.3
51
6.3.5 Summary and comparison to reference methods
When comparing the quantitative scores of the reference methods with the results of the
tuned clustering methods, it is clear all tuned clustering methods outperform the filter
clustering methods. These results are expected, but it is noted how poorly the filter
clustering performs. This further motivates the need to perform clustering of financial
time series, rather than only relying on labels assigned to the instruments. Additionally,
it is noted that both the hybrid hierarchical clustering method and the agglomerative
clustering method outperform the reference clustering algorithm. During the qualitative
analysis of the clustering results, the clusterings produced by the tuned methods will be
briefly compared to the clusterings produced by the reference method.
6.3.6 Selected clustering methods for mutual fund data

When comparing the different evaluation metrics of the mutual fund clustering results, it
appears that the method that uses Kendall’s tau with the highest overall performance is
the agglomerative clustering algorithm, which can be observed in Table 6. This method
achieved a slightly higher CLOSE score than the hybrid hierarchical method in Table 8.
Additionally, the agglomerative method achieved a better clustering index. It is worth
noting however, that the hybrid hierarchical method achieved slightly better internal
validation measures, possibly indicating tighter, more separated clusters. Despite the
slightly worse internal validation measures, the agglomerative clustering algorithm is
chosen for its stability.
The clustering method using Hellinger that has the highest performance on the mutual
fund data is determined to be the hybrid hierarchical clustering method in Table 9. This
clustering method results in a slightly higher CLOSE score compared to the agglomera-
tive method in Table 7. Other than this, the performance of the methods according the
quantitative evaluation measures are very similar.
6.3.7 Selected clustering methods for stock data

For the stock data set, the method using Kendall’s tau that achieves the highest overall
performance is the hybrid hierarchical method in Table 8. While the stability index is
higher for this method than for the agglomerative clustering method in Table 6, the hy-
brid method achieves a higher CLOSE score. Since the internal validation measures are
fairly similar between the two clustering methods, the temporal stability that a higher
CLOSE score indicates motivates this choice.
Finally, the evaluation metrics indicate that the agglomerative clustering method in
Table 7 is the best performing method when using the Hellinger distance to cluster the
stock data. This method achieves a higher CLOSE score than the hybrid hierarchical
method in Table 9, and a slightly worse stability index. However, the choice is motivated
further by the fact that the internal validation measures all indicate a better cluster
result for the agglomerative method.
52
6.4 Visualization of clustering results
In this section, visualizations of clustering results created by the previously selected
clustering methods will be exhibited.
6.4.1 Scatter plots created using dimension reduction

PCA
The following scatter plots have been produced by using PCA to perform dimension
reduction, and choosing the two most significant components. Since PCA is performed
using the eigenvalues of the covariance matrix of the data rather than a distance matrix,
the return series of each financial instrument is used as input for the algorithm. The
plots where created using cluster results obtained by clustering mutual funds or stocks
that span a period of slightly over five years of weekly data. More specifically, each time
series consists of 273 observations.
(a) Cluster result of agglomerative clustering (b) Cluster result of hybrid hierarchical
using Kendall’s tau method using the Hellinger distance
Figure 33: Scatter plots showing reduced data points and cluster assignments for
mutual fund data
(a) Cluster result of hybrid hierarchical (b) Cluster result of agglomerative clustering
method using Kendall’s tau method using the Hellinger distance
Figure 34: Scatter plots showing reduced data points and cluster assignments for stock
data
53
When observing the scatter plots of the fund data, it is difficult to see significant sepa-
ration between the clusters. Additionally, in Figure 33a, the data points that have been
classified as outliers by the clustering algorithm appear to be located very close and
even on top of data points assigned to clusters. Data points in clusters also appear to
be placed quite far apart, and the shapes of the clusters are hard to comprehend. This
task is easier in Figure 33b, since the amount of different clusters is much smaller. Here,
the clusters are more simple to tell apart, but they still appear oddly shaped.
Figure 34a, which displays the stock clustering result partitioned using Kendall’s tau is
also difficult to interpret for similar reasons as Figure 33a. When observing Figure 34b,
it appears that the cluster assignment of the data points are based on distance from the
center of the cluster of data points shown.
t-SNE
The scatter plots shown in this paragraph were acquired by using the t-SNE method
to perform dimension reduction. In contrast to PCA, t-SNE is used to find lower di-
mensional mappings of the data while maintaining the distance between the data point.
In order to find these mappings, the distance matrix used to perform the clustering are
used as input to the t-SNE algorithm in order to obtain the reduced data.
(a) Cluster result of agglomerative clustering (b) Cluster result of hybrid hierarchical
using Kendall’s tau method using the Hellinger distance
Figure 35: Scatter plots showing reduced data points and cluster assignments for
mutual fund data
54
(a) Cluster result of hybrid hierarchical (b) Cluster result of agglomerative clustering
method using Kendall’s tau method using the Hellinger distance
Figure 36: Scatter plots showing reduced data points and cluster assignments for stock
data
When observing the figures showing cluster results of the mutual fund data set, it
is apparent that the scatter plots offer a bit more in terms of interpretability when
the dimension reduction is performed using t-SNE. While the outliers still appear to
be placed in separate clusters in Figure 35a, the clusters in the plot are more easily
discerned and separated. This figure offers the observer an intuition regarding the size
of the clusters, as well as the distance between the clusters themselves. Figure 35b
displays clearly separated, albeit unevenly shaped clusters. Here, an observer can more
easily comprehend the clustering results and see the separation between the clusters.
6.4.2 Cluster evolution plots

The cluster evolution plots shown in this section are created using the method described
in Section 5.2.1 in the paragraph labeled Cluster evolution tracking. The plots show
how the n largest clusters in each time sequence intersect with one another, and how
the clusters evolve as time advances. Here, n is a chosen number of clusters to display
in the plot that is determined by the user. The edges indicate the temporal intersection
between the clusters in the different timestamps, and the nodes represent the clusters
themselves. The numbers that are shown on the nodes show how many data points
that have been partitioned into the cluster, and the number on the edges between the
nodes show how many data points originating in one cluster that are placed in the other
cluster in the next timestamp. Some of the nodes lack edges, and this can happen when
a cluster is created when all of its data points originate in clusters that were not in the
set of n largest clusters in the previous time stamp. If a node has no outgoing edge,
this means that the cluster was likely split into smaller clusters that are not in the n
largest. It is also likely that the sum of outgoing data points from a cluster into the next
timestamp are not equal to the total number of data points in the node. If this happens,
some of the time series in the cluster have been partitioned into smaller clusters, or been
classified as an outlier.
Each timestamp shown in the cluster evolution plots are based on clustering results of
subsequences of the financial time series, each with a total length of 260 weeks. The
start and end dates of each subsequence is displayed on the x-axis of the figures. The
55
mutual fund data set used to create the first two cluster evolution plots contain a total
of 275 financial time series. The stock data is made up of 116 assets.
Mutual fund clustering results
Figure 37: Cluster evolution of agglomerative clustering using Kendall’s tau
Figure 37 shows how the largest clusters are fairly equal in size, apart from the second to
last subsequence where the largest cluster becomes significantly larger than the others.
This appears to be due to the fact that two clusters in the previous subsequence merge
to form this cluster, allowing a fourth smaller cluster with 10 time series to be plotted
as well. In the last subsequence the largest cluster with 97 data points is split in two,
forming the two largest clusters in the final subsequence.
56
Figure 38: Cluster evolution of hybrid hierarchical clustering using the Hellinger
distance
In contrast to the previous figure, Figure 38 shows how the largest cluster is practi-
cally static throughout all subsequences. This cluster is also significantly larger than
all other clusters, containing between 83%− 92% of all data points in the entire data set.
Stock clustering results
Figure 39: Cluster evolution of hybrid hierarchical clustering using Kendall’s tau
The cluster evolution shown in Figure 39 shows how the cluster results are made up of
one large cluster and many smaller clusters in the first three subsequences. It can also be
57
observed however, that approximately half of the time series present in the cluster in the
second sequence have been partitioned into other smaller clusters by the last sequence.
Figure 40: Cluster evolution of agglomerative clustering using the Hellinger distance
Again, the cluster methods that use the Hellinger distance to partition the data seem-
ingly clusters the data into one very large cluster, and a few smaller clusters. The result
in Figure 40 is similarly to the cluster result in Figure 38 fairly static, where the majority
of the time series in the largest cluster stays in this cluster throughout all subsequences.
The cluster evolution plots may not be directly comparable to the the scatter plots pre-
viously shown, since they display how the size of the clusters change over time while the
purpose of the scatter plots are to show the cluster results in one time step. Regardless,
the cluster evolution plots seem to scale better to large data sets and clusters. Since
the data points are represented as numbers that indicate the amount of time series in a
cluster rather than points in a two dimensional space, the plots become less crowded and
easier to comprehend. The evolution plot lacks details regarding the within cluster dis-
tances, but instead displays the relationships between clusters in different subsequences,
might be of interest when performing time series clustering.
6.5 Qualitative analysis of clustering results

In order to evaluate the usefulness of a clustering method in a given use case, it is also
important to study the clusters and the data points that have been assigned to them.
This analyzation requires specific domain knowledge regarding the data that is being
clustered, since domain knowledge can facilitate the explanation of why certain cluster-
ings are reasonable or not. It can also help the detection of outliers, since an individual
with significant knowledge about the data set can find outliers that theoretically should
be similar to data points that have been assigned to clusters. Available meta data of
mutual funds is determined to be more abundant, and slightly easier to interpret than
the descriptive data of stocks. For this reason, only cluster results of the mutual fund
58
data set will be presented here.
The results obtained using the reference sequential clustering algorithm will be compared
to one of the clustering algorithms tested. The chosen method for comparison is the
agglomerative clustering method using Kendall’s tau distance, and the clustered data
set will be the mutual fund set. Only a selection of the clusters created by the different
algorithms are included below in order to maintain brevity.
Table 10: Reference clustering method
Cluster ID Fund region Fund name

1 Sweden SEB SWEDEN EQUITY
1 Sweden SKANDIA SVERIGE EXPONERING
.. .. ..
. . .
1 Nordic SEB NORDENFOND
1 Sweden LANNEBO SVERIGE
2 Global SKANDIA TIME GLOBAL
2 Global AMF AKTIEFOND GLOBAL
.. .. ..
. . .
2 North America AMERIKA TEMA-A1 SEK
2 U.S SKANDIA USA
4 Sweden SPILTAN SMÅBOLAGSFOND
4 Nordic Region AMF AKTIEFOND SMÅBOLAG
.. .. ..
. . .
4 Sweden EVLI SWEDISH SMALL CAP
4 Sweden ÖHMAN SWEDEN MICRO CAP
13 Global LANNEBO TEKNIK
13 Global SEB TEKNOLOGIFOND
.. .. ..
. . .
13 Global SWEDBANK ROBUR TECHNOLOGY
13 Global ÖHMAN GLOBAL GROWTH
23 Sweden SKANDIA REALRÄNTEFONDEN
23 European Region NORDEA SVE REAALIKORKO-GRSEK
23 Global ÖHMAN REALRÄNTEFOND
In the reference clustering result, Swedish equity funds are primarily placed in the cluster
1, with a few Nordic funds added as well. This might indicate that the Nordic equity
funds in this cluster is largely made up of Swedish stocks. Cluster 2 is a global and North
American equity fund cluster. This indicates that the global funds in this cluster contain
primarily North American stocks. Cluster 4 is a Swedish small cap cluster with a few
Nordic small cup funds. It is clear that this clustering algorithm is able to partition small
cap funds separately from the other Swedish equity funds. Cluster 13 is a technology
cluster, containing most of the technology funds in the data set, as well as a few growth
funds. This may be due to the fact that the growth funds present in the cluster have a
significant portion of their placements within companies developing technology. This is
the case for the Öhman Global Growth fund [34], for example. Cluster 23 consists of all
59
inflation linked bond funds. These should all be strongly linked to the inflation, so this
clustering is expected. The total number of singleton clusters in the reference clustering
result is 97.
Table 11: Agglomerative method using Kendall’s tau

1 Sweden SEB SWEDEN EQUITY
1 Sweden SKANDIA SVERIGE EXPONERING
1 European Region ÖHMAN MARKNAD EUROPA
1 Sweden EVLI SWEDISH SMALL CAP
.. .. ..
. . .
1 Nordic SEB NORDENFOND
2 Global SWEDBANK ROBUR TECHNOLOGY
2 Global AMF AKTIEFOND GLOBAL
.. .. ..
. . .
2 North America AMERIKA TEMA-A1 SEK
2 Global LANNEBO TEKNIK
2 U.S SKANDIA USA
7 Nordic SPILTAN HÖGRÄNTEFOND
7 Sweden SPILTAN RÄNTEFOND SVERIGE
.. .. ..
. . .
7 Sweden LANNEBO RÄNTEFOND KORT
7 Sweden STOREBRAND KORTÄNTA
23 Sweden SKANDIA REALRÄNTEFONDEN
23 European Region NORDEA SVE REAALIKORKO-GRSEK
23 Global ÖHMAN REALRÄNTEFOND
The agglomerative clustering result is reminiscent of the reference clustering result, but
the clusters are larger and some clusters that are present in the reference result seems
to have been merged in this clustering result. This clustering method does not partition
Swedish small cap funds separately, and instead places most Swedish equity funds in
cluster 1. The global equity cluster has also been merged with the technology cluster
that is present in the reference clustering algorithm. Cluster 7 consists of a range of
different interest funds. In the reference clustering result, these are partitioned further
into many different clusters, or classified as noise. The total number of singleton clusters
in this clustering result is 19 time series.
As a comparison between the different distance metrics, the clustering result of mutual
funds using the Hellinger distance is also included below:
60
Table 12: Hybrid hierarchical method using the Hellinger distance

1 Sweden SPILTAN SMÅBOLAGSFOND
.. .. ..
. . .
1 U.S S BANK EQUITY - A
1 Finland AKTIA CAPITAL - B
2 Sweden SWEDISH BOND STARS
.. .. ..
. . .
2 Nordic Region ÖHMAN FÖRETAGSOBLIGATIONSFOND
3 Sweden SPILTAN RÄNTEFOND SVERIGE
.. .. ..
. . .
3 Global HANDELSBANKEN KORTRÄNTA
6.5.1 Comparing the clustering results

Compared to the tested agglomerative clustering method, the reference clustering method
results in smaller, more specific clusters. For example, the reference method is able to
partition the Swedish small cap funds into a different cluster than the rest of the Swedish
equity funds. Additionally, it is able to partition the global technology funds into its
own cluster. This approach may lead to a clustering result that is more easily inter-
pretable, since the different fund categories are split into finer clusters containing mostly
one or two types of funds. The fact that the agglomerative method received a better
CLOSE score indicates that this method is more stable over time when compared to
the reference method. The fact that the clusters tend to be larger in this result may be
the cause of this. The reason for this could be that larger clusters capture more general
patterns in the data, and a cluster method producing smaller clusters may capture more
niche patterns specific to fewer funds.
When it comes to the clustering result produced using the Hellinger distance, the results
are vastly different. Here, the funds have been split into three clusters. The first
cluster contain most equity funds in the entire data set. The second cluster contains
all bond funds, and the third cluster contains most interest funds. This indicates that
these different kinds of funds have distinct returns distributions, but no other significant
patterns that the Hellinger distance could help make out.
7 Discussion
7.1 The clustering results
By analyzing the results in section 6.3, it seems that clustering results acquired using
the Kendall’s tau distance between the time series resulted in more and smaller clusters
compared to the results acquired using the Hellinger distance. The Hellinger distance
has seemingly captured the patterns in the data that are specific for the equity funds,
bond funds, and interest funds. Since the Hellinger distance is a measure of similarity
61
between two distributions, the returns distributions may follow a certain pattern for
each of these fund types. By clustering using the Kendall’s tau distance, more distinct
patterns in the data were captured by the clustering algorithms. For example, the clus-
tering method that generated the result visible in table 11 placed Swedish and global
equity funds into separate clusters. For this reason, using Kendall’s tau to partition
mutual funds appears to be a more efficient approach when the goal is portfolio opti-
mization and exploring more intricate patterns in the data set. When it comes to the
stock data set, the clustering methods using Kendall’s tau also seems to create a few
larger clusters and many small. This is likely due to the fact that the data set only
consists of Swedish stocks that all correlate in to some degree.
The filter clustering methods appear to display fairly poor performance when evaluated
using the quantitative metrics in tables 4 and 5. This is another motivator why other
types of clustering are needed in the first place to detect correlated mutual funds and
outliers. The filter clustering methods would not detect that the Nordic fund in cluster
1 in Table 10 is more correlated to the Swedish equity funds than the other Nordic funds
present in another cluster. An investor may then falsely believe that buying this Nordic
fund could increase portfolio diversity if the investor already owns Swedish equity funds.
It is important to note however that the filter clustering methods may perform better
if the data had been partitioned using other properties of the time series. For example,
the results may have been better if the mutual fund filter method only partitioned using
asset type and region instead of also grouping by currency. Making beneficial choices
regarding which properties to use when performing filter clustering requires more in-
tricate knowledge of the time series type in the data set. Using a different clustering
circumvents this choice, and might for this reason be easier to implement.
7.2 Quantitative evaluation methods

7.2.1 Conventional internal validation measures
When comparing the results of different clustering techniques, it becomes apparent that
determining which clustering method that performs best is difficult using only internal
validation metrics. Some clustering methods perform better in regards to one evalua-
tion metric, such as the Silhouette score, while some clustering methods perform better
when looking at the Calinski-Harabasz index. A pattern that can be seen when observ-
ing the internal validation scores for the different methods is that they tend to favor
clustering results that contain small clusters with many outliers. Since these results fail
to capture the underlying structure of the data, this indicates that the internal valida-
tion measures may not be fit for usage in model selection when clustering these data sets.
The internal validation measures used in this project are altered in order to allow the use
of non-Euclidean distances. One reason why these validation metrics do not perform well
during model selection could be that the clustering is performed using distance metrics
that are different than the point-to-point comparison that using the Euclidean distance
would entail. Instead, the Hellinger distance measures the similarity between the return
distributions of two time series in this case, and Kendall’s tau quantifies the correlation
between two time series. It is entirely possible that the internal validation metrics would
62
have been more useful if the time series were compared using the Euclidean distance
instead. Since one of the main purposes of performing the clustering in the first place is
portfolio diversification, measures such as the ones used throughout this project might
be more effective however.
7.2.2 CLOSE
By observing the results in Section 6.1, it becomes apparent that the CLOSE score is
the only cluster validation method applied to the different clustering results that con-
sistently has a maxima for parameters that do not result in trivial clustering solutions.
Trivial clustering solutions in this case is when the clustering method produces as few
clusters as possible, or as many clusters as possible. Additionally, it is the only validation
method applied in this project that takes both the stability over time and the quality
of the clusters into account. For example, a cluster result with a significant amount
of smaller clusters may have a high cluster quality if they are dense. However, these
results may be less stable since a smaller change in distance between the time series is
needed for the cluster assignments to change. Larger clusters on the other hand may be
very stable as time progresses, but are likely to have a lower quality since the distance
between the data points in a larger cluster is likely more significant than the distance
between data points in smaller clusters.
When performing the qualitative analysis of the reference clustering results and results
acquired using the agglomerative clustering method in Section 6.5, the reference method
seems to produce a finer clustering solution compared to the agglomerative. This finer
partitioning of the financial time series is not necessarily the better solution, but it may
be easier to comprehend at first glance. Since the agglomerative clustering solution re-
ceived a higher CLOSE score than the reference method, the temporal stability of this
solution is likely higher for slightly larger clusters while the cluster quality has not begin
to decline significantly. The exploitation term in Equation 39 may also be the reason
why the reference method receives a lower CLOSE score. In the results in Section 6.5,
the number of singleton clusters is significantly higher in the reference result compared
to the result of the agglomerative method. As a result, the score of the reference method
is punished more harshly than the agglomerative method.
Despite the fact that the clusters produced when using the method with the highest
CLOSE score seem to not capture some of the finer patterns in the data sets, it is still
the only quantitative clustering method tested in this thesis that favors clustering so-
lutions where more general patterns have been captured. For example, the clustering
solution in Table 11 fails to partition Swedish small cap funds separately from the other
equity funds but is still able to cluster most Swedish equity funds in one cluster. The
method has also resulted in a cluster consisting exclusively of interest funds.
When observing some of the plots of the CLOSE score during model selection, it can be
seen that the scores are similar for different parameter values. In Figure 1, the CLOSE
score does not differ much between N = 50 and N = 64. Another example is in Figure
17. Here, there is another local maxima when the maximum distance d = 0.16. The
CLOSE score is approximately 0.625 at this maxima, and can be compared to the global
63
maximum value of 0.66. While the score of the clustering result acquired by using a
maximum intra cluster distance of d = 0.16 is worse than the the score when d = 0.24,
this would result in a solution where the clusters are smaller and the number of clusters
is larger. This result would possibly capture the finer patterns in the data, and separate
the large clusters, such as the Swedish equity cluster in Table 11 further.
7.2.3 Stability index

Performing model selection using the stability index proved difficult because of the
seemingly linear increase of the index as the number of clusters in the clustering solu-
tion increased. This means that the optimal number of clusters in the data set when
evaluating using the stability index is the choice of parameter that leads to the fewest
amount of clusters in the result. This indicates that the data set does not contain any
additional structures that would motivate further partitioning. Since the data that is
being clustered is multivariate financial time series, it is deemed unlikely that this is the
case. The reference solution in Table 10 shows that there are groups of financial time
series that correlate strongly enough to partition them together.
The behaviour of the stability index may be caused by the fact that the classifier used to
predict the clustering labels of half of the data set is significantly more successful when
the number of different clustering labels in the result is small. Intuitively this makes
sense, since there are fewer labels that the classifier is able to pick. This effect should
theoretically be offset by the normalizing factor in Equation 43, but it still appears that
solutions with few, large clusters are favored by the stability index. Due to this, the
clustering validation method is may not be appropriate when clustering the data sets
introduced in this thesis. It is not possible at this stage to draw conclusions regarding
any other data sets of financial time series however.
Unlike the CLOSE score, the stability index was not developed specifically for clusters
of time series. Time series have a temporal structure, and multivariate time series in
a data set also have a dependence on each other. This added complexity may require
additional considerations when designing a cluster validation method. It is also possible
that the stability index would have worked well in model selection if the data set would
have consisted of univariate time series, with no dependence on one another at all. This
theory is based on the fact that Roth et al. assume that the data points in the two
splits of the data set are independent [11]. Since the multivariate time series data in the
mutual fund and stock data sets are not independent, this assumption does not hold for
this data type. The validation method was still included though, in order to investigate
its effectiveness on dependent time series data. Overall, it appears that it might be
better to use methods specifically developed for time series and have no assumptions
regarding the independence of the data.
7.3 Cluster visualization methods

7.3.1 Scatter plots
By observing the scatter plots in Section 6.4.1, the plots appear to not be an effective
way of visualizing clusters of the financial time series present in the data sets. Since the
64
returns are used as input to the PCA algorithm, the data vector corresponding to each
time series is high dimensional. Each observation in the time series corresponds to one
individual dimension, meaning that as the number of observations increase in a time
series so does the dimensionality of the data. The data points shown in the scatter plots
in Section 6.4.1 consists of 273 dimensions. It appears that reducing the returns data
to two dimensions removes too much of the information, and the resulting plot becomes
difficult to interpret. Another issue with using PCA to reduce the dimensionality of the
time series data is the fact that according to the stylized facts of financial time series, the
returns data can exhibit some autocorrelation [1]. When performing PCA, it is usually
assumed that the data is independent in time [35]. If this assumption is broken, it may
negatively impact the descriptive ability of PCA. Vanhataloa and Kulahci [36] found
that if the data is autocorrelated, the number of principal components needed to keep a
determined fraction of the variability in the data could increase. This is one explanation
why the two principal components displayed in the scatter plots fail to preserve the
variability and patterns in the original data.
Compared to the scatter plots created using PCA, the plots whose data has been re-
duced using t-SNE appears to show a slightly improved separation between the clusters.
This is especially true in Figure 35b, where three clusters can easily be distinguished.
In Figure 35a, the clusters appear a bit more bunched together. At the same time, all
data points in a cluster are seemingly not plotted in the vicinity of one another. One
example is the two orange data points in the top middle of the figure. Additionally, the
data points determined to be outliers by the clustering algorithm appear to form their
own clusters. The same patterns can be observed in figures 36a and 36b. The clusters
are however slightly more difficult to make out in Figure 36a, since many of the data
points seem to have a fairly even distance to the points next to them. This may be
caused by the fact that the data set consists only of Swedish stocks, and that most of
the Swedish stocks tend to be correlated and influenced similarly by events. The slight
improvement in visualization quality over the scatter plots produced using PCA may
be caused by the fact that t-SNE is nonlinear technique that makes no assumptions
regarding the time dependence of the data. The distances between the data points in
Equation 11 are Euclidean, but one can choose to use other distance metrics as well.
This flexibility allows visualizations to be created based on the distance metrics actually
used when performing the clustering, which seems to have resulted in a slightly better
visualization compared to PCA. Additionally, using Students t-distribution to calcu-
late the lower dimensional representation of the data is more robust to outliers, due to
the heavy tails of the distribution. Longer distances between the time series in the high
dimensional space may transfer better to the lower dimensional mapping because of this.
While these scatter plots may be slightly easier to comprehend, they still provide a fairly
crowded cluster visualization with a limited usage in analysis. Increasing the number
of data points in the data set further would result in even more crowded plots, so the
scaling capabilities of cluster visualization via scatter plots is questioned.
7.3.2 Cluster evolution plots

This method of visualization may provide less information regarding the distances be-
tween the points in each cluster, but at the same time provides a temporal perspective
65
since it shows how the largest clusters evolve over time. Since the method of visualiza-
tion shows the intersection between the clusters in adjacent subsequences, the method is
reminiscent of how the CLOSE score is calculated. A clustering result where the mem-
bers of the clusters are generally clustered together in the next timestamp is considered
a beneficial clustering result since it is consistent over time. In the cluster evolution
plots, this means that the edges from one cluster should be as few and large as possi-
ble since this indicates that most cluster members are transferred together to the next
cluster. On the contrary, if a cluster has many edges connected to many clusters in
the next subsequence, this indicates that the cluster has been split into many smaller
clusters. A cluster result with a higher CLOSE score is likely to display less changes
per subsequence than a clustering result with a lower score. One issue with the cluster
evolution plots as they are displayed in Section 6.4.2 is that the time series are simply
represented as numbers, both in the edges and in the cluster nodes themselves. This
fact makes it impossible to get an intuition regarding which time series tend to follow
one another, or which ones are placed in different clusters after a split of a larger cluster.
However, this method of cluster visualization is considered the most informative out of
the methods explored in this thesis. In the context of time series clustering, temporal
stability is of great interest considering the fact that a cluster method that has been
shown to produce approximately the same results for multiple different sequences of
time may be more reliable for future use as well. The cluster evolution plots provide
an intuition regarding the stability of the method, as well as an idea of the general
size of the clusters in the solution. One additional limitation of the method is that the
user is forced to determine a maximum number of clusters shown to prevent a messy
visualization with edges covering the figure. Another area were the method struggles is
when one or two clusters are significantly larger than the others. An example of this
can be seen in Figure 40, where the smaller edges are barely visible.
7.4 Qualitative analysis using domain knowledge

While evaluation methods such as the quantitative measures presented in this report are
helpful in model selection in order to get a general idea regarding the optimal choice of
parameters, the ultimate decision of which method to use seems to be best decided by
qualitatively analyzing the actual clustering results. In the analysis done in Section 6.5,
the clustering method that received worse scores chosen to be the better method with the
actual use case in mind. The problem with this type of evaluation is that it requires more
intricate knowledge about the data set, and that it is more time demanding than simply
using a score of some kind to evaluate the result. Qualitative analysis of the clustering
results is for this reason best employed when a few candidate clustering methods have
been singled out using other validation methods. In this case, a method calculated to
result in less stable clusters were still selected as the superior method due to its finer
clusters. As for the requirement of domain knowledge in order to perform the analysis
in the first place, this knowledge is likely required to draw meaningful conclusions from
the clustering result anyway.
7.5 Limitations
One limitation and source of error in this project was the lack of access to Swedish
stock data that extended far enough back in time. Due to this, the data set containing
66
Swedish became comparatively small, possibly affecting the clustering and evaluation
outcome adversely. The CLOSE score is even explicitly mentioned to have an increased
sensitivity when the sample size is small [27]. This may explain the consistently lower
CLOSE score of the clustering results of the stock data set. As can be observed in the
cluster evolution plots in Section 6.4.2, the cluster results of the stock data set consisted
of one large cluster and many smaller clusters. This may be caused by the fact that the
difference between the distributions or correlations of the returns of Swedish stocks are
likely not as significant as the differences between different kinds of funds. In a way,
performing clustering of this data set is similar to clustering the Swedish equity cluster
in shown in Table 11. A more equal comparison to the mutual fund data set would
possibly have been clustering stocks from other stock markets than the Swedish as well.
It is likely that the different patterns in the data set would be more obvious, since the
different stocks would have been impacted differently by phenomena such as currency
value fluctuations.
Another limitation of this study is the fact that the cluster validation methods that
utilize the distance matrix between the time series to calculate a score are limited to
only using one type of distance at a time. This becomes an issue when the sequential
clustering method used as a reference in this study is evaluated. Since this clustering
method uses two different distance metrics to perform the clustering initially, the choice
of which distance metric to use for validation must be made. Using only one of the
distance metrics to evaluate the clustering method may not provide a fair evaluation of
the method. Ideally, a measure of cluster quality not dependant on the distance between
the points in the cluster or a way to combine the different distance matrices would be
developed. This would enable a better comparison to the clustering methods that only
use one distance metric to perform the clustering.
The issue of handling data points that are not partitioned into clusters is another factor
that may impact the validation results. These data points are not necessarily outliers.
During model selection, parameters that only create clusters if two data points are ex-
tremely close may consider data points as outliers while another method may consider
them parts of a cluster. Instead of filtering these singleton clusters from the actual
clustering result, it may be a better approach to use some kind of outlier detection
technique to preemptively remove these points from the data set before performing the
clustering. This would ensure that the only points that are removed from the data set
are actual outliers, and not just data points that an algorithm wouldn’t place in a cluster.
Finally, the results that were presented in this report are primarily based on the mutual
fund data set. While the stock data set provided interesting results as well, the inter-
pretability of labeled mutual fund data facilitated the qualitative analysis. If one was
to do the same analysis of the clustering results containing stocks, more effort would
have to go to analyzing the companies themselves. It is important to bear in mind that
the results presented in this report is likely not ubiquitous for all kinds of financial time
series, and conclusions should be drawn carefully.
67
7.6 Future work
First and foremost, it would be of great interest to study new cluster validation meth-
ods that are able to evaluate clustering results created using the sequential clustering
method more fairly. In this report, the Kendall’s tau distance matrix was chosen since
this is the distance metric used in the first of the two clustering rounds that make up
the sequential method. One possibility could be to calculate the quantitative validation
scores separately for each round of clustering, considering only the distance metric that
was used to perform the clustering in the round. The total score would then be the
mean of the scores of each round.
Another interesting area to study further would be additional development of the CLOSE
score. The method performed remarkably well for model selection when clustering fi-
nancial time series, but a version of the method that placed additional focus on cluster
quality rather than stability would be of interest. While stability of the method is im-
portant in order to produce reliable clustering results for different periods of time, this
particular use case focusing on portfolio diversification might call for more attention to
cluster quality.
Applying the validation methods presented in this report to a wider range of financial
time series data sets would be of interest in further studies. By doing so, it would be
possible to draw conclusions regarding a wider range of financial time series, and possi-
bly investigate whether some validation measures work better on one particular kind of
financial asset.
Relating to the visualization of the clusterings, further development of the cluster evo-
lution graph would be of great interest. One feature that might enhance the cluster
evolution plot further could be to select a few data points that differ from one another
in some way, and follow how these move from cluster to cluster as time progresses. One
way to do this would be to draw the cluster nodes as rectangles rather than smaller
circles where the identifiers of the selected time series as well as the total number of
time series in the cluster could be written.
8 Conclusions
Based on the results presented in this thesis, choosing methods of cluster validation
when evaluating clusters of financial time series must be done with intent, and should
be chosen with regard to the data type being clustered, the algorithm used, as well as
the primary use case of the clustering.
The goals of this thesis were to find metrics that could be useful when evaluating cluster-
ings of financial time series. We investigate robustness of clustering methods and ways
to quantify it. Three internal validation scores were used to evaluate the clusterings.
The silhouette score, the Calinski-Harabasz index, and the Davies-Bouldin score. The
results in the model selection part of this thesis indicated that these metrics are not
useful when tuning the parameters of clustering methods used to cluster the data sets
presented in this report. The scores occasionally contradicted each other, but for most
68
of the methods tested the scores suggested that the optimal number of clusters to use
is either the largest or smallest number of clusters tested. This may be caused by the
fact that non-Euclidean distances were used to perform the clustering, and both the
Calinski-Harabasz index as well as the Davies-Bouldin score had to be adapted to use
medioids instead of centroids in the score calculation. It is possible that this adaptation
was not sufficient to successfully validate the clusterings.
Similarly, the stability index was not successful at the task of model selection either.
Here, the index consistently suggested that the optimal parameter choice was the pa-
rameter that resulted in the smallest number of clusters. It was theorized that the issues
with the stability index are caused by the multivariate nature of the time series in the
data sets, since the authors of the method originally assume the independence of the
data points.
The CLOSE score combines the main goals of this thesis, it quantifies both the tem-
poral robustness as well as the quality of the clusters themselves by using the mean
squared error. The method shows how there is a trade off during model selection, since
the quality of the clusters decreases the larger the clusters become, but the temporal
stability increases. By choosing the parameter that results in the highest close score,
one can possibly acquire a clustering result that is both stable and contains clusters of
quality. For this reason, this metric is considered the most successful at quantifying
both robustness and cluster quality out of the quantitative validation metrics tested.
Two different methods of cluster visualization was tested in this thesis, scatter plots
created using dimension reduction and a plot displaying the evolution of clusters over
time. Two methods of dimension reduction was applied to the data, Principal Compo-
nent Analysis, and t-Distributed Stochastic Neighbor Embedding. t-SNE was determined
to be the most effective method to visualize the distances between the time series in the
data set, but the scatter plots were determined to be difficult to interpret as well as not
scalable to very large data sets. Overall, it appears to be a difficult task to reduce high
dimensional data to only two dimensions and still retain sufficient information contained
in the original data.
While the cluster evolution plots provided no information regarding the distance be-
tween the data points in the cluster result, it instead provided a visualization of the
temporal stability quantified by the CLOSE score. Using the cluster evolution plot, one
could observe how the clusters changed over time as well as how the time series moved
from one cluster to another as time progressed. Additionally, the plot allowed the user
to quickly comprehend the size of the clusters as well as the general division of data
points between the clusters as a whole. The cluster evolution plot was determined to
be the most appropriate method of cluster visualization, given the use case and data set.
Finally, the manual method of cluster validation of using domain knowledge was dis-
cussed. It was found that domain knowledge is crucial in cluster validation, since it
allows the user to perform fine adjustments to cluster methods selected by methods
such as the CLOSE score in order to acquire the type of clustering most appropriate for
the use case. It was observed that using the clustering method that received the highest
CLOSE score resulted in clusters that captured more general patterns, but were more
69
temporally stable than the reference method. Again, it seems that there is a trade off
between temporal stability and cluster quality, and in the end it is for the performer of
the clustering to decide what is most important.
Different kinds of evaluation methods need to be applied to each clustering algorithm

in order to make data driven choices when designing a fitting clustering algorithm to
financial time series data. The evaluation methods of choice will also depend largely
on the type of financial time series in the underlying data. Financial time series have
common properties defined by the stylized facts, a financial time series describing the
returns of a fund will behave differently than one describing the returns of a stock. It is
possible that the internal validation measures and the stability index may work well for
other data sets than the ones clustered in this report. It is beneficial to consider both the
temporal stability of a clustering method as well as the quality of the clusters themselves.
The results indicate that the CLOSE score is able to weigh these properties against
one another, and provide a parameter choice that results in clusterings that are both
stable and contain time series with a smaller distance between them. For this reason,
this is considered the primary candidate method to both evaluate the performance and
robustness of a clustering algorithm. The comparison to the reference clustering method
indicates however that the emphasis is placed on cluster stability rather than quality.
The cluster evolution plot presented seems to be an efficient way to visualize the stability
and changes of the cluster solutions over time, but still needs improvement in order to
include some more information regarding the time series and how they move together.
70
9 References
[1] R. Cont. “Empirical properties of asset returns: stylized facts and statistical issues”.
In: Quantitative Finance 1.2 (2001), pp. 223–236.
[2] Kidbrooke. About Us. 2023. url: https://kidbrooke.com/about.
[3] Anton Yeshchenko et al. “Comprehensive process drift detection with visual an-
alytics”. In: Conceptual Modeling: 38th International Conference, ER 2019, Sal-
vador, Brazil, November 4–7, 2019, Proceedings 38. Springer. 2019, pp. 119–135.
[4] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. “Time-series
clustering – A decade review”. In: Information Systems 53 (2015), pp. 16–38.
[5] Odilia Yim and Kylee T Ramdeen. “Hierarchical cluster analysis: comparison of
three linkage measures and application to psychological data”. In: The quantitative
methods for psychology 11.1 (2015), pp. 8–21.
[6] Scikit learn Developers. Agglomerative Clustering. Last used: 2023-02-14. url:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.
AgglomerativeClustering.html.
[7] Clément L Canonne. “A short note on learning discrete distributions”. In: arXiv
preprint arXiv:2002.11457 (2020).
[8] Sorana-Daniela Bolboaca and Lorentz Jäntschi. “Pearson versus Spearman, Kendall’s
tau correlation analysis on structure-activity relationships of biologic active com-
pounds”. In: Leonardo Journal of Sciences 5.9 (2006), pp. 179–200.
[9] Pauli Virtanen et al. “SciPy 1.0: Fundamental Algorithms for Scientific Computing
in Python”. In: Nature Methods 17 (2020), pp. 261–272.
[10] SciPy Developers. Kendall’s tau. Last used: 2023-02-16. url: https : / / docs .
scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html.
[11] Tilman Lange et al. “Stability-Based Validation of Clustering Solutions”. In: Neural
Comput. 16.6 (2004), 1299–1323.
[12] Francesco Pattarin, Sandra Paterlini, and Tommaso Minerva. “Clustering financial
time series: An application to mutual funds style analysis”. In: Computational
Statistics & Data Analysis 47 (Sept. 2004), pp. 353–372.
[13] David R. Harper. Forces That Move Stock Prices. Last updated: 2022-07. url:
https://www.investopedia.com/articles/basics/04/100804.asp.
[14] Yingfan Wang et al. “Understanding how dimension reduction tools work: an em-
pirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data vi-
sualization”. In: The Journal of Machine Learning Research 22.1 (2021), pp. 9129–
9201.
[15] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
Chap. 12.1.
[16] Laurens van der Maaten and Geoffrey Hinton. “Visualizing Data using t-SNE”. In:
Journal of Machine Learning Research 9.86 (2008), pp. 2579–2605.
71
[17] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of
Machine Learning Research 12 (2011), pp. 2825–2830.
[18] Argimiro Arratia and Alejandra Cabaña. “A Graphical Tool for Describing the
Temporal Evolution of Clusters in Financial Stock Markets”. In: Comput. Econ.
41.2 (2013), 213–231.
[19] Yanchi Liu et al. “Understanding of Internal Clustering Validation Measures”. In:
2010 IEEE International Conference on Data Mining (2010), pp. 911–916.
[20] Peter J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and vali-
dation of cluster analysis”. In: Journal of Computational and Applied Mathematics
20 (1987), pp. 53–65.
[21] Scikit learn Developers. Silhouette score. Last used: 2023-02-17. url: https://
scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_
score.html.
[22] Tadeusz Caliński and Harabasz JA. “A Dendrite Method for Cluster Analysis”. In:
Communications in Statistics - Theory and Methods 3 (Jan. 1974), pp. 1–27.
[23] David Davies and Don Bouldin. “A Cluster Separation Measure”. In: IEEE Trans-
actions on Pattern Analysis and Machine Intelligence PAMI-1 (May 1979), pp. 224
–227.
[24] Friedrich Leisch. “A toolbox for k-centroids cluster analysis”. In: Computational
statistics & data analysis 51.2 (2006), pp. 526–544.
[25] Carmelo Cassisi et al. “Similarity Measures and Dimensionality Reduction Tech-
niques for Time Series Data Mining”. In: Sept. 2012.
[26] Hae-Sang Park and Chi-Hyuck Jun. “A simple and fast algorithm for K-medoids
clustering”. In: Expert Systems with Applications 36.2, Part 2 (2009), pp. 3336–
3341.
[27] Gerhard Klassen, Martha Tatusch, and Stefan Conrad. “Cluster-based stability
evaluation in time series data sets”. In: Applied Intelligence (2022), pp. 1–24.
[28] Mahmoud Mousavi Shiri, Sadegh Bafandeh Imandoust, and Mohammad Bolan-
draftar Pasikhani. “Application of K-Nearest Neighbor (KNN) for Predicting Cor-
porate Financial Distress in Tehran Stock Exchange”. In: Monetary & Financial
Economics 20.6 (2013), pp. 48–66.
[29] B. Efron and R. Tibshirani. “Bootstrap Methods for Standard Errors, Confidence
Intervals, and Other Measures of Statistical Accuracy”. In: Statistical Science 1.1
(1986), pp. 54–75.
[30] SK Lahiri and SN Lahiri. Resampling methods for dependent data. Springer Science
& Business Media, 2003, pp. 25–29.
[31] The Global Industry Classification Standard (GICS). Last used: 2023-05-07. url:
https://www.msci.com/our-solutions/indexes/gics.
[32] Fan Liu and Yong Deng. “Determine the Number of Unknown Targets in Open
World Based on Elbow Method”. In: IEEE Transactions on Fuzzy Systems 29.5
(2021), pp. 986–995.
72
[33] The pandas development team. pandas.DataFrame.boxplot. Feb. 2020. url: https:
//pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.
html.
[34] Öhman Fonder. Öhman Global Growth. Last used: 2023-05-10. url: https : / /
www.ohman.se/fonder/fond/ohman-global-growth/.
[35] Bartolomeu Zamprogno et al. “Principal component analysis with autocorrelated
data”. In: Journal of Statistical Computation and Simulation 90.12 (2020), pp. 2117–
2135.
[36] Erik Vanhatalo and Murat Kulahci. “Impact of autocorrelation on principal com-
ponents and their use in statistical process control”. In: Quality and Reliability
Engineering International 32.4 (2016), pp. 1483–1500.
[37] Fabrice Daniel. “Financial time series data processing for machine learning”. In:
arXiv preprint arXiv:1907.03010 (2019).
[38] Hae-Sang Park and Chi-Hyuck Jun. “A simple and fast algorithm for K-medoids
clustering”. In: Expert Systems with Applications 36.2, Part 2 (2009), pp. 3336–
3341.
73
10 Appendix A
10.1 Internal validation measures of bootstrapped time series
Figure 41: Silhouette score of bootstrapped time series
Figure 42: Calinski-Harabasz index of bootstrapped time series
Figure 43: Davies-Bouldin index of bootstrapped time series
74
Stock data set

75
Stock data set
76

77
Stock data set

78
Stock data set
79
11 Appendix B
11.1 Process description for performing clustering of financial
time series
This section will be provided as a suggestion regarding the process of clustering financial
time series. The suggested course of action is designed using the results and conclusions
of this thesis, and it by no means a definitive guide. For this reason, it is important to
keep the specific properties of the time series data in mind, and remain critical of the
clustering result. As the conclusions of this thesis suggests, it is difficult to quantify
which clustering method that produces the optimal results. A qualitative evaluation
of the clustering results themselves is thus a necessity, once a few candidate clustering
methods have been suggested.
11.1.1 Data inspection

As stated in the main body of this report, it is important to know the data that will be
clustered and become familiar with its properties. For example, the mutual fund data
set used had significant differences to the stock data set apart from its size. Since the
mutual fund data set consisted of a wide variety of funds of different regions, currencies
and categories the data contained many different natural patterns and structures. The
stock data set on the other hand consisted solely of Swedish stocks, and as the results
show the number of patterns and structures in this data set seemed more limited. By
inspecting the data and theorizing what the cluster result could look like, it may be
easier to detect odd clustering results. For example, clustering the mutual funds using
the Hellinger distance quite obviously did not capture the more intricate patterns in
the data, since it essentially only divided by data into clusters of equity funds, interest
funds, and bond funds.
As is normally done when performing many kinds of machine learning, the data is nor-
malized. In the case of financial time series, this can be done by calculating the returns
of the price series of each asset/instrument [37]. In order to ensure that the mean of the
return series is zero, each observation in the entire return series is subtracted with the
mean value.
80
11.1.2 Model selection
This part of the clustering process is essential in order to find the clustering method that
performs optimally when clustering the financial time series in the data set. I recom-
mend selecting a few different clustering algorithms in order to get a nuanced comparison
between the choices available. The clustering algorithms tested in this thesis seemingly
performed adequately on the mutual funds and stocks, it is however recommended to
explore the options available. One clustering method that might be of interest in further
experiments is K-medioids[38]. This method is similar to K-means clustering, but uses
medioids of the clusters instead of centroids. As discussed in Section 5.2.2 in the para-
graph labeled Modifications for non-Euclidean distances between time series,
this facilitates the use of non-Euclidean distance measures. The distance measures that
will be tested when performing model selection will also need to be determined in this
stage. The choice will depend heavily on the use case and the purpose of performing
the clustering in the first place. The results in this thesis indicate however that using
the Kendall’s tau distance as described in Section 3.3.2 results in finer, more numerous
clusters compared to the results when using the Hellinger distance. In this particular
case, Kendall’s was determined to be able to capture more complex patterns in the data,
and was thus the more informative option.
Once the clustering methods that will be used as candidates have been selected, tuning
the parameters for optimal performance when clustering the data set is performed. As
shown in the results section of this report, this can be done by using some quantita-
tive score that outputs a graph that can be interpreted in order to find the optimal
parameter value. For financial time series, the only metric that seemed to provide easily
interpretable results were the CLOSE score. For this reason, this is the metric that is
recommended to use in order to ensure that the clustering method that receives the
highest score outputs partitions that are both stable in time as well as contains clusters
of quality. In general, one of the most significant conclusions of this thesis is that the
evaluation metrics used needs to be tailored according to both the data and the use
case. For this reason, further exploring validation methods specifically made for clusters
of time series is recommended.
Once a few promising clustering method candidates have been singled out, further vali-
dation can be done in order to select the method that is most useful for the particular
use case at hand. Here, cluster visualization methods such as the cluster evolution plot
described in Section 5.2.1 can prove useful since they provide an overview of the results
without showing too much detail. The results of this thesis also indicate that perform-
ing a qualitative analysis of the clustering results produced by the different clustering
methods can give insight into which method produces the most useful results. As dis-
cussed in this thesis, this approach requires considerably more domain knowledge than
simply applying a quantitative score of some kind to the clustering results. This method
of evaluation is important however, since the usefulness of a clustering method is dif-
ficult to quantify will depend on the use case specified by the performer of the clustering.
Once a clustering method has been selected and is being used, it is also important to
consistently reevaluate the performance of the algorithm. Due to the nature of financial
time series, a clustering method that produced useful results a few months ago may
81
perform worse now. This is one of the reasons why the CLOSE score is useful, since
it effectively quantifies how stable over time a clustering method has been in the past.
This might not be a guarantee that the method will remain stable in the future however,
and a regular reevaluation is appropriate in order to ensure optimal performance.
82

Ts Art 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ts Art 1

Uploaded by

Copyright:

Available Formats

Upps al a univ ersitets l ogotyp

Johan Millberg Civilingenj örspr ogrammet i tek nisk fysik

Civilingenjörsprogrammet i teknisk fysik

Evaluating clustering techniques in financial time series

Handledare: Erik Brodin Ämnesgranskare: Antônio Horta Ribeiro

Finansiella tidsserier är en typ av data som beskriver exempelvis hur exempelvis en

Experimenten genomfördes genom att utvärdera olika klustringsresultat med de tidi-

Den mest informativa visualiseringsmetoden kallas för klusterevolutions-diagram. Detta

1. Determine metrics for evaluating how well a clustering algorithm performs.

2. Investigate robustness of clustering results of financial time series

There exists a multitude of different clustering algorithms, and an infinite amount of

3.1.1 Stylized facts of financial time series

A description of each stylized fact as specified by Cont follows:

• Returns of financial time series display a significant amount of irregularity. This

• Estimators of volatility display a positive autocorrelation spanning multiple days.

3.2.1 Clustering algorithms

3.3 Measures of similarity

3.3.1 Hellinger distance

3.3.2 Kendall’s tau

Kendall’s tau is bounded to be in the range −1 ≤ τB ≤ 1, where a value of -1 indicates

3.4 Robustness and stability

4.1 Mutual funds

Table 1: Mutual fund meta data

Table 2: Meta data of the stocks used

4.3 Data frequency and length

5.2 Evaluation methods

5.2.1 Qualitative evaluation

PCA is a dimension reduction, and is generally defined as the orthogonal projection of

−||xi −xj ||2

The perplexity is defined as:

Perp(Pi ) = 2H(Pi ) , (12)

(1 + ||yi − yj ||2 )−1

Cluster evolution tracking

Clustering evaluation using domain knowledge

5.2.2 Internal clustering evaluation

b(i) = min d(i, C). (18)

BGSS = tr(Bk ), (24)

S(X1 , X2 , ..., Xm ) ≥ 0 (26)

While keeping the properties of S in mind, Si is defined to be the dispersion of cluster C

Modifications for non-Euclidean distances between time series

5.2.3 Measures of stability in time

∩t {Cti ,a , Ctj ,b } = {Tl |oti ,l ∈ Cti ,a ∧ otj ,l ∈ Ctj ,b }. (34)

The over-time stability of a cluster Ctk ,i is the calculated as:

5.2.4 Replication stability evaluation

1{ϕ(Xi′ ) ̸= Yi′ } = 1 if the prediction ϕ(Xi′ ) ̸= Yi′ . If the prediction is correct, it is 0.

S(ξ) := EX,X ′ dSk (ϕ(X), Y ′ ). (41)

The parameter K is chosen to be K = 1, and this selection is based on the accuracy of

More formally, let a sequence of stationary random variables be defined as Xn =

5.3.3 Pre-evaluation processing of cluster results

Sequential method with known efficacy for clustering fund data

5.3.5 Clustering experiments

• How easily a user can discern the clustered groups

Clustering method Distance metric(s)

6.1.1 Agglomerative clustering using Kendall’s tau

Figure 1: CLOSE score of clustering results of mutual funds

Figure 2: CLOSE score of clustering results of bootstrapped mutual funds

Figure 4: Internal validation scores of clustering results of mutual funds

Figure 5: CLOSE score of clustering results of stocks

Figure 6: CLOSE score of clustering results of bootstrapped stocks

Figure 7: Stability index of clustering results of stocks

6.1.2 Agglomerative clustering using the Hellinger distance

Figure 9: CLOSE score of clustering results of mutual funds