You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/305822636

Multi-objective Trace Clustering: Finding More Balanced Solutions

Conference Paper · September 2016


DOI: 10.1007/978-3-319-58457-7_4

CITATIONS READS
3 173

2 authors:

Pieter De Koninck Jochen De Weerdt


KU Leuven KU Leuven
9 PUBLICATIONS   20 CITATIONS    59 PUBLICATIONS   794 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

EU H2020-MSCA-RISE NeEDS: Research and Innovation Staff Exchange Network of European Data Scientists View project

RFM-Enriched Representation Learning for Churn Prediction View project

All content following this page was uploaded by Pieter De Koninck on 04 August 2016.

The user has requested enhancement of the downloaded file.


Multi-objective Trace Clustering: Finding More
Balanced Solutions

Pieter De Koninck and Jochen De Weerdt

KU Leuven
Research Centre for Management Informatics
Faculty of Economics and Business
Naamsestraat 69, B-3000 Leuven, Belgium
pieter.dekoninck@kuleuven.be
jochen.deweerdt@kuleuven.be

Abstract. In recent years, a multitude of techniques has been proposed


for the task of clustering traces. In general, these techniques either focus
on optimizing their solution based on a certain type of similarity be-
tween the traces, such as the number of insertions and deletions needed
to transform one trace into another; by mapping the traces onto a vector
space model, based on certain patterns in each trace; or on the quality of
a process model discovered from each cluster. Currently, the main tech-
nique of the latter category, ActiTraC, constructs its clusters based on a
single objective: fitness. However, a typical view in process discovery is
that one needs to balance fitness, generalization, precision and simplic-
ity. Therefore, a multi-objective approach to trace clustering is deemed
more appropriate. In this paper, a thorough overview of current trace
clustering techniques and potential approaches for multi-objective trace
clustering is given. Furthermore, a multi-objective trace clustering tech-
nique is proposed. Our solution is shown to provide unique results on a
number of real-life event logs, validating its existence.

Keywords: trace clustering, process mining, process model quality, multi-


objective learning

1 Introduction
Trace clustering is the partitioning of process instances into different groups,
called trace clusters, based on their similarity. A wide variety of trace clustering
techniques have been proposed, differentiated by their clustering methods and
biases. Table 1 contains an overview of these techniques, the data representation
used, the optimization method and the clustering bias. Two main categories of
trace clustering techniques exist: those that map traces onto a vector space model
or quantify the similarity between two traces directly, and those that take the
quality of the underlying process models into account [8],[10]. The driving force
behind these proposed techniques is the observation that real-life event logs are
often quite complex and contain a large degree of variation. Since these event
logs are usually the basis for further analysis like process model discovery or
compliance checking [17], partitioning dissimilar process instances into separate
trace clusters is deemed appropriate.

Table 1: Available trace clustering techniques and their characteristics


Author Data Clustering Clustering Bias
Representation Technique
Greco et al. [14] propositional k-means instance similarity:
alphabet k-grams
Song et al. [16] propositional various instance similarity:
profiles
Ferreira et al. [11] event log first order Markov maximum
mixture model by likelihood
EM
Bose and van der event log hierarchical instance similarity:
Aalst [4] clustering string edit distance
Bose and van der propositional hierarchical instance similarity:
Aalst [3] clustering conserved patterns
Folino et al. [12] event log enhanced Markov maximum
cluster model likelihood
De Weerdt et al. [8] event log model-driven combined process
clustering model fitness (ICS)
Ekanayake et al. propositional complexity-aware instance similarity
[10] clustering + repository
complexity
Delias et al. [9] event log spectral robust instance
similarity

Most of the techniques in this table differentiate themselves from previous


work in one or more of three different dimensions. Some techniques are unique in
a data representation sense: what do they consider as input for a trace clustering?
Over time, this has converged to event logs: a well-known grouping of process
instances, where each instance is a sequence of activities or events. Secondly,
most techniques propose a traditional or unique approach for clustering the
traces: this can be based on hierarchical clustering such as in [4] or model-driven
such as in [8]. The main distinction between these techniques, however, is the
clustering bias, or objective they propose. An example could be the instance
similarity metric based on conserved patterns from [3].
The interest in this paper lies mainly in the third differentiator, the cluster-
ing bias. Typically, clustering approaches consider a single objective. There can
be situations, however, in which one is interested in combinations of these objec-
tives, in which case one would want to deploy a multi-objective trace clustering
approach.
Several different strategies are possible for dealing with these multiple objec-
tives, both on the level of the algorithmic clustering technique as on the level of
the clustering bias. Therefore, the main objective of this paper is to provide an
outlook on multi-objective trace clustering. This outlook contains an elaboration
of potential objectives, solution strategies and evaluations.
In light of this objective, this paper is structured as follows: in a first section,
an overview of potential objectives for trace clustering are detailed. In a second
section, we provide an new trace clustering technique, ActiT raCM O 1 , which is
used as a hook for elaborating directions for future improvement and comparison
to the existing trace clustering field. Finally, a small demonstration is included
to highlight the distinctiveness of our technique.

2 Multi-objective trace clustering

2.1 Objectives for trace clustering

When it comes to trace clustering, several possible objectives exist. Four of


them are envisioned and described here. A first objective is to optimize process
model quality, as proposed by [8]. In that publication, trace clusters are con-
structed based on the Improved Continuous Semantics [15], a fitness measure
for heuristic nets. This is done by mining a process model for each cluster, and
evaluating the fitness of this discovered process model. Rather than purely op-
timizing based on fitness, process discovery techniques typically should make a
trade-off between fitness, precision, generalization and simplicity [17] [5]. Fitness
is defined as the extent to which a process model is able to replay the behaviour
in the event log that was used to discover the model. A precise model is a model
that does not underfit: it should not allow too much behaviour that is unre-
lated to the behaviour in the event log. Generalization is related to overfitting:
a process model should allow behaviour that is not present in the event log, but
likely given the behaviour in the log. Finally, simplicity is conceptually related
to Occam’s razor: given two models that score equal on fitness, precision and
generalization, the simpler model of the two is preferred.
A second objective is denoted as similarity, and is related to the tradi-
tional clustering objective: maximizing inter-cluster similarity and intra-cluster
dissimilarity. This objective is usually conveyed using instance similarity met-
rics in trace clustering. Two subgroups of these metrics exist: on the one hand, a
subgroup where the similarity between two instance is calculated directly, with
approaches such as counting the number of insertions and deletions one would
need to transform one process instance into another (Levenshtein edit distance,
[4]); on the other hand, approaches where the process instances are mapped onto
a vector space model based on the activities they contain, or patterns in these
traces [14],[16], [3]. This vector space model can then be clustered in a more
traditional way, using k-means or hierarchical clustering approach.
1
The approach is implemented as a ProM 6-plugin, which can be found on
http://www.processmining.be/multiobjective.
A third objective relates to the justifiability of a clustering solution, and
is related to the expectations of a domain expert. Intuitively, the end user of
the clustering may be more interested in trace clustering solutions that are jus-
tifiable given his expert knowledge. To our knowledge, there are no techniques
that incorporate expert knowledge directly yet in the trace clustering domain.
Potential methods for incorporating expert knowledge in an objective could be
found in constraint clustering [18], if the expert has perceptions about cases that
must be grouped together or cannot be clustered together, or could be used to
initialize the clustering solution based on a small set of label instances obtained
from the expert (i.e. semi-supervised clustering, [2]).
A final objective one might want to optimize is the stability of a clustering
solution. If the event log is likely to contain errors or to be incomplete, one might
prefer a clustering solution that is resistant to this and produces more reliable
results. In the trace clustering context, stability has been used to determine an
appropriate number of clusters [6].

2.2 Combining multiple objectives


Apart from deciding which objectives are relevant to a certain clustering or event
log, a decision needs to be made on how to combine these different objectives.
Three distinct approaches are envisaged: one based on a hierarchy between the
objectives, one based on a weighting of the objectives, and one based on consen-
sus clustering.
The first possibility is based on a hierarchy between the different objectives:
given that one objective is more important than others, the first objective could
be strictly optimized, and the other objectives can than be used as tie-breakers.
An example can be found in the discovered process model quality metrics: gen-
erally, one wants to make a trade-off between fitness, generalization, precision
and simplicity. However, one is often only interested in the simplest model if the
other quality measures are not affected (Occam’s razor). Therefore, one could
treat simplicity as having a lower hierarchical role. If no such hierarchy is present,
multi-objective learning techniques usually search for several Pareto-optimal so-
lutions, i.e. solutions for which no other solution can be found that scores better
on one of the objectives without scoring worse on any of the other objectives.
Typically, multiple solutions will be Pareto-efficient, therefore these algorithms
usually return a set of solutions. An example of Pareto-efficient process discovery
can be found in [5].
A second approach is to use a weighted objective function. The different ob-
jectives receive a weight, and this composed objective can than be used in a
single-objective clustering technique. Several drawbacks exist to this approach,
the main being that determining the weights can be non-trivial, especially if one
prefers to work with unnormalized objectives [5]. A specific version of this type
of multi-dimensional trace clustering is possible when one only uses features
that can be mapped onto a vector space model, such as in [3]. By simply in-
cluding more features into the vector model, one is effectively including multiple
objectives with an equal weighting.
A third route could stem from co-clustering, which is an approach where
cluster solutions are created for each objective separately, and these solutions are
then combined to result in a final clustering. This combination is typically based
on the frequency with which two traces appear in the same cluster across the
different solutions: the more often they appear together, the higher the likelihood
that they should be clustered together. An example of co-clustering applied in the
process mining domain can be found in [1]. The main drawback of this approach
is that there is no guarantee about the objective results on the combined solution:
while separate clusterings might obtain very decent scores for each of the criteria,
the co-clustered solution will not necessarily maintain this high quality, especially
if the objectives are process model quality metrics. Another drawback is that
combining cluster ensembles may not be an easy task, especially for a low number
of objectives. This can be seen as follows: consider a situation where, in a two-
metric solution, clustering 1 puts trace A and trace B in the same cluster, and
clustering 2 puts them in a separate cluster, how will the consensus be reached?
As before, a resolution could come from hierarchies or weightings.

3 A multi-objective trace clustering technique

3.1 Approach

As a first step towards acceptable multi-objective trace clustering, we propose


Algorithm 1. It is built as an adaptation of ActiTraC [8], a sequential algorithm
that constructs clusters based on a process quality metric, namely fitness. It is
sequential since it constructs clusters one by one, by adding all traces for which
a certain quality threshold is met together, before creating a new cluster. The
approach proposed here, however, works parallel rather than sequential mean-
ing that multiple clusters are being constructed at the same time, adding traces
to whichever cluster would lead to the best score on a certain quality metric.
Furthermore, the approach is hierarchical in the sense that a hierarchy between
the cluster metrics is presumed: only if this best score is reached for multiple
clusters, the next metric is inspected to test which one of these clusters should
be chosen. Although the approach can be adapted to be valid for any metric,
it has been conceived with two specific metrics in mind: the weighted F-score
metric, which balances recall and precision [7]; and the place/transition connec-
tion degree (P/T-CD) [8], which is a measure for the simplicity of the model.
Thus, it combines two approaches from Section 2.1, the weighting approach and
the hierarchical approach. Observe that the hierarchical approach makes sense
in this case, as simplicity is often treated as an objective that is considered to
be less of principal interest than the other process quality dimensions, such as
fitness, precision or generalization: given two solutions with equal precision, for
example, one prefers the simpler one (Occam’s razor), but given two solutions
with different precision, one would typically prefer the more precise model.
3.2 Algorithm

In this section, the algorithmic structure of our approach is described. The al-
gorithm consists of three phases, each of which is discussed here.

Input. Firstly, our algorithm requires six inputs: a grouped event log, this
means all process instances that are similar are grouped together in a single
distinct process instance or dpi. Whenever the algorithm handles a trace, it
is handling all traces that belong to this group of distinct process instances.
Furthermore, this grouped event log is ordered, which means that the most
frequent dpi’s will be treated first.
Secondly, a desired number of clusters is required. If one is uncertain about
an appropriate number of clusters, the approach of [6] can be used.
The next input is a list of metrics, ordered in decreasing importance. The
algorithm is inspired by the approach proposed in [8], so it is conceived with
process model quality metrics in mind. This means that they are evaluated
based on a process model that is discovered for each cluster.
Furthermore, two thresholds are needed: a cluster value threshold cvt, which
is the minimal quality expected from the discovered process model per cluster, it
is used while assigning traces to clusters. It is called the cluster value threshold
because it implies that the value of the quality metric is calculated based on all
the traces that are in the cluster at that time. This is what separates it from
the trace value threshold tvt, which is also based on this same discovered process
model, but the value is calculated solely based on the traces or traces that one
is testing. This difference can be important when the metric under scrutiny is a
replay-related metric, such as fitness or precision metrics.
When the algorithm attempts to add traces to a cluster, it checks the value
of the quality metric on the process model discovered from this cluster. If the
trace metric value tmv or the cluster metric value cmv are below their respective
threshold, this trace will be dubbed unassignable. Depending on the final input
SeperateBoolean, these unassignable traces will either be grouped into a sepa-
rate cluster (if SeperateBoolean is true), or added to the best possible cluster
otherwise.
Initialization. The first phase of the algorithm is the initialization-phase.
Since our approach is centroid-based, each cluster needs to be initialized. This
is done as follows: a random trace is taken from the top 10% of the event log in
terms of frequency. Then, a process model PM is discovered from this trace using
Heuristics Miner [19]. Using this process model, the trace and cluster metric value
(tmv, cmv ) are calculated using the most important metric. Observe that at this
stage, tmv and cmv are the same thing, since the cluster only contains this trace
we are testing. If this trace satisfies the desired quality captured in the thresholds
(tvt and cvt), it is added to the cluster, removed from the log and the next cluster
can be initialized. If it does not satisfy the threshold, the search continues. An
iteration counter is included to prevent non-terminating behaviour, which could
occur if the thresholds have been set unrealistically high.
Trace assignment. In the following phase, traces are assigned to a cluster,
or included in the list of unassignable traces U. Since the log is ordered from
most frequent to least frequent, the traces will also be assigned in this order. Two
variables are used to store information about the clustering process: multiple is a
boolean that denotes whether the highest value has been reached for more than
one cluster and Check[c] is an array of booleans which are true if cluster c still
needs to be checked in future iterations.
The algorithm then starts to evaluate the results of the first hierarchical
metric on each of the clusters. For each cluster, 4 situations are possible: (1)
the results of the trace metric value or cluster metric value are not above the
threshold, in which case this cluster is currently not assignable and calculations
end here for this cluster and metric; (2) its results are above both thresholds and
the cluster metric value is either higher than the current best or equal to the
current best and the trace metric value is higher than the current best: in that
case, the current cluster is temporarily the best one, and all the previous ones
will not have to be checked in potential following rounds for this trace; (3) if the
thresholds are met, but both values are exactly equal to the current best, this
cluster will have to be checked in a following round, and there are now multiple
clusters that are optimal, so multiple becomes true; (4) the thresholds are met,
but the values are lower than the currently best solution. The cluster does not
have to be checked again in a potential next round, so Check[c] is set to false.
After a metric is checked, the trace is assigned to a cluster and removed
from the log, if there were no ties found. If there is a tie, then all clusters for
which Check[c] is still true (i.e. the tied clusters) are evaluated again on the
next metric, until a unique solution is found, or all metrics have been evaluated,
in which case the trace is added to the cluster with the lowest cluster number
for which Check[c] is still true. If the loop ends without a cluster to assign the
trace to because it did not pass the thresholds, the trace is added to the set of
unassignable traces U .
Unassignable resolution. In a final phase the unassignable traces either
get added to an extra, separate cluster (if SeparateBoolean is true), or they get
assigned to the best possible existing cluster, according to a similar strategy as
utilized in phase 2. There are two main differences: the thresholds are no longer
checked (since these traces are the unassignable ones), and the process model
used for calculating the values is not rediscovered each time a trace is added.
Rather, the process models are discovered using only the ‘not-unassignable’
traces, to prevent bias.

3.3 Possible enhancements


In this section, an overview is given of possible future enhancements to our
algorithm. They are divided into two interrelated categories: improvements to
the initialization process and window-based extensions.
Considering that our algorithm constructs clusters in parallel, an intialization
phase is needed. Currently, this is done by randomly selecting a trace from the
to p% of the log in terms of frequency, that are confirmed to have a quality
Algorithm 1 Multi-objective Trace Clustering
Input: L := grouped and ordered event log, nb := number of clusters, M ():= list of metrics, cvt :=
cluster value threshold, tvt:= trace value threshold; SeperateBoolean:= true if unassignable
traces should be grouped in a separate cluster;
Output: CS := A set of clusters

Phase 1: Initialization
1: c := 0 % Cluster counter
2: it := 0 % Iterations counter
3: CS := [{}, {}, ..., {}] % Empty cluster set: array of nb lists of traces
4: while (c < nb ) ∧ (it < |L|) do
5: t := getRandomT race(L) % Get random trace from the top 10% of log in terms of frequency
6: P M := HM (CS[c] ∪ t) % Mine a process model from cluster
7: tmv := getM etricV alue(M (0), P M, t))% Get result of metric on just this trace
8: cmv := getM etricV alue(M (0), P M, CS[c] ∪ t))% Get result of metric on entire cluster
9: if (tmv >= tvt) ∧ (cmv >= cvt) then
10: CS[c]:= CS[c] ∪ t % Add trace to cluster
11: L:= L \ t % Remove trace from log
12: c := c + 1 % Increment cluster counter
13: end if
14: it := it + 1 % Increment iteration counter to prevent non-terminating behaviour
15: end while
16: if it ≥ |L| then
17: return [{}, {}, ..., {}] % In case of failed initialization: return empty cluster solution
18: end if

Phase 2: Trace assignment


19: U := {} % List of unassignable traces
20: for t ∈ L do % Loop over the distinct traces
21: bestCluster := −1 % Temporary values for assignment
22: multiple := f alse % Boolean that indicates multiple clusters have same scores
23: Check[] := [true, true, ..., true] % Boolean for each cluster: should we still check it?
24:
25: for m ∈ M do % Loop over the metrics
26: bestCM V := −1; bestT M V := −1; % Temporary values for optimization
27: for c := 0; c < nb; c := c + 1 do % Inspect each possible cluster
28: if ¬Check[c] then Continue % Skip if we don’t have to check it anymore
29: end if
30: P M := HM (CS[c] ∪ t) % Mine a process model
31: tmv := getM etricV alue(m, P M, t))% Get result of metric on just this trace
32: cmv := getM etricV alue(m, , P M, CS[c] ∪ t))% Get result of metric on entire cluster
33: if (tmv >= tvt) ∧ (cmv >= cvt) then % Check thresholds
34: if cmv > bestCM V ∨ (cmv = bestCM V ∧ tmv > bestT M V ) then
35: bestCM V := cmv; bestT M V := tmv; bestCluster := c
36: Check[i] := false ∀i < c % Temporary best cluster, no need to check previous
clusters in potential next round
37: else if (cmv = bestCM V ∧ tmv = bestT M V ) then
38: multiple = true % Ties detected
39: else Check[c] := f alse % Worse than current best: don’t check again
40: end if
41: end if
42: end for
43: if ¬multiple ∧ bestCluster >= 0 then
44: Break
45: end if
46: end for
47: if bestCluster >= 0 then
48: CS[bestCluster]:= CS[bestCluster] ∪ t % Add trace to cluster
49: L:= L \ t % Remove trace from log
50: Break % Exit for loop
51: else
52: U := U ∪ t % Add trace to unassignable
53: L:= L \ t % Remove trace from log
54: end if
55: end for

Phase 3: Unassignable resolution


56: if SeperateBoolean then
57: CS[nb + 1] := U % Add remaining traces to a new cluster
58: else
59: Add each trace to the cluster in CS using the same procedure as in phase 2, without checking
the thresholds anymore. Furthermore, the trace and cluster metric values are now calculated
without rediscovering a process model each time.
60: end if
61: return CS
above the imposed thresholds. Intuitively, it seems sensible to start of with a so-
lution that represents a reasonable amount of behaviour (high frequency), and a
sufficiently elevated result on the main imposed quality metric. However, other
initialization strategies are possible. Firstly, one could argue that rather than
initializing from the most frequent traces, one might want to take seeds from the
least frequent traces, since it may be likely that these seeds will then be more
separated than when they are taken from the selection of most frequent traces.
Another alternative could come from instance similarity metrics: one could map
traces onto a vector space model, e.g. using conserved patterns [3], and make
sure that these seeds are sufficiently distant from each other in this space. In such
a case, extra objectives (namely the features used to create the vector space) are
incorporated indirectly through the seeding process, hence rendering the tech-
nique extra ‘multi-objective’. Finally, one could leverage yet another category
of objectives as they were described in Section 2.1, expert knowledge. Instead
of randomly selecting the seeds, an expert could select several traces which are
deemed distinctive enough to be included in separate clusters. This is expected
to raise the objective of attaining a justifiable clustering solution.
A second category of enhancements is related to so-called window-based
extensions. Similar to the approach in [8], an extra sub-phase could be added
to phase 2, the trace assignment phase. The main purpose of such a window is
to increase the scalability of an algorithm. This window is defined as a group
of traces with two possibilities: on the one hand, it can be the top p% of traces
in terms of frequency, on the other hand, it can consist of the p% traces that
are closest to the current trace in terms of distance in a vector space model,
like a model created with the Maximal Repeat Alphabet [3]. It could be used
as follows: each time a trace is added to a cluster (line 49 in Algorithm 1) all
traces in the window are checked based on trace metric value, on a process
model that is discovered using a log with only the trace that are already in
the cluster. If their trace metric value exceeds the trace value threshold (for
the most important metric in the hierarchy), it is then added to the cluster
as well. Clearly, this greedier assignment step can increase the performance of
an algorithm, although a trade-off in terms of quality might be present. If the
window is based on a metric not included in the main evaluation objectives, such
as the MRA, than it could potentially increase the validity of the trace clustering
solution as well.
Apart from implementing a window-based approach or adapting the initial-
ization, one could also look into different process discovery techniques for the
mining of a process model form each cluster. Currently, Heuristics miner is used
for this, a technique that mines heuristic nets, that are converted to Petri nets
if needed for the calculation of certain quality metrics. The main advantages of
Heuristics miner are its scalability and its relative robustness to noise. An al-
ternative possibility would be to use a process discovery technique that is itself
multi-dimensional in nature, such as the one proposed in [5]. That technique
even returns a collection of process models, so a strategy for averaging or select-
ing the appropriate process model from this collection for the calculation of the
quality of the discovered collection of process model would then be needed as
well.

4 Experimental evaluation
The main purpose of this short evaluation is to show the general applicability
of our technique, and its distinctness from existing approaches. Therefore, our
approach is tested on multiple real-life datasets and compared with a variety of
other trace clustering techniques.
Four real-life event logs [7] are subjected to our approach. The number of pro-
cess instances, distinct process instances, number of distinct events and average
number of events per process instance are listed in Table 2. The starting point
is that applying process mining methods such as process discovery techniques
on the entire event log leads to undesirable results [7].
With regards to trace clustering techniques, we have calculated results using
6 different methods: 3 methods based on ‘process-model aware’ clustering tech-
niques: our new ActM O ; ActF req, a sequential version of ActiTrac that does not
use a MRA-based window, and ActM RA, [8], a version that does. Furthermore,
the comparison is made with 3 ‘instance-level similarity’-methods (MRA [3];
GED [4]; and K-gram [16]).2 Each of these techniques is evaluated at a number
of clusters of 4 and a number of clusters of 8.

Table 2: Characteristics of the real-life event logs used for the evaluation: num-
ber of process instances (#PI), distinct process instances (#DPI), number
of different events (#EV) and average number of events per process instance
( #EV
P I ).
#EV
Log name #PI #DPI #EV PI

MOA 2004 71 49 6.20


ICP 6407 155 18 5.99
MCRM 956 212 22 11.73
KIM 1541 251 18 5.62

For the comparison of the results, the Normalized Mutual Information (NMI,
[13]) shared by two clustering solutions is used. It is a measure for the extent to
which two clusterings contain the same information, conceptually defined as the
extent in which two process instances are clustered together in both clusterings.
This NMI is averaged across the four event logs: L is a set of event logs and P a
and P b are two clustering solutions:
1 X
aN M I(P a , P b ) = (N M I(P a , P b )) (1)
|L|
l∈L
2
The second and third methods are implemented in the ProM-framework for pro-
cess mining in the ActiTrac-plugin. The latter three methods can be found in the
GuideTree-Miner -plugin.
The results are included in Figures 1 and 2. From these figures, it is clear that
our technique leads to unique clustering results, given its low average similarity
with the clustering results obtained using other techniques. Its uniqueness is
higher for a cluster size of 4 than for a cluster size of 8. Observe, however,
that the similarity between clustering solutions is higher in general at 8 clusters.
GED, M RA and ActM RA appear to lead to quite similar solutions.

Fig. 1: aN M I at 4 clusters Fig. 2: aN M I at 8 clusters

5 Conclusion

In this paper, a novel approach was presented for the multi-objective cluster-
ing of traces. Furthermore, an extensive overview of existing approaches, possi-
ble objectives and possible alternative optimization techniques were given. In a
short evaluation, the results of our technique where shown to be distinct from
those of current trace clustering techniques. Two main elaborations should be
made in future work: on the one hand, a comparison of the approach presented
here to approaches that include the possible enhancements described in Section
3.3, these were the extension with regards to distinct initialization strategies,
window-based approaches and utilization of other discovery techniques. On the
other hand, these approaches should be thoroughly evaluated based on their
runtime, scalability, and the obtained results in terms of their objectives.

References

1. Appice, A., Malerba, D.: A co-training strategy for multiple view clustering in
process mining. IEEE Transactions on Services Computing PP(99), 1–1 (2015)
2. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: In
Proceedings of 19th International Conference on Machine Learning (ICML-2002
(2002)
3. Bose, R.P.J.C., Van Der Aalst, W.M.P.: Trace clustering based on conserved pat-
terns: Towards achieving better process models. In: Lect. Notes Bus. Inf. Process.
vol. 43 LNBIP, pp. 170–181 (2010)
4. Bose, R., Aalst, W.V.D.: Context Aware Trace Clustering: Towards Improving
Process Mining Results. Sdm pp. 401–412 (2009)
5. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Business Process Man-
agement Workshops: BPM 2013 International Workshops, Beijing, China, August
26, 2013, Revised Papers, chap. Discovering and Navigating a Collection of Pro-
cess Models Using Multiple Quality Dimensions, pp. 3–14. Springer International
Publishing, Cham (2014)
6. De Koninck, P., De Weerdt, J.: Determining the number of trace clusters: A
stability-based approach. In: ATAED16. vol. Accepted, forthcoming (2016)
7. De Weerdt, J., De Backer, M., Vanthienen, J., Baesens, B.: A multi-dimensional
quality assessment of state-of-the-art process discovery algorithms using real-life
event logs. Inf. Syst. 37(7), 654–676 (2012)
8. De Weerdt, J., Vanden Broucke, S., Vanthienen, J., Baesens, B.: Active trace clus-
tering for improved process discovery. IEEE Trans. Knowl. Data Eng. 25(12), 2708–
2720 (2013)
9. Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., Matsatsinis, N.: Sup-
porting healthcare management decisions via robust clustering of event logs.
Knowledge-Based Syst. 84, 203–213 (2015)
10. Ekanayake, C.C., Dumas, M., Garcı́a-Bañuelos, L., La Rosa, M.: Slice, mine and
dice: Complexity-aware automated discovery of business process models. Lect.
Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioin-
formatics) 8094 LNCS, 49–64 (2013)
11. Ferreira, D., Zacarias, M., Malheiros, M., Ferreira, P.: Approaching Process Mining
with Sequence Clustering: Experiments and Findings. LNCS 4714, 360–374 (2007)
12. Folino, F., Greco, G., Guzzo, A., Pontieri, L.: Editorial: Mining Usage Scenarios
in Business Processes: Outlier-aware Discovery and Run-time Prediction. Data
Knowl. Eng. (2011)
13. Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to com-
bined solutions. In: Supervised and unsupervised ensemble methods and their ap-
plications, pp. 3–30. Springer (2008)
14. Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models
by clustering log traces. IEEE Trans. Knowl. Data Eng. 18(8), 1010–1027 (2006)
15. Alves de Medeiros, A.K.: Genetic process mining (2006)
16. Song, M., Günther, C., van der Aalst, W.M.: Trace Clustering in Business Process
Mining. In: Bus. Process Manag. Work. vol. 17, pp. 109–120. Springer (2009)
17. Van der Aalst, W., Adriansyah, A., Van Dongen, B.: Replaying history on process
models for conformance checking and performance analysis. Wiley Interdiscip. Rev.
Data Min. Knowl. Discov. 2(2), 182–192 (2012)
18. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering
with background knowledge. In: In ICML. pp. 577–584. Morgan Kaufmann (2001)
19. Weijters, A., van Der Aalst, W.M., De Medeiros, A.A.: Process mining with the
heuristics miner-algorithm. Technische Universiteit Eindhoven, Tech. Rep. WP
166, 1–34 (2006)

View publication stats

You might also like