Professional Documents
Culture Documents
net/publication/305822636
CITATIONS READS
3 173
2 authors:
Some of the authors of this publication are also working on these related projects:
EU H2020-MSCA-RISE NeEDS: Research and Innovation Staff Exchange Network of European Data Scientists View project
All content following this page was uploaded by Pieter De Koninck on 04 August 2016.
KU Leuven
Research Centre for Management Informatics
Faculty of Economics and Business
Naamsestraat 69, B-3000 Leuven, Belgium
pieter.dekoninck@kuleuven.be
jochen.deweerdt@kuleuven.be
1 Introduction
Trace clustering is the partitioning of process instances into different groups,
called trace clusters, based on their similarity. A wide variety of trace clustering
techniques have been proposed, differentiated by their clustering methods and
biases. Table 1 contains an overview of these techniques, the data representation
used, the optimization method and the clustering bias. Two main categories of
trace clustering techniques exist: those that map traces onto a vector space model
or quantify the similarity between two traces directly, and those that take the
quality of the underlying process models into account [8],[10]. The driving force
behind these proposed techniques is the observation that real-life event logs are
often quite complex and contain a large degree of variation. Since these event
logs are usually the basis for further analysis like process model discovery or
compliance checking [17], partitioning dissimilar process instances into separate
trace clusters is deemed appropriate.
3.1 Approach
In this section, the algorithmic structure of our approach is described. The al-
gorithm consists of three phases, each of which is discussed here.
Input. Firstly, our algorithm requires six inputs: a grouped event log, this
means all process instances that are similar are grouped together in a single
distinct process instance or dpi. Whenever the algorithm handles a trace, it
is handling all traces that belong to this group of distinct process instances.
Furthermore, this grouped event log is ordered, which means that the most
frequent dpi’s will be treated first.
Secondly, a desired number of clusters is required. If one is uncertain about
an appropriate number of clusters, the approach of [6] can be used.
The next input is a list of metrics, ordered in decreasing importance. The
algorithm is inspired by the approach proposed in [8], so it is conceived with
process model quality metrics in mind. This means that they are evaluated
based on a process model that is discovered for each cluster.
Furthermore, two thresholds are needed: a cluster value threshold cvt, which
is the minimal quality expected from the discovered process model per cluster, it
is used while assigning traces to clusters. It is called the cluster value threshold
because it implies that the value of the quality metric is calculated based on all
the traces that are in the cluster at that time. This is what separates it from
the trace value threshold tvt, which is also based on this same discovered process
model, but the value is calculated solely based on the traces or traces that one
is testing. This difference can be important when the metric under scrutiny is a
replay-related metric, such as fitness or precision metrics.
When the algorithm attempts to add traces to a cluster, it checks the value
of the quality metric on the process model discovered from this cluster. If the
trace metric value tmv or the cluster metric value cmv are below their respective
threshold, this trace will be dubbed unassignable. Depending on the final input
SeperateBoolean, these unassignable traces will either be grouped into a sepa-
rate cluster (if SeperateBoolean is true), or added to the best possible cluster
otherwise.
Initialization. The first phase of the algorithm is the initialization-phase.
Since our approach is centroid-based, each cluster needs to be initialized. This
is done as follows: a random trace is taken from the top 10% of the event log in
terms of frequency. Then, a process model PM is discovered from this trace using
Heuristics Miner [19]. Using this process model, the trace and cluster metric value
(tmv, cmv ) are calculated using the most important metric. Observe that at this
stage, tmv and cmv are the same thing, since the cluster only contains this trace
we are testing. If this trace satisfies the desired quality captured in the thresholds
(tvt and cvt), it is added to the cluster, removed from the log and the next cluster
can be initialized. If it does not satisfy the threshold, the search continues. An
iteration counter is included to prevent non-terminating behaviour, which could
occur if the thresholds have been set unrealistically high.
Trace assignment. In the following phase, traces are assigned to a cluster,
or included in the list of unassignable traces U. Since the log is ordered from
most frequent to least frequent, the traces will also be assigned in this order. Two
variables are used to store information about the clustering process: multiple is a
boolean that denotes whether the highest value has been reached for more than
one cluster and Check[c] is an array of booleans which are true if cluster c still
needs to be checked in future iterations.
The algorithm then starts to evaluate the results of the first hierarchical
metric on each of the clusters. For each cluster, 4 situations are possible: (1)
the results of the trace metric value or cluster metric value are not above the
threshold, in which case this cluster is currently not assignable and calculations
end here for this cluster and metric; (2) its results are above both thresholds and
the cluster metric value is either higher than the current best or equal to the
current best and the trace metric value is higher than the current best: in that
case, the current cluster is temporarily the best one, and all the previous ones
will not have to be checked in potential following rounds for this trace; (3) if the
thresholds are met, but both values are exactly equal to the current best, this
cluster will have to be checked in a following round, and there are now multiple
clusters that are optimal, so multiple becomes true; (4) the thresholds are met,
but the values are lower than the currently best solution. The cluster does not
have to be checked again in a potential next round, so Check[c] is set to false.
After a metric is checked, the trace is assigned to a cluster and removed
from the log, if there were no ties found. If there is a tie, then all clusters for
which Check[c] is still true (i.e. the tied clusters) are evaluated again on the
next metric, until a unique solution is found, or all metrics have been evaluated,
in which case the trace is added to the cluster with the lowest cluster number
for which Check[c] is still true. If the loop ends without a cluster to assign the
trace to because it did not pass the thresholds, the trace is added to the set of
unassignable traces U .
Unassignable resolution. In a final phase the unassignable traces either
get added to an extra, separate cluster (if SeparateBoolean is true), or they get
assigned to the best possible existing cluster, according to a similar strategy as
utilized in phase 2. There are two main differences: the thresholds are no longer
checked (since these traces are the unassignable ones), and the process model
used for calculating the values is not rediscovered each time a trace is added.
Rather, the process models are discovered using only the ‘not-unassignable’
traces, to prevent bias.
Phase 1: Initialization
1: c := 0 % Cluster counter
2: it := 0 % Iterations counter
3: CS := [{}, {}, ..., {}] % Empty cluster set: array of nb lists of traces
4: while (c < nb ) ∧ (it < |L|) do
5: t := getRandomT race(L) % Get random trace from the top 10% of log in terms of frequency
6: P M := HM (CS[c] ∪ t) % Mine a process model from cluster
7: tmv := getM etricV alue(M (0), P M, t))% Get result of metric on just this trace
8: cmv := getM etricV alue(M (0), P M, CS[c] ∪ t))% Get result of metric on entire cluster
9: if (tmv >= tvt) ∧ (cmv >= cvt) then
10: CS[c]:= CS[c] ∪ t % Add trace to cluster
11: L:= L \ t % Remove trace from log
12: c := c + 1 % Increment cluster counter
13: end if
14: it := it + 1 % Increment iteration counter to prevent non-terminating behaviour
15: end while
16: if it ≥ |L| then
17: return [{}, {}, ..., {}] % In case of failed initialization: return empty cluster solution
18: end if
4 Experimental evaluation
The main purpose of this short evaluation is to show the general applicability
of our technique, and its distinctness from existing approaches. Therefore, our
approach is tested on multiple real-life datasets and compared with a variety of
other trace clustering techniques.
Four real-life event logs [7] are subjected to our approach. The number of pro-
cess instances, distinct process instances, number of distinct events and average
number of events per process instance are listed in Table 2. The starting point
is that applying process mining methods such as process discovery techniques
on the entire event log leads to undesirable results [7].
With regards to trace clustering techniques, we have calculated results using
6 different methods: 3 methods based on ‘process-model aware’ clustering tech-
niques: our new ActM O ; ActF req, a sequential version of ActiTrac that does not
use a MRA-based window, and ActM RA, [8], a version that does. Furthermore,
the comparison is made with 3 ‘instance-level similarity’-methods (MRA [3];
GED [4]; and K-gram [16]).2 Each of these techniques is evaluated at a number
of clusters of 4 and a number of clusters of 8.
Table 2: Characteristics of the real-life event logs used for the evaluation: num-
ber of process instances (#PI), distinct process instances (#DPI), number
of different events (#EV) and average number of events per process instance
( #EV
P I ).
#EV
Log name #PI #DPI #EV PI
For the comparison of the results, the Normalized Mutual Information (NMI,
[13]) shared by two clustering solutions is used. It is a measure for the extent to
which two clusterings contain the same information, conceptually defined as the
extent in which two process instances are clustered together in both clusterings.
This NMI is averaged across the four event logs: L is a set of event logs and P a
and P b are two clustering solutions:
1 X
aN M I(P a , P b ) = (N M I(P a , P b )) (1)
|L|
l∈L
2
The second and third methods are implemented in the ProM-framework for pro-
cess mining in the ActiTrac-plugin. The latter three methods can be found in the
GuideTree-Miner -plugin.
The results are included in Figures 1 and 2. From these figures, it is clear that
our technique leads to unique clustering results, given its low average similarity
with the clustering results obtained using other techniques. Its uniqueness is
higher for a cluster size of 4 than for a cluster size of 8. Observe, however,
that the similarity between clustering solutions is higher in general at 8 clusters.
GED, M RA and ActM RA appear to lead to quite similar solutions.
5 Conclusion
In this paper, a novel approach was presented for the multi-objective cluster-
ing of traces. Furthermore, an extensive overview of existing approaches, possi-
ble objectives and possible alternative optimization techniques were given. In a
short evaluation, the results of our technique where shown to be distinct from
those of current trace clustering techniques. Two main elaborations should be
made in future work: on the one hand, a comparison of the approach presented
here to approaches that include the possible enhancements described in Section
3.3, these were the extension with regards to distinct initialization strategies,
window-based approaches and utilization of other discovery techniques. On the
other hand, these approaches should be thoroughly evaluated based on their
runtime, scalability, and the obtained results in terms of their objectives.
References
1. Appice, A., Malerba, D.: A co-training strategy for multiple view clustering in
process mining. IEEE Transactions on Services Computing PP(99), 1–1 (2015)
2. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: In
Proceedings of 19th International Conference on Machine Learning (ICML-2002
(2002)
3. Bose, R.P.J.C., Van Der Aalst, W.M.P.: Trace clustering based on conserved pat-
terns: Towards achieving better process models. In: Lect. Notes Bus. Inf. Process.
vol. 43 LNBIP, pp. 170–181 (2010)
4. Bose, R., Aalst, W.V.D.: Context Aware Trace Clustering: Towards Improving
Process Mining Results. Sdm pp. 401–412 (2009)
5. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Business Process Man-
agement Workshops: BPM 2013 International Workshops, Beijing, China, August
26, 2013, Revised Papers, chap. Discovering and Navigating a Collection of Pro-
cess Models Using Multiple Quality Dimensions, pp. 3–14. Springer International
Publishing, Cham (2014)
6. De Koninck, P., De Weerdt, J.: Determining the number of trace clusters: A
stability-based approach. In: ATAED16. vol. Accepted, forthcoming (2016)
7. De Weerdt, J., De Backer, M., Vanthienen, J., Baesens, B.: A multi-dimensional
quality assessment of state-of-the-art process discovery algorithms using real-life
event logs. Inf. Syst. 37(7), 654–676 (2012)
8. De Weerdt, J., Vanden Broucke, S., Vanthienen, J., Baesens, B.: Active trace clus-
tering for improved process discovery. IEEE Trans. Knowl. Data Eng. 25(12), 2708–
2720 (2013)
9. Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., Matsatsinis, N.: Sup-
porting healthcare management decisions via robust clustering of event logs.
Knowledge-Based Syst. 84, 203–213 (2015)
10. Ekanayake, C.C., Dumas, M., Garcı́a-Bañuelos, L., La Rosa, M.: Slice, mine and
dice: Complexity-aware automated discovery of business process models. Lect.
Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioin-
formatics) 8094 LNCS, 49–64 (2013)
11. Ferreira, D., Zacarias, M., Malheiros, M., Ferreira, P.: Approaching Process Mining
with Sequence Clustering: Experiments and Findings. LNCS 4714, 360–374 (2007)
12. Folino, F., Greco, G., Guzzo, A., Pontieri, L.: Editorial: Mining Usage Scenarios
in Business Processes: Outlier-aware Discovery and Run-time Prediction. Data
Knowl. Eng. (2011)
13. Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to com-
bined solutions. In: Supervised and unsupervised ensemble methods and their ap-
plications, pp. 3–30. Springer (2008)
14. Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models
by clustering log traces. IEEE Trans. Knowl. Data Eng. 18(8), 1010–1027 (2006)
15. Alves de Medeiros, A.K.: Genetic process mining (2006)
16. Song, M., Günther, C., van der Aalst, W.M.: Trace Clustering in Business Process
Mining. In: Bus. Process Manag. Work. vol. 17, pp. 109–120. Springer (2009)
17. Van der Aalst, W., Adriansyah, A., Van Dongen, B.: Replaying history on process
models for conformance checking and performance analysis. Wiley Interdiscip. Rev.
Data Min. Knowl. Discov. 2(2), 182–192 (2012)
18. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering
with background knowledge. In: In ICML. pp. 577–584. Morgan Kaufmann (2001)
19. Weijters, A., van Der Aalst, W.M., De Medeiros, A.A.: Process mining with the
heuristics miner-algorithm. Technische Universiteit Eindhoven, Tech. Rep. WP
166, 1–34 (2006)