You are on page 1of 8

Wanna Improve Process Mining Results?

It’s High Time We Consider Data Quality Issues Seriously


R.P. Jagadeesh Chandra Bose Ronny S. Mans Wil M.P. van der Aalst
Eindhoven University of Technology Eindhoven University of Technology Eindhoven University of Technology
The Netherlands 5600 MB The Netherlands 5600 MB The Netherlands 5600 MB
Email: j.c.b.rantham.prabhakara@tue.nl Email: r.s.mans@tue.nl Email: w.m.p.v.d.aalst@tue.nl

Abstract—The growing interest in process mining is fueled by While the success stories reported on using process mining
the increasing availability of event data. Process mining tech- are certainly convincing, it is not easy to reproduce these best
niques use event logs to automatically discover process models, practices in many settings due to the quality of event logs and
check conformance, identify bottlenecks and deviations, suggest
improvements, and predict processing times. Lion’s share of the nature of processes. For example, contemporary process
process mining research has been devoted to analysis techniques. discovery approaches have problems in dealing with fine-
However, the proper handling of problems and challenges arising grained event logs and less structured processes. The resulting
in analyzing event logs used as input is critical for the success spaghetti-like process models are often hard to comprehend
of any process mining effort. In this paper, we identify four [1].
categories of process characteristics issues that may manifest in
an event log (e.g. process problems related to event granularity We have applied process mining techniques in over 100
and case heterogeneity) and 27 classes of event log quality issues organizations. These practical experiences revealed that real-
(e.g., problems related to timestamps in event logs, imprecise life logs are often far from ideal and their quality leaves much
activity names, and missing events). The systematic identification to be desired. Most real-life logs tend to be incomplete, noisy,
and analysis of these issues calls for a consolidated effort from the and imprecise. Furthermore, contemporary processes tend to
process mining community. Five real-life event logs are analyzed
to illustrate the omnipresence of process and event log issues. be complex and subject to a wide range of variations. This
We hope that these findings will encourage systematic logging results into event logs that are fine-granular, heterogeneous,
approaches (to prevent event log issues), repair techniques (to and voluminous. Some of the more advanced process discovery
alleviate event log issues) and analysis techniques (to deal with techniques try to address these problems. However, as the say-
the manifestation of process characteristics in event logs). ing “garbage in – garbage out” suggests, more attention should
Key words: Process Mining, Data Quality, Event Log, Prepro-
cessing, Data Cleansing, Outliers be paid to the quality of event logs and the manifestation of
process characteristics in event logs before applying process
I. I NTRODUCTION mining algorithms. The strongest contributions addressing the
Business processes leave trails in a variety of data sources ‘Business Process Intelligence Challenge’ event logs illustrate
(e.g., audit trails, databases, and transaction logs). Process the significance of log preprocessing [2]–[5].
mining is a relatively young research discipline aimed at The process mining manifesto [6] also stresses the need
discovering, monitoring and improving real processes by ex- for high-quality event logs. The manifesto lists five maturity
tracting knowledge from event logs readily available in today’s levels ranging from one star () to five stars (    ). At
information systems [1]. Remarkable success stories have been the lowest maturity level, event logs are of poor quality, i.e.,
reported on the applicability of process mining based on recorded events may not correspond to reality and events may
event logs from real-life workflow management/information be missing. A typical example is an event log where events
systems. In recent years, the scope of process mining broad- are recorded manually. At the highest maturity level, event
ened from the analysis of workflow logs to the analysis of logs are of excellent quality (i.e., trustworthy and complete)
event data recorded by physical devices, web services, ERP and events are well-defined. In this case, the events (and all
(Enterprise Resource Planning) systems, and transportation of their attributes) are recorded automatically and have clear
systems. Process mining has been applied to the logs of high- semantics. For example, the events may refer to a commonly
tech systems (e.g., medical devices such as X-ray machines agreed upon ontology.
and CT scanners), copiers and printers, and mission-critical In this paper, drawing from our experiences, we elicit a
defense systems. The insights obtained through process mining list of common process characteristics issues and data quality
are used to optimize business processes and improve customer issues that we encounter in event logs and their impact on
service. Organizations expect process mining to produce accu- process mining. We describe several classes of process char-
rate insights regarding their processes while depicting only the acteristics problems and data quality problems. For example,
desired traits and removing all irrelevant details. In addition, regarding data quality problems, three timestamp related issues
they expect the results to be comprehensible and context- are identified: (1) missing timestamps (e.g. events with no
sensitive. timestamps), (2) imprecise timestamps (e.g., logs having just

978-1-4673-5895-8/13/$31.00 2013
c IEEE 127

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
date information such that ordering of events on the same process discovery, we assume that each event is classified
day is unknown), and (3) incorrect timestamps (e.g., events based on its activity. Consider a case for which n events have
referring to dates that do not exist or where timestamps been recorded: e1 , e2 , . . . en . This process instance has a trace
and ordering information are conflicting). Interestingly these e1 , e2 , . . . en associated to it. If we use activity names as a
problems appear in all application domains. For example, classifier, the trace corresponds to a sequence of activities.
when looking at hospital data we often encounter the problem However, if we would classify events based on resource
that only dates are recorded. When looking at X-ray machines, information, a trace could correspond to a sequence of person
the events are recorded with millisecond precision, but due names. We use A to denote the set of activities (event classes)
to buffering and distributed clocks these timestamps may be in an event log.
incorrect. In this paper we systematically identify the different
problems and suggest approaches to address them. We also
case event
evaluate several real-life event logs to demonstrate that the
classification can be used to identify problems. The goal of c attribute1: Val position: Num
this paper is not to provide solutions on how to address these c attribute2: Val activity name: Str
data quality issues but to highlight the issues and their impact .. 1 * timestamp: TS
.
on process mining applications. We hope that these findings belongs to resource: Str
will encourage process mining researchers to think about e attribute1: Val
systematic logging approaches and repair/analysis techniques e attribute2: Val
..
towards alleviating these quality issues. .
The rest of the paper is organized as follows. Section II
provides the preliminaries which are needed for the remainder Fig. 1. Cases and events can have attributes and each event can be associated
of the paper. Section III provides a high level classification to a single case.
of process characteristics problems and event log quality
problems. In Section IV, the high level problems that arise
III. H IGH L EVEL C LASSIFICATION OF P ROBLEMS
in event logs are classified in further detail. We evaluate
the prevalence of the process related issues and event log The problems and challenges arising in analyzing event logs
quality issues highlighted in this paper in five real-life logs and in process mining can be broadly classified into two categories:
discuss their results in Section V. Related work is presented in 1) Process characteristics: This class deals with challenges
Section VI. Finally, conclusions are presented in Section VII. emanating from characteristics such as fine-granular
activities, process heterogeneity/variability, process evo-
II. P RELIMINARIES
lution, and high volume processes.
The starting point for process mining is the notion of an 2) Quality of event log: This class deals with problems
event and an event log. An event log captures the manifestation that stem from issues related to the quality of logging
of events pertaining to the instances of a single process. A manifested in event logs.
process instance is also referred to as a case. Each event We discuss the two classes of problems in detail in the
in the log corresponds to a single case and can be related following subsections.
to an activity or a task. Events within a case need to be
ordered (an optional position attribute can specify the index A. Process Characteristics
of an event in a case). An event may also carry optional
additional information like time, transaction type, resource, Event log case
costs, etc. Timing information such as date and timestamp of
when an event occurred is required to analyze the performance
related aspects of the process. Resource information such as
# distinct activities
# cases

the person executing the activity is useful when analyzing


.. # distinct traces
the organizational perspective. We refer to these additional .
properties as attributes. To summarize,
• an event log captures the execution of a process.
• an event log contains process instances or cases.
• each case consists of an ordered list of events. Avg. # events per case
• a case can have attributes such as case id, etc.
• each event can be associated exactly to a single case. Fig. 2. Process characteristics manifested in event logs.
• events can have attributes such as activity, time, resource,
etc. Given an event log, we can extract a few basic metrics such
Fig. 1 depicts the characteristics of events and cases and as the number of cases in the event log, the average number
their relationship. For analysis, we need a function that maps of events per case, the number of unique activities, and the
any event e into its class e. For example, for control-flow number of distinct traces as illustrated in Fig. 2. Depending

128 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
on how these metrics manifest in an event log, one can face event logs.
several challenges such as dealing with fine-granular events, Process mining techniques have problems when dealing
case heterogeneity, voluminous data and, concept drifts. with heterogeneity in event logs. For example, process dis-
1) Voluminous Data–Large Number of Cases or Events: covery algorithms produce spaghetti-like incomprehensible
Today, we see an unprecedented growth of data from a wide process models. Analyzing the whole event log in the presence
variety of sources and systems across many domains and of heterogeneity fails to achieve this objective. Moreover, users
applications. For example, high-tech systems such as medical would be interested in learning any variations in process be-
systems and wafer scanners produce large amounts of data, havior and have their insights on the process put in perspective
because they typically capture very low-level events such as to those variations.
the events executed by the system components, application Trace clustering has been shown to be an effective way of
level events, network/communication events, and sensor read- dealing with such heterogeneity [10]–[13]. The basic idea of
ings (indicating status of components etc.). Each atomic event trace clustering is to partition an event log into homogenous
in these environments has a short life-time and hundreds of subsets of cases. Analyzing homogenous subsets of cases is
events can be triggered within a short time span (even within expected to improve the comprehensibility of process mining
a second) (i.e., the average number of events or the number results. In spite of its success, trace clustering still remains
of cases can be very high of the order of a few millions a subjective technique. A desired goal would be to introduce
(thousands) of events (cases)). some objectivity in partitioning the log into homogenous cases.
Boeing jet engines can produce 20 terabytes (TB) of op- 3) Event Granularity–Large Number of Distinct Activities:
erational information per hour. In just one Atlantic crossing, Processes that are defined over a huge number of activities
a four-engine jumbo jet can generate 640 terabytes of data generate event logs that are fine-granular. Fine-granularity
[7]. Such voluminous data is emanating from many areas is more pronounced in event logs of high-tech systems and
such as banking, insurance, finance, retail, healthcare, and in event logs of information systems where events typically
telecommunications. For example, Walmart is logging one correspond to automated statements in software supporting the
million customer transactions per hour and feeding informa- information system. One can also see such phenomena in event
tion into databases estimated at 2.5 petabytes in size [7], a logs of healthcare processes, e.g., one can see events related
global Customer Relationship Management (CRM) company to fine-grained tests performed at a laboratory in conjunction
is handling around 10 million calls a day with 10–12 customer with coarse-grained surgical procedures.
interaction events associated with each call [8]. The term Process mining techniques have difficulties in dealing with
“big data” has emerged to refer to the spectacular growth of fine-granular event logs. For example, the discovered process
digitally recorded data [9]. models are often spaghetti-like and hard to comprehend.
Process mining has become all the more relevant in this For a log with |A| event classes (activities), a flat process
era of “big data” than ever before. The complexity of data model can be viewed as a graph containing |A| nodes with
demands powerful tools to mine useful information and dis- edges corresponding to the causality defined by the execution
cover hidden knowledge. Contemporary process mining tech- behavior in the log. Graphs become quickly overwhelming
niques/tools are unable to cope with massive event logs. There and unsuitable for human perception and cognitive systems
is a need for research in both the algorithmic as well as even if there are more than a few dozens of nodes [14]. This
deployment aspects of process mining. For example, we should problem is compounded if the graph is dense (which is often
move towards developing efficient, scalable, and distributed the case in unstructured processes) thereby compromising the
algorithms in process mining. comprehensibility of models.
2) Case Heterogeneity–Large Number of Distinct Traces: Analysts and end users often prefer higher levels of abstrac-
Many of today’s processes are designed to be flexible. This tion without being confronted with low level events stored
results in event logs containing a heterogeneous mix of usage in raw event logs. One of the major challenges in process
scenarios with diverse and unstructured behaviors (i.e., the mining is to bridge the gap between the higher level conceptual
number of distinct traces is too high). Another source of het- view of the process and the low level event logs. Several
erogeneity stems from operational processes that change over attempts have been reported in the literature on grouping
time to adapt to changing circumstances, e.g., new legislation, events to create higher levels of abstraction ranging from
extreme variations in supply and demand, seasonal effects, etc. the use of semantic ontologies [15], [16] to the grouping of
Although it is desirable to also record the scenario chosen for events based on correlating activities [17], [18] or the use
a particular case, it is infeasible to define all possible variants. of common patterns of execution manifested in the log [19],
There are several scenarios where such processes exist. [20]. However, most of these techniques are tedious (e.g.,
There is a growing interest in analyzing event logs of high-tech semantic ontologies), only partly automated (e.g., abstractions
systems such as X-ray machines, wafer scanners, and copiers based on patterns), lack domain significance (e.g., correlation
and printers. These systems are complex large scale systems of activities), or result in discarding relevant information (e.g.,
supporting a wide range of functionality. For example, medical abstraction).
systems support medical procedures that have hundreds of 4) Process Flexibility and Concept Drifts: Often business
potential variations. These variations create heterogeneity in processes are executed in a dynamic environment which means

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 129

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
that they are subject to a wide range of variations (variations as an event, event attribute/value, and relation is missing.
lead to diversity in traces and thereby lead to a high number Missing data mostly reflects a problem in the logging frame-
of distinct traces). Process changes manifest latently in event work/process.
logs. Analyzing such changes is of the utmost importance to Incorrect Data: This corresponds to the scenario where
get an accurate insight on process executions at any instant of although data may be provided in a log, it may turn out that,
time. Based on the duration for which a change is active, one based on context information, the data is logged incorrectly.
can classify changes into momentary and evolutionary. For example, an entity, relation, or value provided in a log is
• Evolutionary change: Evolutionary changes are changes incorrect.
that are persistent. These changes manifest in such a way Imprecise Data: This corresponds to the scenario where the
that for the group of traces subsequent to the point of logged entries are too coarse leading to a loss of precision.
change there are substantial differences in comparison Such imprecise data prohibits in performing certain kinds of
with earlier performed traces with regard to the activities analysis where a more precise value is needed and can also
performed, the ordering of activities the data involved lead to unreliable results. For example, the timestamps logged
and/or the people performing the activities. Fig. 3 depicts can be too coarse (e.g., in the order of a day) thereby making
an example manifestation of an evolutionary change. For the order of entries unreliable.
a given process, events “A”, “B”, and “C” are always Irrelevant Data: This corresponds to the scenario where the
recorded sequentially as part of the normal process flow. logged entries may be irrelevant as it is for analysis but another
Due to an evolutionary change, the activities “B” and relevant entity may have to be derived/obtained (e.g., through
“C” are replaced (substituted) with the activities “D” and filtering/aggregation) from the logged entities. However, in
“E” respectively. Thus events “D” and “E” are recorded many scenarios such filtering/transformation of irrelevant en-
sequentially instead of the events “B” and “C” in the tries is far from trivial and this poses as a challenge for process
traces subsequent to the point of change. mining analysis.
In the next section, we elaborate these four categories of
A B C
event log quality issues in detail.

A B C IV. E VENT L OG Q UALITY I SSUES


process The four categories of problems that are identified in Section
time

changes III-B for an event log, can emanate for different entities
A D E within an event log. TABLE I outlines the manifestation of
the different classes of problems across the various entities
A D E
of an event log. In total, 27 different classes of quality issues
can be identified. For each identified event log quality issue an
Fig. 3. Evolutionary change: Due to an evolutionary change the events “D”
and “E” are recorded as part of the normal process flow instead of the events
unique number is given starting with an “I”. Note that not for
“B” and “C”. all combinations of problem classes and log entities an issue
has been identified as not all of them are applicable.
• Momentary change: Momentary changes are short-lived Of these classes of quality issues, only the most interesting
and affect only a very few cases. Momentary changes ones are discussed in detail as both the manifestation of the
manifest as exceptional executions or outliers in an event issue and the effect on the process mining application is non-
log. trivial. For the other classes of event log quality issues only a
Current process mining algorithms assume processes to be in brief summary is given at the end of the section. Note that for
a steady state. In case a log which contains multiple variants the most interesting issues first a description and an example
of a process is analyzed an overly complex model is obtained. is given followed by a discussion about the impact of the issue
Moreover, for the discovered model there is no information on the process mining application.
on the process variants that existed during the logged period. I2: Missing Events: This quality issue refers to the scenario
Recent efforts in process mining have attempted at addressing where one or more events are missing within the trace although
this notion of concept drifts [21]–[24]. Several techniques have they occurred in reality. For example, Fig. 4 shows a missing
been proposed to detect and deal with changes (if any) in an event in a trace. In reality, events “A”, “B”, and “C”, have
event log. occurred. However, only events “A” and “C” have been logged
(event “B” has not been recorded for the trace).
B. Quality of the Event Log The effect of missing events is that there may be problems
With regard to the quality of an event log there can with the results that are produced by a process mining
be various issues. we distinguish four broad categories of algorithm. In particular, relations may be inferred which
problems. hardly or even do not exist in reality. To date, we are not
Missing Data: This corresponds to the scenario where dif- aware of any process mining algorithm which is specifically
ferent kinds of information can be missing in a log although developed for detecting or dealing with missing events.
it is mandatory. For example, a certain entity of a log such However, algorithms which are most close to dealing with

130 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
TABLE I
F OR EACH ENTITY OF AN EVENT LOG DEPICTED IN F IGURE 2 IT IS INDICATED WHETHER THE “ MISSING ”, “ INCORRECT ”, “ IMPRECISE ”, AND
“ IRRELEVANT ” PROBLEM TYPE APPLIES . F OR EACH IDENTIFIED EVENT LOG QUALITY ISSUE A UNIQUE NUMBER IS GIVEN STARTING WITH AN “I”.

activity name
c attribute

e attribute
timestamp
belongs to

resource
position
event
case
Missing Data I1 I2 I3 I4 I5 I6 I7 I8 I9
Incorrect Data I10 I11 I12 I13 I14 I15 I16 I17 I18
Imprecise Data I19 I20 I21 I22 I23 I24 I25
Irrelevant Data I26 I27

missing events are the ones which can deal with noise (e.g. thereby plausibly leading to incorrect inference that “B” causes
the fuzzy miner [18] and the heuristics miner [1]). “A”.

Real Time(A)=
A B C 03-12-2011 13:45 Real Time(A)= Real Time(B)=
03-12-2011 13:45 03-12-2011 13:50

A
Fig. 4. Missing events: events “A”, “B”, and “C” have occurred in reality. B A
Only events “A” and “B” are included in the log whereas event “B” is not
included. Logged Time(A)= Logged Time(B)= Logged Time(A)=
03-12-2011 13:59 03-12-2011 13:50 03-12-2011 13:59

(a) (b)
Regarding missing events, one specific sub-issue can be Fig. 5. Incorrect timestamps: (a) event “A” occurred in reality (marked by
identified. a dark vertical line) at “03-12-2011 13:45”. However, the timestamp of the
• Partial / Incomplete Traces: The prefix and/or suffix event recorded in the log is “03-12-2011 13:59” (b) danger of reversal of
cause and effect phenomenon due to incorrect timestamps–one can wrongly
events corresponding to a trace are missing although they infer that “B” causes “A” where in reality “A” occurred before “B”.
occurred in reality. This is more prevalent in scenarios
where the event data for analysis is considered over a I22: Imprecise Activity Names:
defined time interval (e.g., between Jan’12 and Jun’12). This quality issue corresponds to the scenario in which
The initial events (prefix) of cases that have started before activity names are too coarse. As a result, within a trace there
Jan’12 would have been omitted due to the manner of may be multiple events with the same activity name. These
event data selection. Likewise cases that have been started events may have the same connotation as, for example, they
between Jan‘12 and Jun‘12 but not yet completed by occur in a row. Alternatively, the events may have different
Jun‘12 are incomplete and have their suffixes missing. connotations. That is, the activity corresponding to Send
I16: Incorrect Timestamps: This quality issue corresponds Acknowledgment may mean differently depending on
to the scenario where the recorded timestamp of (some or all) the context in which it manifests. In Fig. 6 an example is
events in the log do not correspond to the real time at which given. One occurrence of event “A” is immediately followed
the events have occurred. For example, Fig. 5(a) illustrates an by another occurrence of event “A”. The two events either
incorrect recording of the timestamp of an event. Event “A” belong to the same instance of task “A” or they belong to
has been automatically registered at time “03-12-2011 13:45”. separate instances of task “A”. The separate instances of task
However, the timestamp of the event recorded in the log is “A” may have the same connotation or a different one. The
“03-12-2011 13:59”. Another example is that of a timestamp separate instances of task “A” may have the same connotation
recorded as February 29 in a non-leap year. Such discrepancies or a different one.
manifest in systems where there are different clocks that are
not synchronized with each other and/or systems where there The effect on the application of process mining is that
are delays in logging. algorithms have difficulty in identifying the notion of duplicate
Process mining algorithms discover the control-flow based tasks and thereby produce results that are inaccurate. For
on behavior observed in the log. The possible effect of incor- example, in process discovery, duplicate tasks are represented
rect timestamps is that the discovered control-flow relations with a single node resulting in a large fan-in/fan-out. In
are unreliable or even incorrect. Moreover, in applications such case of duplicate events which have the same connotation, a
as the discovery of signature patterns for diagnostic purposes simple filter may suffice in order to aggregate the multiple
(e.g., fraudulent claims, fault diagnosis, etc.) [25], there is a events into one.
danger of reversal of cause and effect phenomenon due to
incorrect timestamps. Fig. 5(b) depicts an example of such I23: Imprecise Timestamps:
a scenario. Although the events “A” and “B” occur in that This quality issue corresponds to the scenario where times-
particular order in reality, they are logged as “B” and “A” tamps are imprecise as a too coarse level of abstraction is used

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 131

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
A A hospital, one would be interested in a high-level process flow
between the various departments in the hospital. However,
when analyzing a particular department workflow, one would
A A A be interested in more finer level details on the activities
?
performed within the department.
Failing in filtering and aggregating the correct events creates
Fig. 6. Imprecise activity names: event “A” occurs two times in succession.
As a consequence, both events may belong to the same instance of task “A” several challenges for process mining and leads to inaccurate
or they may each belong to a separate instance of task “A”. results or capturing results at insufficient level of detail. For
example, ignoring critical events required for the context of
analysis will lead to not being able to uncover the required
for the timestamps of (some of the) events. This implies that results (e.g., unable to discover the causes of failure) while
the ordering of events within the log may not conform to the considering more than required will add to the complexity of
actual ordering in which the events occurred in reality, i.e. analysis in addition to generating non-interesting results (e.g.,
the ordering of events within a trace is unreliable. In Fig. 7 an incomprehensible spaghetti-like model).
an example is given of a too coarse granularity of timestamps. Remaining Event Log Quality Issues: For the remaining 22
For a trace, events “B” and “C” both have “05-12-2011” as the event log quality issues, a similar reasoning can be followed
timestamp. It is not clear whether event “B” occurred before as for the issues discussed above. In general, the impact of
“C” or the other way around. For event “A” having timestamp the other “missing data” and “incorrect data” issues on the
“03-12-2011” it is clear that it occurred before both “B” and process mining application is that relations are inferred which
“C”. Likewise, event “D” occurred after both “B” and “C” as hardly exist in reality. For example, for the “missing resources”
it has the timestamp “06-12-2011”. issue it could be concluded that a person is hardly involved in
Process mining algorithms for discovering the control-flow any work whereas in reality this could be exactly the opposite.
assume that all events within the log are totally ordered. Furthermore, when applying decision mining for incorrect case
As multiple events may exist within a trace with the same attributes or data attributes, incorrect choice criteria may be
timestamp, process mining algorithms may have problems inferred. In case these results are to be used in practice, wrong
with identifying the correct control-flow. In particular, choices will be made within a process.
discovered control-flow models tend to have a substantial For the remaining “imprecise data” issues, the general
amount of activities which occur in parallel. Furthermore, impact is that inaccurate results are obtained and that the
event logs with coarse granular timestamps are incapacitated application of certain types of process mining is prohibited.
for certain types of process mining analysis such as For example, for the “imprecise positions” issue the effects are
performance analysis. comparable as for the “imprecise timestamps” issue. Further-
more, when dealing with imprecise data, there exists the risk
? that wrong weaknesses are identified within a process (e.g. a
A B C D non-existing bottleneck or a group of badly performing people
which in reality perform well). As a result, an organization
may focus on wrong improvements in order to optimize a
Time= Time= Time= Time=
03-12-2011 05-12-2011 05-12-2011 06-12-2011
process.
V. E VALUATION OF E VENT L OGS
Fig. 7. Coarse granularity of timestamps: for events “B” and “C”, it is not
clear in which order they occurred as both occurred on the same day. In this section, we analyze the issues highlighted in Sec-
tions III-A and IV against several real-life logs. The objective
of this analysis is to provide an insight into the manner and
I27: Irrelevant Events: In some applications, certain logged extent to which the identified process related issues and event
events may be irrelevant as it is for analysis. As a result, it is log issues actually appear in real-life data that is used for
needed to filter or aggregate events, in order that only relevant process mining. Due to space constraints, we depict only
events are derived/obtained. This filtering or aggregation may the summarized results of issues manifested in five real-life,
be a far from trivial task. If the event log contains rich viz., Philips Healthcare (P), BPI Challenge 2011 (A) [26],
information in the form of event attributes, one may use them BPI Challenge 2012 (B) [27], Catharina Hospital (C), and
to assist in filtering/aggregating irrelevant events. CoSeLog (G). TABLE II summarizes the process related issues
For example, when analyzing X-ray machine event data, while TABLE III summarizes the event log issues manifested
completely different aspects need to be considered when it in these logs. For a more detailed analysis, the reader is
comes to gaining insights on (i) the real usage of the system referred to [28].
and (ii) recurring problems and system diagnosis, the former The logs used in the table contain information regarding
requiring the analysis of commands/functions invoked on the operational events which capture the use of medical equipment
system while the latter needing error and warning events. As such as X-ray machines and MR machines (Philips Health-
another example, when analyzing the process of an entire care); actions performed within the Dutch AMC hospital in

132 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
TABLE III
E VALUATION OF EVENT LOG ISSUES ON A SELECTION OF REAL - LIFE LOGS .

activity name
c attribute

e attribute
timestamp
belongs to

resource
position
event
case
missing {A,B,C,G} {C} {A,B} {C,G}
incorrect {B,G} {P} {A,P,C} {B}
imprecise {A } {A,P,C,G} {A,C}
irrelevant {P} {A,B,P}

TABLE II
E VALUATION OF PROCESS RELATED ISSUES ON A SELECTION OF So far, process mining has been applied in many organi-
REAL - LIFE LOGS . zations. In literature, several scholarly publications can be
found in which an application of process mining has been
P A B C G
Event granularity X X X X described. Several publications mention the need of event log
Voluminous data X preprocessing as data quality problems exist. For example, the
Case heterogeneity X X X X healthcare domain is a prime example of a domain where event
Process flexibility and concept drifts X X
logs suffer from various data quality problems. In [34], [35] the
gynecological oncology healthcare process within a university
hospital has been analyzed; in [36] several processes within
order to treat gynaecological oncology patients (BPI Chal- an emergency department have been investigated; in [37] all
lenge 2011); the handling of loan/overdraft applications in a Computer Tomography (CT), Magnetic Resonance Imaging
Dutch financial institute (BPI Challenge 2012); the various (MRI), ultrasound, and X-ray appointments within a radiology
activities that take place in the Intensive Care Unit (ICU) workflow have been analyzed; in [38] the activities that are
of the Dutch Catharina hospital (Catharina); and procedures performed for patients during hospitalization for breast cancer
that are performed in order to grant permission for projects treatment are investigated; in [39] the journey through multiple
like the construction, alteration or use of a house or building wards has been discovered for inpatients; and finally, in [40],
(CoSeLog). the workflow of a laparoscopic surgery has been analyzed. In
From TABLE II it becomes clear that each process related total, these publications indicate problems such as “process
issue is manifested in at least one log. From TABLE III it can flexibility and concept drifts”, “event granularity”, “case het-
be seen that 12 out of the 27 event log issues are manifested erogeneity”, “missing resources”, “missing activity names”,
in one or more event logs. Furthermore, as suggested by the “missing events”, “imprecise activity names”, and “irrelevant
results, the process characteristics “fine-granular events” and events”. In particular, “case heterogeneity” is mentioned as a
“case heterogeneity”, and the event log data quality issues main problem. This is due to the fact that healthcare processes
“missing event” and “imprecise activity name” occur most are typically highly dynamic, highly complex, and ad hoc [36].
frequently for event logs. Additionally, it can be seen that Another domain in which many data quality problems
timestamp related issues occur within three event logs. Such for the associated event logs can be found is Enterprise
issues have a severe impact on the correctness of the obtained Resource Planning (ERP). Here, scholarly publications about
process mining results. Therefore, all before mentioned issues the application of process mining have been published about a
are good candidates for future research which focuses on procurement and billing process within SAP R/3 [19]; a pur-
alleviating them. chasing process within SAP R/3; and the process of booking
gas capacity in a gas company [41]. The data quality problems
VI. R ELATED W ORK arising here concern “event granularity”, “voluminous data”,
and “irrelevant events”. In particular, the identification of
It is increasingly understood that data in data sources is relationships between events together with the large amounts
often “dirty” and therefore needs to be “cleansed” [29]. In of data that can be found within an ERP system are considered
total, there exist five general taxonomies which focus on as important problems.
classifying data quality problems [30]. Although each of these
approaches construct and sub-divide their taxonomies quite VII. C ONCLUSIONS
differently [29], [31]–[33], [33], they arrive at very similar Process mining has made significant progress since its
findings [29]. Alternatively, as data quality problems for time- inception more than a decade ago. The huge potential and
oriented data have distinct characteristics, a taxonomy of dirty various success stories have fueled the interest in process
time-oriented data is provided in [30]. Although some of the mining. However, despite an abundance of process mining
problems within the previous taxonomies are very similar to techniques and tools, it is still difficult to extract the desired
process mining data quality problems, the taxonomies are not knowledge from raw event data. Real-life applications of
specifically geared towards the process mining domain. process mining tend to be demanding due to process related

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 133

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.
issues manifested in event logs and data quality issues. In [20] R. P. J. C. Bose and W. M. P. van der Aalst, “Abstractions in Process
this paper we highlighted several classes of data quality issues Mining: A Taxonomy of Patterns,” in Business Process Management,
2009, pp. 159–175.
manifested in event logs. These issues hamper the applicability [21] R. P. J. C. Bose, W. M. P. van der Aalst, I. Žliobaitė, and M. Pechenizkiy,
of several process mining techniques and limit the level/quality “Handling Concept Drift in Process Mining,” in Proceedings of CAiSE
of insights gained. There is a pressing need for process mining 2011), 2011, pp. 391–405.
[22] J. Carmona and R. Gavaldà, “Online Techniques for Dealing with
research to also focus on techniques that address these quality Concept Drift in Process Mining,” in Intelligent Data Analysis, 2012,
issues. pp. 90–102.
[23] D. Luengo and M. Sepúlveda, “Applying Clustering in Process Mining
Acknowledgements to Find Different Versions of a Process that Changes Over Time,” in
Business Process Management Workshops, 2012, pp. 153–158.
This research is supported by the Dutch Technology Foun- [24] T. Stocker, “Time-Based Trace Clustering for Evolution-Aware Security
dation STW, applied science division of NWO and the Tech- Audits,” in Business Process Management Workshops, 2012, pp. 471–
476.
nology Program of the Ministry of Economic Affairs. [25] R. P. J. C. Bose, “Process Mining in the Large: Preprocessing, Dis-
covery, and Diagnostics,” Ph.D. dissertation, Eindhoven University of
R EFERENCES Technology, 2012.
[1] W. M. P. van der Aalst, Process Mining: Discovery, Conformance and [26] 3TU Data Center, “BPI Challenge 2011 Event Log,” 2011,
Enhancement of Business Processes. Springer-Verlag, Berlin, 2011. doi:10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54.
[2] R. P. J. C. Bose and W. M. P. van der Aalst, “Analysis of Patient [27] ——, “BPI Challenge 2012 Event Log,” 2012,
Treatment Procedures: The BPI Challenge Case Study,” BPMCenter.org, doi:10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f.
Technical Report BPM-11-18, 2011. [28] R. P. J. C. Bose, R. S. Mans, and W. M. P. van der Aalst, “Wanna
[3] ——, “Process Mining Applied to the BPI Challenge 2012: Divide Improve Process Mining Results? - It’s High Time We Consider Data
and Conquer While Discerning Resources,” BPMCenter.org, Technical Quality Issues Seriously,” BPM Center Report BPM-13-02, BPMcen-
Report BPM-12-16, 2012. ter.org, 2013.
[4] ——, “Analysis of Patient Treatment Procedures,” in Business Process [29] W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee, “A Taxonomy
Management Workshops, 2012, pp. 165–166. of Dirty Data,” Data Mining and Knowledge Discovery, pp. 81–99, 2003.
[5] ——, “Process Mining Applied to the BPI Challenge 2012: Divide and [30] T. Gschwandtner, J. Gärtner, W. Aigner, and S. Miksch, “A Taxonomy
Conquer While Discerning Resources,” in Business Process Manage- of Dirty Time-Oriented Data,” in CD-ARES 2012, 2012, pp. 58–72.
ment Workshops, 2013, pp. 221–222. [31] E. Rahm and H. Do, “Data Cleaning: Problems and Current Ap-
[6] IEEE Task Force on Process Mining, “Process Mining Manifesto,” in proaches,” IEEE Bulletin of the Technical Committee on Data Engi-
BPM 2011 Workshops, 2011, pp. 169–194. neering, 2000.
[7] S. Rogers, “Big Data is Scaling BI and Analytics–Data Growth is About [32] H. Müller and J.-C. Freytag, “Problems, Methods, and Challenges in
to Accelerate Exponentially–Get Ready,” Information Management- Comprehensive Data Cleansing,” Humboldt University Berlin, Technical
Brookfield, 2011. Report HUB-IB-164, 2003.
[8] C. W. Olofson, “Managing Data Growth Through Intelligent Partition- [33] P. Oliveira, F. Rodrigues, and P. Henriques, “A Formal Definition of Data
ing: Focus on Better Database Manageability and Operational Efficiency Quality Problems,” in International Conference on Information Quality
with Sybase ASE,” White Paper, IDC, 2010. (MIT IQ Conference), 2005.
[9] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and [34] R. S. Mans, M. H. Schonenberg, M. S. Song, W. M. P. van der Aalst,
A. H. Byers, “Big Data: The Next Frontier for Innovation, Competition, and P. J. M. Bakker, “Application of Process Mining in Healthcare : a
and Productivity,” McKinsey Global Institute, Tech. Rep., 2011. Case Study in a Dutch Hospital,” in BIOSTEC 2008, 2009, pp. 425–438.
[10] J. D. Weerdt, S. K. L. M. vanden Broucke, J. Vanthienen, and B. Bae- [35] R. S. Mans, “Workflow Support for the Healthcare Domain,” Ph.D.
sens, “Leveraging Process Discovery with Trace Clustering and Text dissertation, Eindhoven University of Technology, 2011.
Mining for Intelligent Analysis of Incident Management Processes,” in [36] A. Rebuge and D. Ferreira, “Business Process Analysis in Healthcare
2012 IEEE Congress on Evolutionary Computation (CEC), 2012, pp. Environments: A Methodology Based on Process Mining,” Information
1–8. Systems, vol. 37, no. 2, pp. 99–116, 2012.
[11] G. Greco, A. Guzzo, L. Pontieri, and D. Sacca, “Discovering Expressive [37] M. Lang, T. Bürkle, S. Laumann, and H.-U. Prokosch, “Process Mining
Process Models by Clustering Log Traces,” IEEE Transactions on for Clinical Workflows: Challenges and Current Limitations,” in Medical
Knowledge and Data Engineering, pp. 1010–1027, 2006. Informatics Europe, 2008.
[12] R. P. J. C. Bose and W. M. P. van der Aalst, “Trace Clustering Based [38] J. Poelmans, G. Dedene, G. Verheyden, H. van der Mussele, S. Vi-
on Conserved Patterns: Towards Achieving Better Process Models,” in aene, and E. Peters, “Combining Business Process and Data Discovery
Business Process Management Workshops, 2010, pp. 170–181. Techniques for Analyzing and Improving Integrated Care Pathways,” in
[13] A. K. A. de Medeiros, A. Guzzo, G. Greco, W. M. P. van der Aalst, A. J. ICDM, 2010, pp. 505–517.
M. M. Weijters, B. F. van Dongen, and D. Sacca, “Process Mining Based [39] L. Perimal-Lewis, S. Qin, C. Thompson, and P. Hakendorf, “Gaining
on Clustering: A Quest for Precision,” in Business Process Management Insight from Patient Journey Data using a Process-Oriented Analysis
Workshops, 2008, pp. 17–29. Approach,” in Health Informatics and Knowledge Management, 2012,
[14] C. Görg, M. Pohl, E. Qeli, and K. Xu, “Visual Representations,” in pp. 59–66.
Human-Centered Visualization Environments, 2007. [40] T. Blum, N. Padoy, H. Feuner, and N. Navab, “Workflow Mining
[15] C. Pedrinaci and J. Domingue, “Towards an Ontology for Process for Visualization and Analysis of Surgeries,” International Journal of
Monitoring and Mining,” in Semantic Business Process and Product Computer Assisted Radiology and Surgery, pp. 379–386, 2008.
Lifecycle Management, 2007, pp. 76–87. [41] L. Maruster and N. R. T. P. van Beest, “Redesigning Business Processes:
[16] A. K. A. de Medeiros, W. M. P. van der Aalst, and P. Carlos, “Semantic a Methodology Based on Simulation and Process Mining Techniques,”
Process Mining Tools: Core Building Blocks,” in Proceedings of ECIS Knowledge and Information Systems, pp. 267–297, 2009.
2008, 2008, pp. 1953–1964.
[17] C. W. Günther, A. Rozinat, and W. M. P. van der Aalst, “Activity
Mining by Global Trace Segmentation,” in Business Process Mangement
Workshops, 2010, pp. 128–139.
[18] C. W. Günther and W. M. P. van der Aalst, “Fuzzy Mining: Adaptive
Process Simplification Based on Multi-perspective Metrics,” in Business
Process Management, 2007, pp. 328–343.
[19] R. P. J. C. Bose, H. M. W. Verbeek, and W. M. P. van der Aalst,
“Discovering Hierarchical Process Models Using ProM,” in CAiSE
Forum 2011, 2012, pp. 33–48.

134 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Authorized licensed use limited to: Vrije Universiteit Brussel. Downloaded on December 21,2020 at 14:51:54 UTC from IEEE Xplore. Restrictions apply.

You might also like