You are on page 1of 16

2IMI35 ASSIGNMENT 1 30/09/2016

Student ID: S167017 Name: Garcia Torres, D.M.


Goals of this Analysis

The goals I am trying to achieve with this assignment are mainly to gain new insights of what can be done with Data Mining,
Machine Learning and Process Mining tools, how can I use them for my own projects and how can I apply the concepts
learned in this Process Mining course.

Part 1. Disco

I. (2.5 points) For the Disco part, start with answering the following simple questions and executing the following
simple analyses:

1. What are possible first and last activities in the log traces?

a. Possible First Activities b. Possible Las Activities

a) Are there last activities that you would not expect to see in the list of the last activities?

I wouldn't expect to see these SCHEDULE and START activities:


W_Nabellen offertes\\START
W_Valideren aanvraag\\START
W_Wijzigen contractgegevens\\SCHEDULE

I also would expect to see W_Beoordelen fraude\\COMPLETE as it is very infrequent (0,86% of Relative
Frequency).

b) Choose the most frequent last activity and apply the endpoints filter to select only activities that end with
that activity. Analyze the model obtained with this filter.
Are there activities that occur more than once in some traces (look at case frequency in Disco)? If so, how
many times maximally?

W_Afhandelen leads\\COMPLETE: Max. repetitions = 13


W_Afhandelen leads\\START: Max. repetitions = 13
W_Afhandelen leads\\SCHEDULE: Max. repetitions =2
W_Beoordelen fraude\\COMPLETE: Max. repetitions = 7
W_Beoordelen fraude\\START: Max. repetitions =7
W_Beoordelen fraude\\SCHEDULE: Max. repetitions = 2
W_Completeren aanvraag\\COMPLETE: Max. repetitions = 27
W_Completeren aanvraag\\START: Max. repetitions = 27

2. What are the mostly used resources?

a) Choose the most frequently used resource and filter the log using the resource filter to only keep activities
executed by this resource and look at the model and the log statistics of the filtered log. Can you make an
assumption about the nature of this resource?

The resource must be a software application part of an automatic workflow available to the user 24/7, with a low
workload between 00:00 and 08:00.

3. Mine the model on the whole log without filtering, and then apply filtering to mine A-, O- and W- subprocesses
separately. What conclusions can you draw based on those models with respect to their structure? Why would
the model obtained from an unfiltered log be so different with respect to the structure than the models of the
A-, O- and W-subprocesses?
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
Both models A- and O- represent changes of status in applications and offers respectively. As a consequence the have
a more predefined (more constrained) and less dynamic behavior compared to the model W-. The latter shows all the
possible activities that an employee can perform in the process, thus the behavior of that model is less predictable. and
more variable. This can be verified in the overview of each one of the filtered models, where It is possible to see that
A- has 17 variants, O- has 168, while W- accounts for 2921 variants. Additionally, there are more activities for the Model
W- (actions that an employee can perform) than activities (statuses) for the models A- (10) and O- (19).

Model Without Filters:

A- Model: Status changes of applications

W- Model: Activities performed by employees in the process


2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
O- Model: Status changes of offers

4. What are the bottlenecks in the process? (look at the performance shown in the mined model). Which activities
contribute the most to the duration of the process?

It is possible to verify that the transition between W_Nabellen offertes\\COMPLETE and W_Nabellen offertes\\START
(Figure 4.1) accounts for the maximum total duration between two activities. Additionally, after performing an animation
one can clearly see (Figure 4.2) that the main bottlenecks are occurring in the following transitions:
W_Completerenaanvraag\\COMPLETE - W_Completeren aanvraag\\START
W_Nabellen offertes\\COMPLETE - W_Nabellen offertes\\START
W_Nabellen offertes\\START - W_Completeren aanvraag\\COMPLETE

Figure 4.1 Figure 4.2

5. What is the distribution of the total case duration?


Introduce performance filters based on the case duration. Look at very short (w.r.t. the execution time) cases.
What part of cases falls into that category? How long are they? Mine their model. What conclusions can you
make about this category of cases?

The bar graph below shows the distribution of the total case duration along with the following statistics:
6949 cases (53%) of 13087 during up to 1 day and 8 hours
Median case duration of 19.4 hours
Mean case duration of 8.6 days
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.

We can consider as very short cases those grouped in the first bar of the distribution graph. More specifically, those
cases with a maximum duration of 32.9 hours.

After applying the performance filter, the overview of the new model, shows the following statistics:
6949 cases.
With 35 activities
Median case duration of 3.2 minutes
Mean case duration of 3.4 hours

Looking in the activity view (in the last in case tab) and the model map, one can see that these cases end up being
declined (A_DECLINED\\COMPLETE) or canceled (A_CANCELLED\\COMPLETE). Either manually after an employee
activity, or automatically by the system. Hence, they must fall in the category of those cases that do not pass
successfully the initial verifications.

6. Use the variation filter:


2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
a) Look at the model of the mainstream behavior (choosing the most frequent variant). What is the share of
the cases that fall into this category? What is the duration of these cases? How many of the
accepted/declined/canceled?

The most frequent variant is Variant 1, with the following statistics:


3429 (26,2%) of 13087 cases.
With 3 activities
Median case duration of 37 seconds
Mean case duration of 38 seconds

100% of them finished in the status A_DECLINED\\COMPLETE (hence, they ended up being declined).

b) Do the same analysis for the second most popular variant of the behavior.
What is the main difference between the first variant and this one?

The second most frequent variant is Variant 2, with the following statistics:
1872 (14,3%) of 13087 cases.
With 6 activities
Median case duration of 50.4 minutes
Mean case duration of 5.2 hours

The difference of these cases lies on the additional execution of an employee activity represented by the following
W_ events trace:
W_Afhandelen leads\\SCHEDULE
W_Afhandelen leads\\START
W_Afhandelen leads\\COMPLETE

c) Now choose the cases (from the complete log) whose sequence of activities is shared by at least 39 and at
most 271 cases. Look at their model.

Those cases correspond to Variants 3 to 16. Below, it is possible to see the applied filter and the outcome model:
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
How many of them are accepted/declined/canceled?

By looking at the variants statistics, one can summarize that:


No variant end up being accepted
1080 cases (74,2%) of 1467 ended up declined (A_DECLINED\\COMPLETE).
They correspond to the following sub-variants:
o Variant 1: 271 cases (18,47%)
o Variant 2: 209 cases (14,25%)
o Variant 3: 160 cases (10,91%)
o Variant 5: 126 cases (8,59%)
o Variant 6: 93 cases (6,34%)
o Variant 8: 74 cases (5,04%)
o Variant 10: 58 cases (3,95%)
o Variant 12: 54 cases (3,68%)
o Variant 13: 44 cases (3%)
378 cases (25,8%) of 1467 ended up canceled (A_CANCELLED\\COMPLETE).
They correspond to the following sub-variants:
o Variant 4: 134 cases (9,13%)
o Variant 7: 87 cases (5,93%)
o Variant 9: 63 cases (4,29%)
o Variant 14: 39 cases (2,66%)

That analysis can also be verified in the Map View by checking at the Case frequency and the Max. repetitions indicators
of these A_ activities.

What takes the most time for those cases?

The activity taking the most time is W_Completeren aanvraag\\COMPLETE - W_Completeren aanvraag\\START (in total
duration time as well in median duration, maximum durations and mean duration).

What is their typical duration?

The Median Case Duration for this set of variants is 5.8 hours.
While the Mean Case Duration is 14.7 hours.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
7. Filter out (remove) the traces that belong to the first two variants of the mainstream behavior (6a and 6b) and
analyze the log with the rest of the traces. Split it into the sublog of the cases that were approved (a trace
contains A_APPROVED\\COMPLETE) and the sublog of the cases that were not approved (use the attribute
filter for that). Mine the models for these sublogs and compare them based on statistical characteristics. What
are the main differences you can find?

After applying the Variation filter to remove the first two variants and an Attribute Filter to filter the approved cases
(filtering mode Mandatory with attribute value A_APPROVED\\COMPLETE), one can see the following characteristics:
2246 (17,16%) of 13087 cases.
With 33 activities
Median case duration of 14,5 days
Mean case duration of 16,7 days

After applying the Variation filter to remove the first two variants and an Attribute Filter to remove the approved cases
(filtering mode Forbidden with Attribute value A_APPROVED\\COMPLETE), one can see the following characteristics:
5540 (42,33%) of 13087 cases.
With 31 activities
Median case duration of 9,7 days
Mean case duration of 13,5 days

There are more than twice not approved cases than there are approved cases (5540 for non-approved vs 2246 for
approved). The number of activities involved between the two sub-logs is almost the same (31 for non-approved vs 32
for approved). The mean duration of not approved cases is more than 3 days less than the mean duration for approved
cases (13,5 for non-approved vs 16,7 for approved)

II. (2.5 points) Continue your analysis based on the insights you gained when answering questions in task I.
Choose three additional perspectives to consider and introduce appropriate filters to use in order to analyze the
process behavior further. Choosing interesting filters is part of the assignment!

Perspective I: Long running cases

Considering the set of cases with a minimum duration of 31 days and 14 hours, this accounts for 4% of all the cases. I
chose that minimum duration because extending down that limit 1 hour (to 31 days and 13 hours of minimum duration), the
percentage of cases would jump from 4% to 10%. Hence the characteristic of that set would be more diverse with more
difficulties to get insights and conclusions from it.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.

One can see a total of 569 cases with a mean duration of 41.6 days:

To verify the statistics of the approved and non-approved cases, one can apply a Mandatory filter (or a Forbidden filter for
the non-approved scenario) over the attribute Activity with value A_APPROVED\\COMPLETE. The output of that filter is
the following Overview of statistics:

For the long-running and Approved cases, we have:


159 (1,2%) of 13087 cases (28% of Long-running cases)
On 33 activities
Median case duration of 38,1 days
Mean case duration of 41,9 days

For the long-running and Non-approved cases, we have:


410 (3,13%) of 13087 cases (72% of Long-running cases)
On 31 activities
Median case duration of 36,6 days
Mean case duration of 41,4 days

Note that the difference between approved and not-approved cases in this "long-running" set of cases is noticeable (28%
vs 72). Hence, knowing that they could last up to 91 days and 10 hours (according to the overview graph on Case Duration),
it can be a good suggestion to automatically cancel those cases running for more than 31 days and 14 hours. One approach
to implement this, could be by establishing time limit checkpoints in the process.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
Perspective II: Exceptional cases

To analyze the behavior of these exceptional cases one can apply a Variation Filter selecting the range of cases grouped
in the left side of the bar graph. These are cases whose sequence of activities is not shared with other cases (hence, they
are indeed exceptional in our log). They account for 28% of the cases.

We can see the following general statistics for this set:


3754 (28%) of 13087 cases
On 36 activities
Median case duration of 16,19 days
Mean case duration of 20,1 days

Additionally, we can apply some Attribute Mandatory filters on the Application Approved, Application Declined and
the Application activities.

This Approved with exceptional behavior set of cases is composed of 1952 cases (51,9% of exceptional behavior cases).
On the other hand, the Declined with exceptional behavior set of cases is composed of 614 cases (16,3% of exceptional
behavior cases). Also, the Canceled cases subset is composed of 944 cases (25,1% of exceptional behavior cases).

These numbers suggest that the Exceptional Behavior cases should still be considered by the organization due to the high
percentage of approvals. In other words, one cannot conclude after this analysis that these "strange" cases should be
detected and dismissed in order to improve and simplify the process.

On the other hand, it is noticeable that these 1952 approved cases with exceptional behavior accounts for 86,9% of the
total of approved cases in the log (2246). This fact could suggest that the complexity of the process is allowing the applicants
to find "strange" ways to get their applications approved. Hence, this approach could be used in a monitoring and
supervision process or as an argument to suggest the simplification or tightening of the approval process.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
Perspective III: Declined Offers

A possible way to improve the loan request process could be detecting beforehand, which application loans that would pass
all the bank checks and get approved, are more prone to not finally getting accepted by the applicants.

Those approved but declined applications, could consume considerable resources in the process. From there, finding the
patterns to detect and filter them could be a proposal of value.

To analyze the behavior of these cases, one can apply a simple Attribute Filter by Activity on Mandatory Mode over the
value Offer Declined (O_DECLINED\\COMPLETE).

This set of cases represent the 6% of all cases and the 15,99% off all loan offer cases (it is possible to check the number
of cases that include a loan offer by applying a similar filter to the one applied in question #3 for the O_ activities).
The following general statistics about this set can be seen in the Overview:
802 (6,1%) of 13087 cases
On 30 activities
Median case duration of 13,9 days
Mean case duration of 16,4 days

The snapshots below show the general statistics of cases including a loan offer, followed by the general statistics of the
cases for declined loan offers and the cases of accepted loan offers. All of them along with the corresponding distribution
over case duration:

The overall numbers of case duration does not show a noticeable pattern difference between the approved and the declined
offers. On the other hand, one might think that a loan offer can be declined by the applicant if there is no fast response
since the application is submitted until an offer is sent.

To compare and verify this fact, it is possible to apply an Attribute filter by Activity with the O_ Activities values and the
Application Submitted activity value with a Keep Selected filtering mode. Then, apply another Attribute filter by Activity with
the value Declined and the Mandatory filtering mode. Finally, in order to make the comparison, one can apply the same
filters with the activity Offer Accepted instead of Offer Declined like it is shown below:
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.

The filtered (and simplified) maps above show the performance indicators of mean time between activities. There is a slight
difference between the mean time of response in the accepted offers and the mean time of response in the declined offers.
Another determining factor could be the amount of the load. Although it would be a more interesting analysis if the log would
have included the amount the bank finally offered (in order to consider the differences between the amount requested and
the amount offered) it is still possible to analyze the distribution of the amounts requested on these declined cases by simply
checking the (case) AMOUNT_REQ Statistics View graphs:

One can see there is a considerable percentage of offers declined with the minimum amount requested, but the distribution
of this graph is very similar to the complete log graph.

It is possible that this perspective can be leveraged by using a Data Mining tool like RapidMiner. The goal would be to
determine (using all available attributes) some patterns of the applications prone to not finally getting accepted by the
applicants. In other word, one would like to additionally predict from the beginning of the process if a loan offer will be
accepted by the applicant.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
Part 2. RapidMiner

I. (1.5 points) Start using RapidMiner by performing the following simple analysis:
Select the cases that took more than 3,0 days (use Filter Examples with the appropriate condition on attribute
values to select rows), then build a decision tree predicting whether a loan application got approved or not (assume
that it got approved if A_APPROVED\\COMPLETE occurred in the trace, and otherwise it is rejected).

Play with the configuration parameters for the decision tree miner and the attributes you use, choose three most
interesting (from your point of view) decision trees and explain what insights into the business process you can
gain from them. Reflect on the reliability of your results (based on confusion matrices, or support, confidence and
lift).

(i) Interesting tree #1: using Case Duration, Number of Activities and Amount Requested to predict the Approval.

With this tree I expect to see how these key attributes, that are not related with the type of the activities performed in the
process, can predict the approval of an application. Before using the decision tree, I first would like to use a filter to dismiss
the cases with a requested amount below 5000 and the cases with less than 22 activities performed (because this last
condition implies 100% of not approved cases). The workflow of the main process is shown below:

The resulting decision tree shows that if the case duration is larger than 29930, it is very unlikely that the application will be
approved unless the number of activities performed is bigger than 75500 (where there is 66% of probabilities of approval).
On the other side of the tree, there is still a lot of uncertainty that could be clarified by adjusting the decision tree parameters
or filtering the cases with a case duration below 29930.

The performance vector results show an accuracy of 60.2%, with a not bad number of true positives (316 of 444) but an
unaccurate number of true negatives (266 of 522).

(ii) Interesting tree #2: using Employee activities

The new process is shown below. For this tree we take into consideration a data set with records of some employee
activities performed like Fraud verification (W_Beoordelen fraude\\START) and Handling of incomplete records
(W_Nabellen incomplete dossiers\\START). The workflow is shown below:
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.

It is noticeable that the handling of incomplete dossiers ended up in data set with a high rate of approved applications. On
the other hand, for the complete case (where no incomplete dossier handling was performed) it is noticeable that a high
percentage of applications with less than 71 activities and a case duration below 33.634 was not approved.

For this tree, the performance vector shows an accuracy of 72.9%, with a bad number of true positives (214 of 450) but an
accurate number of true negatives (603 of 700).

(iii) Interesting tree #3: using Amount Requested to predict the Approval.

For this new tree, I want to give more importance to the attribute of Amount Requested above other attributes. In order to
do so, I had to discretize other attributes in order to lower their visual participation in the tree. In addition, I also used a
discretization operator over the Requested Amount attribute to do a categorization by range of the amount. The workflow
used is shown below:

What is noticeable about these results, is that the amount requested seems to not be a deterministic factor in the approval
of an application. The figure below shows that the ratio of approval is very similar along all the requested amount ranges.
In other order of ideas, the performance of the employee activity for handling incomplete dossiers, once again in this new
tree, shows his relevance over the final decision of approval.
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.

In this case the performance vector is very similar to the last tree results. One can see an accuracy of 73.25%, with a bad
number of true positives (210 of 442) but an accurate number of true negatives (562 of 642).

II. (2 points) Choose you own question which you would like to analyze further by using decision tree mining (e.g.
predict the case processing time), association rule learning or clustering. Perform the analysis and present its
results, reflecting on their reliability and business insights. If you want to build a decision tree explaining how the
processing time of a case depends on the activities performed in the case, you choose Case duration as the label.
Since labels should be of type nominal for building a decision tree, the operator Discretize can be used on
Processing time (you can change the way it is discretized using an appropriate discretization scheme).

Question: Predict the case processing time

In order to predict the value of the Case duration, I am going to apply the following discretization scheme for this label
attribute:

1-3 days
3-10 days
10-20 days
20-30 days
MT 30 days (more than 30 days)

Assuming that this prediction is needed before the process starts or at least in any checkpoint of the process, I should
consider the following attributes that represent the execution of non-trivial activities and the previously known values of the
applications:

AMOUNT_REQ: Amount Requested


W_Beoordelen fraude\\START: Execution of Fraud verification activity
W_Nabellen incomplete dossiers\\START: Execution of Handling of incomplete dossiers

The workflow discretizes these attributes and executes a split of data to perform the corresponding validations and generate
a decision tree:
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
The result show that if the dossiers are incomplete, the application process will be likely to last more than 10 days. If the
dossier is complete the application process will be likely to last less than 1 day.

The performance indicator for this decision tree shows the confusion matrix with an accuracy of 62,4%.

Part 3. Combining Insights

In the perspective #3 of the Disco part, I was analyzing the behavior of the cases with declined offers. I found a slight
difference between the response times of the accepted and not accepted cases. Additionally I analyzed the Amount
Requested attribute of these cases without finding any conclusion. Now, it is possible to extend that analysis with a data
mining tool like Rapid Miner.

The goal in this section would be to determine patterns in the applications prone to not finally getting accepted by the
applicants. In other words, at the moment of approval, one would like to know the probabilities of final acceptance of a loan
offer considering the most relevant attributes available from the process, including the Case Duration and the Amount
Requested.

In this case the label attribute will be O_ACCEPTED\\COMPLETE. Then, the attributes taken into consideration for the
decision tree are going to be the following:

AMOUNT_REQ: Amount Requested


Case duration
W_Beoordelen fraude\\START: Execution of Fraud verification activity
W_Nabellen incomplete dossiers\\START: Execution of Handling of incomplete dossiers
2IMI35 ASSIGNMENT 1 30/09/2016
Student ID: S167017 Name: Garcia Torres, D.M.
Similarly to the process applied for the Interesting tree #3, various discretizations are being performed, the initial data set
is filtered by Case Duration and the final data set is split in a 80/20 ratio to perform the validations and to generate the
decision tree. The process workflow is shown below:

For this decision tree I decided to reduce the maximal depth in order to simplify it. That decision came after realizing that
the accuracy difference between a 5 level tree and a 4 level tree in this particular case is less than 1%.
The output tree shows a clear pattern between offers accepted and the performance of handling incomplete dossiers
activity. The case duration is also an influencing factor in the offer acceptance. Is indeed noticeable that the decision tree
classifies a high percentage of cases with a duration below 30 as more likely to being finally accepted.

The performance vector of this tree shows an accuracy of 73.13%, with a bad number of true positives (216 of 450) but an
accurate number of true negatives (607 of 700).