Professional Documents
Culture Documents
Dante Niewenhuis
11058595
Bachelor thesis
Credits: 18 EC
University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam
Supervisor
dr. Sander van Splunter
Informatics Institute
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam
Abstract
In binary classification problems, several features present in a data set do not influ-
ence the prediction process. These features are redundant and not used, but do cause
the learning algorithm to be slower and to be more prone to overfitting. In this thesis,
an attempt is made to create a system that removes these redundant features from a
data set using a combination of Weight of Evidence and XGBoost. This system is eval-
uated using neural networks comparing both the balanced accuracy and the F-score.
This thesis is written in collaboration with ABN-AMRO, using their incident data set.
Aside from the ABN data set, three other data sets are evaluated to get a broader
understanding of the impact of the method used. All four data sets tested resulted in
a significant reduction of the number of features without a drop in predictive power.
One of the data sets resulted in a significant increase in both the balanced accuracy
as well as the F-score. Evaluating the results has shown that a combination of Weight
of Evidence and XGBoost gives more consistent and better results than one of the
methods by themselves.
Contents
1 Introduction 4
1.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Context: Blue Student Lab and ABN-AMRO . . . . . . . . . . . . . . . . . . 6
1.3.1 Related Thesis 1: Predicting resolution time . . . . . . . . . . . . . . . 6
1.3.2 Related Thesis 2: Predicting assignment group . . . . . . . . . . . . . 6
1.3.3 Related Thesis 3: Predicting caused by change . . . . . . . . . . . . . 7
1.3.4 Related Thesis 4: Clustering events and incidents . . . . . . . . . . . . 7
2 Background knowledge 8
2.1 Weight of Evidence and Information Value . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Calculating Weight of Evidence . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Calculating Information Value . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Potential problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Potential problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Overfitting and speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Ordinal Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 One-Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Data 19
3.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 20
4 Method 20
5 Results 22
5.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Conclusion 25
8 Acknowledgements 28
References 28
Appendices 30
A Data clarification 30
A.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 33
B Test Results 42
B.1 Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B.1.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.3 Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.4 ABN-AMRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.4.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Abbreviations
bAcc Balanced Accuracy
IG Information Gain
IV Information Value
NN Neural Network
1 Introduction
Large scale organizations rely on multiple hundreds of applications. For such applications,
lack of availability, reliability, or responsiveness can lead to extensive losses (Wang et al.,
2013). For example, customers being unable to place orders could cost Amazon up to $1.75
million per hour(Wang et al., 2013), which means that knowledge of software, hardware
and their incidents is vital. This thesis is written in collaboration with ABN-AMRO1 and
attempts to gain information from incident data. ABN-AMRO is a large organization based
in the Netherlands, that deals with a large number of applications in many different fields
of operation, ranging from online banking to internal communication systems. Having this
many different systems working together creates many possible problems which need to be
solved as quickly as possible. When an incident is reported, it gets assigned a priority rating
as well as a time of completion. If this time is not met, it will result in an out of time
incident (OOT). Reducing the number of OOTs is a big priority for ABN-AMRO.
In 2018 ten Kaate attempted to create a system capable of predicting if an incident
would go out of time based on the first documentation (ten Kaate, 2018). This was achieved
by using a multi-layered neural network and resulted in an accuracy of 0.7679, but only
a precision of 0.2169(ten Kaate, 2018). Neural networks have the positive characteristic
that most data problems can be predicted quite accurately without much added knowledge.
Neural networks, however, have problems with readability: It is hard to know what the more
important features are, or why certain data sets are less complicated to predict than others.
This makes neural networks very effective when only predictions are needed, but insufficient
when looking for insight into the solution. Knowing the reasons why incidents are predicted
to be out of time could help ABN-AMRO reduce the number of incidents rather than predict
them.
In this thesis, an attempt is made to expand on the project by ten Kaate by making a
system that removes features that are redundant when trying to make predictions. Besides
readability, there are more advantages when reducing features. The first obvious improve-
ment is the speed of the algorithm. Regardless of the kind of algorithm used, more features
are almost always equal to slower execution. Removing redundant data will, therefore, al-
ways have a positive impact on the speed. The second advantage is the lower possibility of
overfitting. Overfitting is the process where the algorithm is not finding patterns that could
help with predicting but is just memorizing the data. Many factors can cause overfitting
and features that do not add new information about predicting is one of them. Reducing
the features in a data set could lower the possibility of overfitting and thereby improve
predictive power.
The system proposed in this thesis is a combination of Weight of Evidence2 (WoE) and
Extreme gradient boosting3 (XGBoost). WoE is a measure of how much a feature supports
or undermines a hypothesis. WoE is ideally used when dealing with binary problems but
can be modified to work on classifying problems with more than two possible categories.
WoE is further explained in Subsection 2.1. XGBoost is a tree boosting algorithm. Tree
boosting algorithms use multiple weak learners and combine them to create a strong learner.
An advantage of XGBoost and other boosting algorithms is the readability. XGBoost is an
1 hhtps://www.abnamro.nl
2 https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
3 https://xgboost.readthedocs.io/en/latest/
ideal algorithm to determine the importance of features in a data set. XGBoost is further
explained in Subsubsection 2.2.3.
Removing features has many advantages, but those advantages are worthless if the re-
moval causes a significant drop in predictive power. This is why evaluation focuses primarily
on the impact of the removal on the predictive power. Evaluation of the system is done using
neural networks. For the evaluation, three neural networks are trained: one trained on all
features as a reference, one trained on the important features, and one on the unimportant
features. The three networks are compared based on predictive power. Predictive power is
based on both the balanced accuracy (bAcc) and the F-score. In this thesis, a significant
drop in predictive power is defined as a drop of more than 0.05 in either F-score or bAcc.
All evaluation metrics used in this thesis are explained in Subsection 2.5. In an ideal result,
the network trained on the important features would have no significant drop in predictive
power compared to the reference network, while the network trained on the unimportant
features would have a significant drop in predictive power.
1.2 Hypotheses
The first subquestion is expected to succeed since the method used is based on a research on
variable reduction using Weight of Evidence(Lin & Hsieh, 2014). The second subquestion is
also expected to succeed based on papers written on the possibilities of using random forest
algorithms for feature reduction(Genuer, Poggi, & Tuleau-Malot, 2010). XGBoost is a type
of a random forest algorithm and is expected also to be capable of feature reduction. The
third and fourth subquestions are much harder to predict. The combination of the WoE and
XGBoost would ideally combine the positives of both methods and produce better and more
reliable results. The questions stated above are answered using the ABN-AMRO incident
data sets, as well as three extra data sets. The three extra data sets are chosen based on
size and difficulty to predict. This ensures that this thesis provides a broader overview of
the reliability of the methods used.
The first thesis in the OOT group is written by Riemersma (2019). In her thesis Riemersma
attempts to expand on the system of ten Kaate by not only predicting if an incident will be
out of time but also how much time. Solving incidents in time is a complex task that can
be optimized in several different ways. One aspect that may help this process is knowing
the resolution time of an incident beforehand.
The second thesis in the OOT group is written by Wiggerman (2019). Wiggerman attempts
to reduce OOT incidents by assigning incidents directly to the right assignment group.
When an incident is noticed, it is assigned to an assignment group. If this assignment group
is unable to solve the incident, it will be passed through to another. This process will
continue until the incident is solved. The problem is that every assignment group needs to
repeat many steps of the solving process, which means that time is spent very inefficiently.
It is thereby no surprise that incidents with a high number of different assignment groups
are more likely to take too long to solve. Wiggerman attempts to resolve this process using
neural networks and k-nearest neighbour clustering algorithms to predict the best assignment
group for a given incident.
The third thesis in the OOT group is written by Velez (2019). In his thesis, Velez is creating
a theoretical model that could predict if an incident is caused by a change. In large-scale
software organizations, up to 80% of the incidents are caused by previous changes made
(Scott, 2001). Having a system that could predict the change that caused an incident
would be beneficial when trying to solve software incidents and prevent further ones from
occurring. Velez attempts to predict if an incident is caused by a change using PU learning
4
. PU learning is a niche machine learning technique which uses a combination of machine
learning algorithms and a special sampling method to handle incorrectly labelled data.
The fourth thesis in the OOT group is written by Knigge (2019). At ABN-AMRO there
are besides incidents also events. Events are incidents that are detected and registered by
automatic systems within the organization. An example of an event is a bot that tries to
log into the system every few minutes and creates an event every time it would not be able
to. Because events are created automatically, there is a tendency to create many events
for the same problem. This can be overwhelming for teams solving incidents, and thus,
many of these events are mostly ignored. In his thesis Knigge looks at the possibilities
of clustering these events so it would be easier to recognize new events and filter out the
duplicates. Knigge also tries to connect the events to an incident. In the example given
above, this would mean that when an incident is created because a customer could not log
in, this incident would be connected to the events created by the bot.
4 https://www.cs.uic.edu/
~liub/NSF/PSC-IIS-0307239.html
2 Background knowledge
2.1 Weight of Evidence and Information Value
Weight of Evidence (WoE) is a topic that has appeared in scientific literature for at least
the last 50 years(Weed, 2005). It has mostly been used as a method of risk-assessment but
can also be used for segmentation, variable reduction and various other things. In this thesis
WoE is used for variable reduction using a method that is primarily based on a paper by
Lin and Hsieh (2014). Lin and Hsieh uses WoE to asses the predictive power of a feature by
separating the data into multiple bins, and calculating the differences between the proportion
of events in the bin compared to rest of the data. The bigger the discrepancy, the higher the
WoE. In this thesis, an event means that the target value is true while a non-event means
the target value is false. The target is the feature that the algorithm tries to predict. For
example, in the OOT data set, the goal is to predict if an incident is going to be OOT. This
means that the target is the feature OOT, an event is when the incident is OOT, and a
non-event is when the incident is not OOT.
2.1.1 Binning
The method used in this thesis consists of four steps. The initial step is to separate the
feature into bins. In a paper about WoE Guoping states that three rules should be followed
while binning a data set for WoE (Guoping, 2014). The first rule states that each bin should
have at least 5% of the observations. This is done to prevent the final score from being
determined by a small fraction of the data. The second rule states that the missing values
have to be binned into a separate bin. The third rule states that every bin should have at
least one event and one non-event. The third rule of binning has not been followed in this
thesis because the used data did not always allow for it. The problems that are caused by
a bin with either no events or no non-events are solved using an adjusted WoE equation,
which is explained in Subsubsection 2.1.2. In this thesis, the data is divided into nine bins
plus one bin for missing data. The nine bins for the values are made as similar in size as
possible. If the feature has less than nine unique values, the number of bins is equal to
the number of unique values. The bins are made using the cut function from the Pandas5
package in Python.
The second step is to calculate the WoE for every bin. The equation to calculate the WoE
is as follows:
%Events
W oE = ln( ) (1)
%nonEvents
The WoE is calculated using the percentage of both events and non-events. Note that the
percentage of the events does not mean the percentage of the observations in the bin that
are events, but the percentage of events compared to the total number of events in the data
set. The WoE is positive when the percentage of events is higher than the percentage of non-
events and grows when the discrepancy grows. The WoE is negative when the percentage of
5 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
events is lower than the percentage of non-events and decreases when the discrepancy grows.
The WoE is zero when the percentage of events is equal to the percentage of non-events.
This equation for the WoE works for most cases but does assume that every bin has at
least one observation that is an event and at least one that is a non-event. This is caused
by the fact that dividing by zero as well as taking the natural log of zero is mathematically
not possible. As stated in Subsubsection 2.1.1, this thesis does not use the third rule of the
paper from Guoping, and because the system created in this thesis should work for many
different data sets, it is not possible to guarantee that bins with either zero events or zero
non-events are not present. To accommodate for all data, an adjusted equation for WoE is
used. The adjusted equation is as follows:
The third step is to calculate the Information Value (IV) for all bins. The WoE is the degree
of difference between the ratio of events in a single bin and the whole feature. However, to
state something about the predictive power of a feature, the IV is needed. The equation of
the IV is as follows:
Even though WoE has many advantages, it has two potential problems. The first potential
problem is that WoE is dependent on the quality of the binning. Because WoE is calculated
based on the difference between the bins and the total data, the method of binning can have
consequences for the results.
The second potential problem is the fact that WoE is purely based on the feature by itself.
This could potentially cause a problem when a feature can be important for the prediction
of a subset of the data but not necessarily for the prediction of all the data because the
WoE is calculated for the whole data set, it could be classified as unimportant while being
important.
(a) Weight of Evidence table for feature A. Feature A has low predictive power
Range bins nonE E % nonE %E % E - %nonE WoE IV
0-20 1 198 97 19.7 19.5 -0.2 -0.0002 0
20-40 2 204 105 20.3 21.1 0.8 0.0009 0.0007
40-60 3 197 98 19.7 19.6 -0.1 -0.0002 0
60+ 4 196 102 19.6 20.5 0.8 0.001 0.001
Missing 5 207 98 20.7 19.7 -1.0 -0.0012 0.0012
1002 498 0.0029
(b) Weight of Evidence table for feature B. Feature B has high predictive power
Range bins nonE E % nonE %E % E - %nonE WoE IV
0-20 1 250 80 23.7 14.6 -9.1 -0.013 0.117
20-40 2 180 140 17.1 25.5 8.4 0.0094 0.080
40-60 3 250 80 23.7 14.6 -9.2 -0.013 0.117
60+ 4 196 120 18.6 21.8 3.2 0.0039 0.012
Missing 5 180 130 17.1 23.6 6.5 0.0079 0.05
1056 550 0.376
Figure (1) Weight of Evidence tables for two different features. The low total IV score of
the feature A suggests low predictive power while the high total IV score of feature B suggests
a high predictive power. (Note that events was abbreviated to E.)
In this equation, pi means the fraction of the total observation is part of category i.
When dealing with binary problems, there are only two possible categories, True or False.
This simplifies the equation into:
The equation for the entropy functions like a parabola that has its peak at 0.5 with a
value of 1.0 and has a value of 0.0 if either everything is true or everything is false. After
being split into subsets, the entropy of the data set is calculated using the weighted average
of the subsets. The equation to calculate this weighted average is as follows:
n
X
E(S) = Pi ∗ E(i) (6)
i=1
In this equation, E(S) is the entropy of the whole data set while E(i) is the entropy of
a subset of data. Pi is the fraction of data that is part of subset i, which means that the
entropy of the larger subsets is weighted more.
Figure 2 shows an example of an effec-
tive split made. The data set consists of
11 observations where five observations are
red stars, and six observations are blue di-
amonds. The best prediction that could be
made from this initial data set would be
to predict all observations to be diamonds,
which would result in only 55% of the pre-
dictions being correct. The difficulty of
prediction is also be shown by the high en-
tropy value of 0.69.
The data set is split into two subsets,
one consisting of all the observation with
feature X larger than 30 and one consisting
of the remaining observations. The entropy
of the two subsets is lower than the entropy
of the root by having a value of 0.45 and
0.5, respectively. Calculating the entropy
of the data set after the split is calculated
using the weighted average and results in
Figure (2) Example of a simple decision tree
0.47. The IG of the split has a value of
0.22, indicating the split is effective.
The example given in Figure 2 is of a simple tree consisting of only one split, while
in reality, many more splits are needed to predict complex data sets correctly. It is not
uncommon that trees grow to many hundreds of splits. When using DTCs it is advised to
limit the number of splits to prevent overfitting.
Even though DTCs can be good classifiers and offer great readability, there are classification
problems that are very hard to solve using normal DTCs. One of the methods to improve
the predictive power of the tree algorithms is by extending it into a random forest algorithm.
Random forest algorithms function by creating a high number of simple DTCs that are all
trained on subsets of the data set. These small DTCs are called weak learners because they
have low predictive power by themselves. When a random forest algorithm wants to make
a prediction, all weak learners make a prediction. The predictions from the weak learners
are evaluated, and the most common prediction is chosen as the final prediction. Results
from research done by Breiman show that random forest algorithms are more reliable and
accurate when compared to algorithms that are based on a single tree (Breiman, 2001).
2.2.2 AdaBoost
AdaBoost is one of the most popular implementations of random forest algorithms. Ad-
aBoost uses boosting to create and evaluate the high number of weak learners made for
random forest algorithms. AdaBoost is the first practical boosting algorithm and is still one
of the most widely used (Schapire, 2013). The first step of AdaBoost is to create a weak
learner similar to the one shown in Figure 2 based on the full data set. Note that every weak
learner used by AdaBoost consists of a single split, these are also called stumps. A subset
of the data is created, which consists primarily of the observations that are not predicted
correctly by the first tree. A second tree is created based on this new subset. This process
will repeat until either the desired number of weak learners are created, or the predictions
made by the algorithm have reached the desired accuracy. In Adaboost not all weak learners
are weighted equally when predicting but are assigned weights which determine how much
they influence the prediction. The weight of a weak learner is determined by the fraction of
data it correctly predicts.
2.2.3 XGBoost
In this thesis, Extreme Gradient Boosting (XGBoost) is used instead of AdaBoost. XGBoost
is similar to AdaBoost but has certain advantages which make it more suitable to use.
The first reason to use XGBoost is the optimization for the use of sparse data. XGBoost
has shown to run 50 times faster on sparse data than naive boosting algorithms (Chen
& Guestrin, 2016). Effective functionality when dealing with sparse data is vital given
the amount of sparse data used in this project. Benchmarks made comparing different
types of boosting algorithms6 show that XGBoost is among the fastest and most accurate
boosting algorithms. XGBoost has proven to be very successful and widely used in many
programming competitions. An example of this is the KDDCup 2015, where all top-10
finishers used XGBoost(Chen & Guestrin, 2016).
This thesis uses XGBoost to determine the importance of each feature. First, a model
is trained using XGBoost. When using the Python version of XGBoost7 , it is possible
to get a list of feature importance. From this list, the most important features can be
selected. Determining which features need to be selected can be done using various methods,
6 http://datascience.la/benchMarcing-random-forest-implementations/
7 https://xgboost.readthedocs.io/en/latest/python/python intro.html
but in this thesis, a simple threshold is used. If the feature has higher importance than
the threshold, it is selected, and otherwise, it is removed. Increasing this threshold will
decrease the number of selected feature but will increase the possibility of a significant drop
in performance.
Even though XGBoost has many advantages, it has two potential problems. The first poten-
tial problem is that XGBoost is a greedy algorithm, which means that XGBoost generates
its splits using heuristics rather than processing the whole dataset. This could result in XG-
Boost, making locally optimal choices but not always globally optimal choices. This could
impact the importance of value given to a feature.
The second potential problem is the approach XGBoost has towards dealing with fea-
tures containing similar information. If two features contain similar information, by being
correlated, XGBoost would only need one of the two features for predicting. This means
that one of the two features would get a very low importance rating, even though it is as
important as the other feature.
Many AI machine learning models are prone to overfitting. Overfitting is the phenomenon
where instead of finding patterns in the data, the algorithm starts to memorize the data.
An example of overfitting is shown in Figure 3. In this figure, two algorithms attempt to
predict the value of feature Y based on feature X. Figure 3a shows a line that predicts the
value of feature Y while not being too complex. Figure 3b shows a line that predicts the
value of feature Y with a very complex line. While Figure 3b is much more accurate when
predicting the training data, it is much worse when predicting the validation data.
NNs have the advantage that they can learn very complicated relationships between
inputs and outputs. NNs are however very susceptible to overfitting (Lawrence, Giles, &
Tsoi, 1997). Lawrence et al. state that one reason for overfitting is the high number of
weights present. A reduction of the number of features in a data set (Lawrence et al.,
1997), using the methods proposed in this thesis, reduces the number of weights and could
thereby reduce the possibility of overfitting. Another advantage of feature reduction is the
improvement in learning speed. The number of calculations done by a NN is based on
the number of weights; if this number is reduced it will automatically increase the training
speed.
Figure (3) Example of the difference between an algorithm that is overfitting and one that
is not.
2.4 Encoding
Many algorithms have difficulties when dealing with categorical data. These difficulties
are caused by the fact that most algorithms function using numeric data. To resolve this
problem, categorical data is processed using an encoder. There are many different methods
of encoding data, but only two are used in this thesis.
The first method used is ordinal encoding, which means that categorical data is replaced
by numeric values. An example of ordinal encoding is shown in Figure 4. This method
is sometimes also called numeric or integer encoding. In this thesis ordinal encoding is
executed using sklearn8 . Ordinal encoding has the advantage that it is easy to execute and
is very space-efficient given that it is one of the only methods of encoding that does not add
new columns to the data.
While ordinal encoding has many advantages, it also has some problems. One problem
with ordinal encoding is that it implies a relationship between categories that might not
be present. In the example given the encoded data could imply that Rotterdam is twice
Amsterdam, and London is even higher even though this is not the case. Another prob-
lem with ordinal encoding is that not all types of algorithms can work with it optimally.
When researching the impact of encoding data on the performance of a neural network,
ordinal encoding was shown to be the worst-performing method of encoding tested (Potdar,
Pardawala, & Pai, 2017).
City City
0 Amsterdam 0 1
1 Rotterdam 1 2
2 Amsterdam 2 1
3 Rotterdam 3 2
4 London 4 3
The second method encoding used in this thesis is one-hot encoding. one-hot encoding is one
of the most used encoding methods because it requires no knowledge of the data and works
very well with neural networks. one-hot encoding creates a separate column for every unique
category in a column. The values in the new columns consist only of the values 1 and 0,
stating if the observation is a part of the category or not. In Figure 5 an example of one-hot
encoding is shown. In the example given in Figure 5 the column City turns into separate
columns for Amsterdam, Rotterdam and London respectively. The reason NNs work so well
using One-hot encoding is that it can assign different weights to all the categories separately.
While one-hot encoding has many advantages, it also has some problems. The primary
problem with one-hot encoding is space efficiency. One-hot encoding creates a new column
for every unique category, which can lead to a very large data set, especially when the
number of categories increases. The space efficiency can be increased when using sparse
matrices9 , but is still not ideal. To reduce the number of columns created, a method from
ten Kaate’s thesis is used whereby all categories present less than 5 times are placed together
8 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder
.html#sklearn.preprocessing.OrdinalEncoder
9 https://docs.scipy.org/doc/scipy/reference/sparse.html
P
tp
P recision = P P (7)
tp + f p
P
tp
Recall = P P (8)
tp + f n
P recision ∗ Recall
F − score = 2 ∗ (9)
P recision + Recall
P
tn
T rueN egativeRate = P P (10)
tn + f n
P P
tp + tn
Accuracy = P P P P (11)
tp + f p + tn + f n
Recall + T rueN egativeRate
BalancedAccuracy = (12)
2
Figure (6) equations used for evaluation results, T rueN egativeRate and Accuracy are not
directly used in this research but are give as context.
The precision of a model is the number of correct positive predictions in the collection
of all positive predictions. When an algorithm is optimized on precision, it will cause the
algorithm to predict significantly more true positives than false positives. This can be very
useful when developing a system where false positives are a big problem. An example could
be an automatic fine system where an algorithm would determine an offence and would
automatically hand out fines without any human interference. The downside of precision is
that it does not take into account the number of false-negative predictions. This means that
these algorithms tend to predict positive less often because they need to be entirely sure to
do so. This, however, means that these algorithms can get a high number of false negatives.
The recall of a model is the number of correct positive predictions in the collection
of all positives. When an algorithm is optimized on recall, it will cause the algorithm to
predict significantly more true negatives than false negatives. This could be very useful when
developing a system where false positives are not a big problem. An example of this would
be an algorithm that would be used when filtering data before a human looks at them to
determine the further process. In this case, false positives would not be a problem because
the human evaluation could remove those, but the filtering is still successful because it could
save the observer a significant amount of work. The downside of these algorithms is that
they do not take into account the number of false-positive predictions. This means that
these algorithms tend to predict positive much more often because it increases the chance
of predicting all the positives correctly.
The F-score combines the precision and the recall. The F-score operates like a mean when
the precision and recall are close together but punishes disparity between the precision and
recall.
Despite both recall and precision being capable of evaluating results, they both dismiss
the importance of the true negative predictions. Accuracy incorporates the true negative
prediction and could thereby give a better evaluation of the results. Accuracy does, however,
have a big downside, and that is working with unbalanced prediction targets. A prediction
target is unbalanced when one class is much more present than the other. An example of an
unbalanced target is in the OOT data set used in this thesis. The OOT data set consists of
a much higher number of incidents that are solved within the time limit than incidents that
are not. In Figure 7 an example of a problematic situation is described. In this example,
both the precision and the recall are extremely low while the accuracy is very high. This
problem is caused by the fact that, without a normalizing factor, the dominant class has
much more impact on accuracy. This problem can be resolved easily by using the balanced
accuracy (bAcc). bAcc works similarly to normal accuracy but neutralizes the prediction.
Because not all data used in this thesis is balanced, it is vital that the used evaluation
methods are capable of working with unbalanced data and thus the bAcc is used.
tp = 1, f p = 9
tn = 941, f n = 49
1 1
P recision = = 0.02, Recall = = 0.1
50 10
1 + 941
Accuracy = = 0.942
1000
941
T rueN egativeRate = = 0.99
950
0.99 + 0.1
BalancedAccuracy = = 0.5
2
Figure (7) Example of the problem that can occur when using accuracy while working with
unbalanced data
3 Data
In this project, four data sets are used to ensure the methods tested are effective on different
data sets. The data sets vary in size, the number of features and the ease to predict. All
data sets are preprocessed in two steps. The first step is removing all columns that consist of
more than 95% of empty rows. The second step is removing all columns that either consist
of only one unique value or consist of more than 95% unique values. The two methods of
preprocessing are used to reduce overfitting. A more elaborate overview of the data sets can
be found in Appendix A.
4 Method
In this thesis, an attempt is made to divide the set of features into a subset of important and
unimportant features. This is done using three different methods. The first method used is
based on the WoE. The IV of every feature is determined, as explained in Subsection 2.1.
The selection of important features is made using a threshold. Every feature with a total IV
higher than 0.05 is classified important while all other features are classified unimportant.
The threshold used was determined based on initial experimental exploration. The first
experiments were done using a threshold value of 0.02, which is classified as ”unpredictable”
learners in Table 1. However, this threshold resulted in too many features being classified as
important that did not seem important, because they could be removed without a significant
drop in predictive power. The next experiments were done using a threshold value of 0.1,
which is classified as ”weak” learners in Table 1. However, this threshold resulted in a
significant drop in predictive power. A threshold used in this thesis is between the two
thresholds with a value of 0.05 and resulted in the highest number of removed features
without a significant drop in predictive power.
The second method used is based on XGBoost13 . XGBoost is used to determine the im-
portance of every features using the method explained in Subsubsection 2.2.3. The selection
of important features is made using a threshold. Every feature with an importance value be-
low 0.01 will be removed. The threshold used was determined based on initial experimental
exploration. Initially, the same threshold value as the WoE was used, but this resulted in a
drop in predictive power. After multiple experiments, a threshold value of 0.01 was shown
to be the most reliable.
The final method used is a combination of the two methods mentioned above. First,
a selection is made using the IV based method. The XGBoost based method is used on
the same data set consisting of only the selected features. Both WoE and XGBoost have
potential problems, as shown in Subsubsection 2.1.4 and Subsubsection 2.2.4 respectively.
The combination of the two methods would ideally utilize the advantages of both methods
and dispose of the disadvantages.
The evaluation is done using neural networks which are built using Tensorflow14 and
13 https://xgboost.readthedocs.io/en/latest/
14 https://www.tensorflow.org/
Keras15 with Python16 as the programming language. The neural networks used to train
are of limited complexity, consisting of an input layer, one hidden layer and an output layer.
The size of the input layer is based on the input size, the size of the hidden layer is 128,
and the size of the output layer is 2 to predict binary values. The activation function of the
hidden layer is ReLU17 and the activation function of the output layer is Softmax18 . The
optimizer used to train the data is the Adam optimizer19 with default parameter values. All
neural networks are trained for ten epochs with a 30% validation-split.
In total, seven different neural networks are trained. One using all features, which is
used as a reference, and for all three methods of reduction one based on the important
features and one based on the unimportant features. The results of the experiments will
15 https://keras.io/
16 https://www.python.org/
17 https://www.tensorflow.org/api docs/python/tf/nn/relu
18 https://www.tensorflow.org/api docs/python/tf/nn/softmax
19 https://www.tensorflow.org/api docs/python/tf/train/AdamOptimizer
consist of two parts. The first part is the reduction power of the split, meaning how many
features are removed and how much does this improve the speed. Expected is that the more
features are removed, the more the speed will improve. The second part of the results is
the quality evaluation. Removing features can only be positive when it does not have a
significant negative impact on the predictive power of the system. The quality is evaluated
using F-score and bAcc. In this thesis, a significant drop in predictive power is defined as a
drop of more than 0.05 in either F-score or bAcc. All evaluations are made using SKlearn
model evaluation20 in Python, except for the F-score which is calculated using Equation 9.
In an ideal result, the predictive power of the network trained on the selected features
would not be significantly worse than the predictive power of the network trained on all the
features, while the predictive power of the network trained on the removed features would
be.
5 Results
Below are the results gathered during testing. The predictive power is based on both the
balanced accuracy and the F-score. All the networks are trained for ten epochs and line
graphs of the accuracy over time can be found in Appendix B. Appendix B also shows bar
plots with the importance of each feature.
5.5 Analysis
The IV based method is able to reduce the number of features of all the data sets used in this
thesis. The IV based method does, however, perform significantly worse compared to both
the XGBoost and the combined method in the weatherAUS data set. This is caused by the
second potential problem when using the IV based system discussed in Subsubsection 2.1.4.
In the weatherAUS data set, only one feature is required to predict the target 100% correct,
which means that all other features can be removed. This does, however, not mean that
the features are useless when predicting, but that these features do not add anything to the
information already present in the most important feature. The features are thus classified
as important by the IV based method because the features could be used for prediction by
themselves, even though they do not contribute much compared to the best feature.
The XGBoost based method can reduce the number of features of three of the four data
sets used in this thesis. The XGBoost is, however, not able to reduce the number of features
used in the Titanic data set. This is possibly caused by the first potential problem discussed
in Subsubsection 2.2.4. When comparing the list of importance of each feature to the one
made by the IV based method, it is prominent that the most important features are the
same. The problem is that while all the features classified as unimportant in the IV are
also the lowest-scoring features in the XGBoost based method; their value is too high to be
classified as unimportant. This problem could be resolved by increasing the threshold, but
this could also increase the possibility of a significant drop in predictive power.
The combined method can reduce the number of features of all the data sets used in
this thesis. In all four data sets the combined method was either one of the best or the
best method to use. The results suggest that the combined method has all the advantages
of both methods while removing the disadvantages. The ABN-AMRO OOT data set is the
one data set where the combined method proved to be better than either one of the two
methods. This is most likely caused by the lower number of features present when executing
XGBoost. Fewer features that are unimportant lowers the impact of the second potential
problem discussed in Subsubsection 2.2.4. This is an excellent example of the advantage of
using the combined method. Noteworthy is the fact that in both the adult data set and the
ABN-AMRO OOT data set, either the precision decreases and the recall increases, or vice
versa. There has been no mention of this fact in any literature used for this thesis.
6 Conclusion
Based on the evaluated results discussed in Subsection 5.5 it is possible to answer the research
questions proposed in Subsection 1.1. Before the research question can be answered, all four
subquestions have to be answered
The first subquestion is ”Can Weight of Evidence be used to reduce the number of
features of a data set without a significant drop in predictive power when solving a binary
classification problem?”. The results indicate that Weight of Evidence is indeed able to
reduce the number of features without a significant drop in predictive power. The WoE
based method resulted in a reduction of the number of features in every data set used in this
thesis. The reduction of the adult data set even resulted in a small increase in the balanced
accuracy score.
The second subquestion is ”Can XGBoost be used to reduce the number of features of a
data set without a significant drop in predictive power when solving a binary classification
problem?”. The results indicate that XGBoost is indeed able to reduce the number of
features without a significant drop in predictive power. The XGBoost based method resulted
in a reduction of the number of features in three of four used data sets. The XGBoost based
method was however not able to reduce the number of features of the Titanic data set
The third subquestion is ”Does Weight of Evidence or XGBoost perform better when
reducing the number of features of a data set without a significant drop in predictive power
when solving a binary classification problem?”. Neither of the two methods consistently
perform better than the other. In both the weatherAUS data set as well as the ABN-
AMRO OOT data set, the XGBoost based method performed better than the WoE based
method. In both the Titanic data set as well as the Adult data set, the WoE based method
performed better than the XGBoost based method. Noteworthy, however, is the fact that
the XGBoost based method was not able to reduce the Titanic data set which might indicate
that the WoE based method is more reliable.
The fourth subquestion is ”Does a combination of Weight of Evidence and XGBoost
perform better when reducing the number of features of a data set without a significant
drop in predictive power when solving a binary classification problem than one of them
by themselves?”. The results show that in all four data sets tested in this thesis, the
combined method resulted in the highest number of removed features without a significant
drop in predictive power. Besides being more consistent than the methods by themselves,
the combined method also performs better in the ABN-AMRO OOT data set. The ABN-
AMRO OOT data set is an excellent example of the two methods using their advantages to
resolve their disadvantages.
The overall research question is ”Can a combination of Weight of Evidence and XG-
Boost be used to reduce the number of features of a data set without a significant drop in
predictive power when solving a binary classification problem?”. The results indicate that a
combination of WoE and XGBoost can reduce the number of features and does this better
and more consistent than one of the two methods separately. There are no indications that
show difficulties with complexity or size, which would indicate that the method can be used
for all binary problems.
suggest this factor to be of significant impact since they also have a much smaller input size,
but still performed much worse than the whole data set.
The third remark is about the thresholds chosen for both the XGBoost and the Weight
of Evidence. Both thresholds are primarily chosen because they provided the best results
during initial experimentation. While these thresholds seem to work well, it could be possible
that better parameters could be found. A possible improvement to the system could also
be the method of choosing features based on the importance scores to better suit the needs
of the project. Increasing the thresholds would remove more redundant features and thus
leave only the most important features but would cause a loss in prediction power. For some
projects, a loss in prediction power is much less detrimental because the readability is a more
important goal. Another method of classifying features could be to choose the top X features.
This method obviously has the big downside that it requires some form of knowledge about
the data, to choose the number of features returned. Picking a wrong number of features
could result in either removing important features because not enough features are returned,
or classifying unimportant features as important because many features are returned.
formula%20guide/Weight%20of%20Evidence%20Formula%20Guide.pdf
8 Acknowledgements
I would like to thank ABN-AMRO for the use of their time, data and expertise. I am
especially grateful for both Monique Gerrits, Ronald van der Veen and Paul ten Kaate for
their time and support within the ABN-AMRO. I would like to thank all my fellow student
for feedback and collaboration.
References
Breiman, L. (2001, Oct 01). Random forests. Machine Learning, 45 (1), 5–32. Retrieved
from https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining
(pp. 785–794). New York, NY, USA: ACM. Retrieved from http://doi.acm.org/
10.1145/2939672.2939785 doi: 10.1145/2939672.2939785
Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random
forests. Pattern Recognition Letters, 31 (14), 2225 - 2236. Retrieved from http://
www.sciencedirect.com/science/article/pii/S0167865510000954 doi: https://
doi.org/10.1016/j.patrec.2010.03.014
Guoping, Z. (2014, 07). A necessary condition for a good binning algorithm in credit scoring.
Applied Mathematical Sciences, Vol. 8 , 3229-3242. doi: 10.12988/ams.2014.44300
Knigge, D. (2019). Event correlation and dependenc-graph analysis to suppory root cause
analysis in itsm environments. (Bachelor’s Thesis). University of Amsterdam.
Lawrence, S., Giles, C. L., & Tsoi, A. C. (1997). Lessons in neural network training:
Overfitting may be harder than expected.
Lin, A. Z., & Hsieh, T.-Y. (2014). Expanding the use of weightof evidence and infor-
mation value to continuous dependent variables for variable reduction and scorecard
development. SESUG, 2014 .
Potdar, K., Pardawala, T., & Pai, C. (2017, 10). A comparative study of categorical variable
encoding techniques for neural network classifiers. International Journal of Computer
Applications, 175 , 7-9. doi: 10.5120/ijca2017915495
Riemersma, R. (2019). Predicting incident duration time. (Bachelor’s Thesis). University
of Amsterdam.
Rosset, & Saharon. (2004). Model selection via the AUC. In Proceedings of the twenty-
first international conference on machine learning (pp. 89–). New York, NY, USA:
ACM. Retrieved from http://doi.acm.org/10.1145/1015330.1015400 doi: 10
.1145/1015330.1015400
S., M. W., & Pitts, W. (1943, Dec 01). A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5 (4), 115–133. Retrieved from
https://doi.org/10.1007/BF02478259 doi: 10.1007/BF02478259
Schapire, R. E. (2013). Explaining adaboost. In B. Schölkopf, Z. Luo, & V. Vovk (Eds.),
Empirical inference: Festschrift in honor of vladimir n. vapnik (pp. 37–52). Berlin,
Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/
978-3-642-41136-6 5 doi: 10.1007/978-3-642-41136-6 5
Schmidhuber, J. (2014). Deep learning in neural networks: An overview. CoRR,
abs/1404.7828 . Retrieved from http://arxiv.org/abs/1404.7828
Scott, D. (2001). Nsm: Often the weakest link in business availability. Gartner Group
AV-13-9472 .
Su, J., & Zhang, H. (2006). A fast decision tree learning algorithm. In Proceedings of the
21st national conference on artificial intelligence - volume 1 (pp. 500–505). AAAI
Press. Retrieved from http://dl.acm.org/citation.cfm?id=1597538.1597619
ten Kaate, P. (2018). Automatic detection, diagnosis and mitigation of incidents in multi-
system environments. (Bachelor’s Thesis). University of Amsterdam.
Velez, M. (2019). Predicting causal relations between itsm incidents and changes. (Bachelor’s
Thesis). University of Amsterdam.
Wang, C., Kavulya, S., Tan, J., Hu, L., Kutare, M., Kasick, M., . . . Gandhi, R. (2013,
11). Performance troubleshooting in data centers: an annotated bibliography? ACM
SIGOPS Operating Systems Review , 47 , 50-62. doi: 10.1145/2553070.2553079
Weed, D. L. (2005). Weight of evidence: A review of concept and methods. Risk Analysis,
Vol 25, No 6 .
Widrow, B., Rumelhart, D. E., & Lehr, M. A. (1994, March). Neural networks: Applications
in industry, business and science. Commun. ACM , 37 (3), 93–105. Retrieved from
http://doi.acm.org/10.1145/175247.175257 doi: 10.1145/175247.175257
Wiggerman, M. (2019). Predicting the first assignment group for a smooth incident resolution
process (Bachelor’s Thesis). University of Amsterdam.
Zadrozny, & Bianca. (2004, 09). Learning and evaluating classifiers under sample selection
bias. Proceedings, Twenty-First International Conference on Machine Learning, ICML
2004 , 2004 . doi: 10.1145/1015330.1015425
Appendices
A Data clarification
A.1 Simple data set - Titanic
In the Titanic data set two rows were removed both where removed during preprocessing.
B Test Results
B.1 Titanic
B.1.1 Accuracy
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
Unimportant Features
0.7
All Features
0.6
0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time
Figure (13) Validation accuracy over time when training the Titanic data set, Because
no features were removed during the Tree-based method it was not possible to test the non-
selected Features
B.1.2 Loss
loss
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time test time
(c) Combined
1
0.9
0.8
0.7 Important Features
0.6
loss
B.2 Weather
B.2.1 Accuracy
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
Unimportant Features
0.7
All Features
0.6
0.5
0 50 100 150 200 250 300 350 400
test time
Figure (17) Validation accuracy over time when training the weatherAUS data set
B.2.2 Loss
loss
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
test time test time
(c) Combined
2
1.8
1.6
1.4 Important Features
1.2
loss
1 Unimportant Features
0.8
0.6 All Features
0.4
0.2
0
0 50 100 150 200 250 300 350 400
test time
Figure (19) Validation loss over time when training the weatherAUS data set
B.3 Adult
B.3.1 Accuracy
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
Unimportant Features
0.7
All Features
0.6
0.5
0 5 10 15 20 25 30 35 40
test time
Figure (21) Validation accuracy over time when training the Adults data set
B.3.2 Loss
loss
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
test time test time
(c) Combined
2
1.8
1.6
1.4 Important Features
1.2
loss
1 Unimportant Features
0.8
0.6 All Features
0.4
0.2
0
0 5 10 15 20 25 30 35 40
test time
Figure (23) Validation loss over time when training the Adults data set
B.4 ABN-AMRO
B.4.1 Accuracy
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 400 800 1,200 1,600 2,000 0 400 800 1,200 1,600 2,000
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
Unimportant Features
0.7
All Features
0.6
0.5
0 400 800 1,200 1,600 2,000
test time
Figure (25) Validation accuracy over time when training the ABN-AMRO data set
B.4.2 Loss
loss
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 400 800 1,200 1,600 2,000 0 400 800 1,200 1,600 2,000
test time test time
(c) Combined
1
0.9
0.8
0.7
Important Features
0.6
loss