Feature Reduction For Binary Classification Problems Using Weight of Evidence and Xgboost

Feature Reduction for Binary
Classification Problems using Weight of

Evidence and XGBoost
Dante Niewenhuis
11058595
Bachelor thesis
Credits: 18 EC
Bachelor Opleiding Kunstmatige Intelligentie
University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam
Supervisor
dr. Sander van Splunter
Informatics Institute
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam
June 29th, 2019

Bachelor Artificial Intelligence
Abstract
In binary classification problems, several features present in a data set do not influ-
ence the prediction process. These features are redundant and not used, but do cause
the learning algorithm to be slower and to be more prone to overfitting. In this thesis,
an attempt is made to create a system that removes these redundant features from a
data set using a combination of Weight of Evidence and XGBoost. This system is eval-
uated using neural networks comparing both the balanced accuracy and the F-score.
This thesis is written in collaboration with ABN-AMRO, using their incident data set.
Aside from the ABN data set, three other data sets are evaluated to get a broader
understanding of the impact of the method used. All four data sets tested resulted in
a significant reduction of the number of features without a drop in predictive power.
One of the data sets resulted in a significant increase in both the balanced accuracy
as well as the F-score. Evaluating the results has shown that a combination of Weight
of Evidence and XGBoost gives more consistent and better results than one of the
methods by themselves.
Keywords— Feature Reduction, Weight of Evidence, Information Value, Neural Networks,

Readability.
Dante Niewenhuis page 1 of 49

Contents
1 Introduction 4
1.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Context: Blue Student Lab and ABN-AMRO . . . . . . . . . . . . . . . . . . 6
1.3.1 Related Thesis 1: Predicting resolution time . . . . . . . . . . . . . . . 6
1.3.2 Related Thesis 2: Predicting assignment group . . . . . . . . . . . . . 6
1.3.3 Related Thesis 3: Predicting caused by change . . . . . . . . . . . . . 7
1.3.4 Related Thesis 4: Clustering events and incidents . . . . . . . . . . . . 7
2 Background knowledge 8
2.1 Weight of Evidence and Information Value . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Calculating Weight of Evidence . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Calculating Information Value . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Potential problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Potential problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Overfitting and speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Ordinal Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 One-Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Data 19
3.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 20
4 Method 20
5 Results 22
5.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Conclusion 25

7 Discussion & Future Research 26

7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Recomendations for ABN-AMRO . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Acknowledgements 28
References 28
Appendices 30
A Data clarification 30
A.1 Simple data set - Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.2 Big data set - WeatherAUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.3 Complex data set - Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.4 Domain data set - ABN-AMRO OOT . . . . . . . . . . . . . . . . . . . . . . 33
B Test Results 42
B.1 Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B.1.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.3 Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.4 ABN-AMRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.4.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Abbreviations
bAcc Balanced Accuracy
DTC Decision Tree Classification algorithm
IG Information Gain
IV Information Value
NN Neural Network
WoE Weight of Evidence
XGBoost Extreme Gradiant Boosting

1 Introduction
Large scale organizations rely on multiple hundreds of applications. For such applications,
lack of availability, reliability, or responsiveness can lead to extensive losses (Wang et al.,
2013). For example, customers being unable to place orders could cost Amazon up to $1.75
million per hour(Wang et al., 2013), which means that knowledge of software, hardware
and their incidents is vital. This thesis is written in collaboration with ABN-AMRO1 and
attempts to gain information from incident data. ABN-AMRO is a large organization based
in the Netherlands, that deals with a large number of applications in many different fields
of operation, ranging from online banking to internal communication systems. Having this
many different systems working together creates many possible problems which need to be
solved as quickly as possible. When an incident is reported, it gets assigned a priority rating
as well as a time of completion. If this time is not met, it will result in an out of time
incident (OOT). Reducing the number of OOTs is a big priority for ABN-AMRO.
In 2018 ten Kaate attempted to create a system capable of predicting if an incident
would go out of time based on the first documentation (ten Kaate, 2018). This was achieved
by using a multi-layered neural network and resulted in an accuracy of 0.7679, but only
a precision of 0.2169(ten Kaate, 2018). Neural networks have the positive characteristic
that most data problems can be predicted quite accurately without much added knowledge.
Neural networks, however, have problems with readability: It is hard to know what the more
important features are, or why certain data sets are less complicated to predict than others.
This makes neural networks very effective when only predictions are needed, but insufficient
when looking for insight into the solution. Knowing the reasons why incidents are predicted
to be out of time could help ABN-AMRO reduce the number of incidents rather than predict
them.
In this thesis, an attempt is made to expand on the project by ten Kaate by making a
system that removes features that are redundant when trying to make predictions. Besides
readability, there are more advantages when reducing features. The first obvious improve-
ment is the speed of the algorithm. Regardless of the kind of algorithm used, more features
are almost always equal to slower execution. Removing redundant data will, therefore, al-
ways have a positive impact on the speed. The second advantage is the lower possibility of
overfitting. Overfitting is the process where the algorithm is not finding patterns that could
help with predicting but is just memorizing the data. Many factors can cause overfitting
and features that do not add new information about predicting is one of them. Reducing
the features in a data set could lower the possibility of overfitting and thereby improve
predictive power.
The system proposed in this thesis is a combination of Weight of Evidence2 (WoE) and
Extreme gradient boosting3 (XGBoost). WoE is a measure of how much a feature supports
or undermines a hypothesis. WoE is ideally used when dealing with binary problems but
can be modified to work on classifying problems with more than two possible categories.
WoE is further explained in Subsection 2.1. XGBoost is a tree boosting algorithm. Tree
boosting algorithms use multiple weak learners and combine them to create a strong learner.
An advantage of XGBoost and other boosting algorithms is the readability. XGBoost is an
1 hhtps://www.abnamro.nl
2 https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
3 https://xgboost.readthedocs.io/en/latest/

ideal algorithm to determine the importance of features in a data set. XGBoost is further
explained in Subsubsection 2.2.3.
Removing features has many advantages, but those advantages are worthless if the re-
moval causes a significant drop in predictive power. This is why evaluation focuses primarily
on the impact of the removal on the predictive power. Evaluation of the system is done using
neural networks. For the evaluation, three neural networks are trained: one trained on all
features as a reference, one trained on the important features, and one on the unimportant
features. The three networks are compared based on predictive power. Predictive power is
based on both the balanced accuracy (bAcc) and the F-score. In this thesis, a significant
drop in predictive power is defined as a drop of more than 0.05 in either F-score or bAcc.
All evaluation metrics used in this thesis are explained in Subsection 2.5. In an ideal result,
the network trained on the important features would have no significant drop in predictive
power compared to the reference network, while the network trained on the unimportant
features would have a significant drop in predictive power.
1.1 Research questions

Given the goals stated in Section 1 the research question 1 is determined. The research
question is divided into four subquestions. Subquestions 1.1 and 1.2 determine if the meth-
ods used in this thesis are capable of reducing the number of features of a data set. This
is important to know when explaining why the combination of the two is effective or non-
effective. These subquestions can also reveal potential problems that the two methods have,
which will be very usefull information when discussing the viability of the combination.
Subquestion 1.3 is used to determine which of the two methods is more effective. If one of
the methods is more effective, it will be used as the reference when answering Subquestion
1.4. Subquestion 1.4 is the most important to answer because even if it is possible to reduce
the amount of features using a combination of WoE and XGBoost, it is only beneficial if it
is more effective than both methods separately. If one of the features is as effective as the
two combined, it is a waste of time to use the combined method.
Can a combination of Weight of Evidence and XGBoost be used to

reduce the number of features of a data set without a significant drop in (1)
predictive power when solving a binary classification problem?
Can Weight of Evidence be used to reduce the number of features of a

data set without a significant drop in predictive power when solving a (1.1)
binary classification problem?
Can XGBoost be used to reduce the number of features of a data set
without a significant drop in predictive power when solving a binary (1.2)
classification problem?
Does Weight of Evidence or XGBoost perform better when reducing the
number of features of a data set without a significant drop in predictive (1.3)
power when solving a binary classification problem?
Does a combination of Weight of Evidence and XGBoost perform better
when reducing the number of features of a data set without a significant
(1.4)
drop in predictive power when solving a binary classification problem
than one of them by themselves?

1.2 Hypotheses
The first subquestion is expected to succeed since the method used is based on a research on
variable reduction using Weight of Evidence(Lin & Hsieh, 2014). The second subquestion is
also expected to succeed based on papers written on the possibilities of using random forest
algorithms for feature reduction(Genuer, Poggi, & Tuleau-Malot, 2010). XGBoost is a type
of a random forest algorithm and is expected also to be capable of feature reduction. The
third and fourth subquestions are much harder to predict. The combination of the WoE and
XGBoost would ideally combine the positives of both methods and produce better and more
reliable results. The questions stated above are answered using the ABN-AMRO incident
data sets, as well as three extra data sets. The three extra data sets are chosen based on
size and difficulty to predict. This ensures that this thesis provides a broader overview of
the reliability of the methods used.
1.3 Context: Blue Student Lab and ABN-AMRO

This thesis is written in the context of the Blue Student lab. The Blue Student Lab is a col-
laboration between the University of Amsterdam and large organizations in which bachelor
students get the opportunity to write their thesis using the data from those organizations.
This project started in 2018. Since then there has only been a collaboration with ABN-
AMRO, but it will be expanded in the future. In 2019 there are ten students working
together with ABN-AMRO divided into two groups: blockchain and incident management.
This thesis is part of the incident management group, and thus, this section will provide an
introduction to other projects in the management group.
1.3.1 Related Thesis 1: Predicting resolution time
The first thesis in the OOT group is written by Riemersma (2019). In her thesis Riemersma
attempts to expand on the system of ten Kaate by not only predicting if an incident will be
out of time but also how much time. Solving incidents in time is a complex task that can
be optimized in several different ways. One aspect that may help this process is knowing
the resolution time of an incident beforehand.
1.3.2 Related Thesis 2: Predicting assignment group
The second thesis in the OOT group is written by Wiggerman (2019). Wiggerman attempts
to reduce OOT incidents by assigning incidents directly to the right assignment group.
When an incident is noticed, it is assigned to an assignment group. If this assignment group
is unable to solve the incident, it will be passed through to another. This process will
continue until the incident is solved. The problem is that every assignment group needs to
repeat many steps of the solving process, which means that time is spent very inefficiently.
It is thereby no surprise that incidents with a high number of different assignment groups
are more likely to take too long to solve. Wiggerman attempts to resolve this process using
neural networks and k-nearest neighbour clustering algorithms to predict the best assignment
group for a given incident.

1.3.3 Related Thesis 3: Predicting caused by change
The third thesis in the OOT group is written by Velez (2019). In his thesis, Velez is creating
a theoretical model that could predict if an incident is caused by a change. In large-scale
software organizations, up to 80% of the incidents are caused by previous changes made
(Scott, 2001). Having a system that could predict the change that caused an incident
would be beneficial when trying to solve software incidents and prevent further ones from
occurring. Velez attempts to predict if an incident is caused by a change using PU learning
4
. PU learning is a niche machine learning technique which uses a combination of machine
learning algorithms and a special sampling method to handle incorrectly labelled data.
1.3.4 Related Thesis 4: Clustering events and incidents
The fourth thesis in the OOT group is written by Knigge (2019). At ABN-AMRO there
are besides incidents also events. Events are incidents that are detected and registered by
automatic systems within the organization. An example of an event is a bot that tries to
log into the system every few minutes and creates an event every time it would not be able
to. Because events are created automatically, there is a tendency to create many events
for the same problem. This can be overwhelming for teams solving incidents, and thus,
many of these events are mostly ignored. In his thesis Knigge looks at the possibilities
of clustering these events so it would be easier to recognize new events and filter out the
duplicates. Knigge also tries to connect the events to an incident. In the example given
above, this would mean that when an incident is created because a customer could not log
in, this incident would be connected to the events created by the bot.
4 https://www.cs.uic.edu/
~liub/NSF/PSC-IIS-0307239.html

2 Background knowledge
2.1 Weight of Evidence and Information Value
Weight of Evidence (WoE) is a topic that has appeared in scientific literature for at least
the last 50 years(Weed, 2005). It has mostly been used as a method of risk-assessment but
can also be used for segmentation, variable reduction and various other things. In this thesis
WoE is used for variable reduction using a method that is primarily based on a paper by
Lin and Hsieh (2014). Lin and Hsieh uses WoE to asses the predictive power of a feature by
separating the data into multiple bins, and calculating the differences between the proportion
of events in the bin compared to rest of the data. The bigger the discrepancy, the higher the
WoE. In this thesis, an event means that the target value is true while a non-event means
the target value is false. The target is the feature that the algorithm tries to predict. For
example, in the OOT data set, the goal is to predict if an incident is going to be OOT. This
means that the target is the feature OOT, an event is when the incident is OOT, and a
non-event is when the incident is not OOT.
2.1.1 Binning
The method used in this thesis consists of four steps. The initial step is to separate the
feature into bins. In a paper about WoE Guoping states that three rules should be followed
while binning a data set for WoE (Guoping, 2014). The first rule states that each bin should
have at least 5% of the observations. This is done to prevent the final score from being
determined by a small fraction of the data. The second rule states that the missing values
have to be binned into a separate bin. The third rule states that every bin should have at
least one event and one non-event. The third rule of binning has not been followed in this
thesis because the used data did not always allow for it. The problems that are caused by
a bin with either no events or no non-events are solved using an adjusted WoE equation,
which is explained in Subsubsection 2.1.2. In this thesis, the data is divided into nine bins
plus one bin for missing data. The nine bins for the values are made as similar in size as
possible. If the feature has less than nine unique values, the number of bins is equal to
the number of unique values. The bins are made using the cut function from the Pandas5
package in Python.
2.1.2 Calculating Weight of Evidence
The second step is to calculate the WoE for every bin. The equation to calculate the WoE
is as follows:
%Events
W oE = ln( ) (1)
%nonEvents
The WoE is calculated using the percentage of both events and non-events. Note that the
percentage of the events does not mean the percentage of the observations in the bin that
are events, but the percentage of events compared to the total number of events in the data
set. The WoE is positive when the percentage of events is higher than the percentage of non-
events and grows when the discrepancy grows. The WoE is negative when the percentage of
5 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

events is lower than the percentage of non-events and decreases when the discrepancy grows.
The WoE is zero when the percentage of events is equal to the percentage of non-events.
This equation for the WoE works for most cases but does assume that every bin has at
least one observation that is an event and at least one that is a non-event. This is caused
by the fact that dividing by zero as well as taking the natural log of zero is mathematically
not possible. As stated in Subsubsection 2.1.1, this thesis does not use the third rule of the
paper from Guoping, and because the system created in this thesis should work for many
different data sets, it is not possible to guarantee that bins with either zero events or zero
non-events are not present. To accommodate for all data, an adjusted equation for WoE is
used. The adjusted equation is as follows:
%nonEvents + 0.5 %Events + 0.5

adjW oE = ln( / ) (2)
%nonEvents %Events
The adjusted equation for the weight of evidence can handle zero events or zero non-
events by adding a small value to both.
2.1.3 Calculating Information Value
The third step is to calculate the Information Value (IV) for all bins. The WoE is the degree
of difference between the ratio of events in a single bin and the whole feature. However, to
state something about the predictive power of a feature, the IV is needed. The equation of
the IV is as follows:
Table (1) Predictive Power of

a feature based on the total In-
IV = W eO ∗ (%Events − %nonEvents) (3) formation Value(Lin & Hsieh,
2014).
The IV is calculated by multiplying the WoE and the Information Predictive
difference between the percentage of events and the per-
Value Power
centage of non-events. The IV of a bin is always positive
because the WoE is always negative if the percentage < 0.02 Unpredictive
of non-events is higher than the percentage of events. 0.02 to 0.1 Weak
The IV gets exponentially higher when the discrepancy
0.1 to 0.3 Medium
between the percentage of events and the percentage of
0.3 to 0.5 Strong
non-events increases. The IV of a feature is calculated by
taking the sum of the IV of all bins and can be used to > 0.5 Suspicious
determine the predictive power of the feature. Figure 1
shows two examples of IV being calculated. Figure 1a shows a feature with a very low total
IV while Figure 1b shows a feature with a high total IV. In a paper about WoE and IV Lin
and Hsieh states that the predictive power of a feature can be determined using Table 1
(Lin & Hsieh, 2014). Using this table, the feature is shown in Figure 1a would be considered
unpredictable, and the feature shown in Figure 1b would be considered strong.
2.1.4 Potential problems
Even though WoE has many advantages, it has two potential problems. The first potential
problem is that WoE is dependent on the quality of the binning. Because WoE is calculated

based on the difference between the bins and the total data, the method of binning can have
consequences for the results.
The second potential problem is the fact that WoE is purely based on the feature by itself.
This could potentially cause a problem when a feature can be important for the prediction
of a subset of the data but not necessarily for the prediction of all the data because the
WoE is calculated for the whole data set, it could be classified as unimportant while being
important.
(a) Weight of Evidence table for feature A. Feature A has low predictive power
Range bins nonE E % nonE %E % E - %nonE WoE IV
0-20 1 198 97 19.7 19.5 -0.2 -0.0002 0
20-40 2 204 105 20.3 21.1 0.8 0.0009 0.0007
40-60 3 197 98 19.7 19.6 -0.1 -0.0002 0
60+ 4 196 102 19.6 20.5 0.8 0.001 0.001
Missing 5 207 98 20.7 19.7 -1.0 -0.0012 0.0012
1002 498 0.0029
(b) Weight of Evidence table for feature B. Feature B has high predictive power
Range bins nonE E % nonE %E % E - %nonE WoE IV
0-20 1 250 80 23.7 14.6 -9.1 -0.013 0.117
20-40 2 180 140 17.1 25.5 8.4 0.0094 0.080
40-60 3 250 80 23.7 14.6 -9.2 -0.013 0.117
60+ 4 196 120 18.6 21.8 3.2 0.0039 0.012
Missing 5 180 130 17.1 23.6 6.5 0.0079 0.05
1056 550 0.376
Figure (1) Weight of Evidence tables for two different features. The low total IV score of
the feature A suggests low predictive power while the high total IV score of feature B suggests
a high predictive power. (Note that events was abbreviated to E.)
2.2 Decision Tree Algorithm

Decision Tree Classification algorithms(DTC) are among the most used learning algorithms.
DTCs have many advantages; they are, for example, straightforward to use (Su & Zhang,
2006). This simplicity is caused by the fact that DTCs do not require many parameters,
and can deal with many different data types effectively. Another advantage of DTCs is
readability. DTCs are easy to understand because they work similar to how we would make
decisions ourselves. DTCs can be used for various types of categorical problems, but in this
thesis, we will only discuss binary classification problems, which means the answer is either
true or false.
DTCs work by splitting the data into subsets based on feature values. The best split
is determined using Information Gain(IG). The IG of a split is the entropy before the split
minus the entropy after the split. The higher the IG, the better the split. The entropy of a

data set is calculated using the following equation:

c
X
E= −pi log(pi ) (4)
i=0
In this equation, pi means the fraction of the total observation is part of category i.
When dealing with binary problems, there are only two possible categories, True or False.
This simplifies the equation into:
E = (−pt ∗ log(pt )) + (−pf ∗ log(pf )) (5)
The equation for the entropy functions like a parabola that has its peak at 0.5 with a
value of 1.0 and has a value of 0.0 if either everything is true or everything is false. After
being split into subsets, the entropy of the data set is calculated using the weighted average
of the subsets. The equation to calculate this weighted average is as follows:
n
X
E(S) = Pi ∗ E(i) (6)
i=1
In this equation, E(S) is the entropy of the whole data set while E(i) is the entropy of
a subset of data. Pi is the fraction of data that is part of subset i, which means that the
entropy of the larger subsets is weighted more.
Figure 2 shows an example of an effec-
tive split made. The data set consists of
11 observations where five observations are
red stars, and six observations are blue di-
amonds. The best prediction that could be
made from this initial data set would be
to predict all observations to be diamonds,
which would result in only 55% of the pre-
dictions being correct. The difficulty of
prediction is also be shown by the high en-
tropy value of 0.69.
The data set is split into two subsets,
one consisting of all the observation with
feature X larger than 30 and one consisting
of the remaining observations. The entropy
of the two subsets is lower than the entropy
of the root by having a value of 0.45 and
0.5, respectively. Calculating the entropy
of the data set after the split is calculated
using the weighted average and results in
Figure (2) Example of a simple decision tree
0.47. The IG of the split has a value of
0.22, indicating the split is effective.
The example given in Figure 2 is of a simple tree consisting of only one split, while
in reality, many more splits are needed to predict complex data sets correctly. It is not
uncommon that trees grow to many hundreds of splits. When using DTCs it is advised to
limit the number of splits to prevent overfitting.

2.2.1 Random Forest
Even though DTCs can be good classifiers and offer great readability, there are classification
problems that are very hard to solve using normal DTCs. One of the methods to improve
the predictive power of the tree algorithms is by extending it into a random forest algorithm.
Random forest algorithms function by creating a high number of simple DTCs that are all
trained on subsets of the data set. These small DTCs are called weak learners because they
have low predictive power by themselves. When a random forest algorithm wants to make
a prediction, all weak learners make a prediction. The predictions from the weak learners
are evaluated, and the most common prediction is chosen as the final prediction. Results
from research done by Breiman show that random forest algorithms are more reliable and
accurate when compared to algorithms that are based on a single tree (Breiman, 2001).
2.2.2 AdaBoost
AdaBoost is one of the most popular implementations of random forest algorithms. Ad-
aBoost uses boosting to create and evaluate the high number of weak learners made for
random forest algorithms. AdaBoost is the first practical boosting algorithm and is still one
of the most widely used (Schapire, 2013). The first step of AdaBoost is to create a weak
learner similar to the one shown in Figure 2 based on the full data set. Note that every weak
learner used by AdaBoost consists of a single split, these are also called stumps. A subset
of the data is created, which consists primarily of the observations that are not predicted
correctly by the first tree. A second tree is created based on this new subset. This process
will repeat until either the desired number of weak learners are created, or the predictions
made by the algorithm have reached the desired accuracy. In Adaboost not all weak learners
are weighted equally when predicting but are assigned weights which determine how much
they influence the prediction. The weight of a weak learner is determined by the fraction of
data it correctly predicts.
2.2.3 XGBoost
In this thesis, Extreme Gradient Boosting (XGBoost) is used instead of AdaBoost. XGBoost
is similar to AdaBoost but has certain advantages which make it more suitable to use.
The first reason to use XGBoost is the optimization for the use of sparse data. XGBoost
has shown to run 50 times faster on sparse data than naive boosting algorithms (Chen
& Guestrin, 2016). Effective functionality when dealing with sparse data is vital given
the amount of sparse data used in this project. Benchmarks made comparing different
types of boosting algorithms6 show that XGBoost is among the fastest and most accurate
boosting algorithms. XGBoost has proven to be very successful and widely used in many
programming competitions. An example of this is the KDDCup 2015, where all top-10
finishers used XGBoost(Chen & Guestrin, 2016).
This thesis uses XGBoost to determine the importance of each feature. First, a model
is trained using XGBoost. When using the Python version of XGBoost7 , it is possible
to get a list of feature importance. From this list, the most important features can be
selected. Determining which features need to be selected can be done using various methods,
6 http://datascience.la/benchMarcing-random-forest-implementations/
7 https://xgboost.readthedocs.io/en/latest/python/python intro.html

but in this thesis, a simple threshold is used. If the feature has higher importance than
the threshold, it is selected, and otherwise, it is removed. Increasing this threshold will
decrease the number of selected feature but will increase the possibility of a significant drop
in performance.
2.2.4 Potential problems
Even though XGBoost has many advantages, it has two potential problems. The first poten-
tial problem is that XGBoost is a greedy algorithm, which means that XGBoost generates
its splits using heuristics rather than processing the whole dataset. This could result in XG-
Boost, making locally optimal choices but not always globally optimal choices. This could
impact the importance of value given to a feature.
The second potential problem is the approach XGBoost has towards dealing with fea-
tures containing similar information. If two features contain similar information, by being
correlated, XGBoost would only need one of the two features for predicting. This means
that one of the two features would get a very low importance rating, even though it is as
important as the other feature.
2.3 Neural Networks

Neural networks (NN) are among the most used learning algorithms when solving classifi-
cation problems. NNs are used for classification problems like credit card fraud detection,
cursive handwriting recognition and cancer screening, to name a few (Widrow, Rumelhart,
& Lehr, 1994). Standard NNs consist of many simple, connected processors called neurons,
each producing a sequence of real-valued activations (Schmidhuber, 2014). Input neurons
are activated through sensors perceiving the environment, and other neurons get activated
through weighted connections from previously active neurons(Schmidhuber, 2014). The sim-
plest forms of NNs have been around for over 50 years, with the first form of neural networks
proposed in 1943 (S. & Pitts, 1943). It was not yet able to learn but was dependant on
static parameters given by the user. Nowadays, neural networks can train on data using
either supervised or unsupervised methods.
2.3.1 Overfitting and speed
Many AI machine learning models are prone to overfitting. Overfitting is the phenomenon
where instead of finding patterns in the data, the algorithm starts to memorize the data.
An example of overfitting is shown in Figure 3. In this figure, two algorithms attempt to
predict the value of feature Y based on feature X. Figure 3a shows a line that predicts the
value of feature Y while not being too complex. Figure 3b shows a line that predicts the
value of feature Y with a very complex line. While Figure 3b is much more accurate when
predicting the training data, it is much worse when predicting the validation data.
NNs have the advantage that they can learn very complicated relationships between
inputs and outputs. NNs are however very susceptible to overfitting (Lawrence, Giles, &
Tsoi, 1997). Lawrence et al. state that one reason for overfitting is the high number of
weights present. A reduction of the number of features in a data set (Lawrence et al.,
1997), using the methods proposed in this thesis, reduces the number of weights and could

thereby reduce the possibility of overfitting. Another advantage of feature reduction is the
improvement in learning speed. The number of calculations done by a NN is based on
the number of weights; if this number is reduced it will automatically increase the training
speed.
(a) Example of an algorithm that is not overfitting.
(b) Example of an algorithm that is overfitting.
Figure (3) Example of the difference between an algorithm that is overfitting and one that
is not.
2.4 Encoding
Many algorithms have difficulties when dealing with categorical data. These difficulties
are caused by the fact that most algorithms function using numeric data. To resolve this
problem, categorical data is processed using an encoder. There are many different methods
of encoding data, but only two are used in this thesis.
2.4.1 Ordinal Encoding
The first method used is ordinal encoding, which means that categorical data is replaced
by numeric values. An example of ordinal encoding is shown in Figure 4. This method

is sometimes also called numeric or integer encoding. In this thesis ordinal encoding is
executed using sklearn8 . Ordinal encoding has the advantage that it is easy to execute and
is very space-efficient given that it is one of the only methods of encoding that does not add
new columns to the data.
While ordinal encoding has many advantages, it also has some problems. One problem
with ordinal encoding is that it implies a relationship between categories that might not
be present. In the example given the encoded data could imply that Rotterdam is twice
Amsterdam, and London is even higher even though this is not the case. Another prob-
lem with ordinal encoding is that not all types of algorithms can work with it optimally.
When researching the impact of encoding data on the performance of a neural network,
ordinal encoding was shown to be the worst-performing method of encoding tested (Potdar,
Pardawala, & Pai, 2017).
City City
0 Amsterdam 0 1
1 Rotterdam 1 2
2 Amsterdam 2 1
3 Rotterdam 3 2
4 London 4 3
Figure (4) Example of column encoded using ordinal encoding
2.4.2 One-Hot Encoding
The second method encoding used in this thesis is one-hot encoding. one-hot encoding is one
of the most used encoding methods because it requires no knowledge of the data and works
very well with neural networks. one-hot encoding creates a separate column for every unique
category in a column. The values in the new columns consist only of the values 1 and 0,
stating if the observation is a part of the category or not. In Figure 5 an example of one-hot
encoding is shown. In the example given in Figure 5 the column City turns into separate
columns for Amsterdam, Rotterdam and London respectively. The reason NNs work so well
using One-hot encoding is that it can assign different weights to all the categories separately.
While one-hot encoding has many advantages, it also has some problems. The primary
problem with one-hot encoding is space efficiency. One-hot encoding creates a new column
for every unique category, which can lead to a very large data set, especially when the
number of categories increases. The space efficiency can be increased when using sparse
matrices9 , but is still not ideal. To reduce the number of columns created, a method from
ten Kaate’s thesis is used whereby all categories present less than 5 times are placed together
8 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder
.html#sklearn.preprocessing.OrdinalEncoder
9 https://docs.scipy.org/doc/scipy/reference/sparse.html

into the category ”uncommon” (ten Kaate, 2018).

Another problem with One-hot encoding is that it completely removes information that
was present in the original data set because it splits columns into multiple columns. When
looking at the encoded data it is not possible to determine which columns were based on
the same original column. This can have a negative impact on performance when using
algorithms that use this kind of information, like DTCs.
City City A City R City L

0 Amsterdam 0 1 0 0
1 Rotterdam 1 0 1 0
2 Amsterdam 2 1 0 0
3 Rotterdam 3 0 1 0
4 London 4 0 0 1
Figure (5) Example of column encoded using one-hot encoding
2.5 Evaluation Metrics

Evaluation is one of the most important stages of research. Knowing the quality of the
results and their implications is vital when discussing the success of the used method. In
this thesis, the success is based on the number of features removed as well as the predictive
power. The predictive power is a combination of the F-score and the balanced accuracy
(bAcc). Precision and recall are also shown in the results of this thesis but will not be used
during the evaluation of the results.
Precision, recall and accuracy make use of true positives(tp), true negatives(tn), false
positives(fp) and false negatives(fn). A prediction is true positive when correctly predicted
positive, is true negative when correctly predicted negative, is false positive when incorrectly
predicted positive and is false negative when incorrectly predicted negative.

P
tp
P recision = P P (7)
tp + f p
P
tp
Recall = P P (8)
tp + f n
P recision ∗ Recall
F − score = 2 ∗ (9)
P recision + Recall
P
tn
T rueN egativeRate = P P (10)
tn + f n
P P
tp + tn
Accuracy = P P P P (11)
tp + f p + tn + f n
Recall + T rueN egativeRate
BalancedAccuracy = (12)
2
Figure (6) equations used for evaluation results, T rueN egativeRate and Accuracy are not
directly used in this research but are give as context.
The precision of a model is the number of correct positive predictions in the collection
of all positive predictions. When an algorithm is optimized on precision, it will cause the
algorithm to predict significantly more true positives than false positives. This can be very
useful when developing a system where false positives are a big problem. An example could
be an automatic fine system where an algorithm would determine an offence and would
automatically hand out fines without any human interference. The downside of precision is
that it does not take into account the number of false-negative predictions. This means that
these algorithms tend to predict positive less often because they need to be entirely sure to
do so. This, however, means that these algorithms can get a high number of false negatives.
The recall of a model is the number of correct positive predictions in the collection
of all positives. When an algorithm is optimized on recall, it will cause the algorithm to
predict significantly more true negatives than false negatives. This could be very useful when
developing a system where false positives are not a big problem. An example of this would
be an algorithm that would be used when filtering data before a human looks at them to
determine the further process. In this case, false positives would not be a problem because
the human evaluation could remove those, but the filtering is still successful because it could
save the observer a significant amount of work. The downside of these algorithms is that
they do not take into account the number of false-positive predictions. This means that
these algorithms tend to predict positive much more often because it increases the chance
of predicting all the positives correctly.
The F-score combines the precision and the recall. The F-score operates like a mean when
the precision and recall are close together but punishes disparity between the precision and
recall.
Despite both recall and precision being capable of evaluating results, they both dismiss
the importance of the true negative predictions. Accuracy incorporates the true negative
prediction and could thereby give a better evaluation of the results. Accuracy does, however,
have a big downside, and that is working with unbalanced prediction targets. A prediction

target is unbalanced when one class is much more present than the other. An example of an
unbalanced target is in the OOT data set used in this thesis. The OOT data set consists of
a much higher number of incidents that are solved within the time limit than incidents that
are not. In Figure 7 an example of a problematic situation is described. In this example,
both the precision and the recall are extremely low while the accuracy is very high. This
problem is caused by the fact that, without a normalizing factor, the dominant class has
much more impact on accuracy. This problem can be resolved easily by using the balanced
accuracy (bAcc). bAcc works similarly to normal accuracy but neutralizes the prediction.
Because not all data used in this thesis is balanced, it is vital that the used evaluation
methods are capable of working with unbalanced data and thus the bAcc is used.
totalpositives(p) = 50, totalnegatives(n) = 950
tp = 1, f p = 9
tn = 941, f n = 49
1 1
P recision = = 0.02, Recall = = 0.1
50 10
1 + 941
Accuracy = = 0.942
1000
941
T rueN egativeRate = = 0.99
950
0.99 + 0.1
BalancedAccuracy = = 0.5
2
Figure (7) Example of the problem that can occur when using accuracy while working with
unbalanced data

3 Data
In this project, four data sets are used to ensure the methods tested are effective on different
data sets. The data sets vary in size, the number of features and the ease to predict. All
data sets are preprocessed in two steps. The first step is removing all columns that consist of
more than 95% of empty rows. The second step is removing all columns that either consist
of only one unique value or consist of more than 95% unique values. The two methods of
preprocessing are used to reduce overfitting. A more elaborate overview of the data sets can
be found in Appendix A.
3.1 Simple data set - Titanic

The first data set used is the titanic data set10 taken from a Kaggle basic machine learning
exercise. In this exercise, the user is asked to predict if a person would have survived based
on features like sex, age or Cabin. This data set is chosen because of the small size and
the expected ease to solve, given that it is a basic exercise. The titanic data set shows if
the methods used are capable of reducing the number of features of a simple data set. The
titanic data set consists of 891 rows and 11 columns. Two columns are removed during the
preprocessing phase, zero due to missing values and two due to unique values. This leaves
nine usable columns. A more in-depth breakdown of the data can be found in Appendix
A.1.
3.2 Big data set - WeatherAUS

The second data set used is the Australian weather data set11 , which was also taken from
Kaggle machine learning exercises. The goal of this data set was to predict if there would
be any rainfall the following day based on various weather features of the current day. This
data set is chosen because it has a much larger number of rows compared to the titanic data
set. The weatherAUS data set shows if the size of a data set influences the effectiveness of
the method used. The weatherAUS data set consists of 142193 rows and 24 columns. No
columns are removed during the preprocessing phase leaving 24 usable columns. A more
in-depth breakdown of the data can be found in Appendix A.2.
3.3 Complex data set - Adult

The third data set used is the adult data set12 . This is a data set that consists of information
about various adults living in America, created in 1996. The goal of this data set is to predict
if one earns more or less than 50k a year. This data set is chosen because it is expected to
be more complex than both the Titanic or the weatherAUS data set. The adult data set
shows if the effectiveness of the methods used is influenced by the difficulty of a problem.
The data set consists of 48,842 rows and falls, thereby right between the two previous data
sets when looking at size. This data set is used many times in different scientific papers and
is commonly known as the ”Census Income” data set (Zadrozny & Bianca, 2004)(Rosset &
Saharon, 2004). The Titanic data set consists of 32,561 rows and 15 columns. No columns
are removed during the preprocessing phase leaving 15 usable columns. A more in-depth
breakdown of the data can be found in Appendix A.3.
10 https://www.kaggle.com/c/titanic
11 https://www.kaggle.com/jsphyg/weather-dataset-rattle-package
12 https://archive.ics.uci.edu/ml/datasets/Adult

3.4 Domain data set - ABN-AMRO OOT

The fourth data set used is the ABN-AMRO OOT data set. This is a data set of the first
documentation of incidents in the ABN-AMRO system. The goal of this data set is to
predict if an incident will be out of time. It is hard to predict how the ABN-AMRO OOT
data set compares to the other three data sets because not much prior research is done on
it. It is expected that the complexity of the problem is comparable to the Adult data set,
given the results achieved by ten Kaate (2018). The ABN-AMRO OOT data set shows if
the effectiveness of the methods used is influenced by the number of features in the data
set. The ABN-AMRO data set consists of 55,583 rows and 269 columns. 186 columns are
removed during the preprocessing phase, 166 due to missing values and 20 due to unique
values. This leaves 83 usable columns. A more in-depth breakdown of the data can be found
in Appendix A.4.
4 Method
In this thesis, an attempt is made to divide the set of features into a subset of important and
unimportant features. This is done using three different methods. The first method used is
based on the WoE. The IV of every feature is determined, as explained in Subsection 2.1.
The selection of important features is made using a threshold. Every feature with a total IV
higher than 0.05 is classified important while all other features are classified unimportant.
The threshold used was determined based on initial experimental exploration. The first
experiments were done using a threshold value of 0.02, which is classified as ”unpredictable”
learners in Table 1. However, this threshold resulted in too many features being classified as
important that did not seem important, because they could be removed without a significant
drop in predictive power. The next experiments were done using a threshold value of 0.1,
which is classified as ”weak” learners in Table 1. However, this threshold resulted in a
significant drop in predictive power. A threshold used in this thesis is between the two
thresholds with a value of 0.05 and resulted in the highest number of removed features
without a significant drop in predictive power.
The second method used is based on XGBoost13 . XGBoost is used to determine the im-
portance of every features using the method explained in Subsubsection 2.2.3. The selection
of important features is made using a threshold. Every feature with an importance value be-
low 0.01 will be removed. The threshold used was determined based on initial experimental
exploration. Initially, the same threshold value as the WoE was used, but this resulted in a
drop in predictive power. After multiple experiments, a threshold value of 0.01 was shown
to be the most reliable.
The final method used is a combination of the two methods mentioned above. First,
a selection is made using the IV based method. The XGBoost based method is used on
the same data set consisting of only the selected features. Both WoE and XGBoost have
potential problems, as shown in Subsubsection 2.1.4 and Subsubsection 2.2.4 respectively.
The combination of the two methods would ideally utilize the advantages of both methods
and dispose of the disadvantages.
The evaluation is done using neural networks which are built using Tensorflow14 and
13 https://xgboost.readthedocs.io/en/latest/
14 https://www.tensorflow.org/

Keras15 with Python16 as the programming language. The neural networks used to train
are of limited complexity, consisting of an input layer, one hidden layer and an output layer.
The size of the input layer is based on the input size, the size of the hidden layer is 128,
and the size of the output layer is 2 to predict binary values. The activation function of the
hidden layer is ReLU17 and the activation function of the output layer is Softmax18 . The
optimizer used to train the data is the Adam optimizer19 with default parameter values. All
neural networks are trained for ten epochs with a 30% validation-split.
Figure (8) Neural network used in this thesis
In total, seven different neural networks are trained. One using all features, which is
used as a reference, and for all three methods of reduction one based on the important
features and one based on the unimportant features. The results of the experiments will
15 https://keras.io/
16 https://www.python.org/
17 https://www.tensorflow.org/api docs/python/tf/nn/relu
18 https://www.tensorflow.org/api docs/python/tf/nn/softmax
19 https://www.tensorflow.org/api docs/python/tf/train/AdamOptimizer

consist of two parts. The first part is the reduction power of the split, meaning how many
features are removed and how much does this improve the speed. Expected is that the more
features are removed, the more the speed will improve. The second part of the results is
the quality evaluation. Removing features can only be positive when it does not have a
significant negative impact on the predictive power of the system. The quality is evaluated
using F-score and bAcc. In this thesis, a significant drop in predictive power is defined as a
drop of more than 0.05 in either F-score or bAcc. All evaluations are made using SKlearn
model evaluation20 in Python, except for the F-score which is calculated using Equation 9.
In an ideal result, the predictive power of the network trained on the selected features
would not be significantly worse than the predictive power of the network trained on all the
features, while the predictive power of the network trained on the removed features would
be.
5 Results
Below are the results gathered during testing. The predictive power is based on both the
balanced accuracy and the F-score. All the networks are trained for ten epochs and line
graphs of the accuracy over time can be found in Appendix B. Appendix B also shows bar
plots with the importance of each feature.
5.1 Simple data set - Titanic

The best results for the Titanic data set were gained using either the IV based or the
combined method. The best result reduces the number of columns by 56% and the speed
by 21% without a significant drop in predictive power. Note that XGBoost classified all
features as important and thus resulted in no reduction. A more in-depth breakdown of the
results and the training process can be found in Appendix B.1.
Method Features OHcolumns Speed Precision Recall F-score bAcc

Reference 9 1195 1.18 0.73 0.73 0.73 0.79
IV
Important 4 404 0.93 0.81 0.67 0.73 0.78
Unimportant 5 791 1.05 0.65 0.37 0.47 0.62
Tree
Important 9 1195 1.21 0.73 0.73 0.73 0.79
Unimportant X X X X X X X
Combined
Important 4 404 0.92 0.77 0.72 0.75 0.79
Unimportant 5 791 1.08 0.6 0.43 0.50 0.61
Figure (9) Results for the Titanic data set, X means that the test has not been done because
there are 0 columns and thus no data. OHcolumns means the number of columns present
after one-hot encoding.
20 https://scikit-learn.org/stable/modules/model evaluation.html

5.2 Big data set - WeatherAUS

The best results for the weatherAUS data set were gained using either the XGBoost based
or the combined method. The best result reduces the number of columns by 95% and the
speed by 60% without a significant drop in predictive power. A more in-depth breakdown
of the results and the training process can be found in Appendix B.2.

Reference 23 2280 271 1 1 1 1
IV
Important 7 337 130 1 1 1 1
Unimportant 16 1943 261 0.57 0.48 0.52 0.69
Tree
Important 1 63 106 1 1 1 1
Unimportant 22 2217 282 0.64 0.53 0.58 0.72
Combined
Important 1 64 107 1 1 1 1
Unimportant 22 2217 279 0.64 0.53 0.58 0.72
Figure (10) Results for the weatherAUS data set. OHcolumns means the number of columns
present after one-hot encoding.
5.3 Complex data set - Adult

The best results for the Adult data set were gained using either the IV or the combined
method. The best result reduces the number of columns by 29% and the speed by 3%
without a significant drop in predictive power. The unimportant features from both the
XGBoost based method, as well as the combined method, result in precision and recall of
0.0. This is caused by predicting all rows as False which is caused by the lack of useful
information present in the given data. A more in-depth breakdown of the results and the
training process can be found in Appendix B.3.


Reference 14 101 31 0.68 0.31 0.43 0.63
IV
Important 10 64 30 0.67 0.43 0.52 0.68
Unimportant 4 37 24 0.53 0.11 0.18 0.54
Tree
Important 12 82 23 0.69 0.32 0.44 0.64
Unimportant 2 19 25 0.0 0.0 X 0.5
Combined
Important 10 64 35 0.6 0.57 0.59 0.72
Unimportant 4 37 32 0.0 0.0 X 0.5
Figure (11) Results for the Adult data set, It is not possible to calculate the F-score when
both the Precision and Recall have a values of 0.0, thereby X is written as value. OHcolumns
means the number of columns present after one-hot encoding.
5.4 Domain data set - ABN-AMRO OOT

The best results for the ABN-AMRO data set were gained using the combined method.
The best result reduces the number of columns by 76% and the speed by 93% without a
significant drop in predictive power. A more in-depth breakdown of the results and the
training process can be found in Appendix B.4.

Reference 83 16202 1770 0.74 0.77 0.76 0.86
IV
Important 30 6319 212 0.8 0.68 0.74 0.83
Unimportant 53 9883 1132 0.47 0.45 0.46 0.68
Tree
Important 30 5649 623 0.76 0.76 0.76 0.86
Unimportant 53 10553 610 0.5 0.38 0.43 0.66
Combined
Important 20 3705 132 0.79 0.70 0.74 0.83
Unimportant 63 12497 1493 0.52 0.51 0.52 0.71
Figure (12) Results for the ABN-AMRO data set. OHcolumns means the number of
columns present after one-hot encoding.
5.5 Analysis
The IV based method is able to reduce the number of features of all the data sets used in this
thesis. The IV based method does, however, perform significantly worse compared to both
the XGBoost and the combined method in the weatherAUS data set. This is caused by the

second potential problem when using the IV based system discussed in Subsubsection 2.1.4.
In the weatherAUS data set, only one feature is required to predict the target 100% correct,
which means that all other features can be removed. This does, however, not mean that
the features are useless when predicting, but that these features do not add anything to the
information already present in the most important feature. The features are thus classified
as important by the IV based method because the features could be used for prediction by
themselves, even though they do not contribute much compared to the best feature.
The XGBoost based method can reduce the number of features of three of the four data
sets used in this thesis. The XGBoost is, however, not able to reduce the number of features
used in the Titanic data set. This is possibly caused by the first potential problem discussed
in Subsubsection 2.2.4. When comparing the list of importance of each feature to the one
made by the IV based method, it is prominent that the most important features are the
same. The problem is that while all the features classified as unimportant in the IV are
also the lowest-scoring features in the XGBoost based method; their value is too high to be
classified as unimportant. This problem could be resolved by increasing the threshold, but
this could also increase the possibility of a significant drop in predictive power.
The combined method can reduce the number of features of all the data sets used in
this thesis. In all four data sets the combined method was either one of the best or the
best method to use. The results suggest that the combined method has all the advantages
of both methods while removing the disadvantages. The ABN-AMRO OOT data set is the
one data set where the combined method proved to be better than either one of the two
methods. This is most likely caused by the lower number of features present when executing
XGBoost. Fewer features that are unimportant lowers the impact of the second potential
problem discussed in Subsubsection 2.2.4. This is an excellent example of the advantage of
using the combined method. Noteworthy is the fact that in both the adult data set and the
ABN-AMRO OOT data set, either the precision decreases and the recall increases, or vice
versa. There has been no mention of this fact in any literature used for this thesis.
6 Conclusion
Based on the evaluated results discussed in Subsection 5.5 it is possible to answer the research
questions proposed in Subsection 1.1. Before the research question can be answered, all four
subquestions have to be answered
The first subquestion is ”Can Weight of Evidence be used to reduce the number of
features of a data set without a significant drop in predictive power when solving a binary
classification problem?”. The results indicate that Weight of Evidence is indeed able to
reduce the number of features without a significant drop in predictive power. The WoE
based method resulted in a reduction of the number of features in every data set used in this
thesis. The reduction of the adult data set even resulted in a small increase in the balanced
accuracy score.
The second subquestion is ”Can XGBoost be used to reduce the number of features of a
data set without a significant drop in predictive power when solving a binary classification
problem?”. The results indicate that XGBoost is indeed able to reduce the number of
features without a significant drop in predictive power. The XGBoost based method resulted

in a reduction of the number of features in three of four used data sets. The XGBoost based
method was however not able to reduce the number of features of the Titanic data set
The third subquestion is ”Does Weight of Evidence or XGBoost perform better when
reducing the number of features of a data set without a significant drop in predictive power
when solving a binary classification problem?”. Neither of the two methods consistently
perform better than the other. In both the weatherAUS data set as well as the ABN-
AMRO OOT data set, the XGBoost based method performed better than the WoE based
method. In both the Titanic data set as well as the Adult data set, the WoE based method
performed better than the XGBoost based method. Noteworthy, however, is the fact that
the XGBoost based method was not able to reduce the Titanic data set which might indicate
that the WoE based method is more reliable.
The fourth subquestion is ”Does a combination of Weight of Evidence and XGBoost
perform better when reducing the number of features of a data set without a significant
drop in predictive power when solving a binary classification problem than one of them
by themselves?”. The results show that in all four data sets tested in this thesis, the
combined method resulted in the highest number of removed features without a significant
drop in predictive power. Besides being more consistent than the methods by themselves,
the combined method also performs better in the ABN-AMRO OOT data set. The ABN-
AMRO OOT data set is an excellent example of the two methods using their advantages to
resolve their disadvantages.
The overall research question is ”Can a combination of Weight of Evidence and XG-
Boost be used to reduce the number of features of a data set without a significant drop in
predictive power when solving a binary classification problem?”. The results indicate that a
combination of WoE and XGBoost can reduce the number of features and does this better
and more consistent than one of the two methods separately. There are no indications that
show difficulties with complexity or size, which would indicate that the method can be used
for all binary problems.
7 Discussion & Future Research

7.1 Discussion
Even though the method used in this thesis has shown to be very promising, three remarks
are identified. The first remark is that currently when two features contain the same infor-
mation, only one of the two is chosen by XGBoost to be of high importance while the other
feature is deemed redundant. This could be problematic when using the method purely for
readability reasons.
The second remark is about the neural network that is used to evaluate the selected
features. This neural network is always built with the same size hidden layer and an output
layer with a varying first layer based on the size of the input size. This is done to make a fair
comparison between the different neural networks, but it can be argued that a transition
from a larger input size is more difficult and should require either more hidden layers or
a larger hidden layer. This could mean that the selected data has a slight edge due to a
smaller input size. If however the results of the removed features are examined, it does not

suggest this factor to be of significant impact since they also have a much smaller input size,
but still performed much worse than the whole data set.
The third remark is about the thresholds chosen for both the XGBoost and the Weight
of Evidence. Both thresholds are primarily chosen because they provided the best results
during initial experimentation. While these thresholds seem to work well, it could be possible
that better parameters could be found. A possible improvement to the system could also
be the method of choosing features based on the importance scores to better suit the needs
of the project. Increasing the thresholds would remove more redundant features and thus
leave only the most important features but would cause a loss in prediction power. For some
projects, a loss in prediction power is much less detrimental because the readability is a more
important goal. Another method of classifying features could be to choose the top X features.
This method obviously has the big downside that it requires some form of knowledge about
the data, to choose the number of features returned. Picking a wrong number of features
could result in either removing important features because not enough features are returned,
or classifying unimportant features as important because many features are returned.
7.2 Future Research

Aside from the methods of determining thresholds, there is more future research possible.
The first possible continuation is to vary the different parameters that can be defined when
using XGBoost.In this project, the basic parameters from the Python packages were used,
but improvements in both execution and results could be made when optimizing these
parameters. A similar continuation could be done exploring the parameters of the NNs.
The second possible continuation is an exploration of the use of other types of tree-based
algorithms like AdaBoost, Microsoft’s LightGBM21 or even basic single Tree algorithms.
A similar continuation could be done exploring alternatives for the Weight of Evidence
algorithm like Chi-squared or Gini 22 .
A third possible continuation is an expansion towards all types of classification opposed to
only binary classification. XGBoost is already capable of all types of classification, but the
IV based method would have to be altered to account for non-binary classification.
7.3 Recomendations for ABN-AMRO

For ABN-AMRO, the following options are given for the usage of this method. The first
option is to analyze the features that are classified as important. This can be done by
looking for either correlation or for exception values that are either much more frequent
OOT or much less frequent OOT. This analysis could also include predicting OOT based
on the data, but without the feature being analyzed. The second option is to analyze the
differences between periods. In this thesis, all accidents between December 2018 and June
2019 are put into one data set, but this could be split into 6 subsets where each subset
contains the data from a single month.
21 https://github.com/microsoft/LightGBM
22 Anumber of different methods can be found at http://documentation.statsoft.com/portals/0/
formula%20guide/Weight%20of%20Evidence%20Formula%20Guide.pdf

8 Acknowledgements
I would like to thank ABN-AMRO for the use of their time, data and expertise. I am
especially grateful for both Monique Gerrits, Ronald van der Veen and Paul ten Kaate for
their time and support within the ABN-AMRO. I would like to thank all my fellow student
for feedback and collaboration.
References
Breiman, L. (2001, Oct 01). Random forests. Machine Learning, 45 (1), 5–32. Retrieved
from https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining
(pp. 785–794). New York, NY, USA: ACM. Retrieved from http://doi.acm.org/
10.1145/2939672.2939785 doi: 10.1145/2939672.2939785
Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random
forests. Pattern Recognition Letters, 31 (14), 2225 - 2236. Retrieved from http://
www.sciencedirect.com/science/article/pii/S0167865510000954 doi: https://
doi.org/10.1016/j.patrec.2010.03.014
Guoping, Z. (2014, 07). A necessary condition for a good binning algorithm in credit scoring.
Applied Mathematical Sciences, Vol. 8 , 3229-3242. doi: 10.12988/ams.2014.44300
Knigge, D. (2019). Event correlation and dependenc-graph analysis to suppory root cause
analysis in itsm environments. (Bachelor’s Thesis). University of Amsterdam.
Lawrence, S., Giles, C. L., & Tsoi, A. C. (1997). Lessons in neural network training:
Overfitting may be harder than expected.
Lin, A. Z., & Hsieh, T.-Y. (2014). Expanding the use of weightof evidence and infor-
mation value to continuous dependent variables for variable reduction and scorecard
development. SESUG, 2014 .
Potdar, K., Pardawala, T., & Pai, C. (2017, 10). A comparative study of categorical variable
encoding techniques for neural network classifiers. International Journal of Computer
Applications, 175 , 7-9. doi: 10.5120/ijca2017915495
Riemersma, R. (2019). Predicting incident duration time. (Bachelor’s Thesis). University
of Amsterdam.
Rosset, & Saharon. (2004). Model selection via the AUC. In Proceedings of the twenty-
first international conference on machine learning (pp. 89–). New York, NY, USA:
ACM. Retrieved from http://doi.acm.org/10.1145/1015330.1015400 doi: 10
.1145/1015330.1015400
S., M. W., & Pitts, W. (1943, Dec 01). A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5 (4), 115–133. Retrieved from
https://doi.org/10.1007/BF02478259 doi: 10.1007/BF02478259
Schapire, R. E. (2013). Explaining adaboost. In B. Schölkopf, Z. Luo, & V. Vovk (Eds.),
Empirical inference: Festschrift in honor of vladimir n. vapnik (pp. 37–52). Berlin,
Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/
978-3-642-41136-6 5 doi: 10.1007/978-3-642-41136-6 5
Schmidhuber, J. (2014). Deep learning in neural networks: An overview. CoRR,
abs/1404.7828 . Retrieved from http://arxiv.org/abs/1404.7828

Scott, D. (2001). Nsm: Often the weakest link in business availability. Gartner Group
AV-13-9472 .
Su, J., & Zhang, H. (2006). A fast decision tree learning algorithm. In Proceedings of the
21st national conference on artificial intelligence - volume 1 (pp. 500–505). AAAI
Press. Retrieved from http://dl.acm.org/citation.cfm?id=1597538.1597619
ten Kaate, P. (2018). Automatic detection, diagnosis and mitigation of incidents in multi-
system environments. (Bachelor’s Thesis). University of Amsterdam.
Velez, M. (2019). Predicting causal relations between itsm incidents and changes. (Bachelor’s
Thesis). University of Amsterdam.
Wang, C., Kavulya, S., Tan, J., Hu, L., Kutare, M., Kasick, M., . . . Gandhi, R. (2013,
11). Performance troubleshooting in data centers: an annotated bibliography? ACM
SIGOPS Operating Systems Review , 47 , 50-62. doi: 10.1145/2553070.2553079
Weed, D. L. (2005). Weight of evidence: A review of concept and methods. Risk Analysis,
Vol 25, No 6 .
Widrow, B., Rumelhart, D. E., & Lehr, M. A. (1994, March). Neural networks: Applications
in industry, business and science. Commun. ACM , 37 (3), 93–105. Retrieved from
http://doi.acm.org/10.1145/175247.175257 doi: 10.1145/175247.175257
Wiggerman, M. (2019). Predicting the first assignment group for a smooth incident resolution
process (Bachelor’s Thesis). University of Amsterdam.
Zadrozny, & Bianca. (2004, 09). Learning and evaluating classifiers under sample selection
bias. Proceedings, Twenty-First International Conference on Machine Learning, ICML
2004 , 2004 . doi: 10.1145/1015330.1015425

Appendices
A Data clarification
A.1 Simple data set - Titanic
In the Titanic data set two rows were removed both where removed during preprocessing.
Feature is removed because:

= above 95% unique values = 1 unique value = above 95% missing
faeture name missing missing% unique unique%

PassengerId 0 0.0 891 100.0
Survived 0 0.0 2 0.22
Pclass 0 0.0 3 0.34
Name 0 0.0 891 100.0
Sex 0 0.0 2 0.22
Age 177 19.87 89 9.99
SibSp 0 0.0 7 0.79
Parch 0 0.0 7 0.79
Ticket 0 0.0 681 76.43
Fare 0 0.0 248 27.83
Cabin 687 77.1 148 16.61
Embarked 2 0.22 4 0.45
Table (2) All features in the Titanic data set. Colored rows show the feature was removed
during preprocessing.

A.2 Big data set - WeatherAUS

The waetherAUS data set consists of 24 features and none were removed during preprocess-
ing.

Date 0 0.0 3436 2.42
Location 0 0.0 49 0.03
MinTemp 637 0.45 390 0.27
MaxTemp 322 0.23 506 0.36
Rainfall 1406 0.99 680 0.48
Evaporation 60843 42.79 357 0.25
Sunshine 67816 47.69 146 0.1
WindGustDir 9330 6.56 17 0.01
WindGustSpeed 9270 6.52 68 0.05
WindDir9am 10013 7.04 17 0.01
WindDir3pm 3778 2.66 17 0.01
WindSpeed9am 1348 0.95 44 0.03
WindSpeed3pm 2630 1.85 45 0.03
Humidity9am 1774 1.25 102 0.07
Humidity3pm 3610 2.54 102 0.07
Pressure9am 14014 9.86 547 0.38
Pressure3pm 13981 9.83 550 0.39
Cloud9am 53657 37.74 11 0.01
Cloud3pm 57094 40.15 11 0.01
Temp9am 904 0.64 441 0.31
Temp3pm 2726 1.92 501 0.35
RainToday 1406 0.99 3 0.0
RISK MM 0 0.0 681 0.48
RainTomorrow 0 0.0 2 0.0
Table (3) All features in the weatherAUS data set.

A.3 Complex data set - Adult

The adult data set consists of 14 features and none were removed during preprocessing.

age 0 0.0 73 0.22
workclass 0 0.0 9 0.03
fnlwgt 0 0.0 21648 66.48
education 0 0.0 16 0.05
education-num 0 0.0 16 0.05
marital-status 0 0.0 7 0.02
occupation 0 0.0 15 0.05
relationship 0 0.0 6 0.02
race 0 0.0 5 0.02
sex 0 0.0 2 0.01
capital-gain 0 0.0 119 0.37
capital-loss 0 0.0 92 0.28
hours-per-week 0 0.0 94 0.29
native-country 0 0.0 42 0.13
Table (4) All features in the Adult data set.

A.4 Domain data set - ABN-AMRO OOT

The adult data set consists of 269 features and 186 were removed during preprocessing.
Feature were removed because:

ACTION 55583 100.0 1 0.0
Avail 55583 100.0 1 0.0
Average 55570 99.98 9 0.02
COMMAND 55583 100.0 1 0.0
CPU 55583 100.0 1 0.0
CollateralID 55583 100.0 1 0.0
CompanyName 55583 100.0 1 0.0
Description 55582 100.0 2 0.0
Device 55583 100.0 1 0.0
Duration 55583 100.0 1 0.0
ERR 55570 99.98 6 0.01
Email 55583 100.0 1 0.0
FirstName 55583 100.0 1 0.0
Group 55487 99.83 5 0.01
IpAddress 55583 100.0 1 0.0
IsActive 55583 100.0 1 0.0
LastLoginDate 55583 100.0 1 0.0
LastName 55583 100.0 1 0.0
Maximum 55570 99.98 11 0.02
Message 55581 100.0 3 0.01
Minimum 55570 99.98 10 0.02
MobilePhone 55583 100.0 1 0.0
Name 55583 100.0 1 0.0
Opened by 55583 100.0 1 0.0
PID 55582 100.0 2 0.0
Port 55582 100.0 2 0.0
PostalCode 55583 100.0 1 0.0
Profile.PermissionsApiEnabled 55583 100.0 1 0.0
Profile.PermissionsModifyAllData 55583 100.0 1 0.0
Profile.PermissionsViewSetup 55583 100.0 1 0.0
Table (5) Part of the features in the ABN-AMRO OOT data set. Colored rows show that
the feature was removed during preprocessing.



ProfileId 55583 100.0 1 0.0
REPORT ID 55583 100.0 1 0.0
Reason 55583 100.0 1 0.0
S 55583 100.0 1 0.0
Start 55583 100.0 1 0.0
State 55473 99.8 8 0.01
TTY 55583 100.0 1 0.0
Type 55581 100.0 3 0.01
USER 55583 100.0 1 0.0
Unit 55583 100.0 1 0.0
UserId 55583 100.0 1 0.0
UserRoleId 55583 100.0 1 0.0
UserType 55583 100.0 1 0.0
Username 55583 100.0 1 0.0
action 55515 99.88 9 0.02
active 0 0.0 1 0.0
activity due 50679 91.18 4440 7.99
additional assignee list 55583 100.0 1 0.0
affect dest 55583 100.0 1 0.0
app 55579 99.99 4 0.01
approval 13 0.02 2 0.0
approval history 55583 100.0 1 0.0
approval set 55583 100.0 1 0.0
assigned to 55583 100.0 1 0.0
assigned to name 55583 100.0 1 0.0
assignment group 59 0.11 453 0.81
assignment group name 107 0.19 448 0.81
assignment user name 55583 100.0 1 0.0
assignment user username 55583 100.0 1 0.0
body 55583 100.0 1 0.0



business duration 55583 100.0 1 0.0
business service 57 0.1 608 1.09
business service name 57 0.1 608 1.09
business service offering 55583 100.0 1 0.0
business stc 55583 100.0 1 0.0
calendar duration 55583 100.0 1 0.0
calendar stc 55583 100.0 1 0.0
caller id 22 0.04 17954 32.3
category 0 0.0 1 0.0
caused by 55113 99.15 192 0.35
change 55583 100.0 1 0.0
change state name 55583 100.0 1 0.0
child incidents 16 0.03 2 0.0
ci affected host 55579 99.99 5 0.01
ci affected item 55579 99.99 5 0.01
ci reason code 55579 99.99 5 0.01
close code 20 0.04 4 0.01
close notes 230 0.41 40910 73.6
closed at 0 0.0 50687 91.19
closed by 55583 100.0 1 0.0
cmd 55573 99.98 5 0.01
cmdb ci 18429 33.16 8864 15.95
command 55582 100.0 2 0.0
comments 55583 100.0 1 0.0
comments and work notes 55583 100.0 1 0.0
contact type 0 0.0 3 0.01
contract 55583 100.0 1 0.0
correlation display 16 0.03 2 0.0
correlation id 0 0.0 55583 100.0
cpu count 55583 100.0 1 0.0



delivery plan 55583 100.0 1 0.0
delivery task 55583 100.0 1 0.0
description 57 0.1 24849 44.71
dest 0 0.0 1 0.0
due date 21995 39.57 28710 51.65
endpoint 0 0.0 1 0.0
epoch opened at 0 0.0 55243 99.39
errorCode 55582 100.0 2 0.0
errorMessage 55582 100.0 2 0.0
escalation 8 0.01 2 0.0
eventName 55582 100.0 2 0.0
eventtype 0 0.0 16 0.03
expected start 55583 100.0 1 0.0
follow up 55583 100.0 1 0.0
group list 55583 100.0 1 0.0
hash 55583 100.0 1 0.0
host 0 0.0 2 0.0
id 55554 99.95 21 0.04
impact 12 0.02 4 0.01
incident 57 0.1 24849 44.71
incident state 0 0.0 5 0.01
incident state name 0 0.0 5 0.01
index 0 0.0 1 0.0
instance 55583 100.0 1 0.0
ip 55582 100.0 2 0.0
ip address 55583 100.0 1 0.0
it product 57 0.1 1487 2.68
key 55583 100.0 1 0.0
knowledge 11 0.02 2 0.0
latitude 55458 99.78 9 0.02



level 55582 100.0 2 0.0
linecount 0 0.0 151 0.27
location 46102 82.94 566 1.02
location name 55405 99.68 24 0.04
logname 55582 100.0 2 0.0
longitude 55458 99.78 9 0.02
made sla 17 0.03 2 0.0
mem 55583 100.0 1 0.0
message 55582 100.0 2 0.0
mode 55583 100.0 1 0.0
name 55583 100.0 1 0.0
node 55583 100.0 1 0.0
notify 19 0.03 2 0.0
number 0 0.0 55583 100.0
opened at 0 0.0 55243 99.39
opened by 9 0.02 3 0.01
opened by name 55583 100.0 1 0.0
operational status 55583 100.0 1 0.0
operational status name 55583 100.0 1 0.0
os 55583 100.0 1 0.0
parent 55583 100.0 1 0.0
parent incident 52951 95.26 378 0.68
password 55583 100.0 1 0.0
path 55581 100.0 3 0.01
pid 55575 99.99 9 0.02
platform 55583 100.0 1 0.0
principal 55583 100.0 1 0.0
priority 0 0.0 5 0.01
problem 55583 100.0 1 0.0
problem id 53678 96.57 334 0.6



problem state 55583 100.0 1 0.0
problem state name 55583 100.0 1 0.0
process 55583 100.0 1 0.0
product 55583 100.0 1 0.0
profile city 55583 100.0 1 0.0
profile country 55583 100.0 1 0.0
punct 0 0.0 11269 20.27
reassignment count 0 0.0 44 0.08
reopen count 15 0.03 10 0.02
reopened by 38085 68.52 2 0.0
reopened time 38084 68.52 17388 31.28
resolved by 55583 100.0 1 0.0
rfc 55344 99.57 222 0.4
service offering 55583 100.0 1 0.0
severity 31082 55.92 2 0.0
shell 55583 100.0 1 0.0
short description 57 0.1 24849 44.71
sid 55582 100.0 2 0.0
size 55579 99.99 5 0.01
skills 55583 100.0 1 0.0
sla due 55582 100.0 2 0.0
source 0 0.0 1 0.0
sourcetype 0 0.0 1 0.0
splunk server 0 0.0 13 0.02
splunk server group 55583 100.0 1 0.0
src 55476 99.81 7 0.01
src user 11 0.02 2 0.0
state 13 0.02 4 0.01
state name 55583 100.0 1 0.0
status 55571 99.98 9 0.02



subcategory 55583 100.0 1 0.0
subject 55582 100.0 2 0.0
swap 55583 100.0 1 0.0
sys created by 11 0.02 2 0.0
sys created on 11 0.02 54569 98.18
sys domain 18 0.03 2 0.0
sys domain path 16 0.03 2 0.0
sys id 0 0.0 55583 100.0
sys mod count 13 0.02 131 0.24
sys tags 55583 100.0 1 0.0
sys updated by 21 0.04 2 0.0
sys updated on 8 0.01 45181 81.29
tag 0 0.0 15 0.03
tag::app 55583 100.0 1 0.0
tag::eventtype 0 0.0 9 0.02
tag::host 37711 67.85 2 0.0
tag::shell 55583 100.0 1 0.0
threads 55582 100.0 2 0.0
ticket id 0 0.0 55583 100.0
time 55557 99.95 19 0.03
time submitted 11 0.02 54569 98.18
time worked 55583 100.0 1 0.0
timestamp 0 0.0 1 0.0
title 55582 100.0 2 0.0
type 55579 99.99 5 0.01
u backlog id 55583 100.0 1 0.0
u bsd domain 57 0.1 25 0.04
u bsd domain name 1875 3.37 25 0.04
u bsd subdomain 57 0.1 68 0.12
u bsd subdomain name 9853 17.73 67 0.12



u business value 76 0.14 10 0.02
u cause code 6 0.01 10 0.02
u caused by change 25147 45.24 4 0.01
u close code category 331 0.6 13 0.02
u closure code 55583 100.0 1 0.0
u correlation id out 55583 100.0 1 0.0
u customer 34 0.06 17952 32.3
u direct closed 0 0.0 2 0.0
u environment 0 0.0 1 0.0
u estimated delivery date 55583 100.0 1 0.0
u external parent ref 52951 95.26 378 0.68
u ibm closed 55583 100.0 1 0.0
u ibm closed by 22170 39.89 2457 4.42
u ibm created 10 0.02 55240 99.38
u ibm created by 21 0.04 2 0.0
u ibm opened 55583 100.0 1 0.0
u ibm opened by 55583 100.0 1 0.0
u ibm resolved 55583 100.0 1 0.0
u ibm resolved by 55583 100.0 1 0.0
u ibm sla report 5 0.01 3 0.01
u ibm sys id 0 0.0 55583 100.0
u ibm updated 9 0.02 50661 91.14
u ibm updated by 55583 100.0 1 0.0
u ibm updates 5 0.01 130 0.23
u incident phase 12 0.02 3 0.01
u intensive care 17 0.03 3 0.01
u it service offering 57 0.1 1487 2.68
u itp list 55583 100.0 1 0.0
u knowledge article 55583 100.0 1 0.0
u notify by 55583 100.0 1 0.0



u opened by group 16 0.03 396 0.71
u phase display 17 0.03 3 0.01
u problem candidate 17 0.03 3 0.01
u problem candidate reviewed 18 0.03 2 0.0
u read only 10 0.02 2 0.0
u related tasks 55583 100.0 1 0.0
u security incident category 24152 43.45 8 0.01
u security relevant 14 0.03 2 0.0
u service offering status 25599 46.06 5 0.01
u solution 55075 99.09 131 0.24
u source system 0 0.0 1 0.0
u state display 14 0.03 6 0.01
u supplier reference 55583 100.0 1 0.0
u support offering status 25593 46.04 6 0.01
u vendor incident 55583 100.0 1 0.0
u visible to business user 15 0.03 2 0.0
uid 55576 99.99 6 0.01
upon approval 6 0.01 2 0.0
upon reject 9 0.02 2 0.0
urgency 7 0.01 4 0.01
user 55583 100.0 1 0.0
user input 55583 100.0 1 0.0
vendor 55583 100.0 1 0.0
version 55579 99.99 2 0.0
watch list 55583 100.0 1 0.0
work end 55583 100.0 1 0.0
work notes 55583 100.0 1 0.0
work start 55583 100.0 1 0.0
time 0 0.0 45188 81.3

B Test Results
B.1 Titanic
B.1.1 Accuracy
(a) IV (b) Tree

1 1
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
Unimportant Features
0.7
All Features
0.6
0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time
Figure (13) Validation accuracy over time when training the Titanic data set, Because
no features were removed during the Tree-based method it was not possible to test the non-
selected Features

B.1.2 Loss
(a) IV (b) Tree

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
loss
loss
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time test time
(c) Combined
1
0.9
0.8
0.7 Important Features
0.6
loss
0.5 Unimportant Features

0.4
0.3 All Features
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
test time
Figure (15) Validation loss over time when training the Titanic data set, Because no fea-
tures were removed during the Tree-based method it was not possible to test the non-selected
Features

B.2 Weather
B.2.1 Accuracy
(a) IV (b) Tree

1 1
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
0.7
All Features
0.6
0.5
0 50 100 150 200 250 300 350 400
test time
Figure (17) Validation accuracy over time when training the weatherAUS data set

B.2.2 Loss
(a) IV (b) Tree

2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
loss
loss
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
test time test time
(c) Combined
2
1.8
1.6
1.2
loss
1 Unimportant Features
0.8
0.6 All Features
0.4
0.2
0
0 50 100 150 200 250 300 350 400
test time
Figure (19) Validation loss over time when training the weatherAUS data set

B.3 Adult
B.3.1 Accuracy
(a) IV (b) Tree

1 1
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
0.7
All Features
0.6
0.5
0 5 10 15 20 25 30 35 40
test time
Figure (21) Validation accuracy over time when training the Adults data set

B.3.2 Loss
(a) IV (b) Tree

2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
loss
loss
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
test time test time
(c) Combined
2
1.8
1.6
1.2
loss
1 Unimportant Features
0.8
0.6 All Features
0.4
0.2
0
0 5 10 15 20 25 30 35 40
test time
Figure (23) Validation loss over time when training the Adults data set

B.4 ABN-AMRO
B.4.1 Accuracy
(a) IV (b) Tree

1 1
0.9 0.9
accuracy
accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 400 800 1,200 1,600 2,000 0 400 800 1,200 1,600 2,000
test time test time
(c) Combined
1
0.9
Important Features
accuracy
0.8
0.7
All Features
0.6
0.5
0 400 800 1,200 1,600 2,000
test time
Figure (25) Validation accuracy over time when training the ABN-AMRO data set

B.4.2 Loss
(a) IV (b) Tree

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
loss
loss
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 400 800 1,200 1,600 2,000 0 400 800 1,200 1,600 2,000
test time test time
(c) Combined
1
0.9
0.8
0.7
Important Features
0.6
loss
0.5 Unimportant Features

0.4
0.3 All Features
0.2
0.1
0
0 400 800 1,200 1,600 2,000
test time
Figure (27) Validation loss over time when training the ABN-AMRO data set

Feature Reduction For Binary Classification Problems Using Weight of Evidence and Xgboost

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Reduction For Binary Classification Problems Using Weight of Evidence and Xgboost

Uploaded by

Copyright:

Available Formats

Feature Reduction for Binary

Classification Problems using Weight of

Bachelor Opleiding Kunstmatige Intelligentie

June 29th, 2019

Keywords— Feature Reduction, Weight of Evidence, Information Value, Neural Networks,

Dante Niewenhuis page 1 of 49

Dante Niewenhuis page 2 of 49

7 Discussion & Future Research 26

DTC Decision Tree Classification algorithm

WoE Weight of Evidence

XGBoost Extreme Gradiant Boosting

Dante Niewenhuis page 3 of 49

Dante Niewenhuis page 4 of 49

1.1 Research questions

Can a combination of Weight of Evidence and XGBoost be used to

Can Weight of Evidence be used to reduce the number of features of a

Dante Niewenhuis page 5 of 49

1.3 Context: Blue Student Lab and ABN-AMRO

1.3.1 Related Thesis 1: Predicting resolution time

1.3.2 Related Thesis 2: Predicting assignment group

Dante Niewenhuis page 6 of 49

1.3.3 Related Thesis 3: Predicting caused by change

1.3.4 Related Thesis 4: Clustering events and incidents

Dante Niewenhuis page 7 of 49

2.1.2 Calculating Weight of Evidence

Dante Niewenhuis page 8 of 49

%nonEvents + 0.5 %Events + 0.5

2.1.3 Calculating Information Value

Table (1) Predictive Power of

2.1.4 Potential problems

Dante Niewenhuis page 9 of 49

2.2 Decision Tree Algorithm

Dante Niewenhuis page 10 of 49

data set is calculated using the following equation:

E = (−pt ∗ log(pt )) + (−pf ∗ log(pf )) (5)

Dante Niewenhuis page 11 of 49

2.2.1 Random Forest

Dante Niewenhuis page 12 of 49

2.2.4 Potential problems

2.3 Neural Networks

2.3.1 Overfitting and speed

Dante Niewenhuis page 13 of 49

(a) Example of an algorithm that is not overfitting.

(b) Example of an algorithm that is overfitting.

2.4.1 Ordinal Encoding

Dante Niewenhuis page 14 of 49

Figure (4) Example of column encoded using ordinal encoding

2.4.2 One-Hot Encoding

Dante Niewenhuis page 15 of 49

into the category ”uncommon” (ten Kaate, 2018).

City City A City R City L

Figure (5) Example of column encoded using one-hot encoding

2.5 Evaluation Metrics

Dante Niewenhuis page 16 of 49

Dante Niewenhuis page 17 of 49

totalpositives(p) = 50, totalnegatives(n) = 950

Dante Niewenhuis page 18 of 49

3.1 Simple data set - Titanic

3.2 Big data set - WeatherAUS

3.3 Complex data set - Adult

Dante Niewenhuis page 19 of 49

3.4 Domain data set - ABN-AMRO OOT