You are on page 1of 41

Knowledge Distillation to Improve

Model Performance and Explainability:


A Decision-Critical Scenario Analysis

Boyd Franciscus Cornelis Vosters


S TUDENT NUMBER: 1273464

T HESIS SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE OF
M ASTER OF S CIENCE IN D ATA S CIENCE & S OCIETY
D EPARTMENT OF C OGNITIVE S CIENCE & A RTIFICIAL I NTELLIGENCE
S CHOOL OF H UMANITIES AND D IGITAL S CIENCES
T ILBURG U NIVERSITY

Thesis committee:
Dr. Juan Sebastian Olier Jauregui
Dr. Peter Hendrix

Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
December 2020
Preface

I hereby represent you my master thesis on the usage of Knowledge Distillation in


practical, decision- critical scenarios, to create high performing and explainable models.
This study is performed in partial fulfillment of graduation for my master’s in Data
Science & Society. Over the past three months, this study is the culmination of hard
work and many hours whilst working from home amidst the COVID-19 pandemic. I
want to thank Sebastian Olier for guidance and support on writing this thesis during
the many online Zoom meetings. Furthermore, I would like to thank my friends and
family for support during the process of writing.
Knowledge Distillation to Improve Model
Performance and Explainability: A
Decision-Critical Scenario Analysis

Boyd Franciscus Cornelis Vosters

Humans are increasingly interacting with opaque Deep Neural Networks that can potentially
impact their lives depending on the outcome. In such an interaction, it can be valuable for the
end-user to receive some form of insight regarding the properties that drove the decision. How-
ever, such an explanation is not inherently possible when interacting with neural networks that
have multiple layers and conduct complicated computations to generate an output. This study
extends research on a proposed solution to transfer knowledge from high performing ’black-box’
algorithms to ’white-box’ algorithms that are both high performing and explainable: Knowledge
Distillation. The contribution to the existing literature of this study resides in a practical analysis
of Knowledge Distillation on three new decision-critical datasets, which are datasets that can be
decisive in real-life applications. Thus, this study aims to answer the following main research
question: To what extent can Knowledge Distillation improve performance of an interpretable
student decision tree in decision-critical scenarios? Additionally, this study investigates how
Knowledge Distillation is affected by the problem of class imbalance and proposes two potential
techniques to fix this problem. The results in this study show that Knowledge Distillation is
capable of boosting the performance of interpretable student decision trees in decision-critical
scenarios to a certain extent. It is evident that class imbalance negatively affects the use of
Knowledge Distillation since it relies heavily on the distilled model’s own generalization ability.
Furthermore, the potential fixes for class imbalance: SMOTE oversampling and undersampling,
proved not to be powerful enough in addressing the class imbalance issue adequately.

1
Data Science & Society 2020

1. Introduction

Without mathematical knowledge or a degree in AI or Data Science, Deep Neural


Networks (DNNs) can be perceived as ambiguous. DNNs are criticized for being non-
transparent and have poor interpretability of the predictions made by the algorithm.
When making use of a DNN with multiple hidden layers, it is challenging to trace back
what properties have driven the decision. In decision-critical domains, such as mort-
gage lending, feedback about the decision is crucial to prevent ambiguity. The so-called
’user-friendly’ explanations are required to provide understanding for end-users with
limited Machine Learning knowledge and to give them a sense of trust in the algorithm
(Xie et al. 2020). However, due to the success of Deep Learning applications in many
fields, an increasing number of individuals are interacting with opaque algorithms on
a day-to-day basis (Parloff 2016; Aggarwal 2018; Buhrmester, Münch, and Arens 2019).
Unfortunately, there is a trade-off between explainability and prediction performance
within the domain of Machine Learning. When using DNNs, this trade-off is growing
towards more complex models with high performance, at the cost of explainability
(Shwartz-Ziv and Tishby 2017). The issue of ambiguity within the DNN domain is
currently a trending subject gathered under the overarching topic of Explainable AI
(XAI). The topic of XAI focuses on introducing a set of techniques or tools that will pro-
vide intuitive and comprehensible feedback of an AI decision (Das and Rad 2020). New
techniques have been developed in recent years as means to open the so-called ’black
box’ problem faced when dealing with opaque DNNs (Shwartz-Ziv and Tishby 2017;
Xie et al. 2020). An intriguing application was introduced by Liu, Wang, and Matwin
(2018), which uses the so-called Knowledge Distillation technique to distill a DNN into
a comprehensible decision tree. In their study, Knowledge Distillation showed potential
to create both explainable and high performing decision trees. However, they stress a
need for further research and new applications of Knowledge Distillation for model
explainability improvement (Liu, Wang, and Matwin 2018). Therefore, this case study
investigates the usage of Knowledge Distillation with teacher neural networks and
student decision trees on three new real-life classification problems. These applications
are set in decision-critical scenarios, where the outcome of the decision can have a
significant impact on human lives (Grigorescu et al. 2020). Additionally, datasets that
are set in decision-critical scenarios often suffer from class imbalance due to a smaller
probability of the negative class occurring in reality. This issue creates the so-called
class imbalance problem, where the classifier tends to be biased towards predicting the
majority class (Veni and Rani 2018). Given this problem, this study also investigates
the effect that the class imbalance problem has on the usage of Knowledge Distillation
and investigates what techniques are most suitable at addressing the issue. Altogether,
this study complements the academic literature in the beforementioned XAI domain
by further investigating the usage of Knowledge Distillation on new applications, with
varying dataset structures.
In order to structure the before mentioned research focus, the following main re-
search question is formed:
To what extent can Knowledge Distillation improve performance of an interpretable student
decision tree in decision-critical scenarios?
The main research question is answered by using three different datasets, each with
a different binary classification task and data structure in a decision-critical scenario.
Subsequently, based on a comparison of performance, a conclusion can be drawn to
answer the research question. The three datasets used in this study are:

2
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

• Home Credit Default Risk Dataset (Kaggle 2018). Set in the mortgage
assessment domain, this dataset is designed for a binary classification task
to assess whether a client is credit-worthy or not. This dataset is
imbalanced.
• Default of Credit Card Clients Dataset (Yeh and Lien 2009) Set in a credit
assessment setting, this dataset is created for a binary classification task to
predict whether a person will have payment difficulties or not. This
dataset is also imbalanced.
• Breast Cancer Wisconsin Diagnostic Dataset (Wolberg, Street, and
Mangasarian 1995). In contrast to the other finance-focused datasets, the
Breast Cancer Wisconsin Diagnostic Dataset is set in the medical domain
and is designed to predict whether an individual cell from a breast tumor
is cancerous or not. This dataset is balanced.

Additionally, two sub-questions are implemented in this study to focus on the issue
of class imbalance that is commonly present in decision-critical scenarios:

• How does class imbalance affect Knowledge Distillation?


• What is the most effective technique to address class imbalance when
applying Knowledge Distillation?

The first sub-question is answered by comparing the Knowledge Distillation results of


both imbalanced datasets with the Breast Cancer Wisconsin Dataset. The second sub-
question is answered by applying two standard techniques that are capable of address-
ing the class imbalance issue according to existing literature: SMOTE oversampling and
undersampling (Fernández et al. 2018; Veni and Rani 2018)
Based on the results of this study, it is evident that Knowledge Distillation is capable
of boosting the performance of interpretable student decision trees in decision-critical
scenarios to a certain extent. All teacher models performed well on the classification
task, but only the student Breast Cancer Wisconsin Diagnostic Dataset model was able
to achieve a good performance of 0.93 AUC. The other student models showed improve-
ment over their baselines that received no Knowledge Distillation but were unable to
achieve a reliable AUC performance (> 0.7). Therefore, the practical contribution of this
study is that the effect of Knowledge Distillation remains limited in real-life situations.
Additionally, the theoretical contribution of this study is that class imbalance negatively
affects the use of Knowledge Distillation, since it relies heavily on the students’ model
own generalization ability. The fixes proposed in this study; SMOTE oversampling and
undersampling proved not to be consistent enough in addressing this issue, with even
showing negative results occasionally. Therefore, this study is unable to answer which
technique is most effective at addressing the class imbalance issue.

3
Data Science & Society 2020

2. Related Work

2.1 Explainable Deep Learning

With the rise of Artificial Intelligence (AI) in recent years, various Deep Neural Net-
works (DNNs) run on applications that humans use on a daily basis. DNNs run on cars
to prevent accidents (Grigorescu et al. 2020), DNNs run on our smartphones to improve
user experiences (Ignatov et al. 2019), and are used in the medical domain to support
diagnoses (Bakator and Radosav 2018). DNNs have integrated into our society and are
continuously developed for new areas and industries. However, some decisions made
by DNNs have far-reaching consequences that have a significant impact on people’s
lives. In such a case, the result of a DNN is leading and deployed in a decision-
critical scenario (Grigorescu et al. 2020). Notably, deploying algorithms for decision
making is also called Algorithmic Decision-Making (ADM) (Waltl and Vogl 2018). To
illustrate, what if a DNN determines which medical treatment someone will have and
the additional consequences if the prediction is wrong? Likewise, a financial impact
is present when making a major financial decision, such as applying for a mortgage
where the output of a DNN is leading. What if this decision is negative, resulting in the
application being turned down with no further explanation? This result will inevitably
lead to questions from the end-user about the decision making process. Unfortunately,
intrinsic human-like feedback is not present when using a DNN for decision making
because a DNN is not intrinsically ’explainable.’ The information captured by a DNN
is intertwined and compressed into a value via nonlinear transformations through a
weighted sum of feature values. The compression occurs in multiple hidden layers
with different weight vectors, depending on the previous layer’s amount of activations
(Aggarwal 2018). This process makes it challenging to trace back the properties that
have driven the decision of the DNN, even when diving into the math, let alone for
average users. Due to the complexity of decision making an information asymmetry
situation occurs, where only a few data scientist or AI experts have knowledge about
which features resulted in a negative or positive decision (Lepri et al. 2018). The
academic field that focuses on solving the lack of explainability of certain algorithms
and the additional consequences, such as information asymmetry, is called Explainable
Artificial Intelligence (XAI) (Dosilovic, Brcic, and Hlupic 2018). Within the XAI domain,
researchers are investigating tools that create interpretable and intuitive explanations
behind the decision of the algorithm.
The relevance of XAI is further strengthened by the introduction of the General
Data Protection Regulation (GDPR) in Europe. Within multiple articles of the GDPR,
the ’right to explanation’ is mentioned. According to Article 13(2)(f) of the GDPR, this
right includes the following: The controller shall provide data subjects with: "information on
the existence of automated decision-making, including profiling, referred to in Article 22(1) and
(4) and, at least in those cases, meaningful information about the logic involved, as well as the
significance and the envisaged consequences of such processing for the data subject" (GDPR
2016). The right to explanation within the GDPR is designed to solve legal problems
that arise due to information asymmetry between the algorithm’s owner and end-user.
Information asymmetry can be created based on three forms of algorithmic opacity.
Firstly, there is intrinsic opacity, which refers to the ambiguity of algorithms in their
essence. Secondly, there is illiterate opacity, which refers to the average end-user lacking
the needed technical skills to understand the internal process of the algorithm. Lastly,
there is intentional opacity, which refers to the means that firms deploy to prevent end-
users and competitors figuring out how the algorithm works (Lepri et al. 2018; Janssen

4
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

2019). When deploying a DNN for assessing a certain decision, all three forms of opacity
can be present. The main focus of XAI research is that the algorithm must produce infor-
mation for the user that relates characteristics of the input with its output. Therefore, it
is too no surprise that there is an increased interest amongst AI academics to investigate
options for AI to propose a course of action with a justifiable explanation rather than
prescribe one (Xie et al. 2020). The unexplainable nature of a DNN is becoming a
restriction now that society is widespread adopting DNNs for decision making. The
widespread adoption of DNNs, paired with the lack of explainability, can potentially
harm human well-being (Xie et al. 2020). Furthermore, with the introduction of GDPR,
explainability of algorithmic decision making has become a legal concern. A specific
subtopic of XAI is called Explainable Deep Learning, which focuses on AI driven by
Deep Neural Networks. A useful framework was proposed by Lipton (2018), shown in
Figure 1, regarding intrinsic properties that drive explainability of an algorithm. If such
a property exists within a DNN, it advocates some form of explainability. However,
many high performing DNNs do not inherently possesses these properties (Xie et al.
2020).

Figure 1: Taxonomy of Explainability within Automated Decision Making by Lipton


(2018)

To investigate Explainable Deep Learning, researchers have developed specific


foundational methods that potentially allow DNNs to become explainable. These meth-
ods are considered foundational since they are concepts that current research builds
upon. There are three foundational methods within Explainable Deep Learning (Xie
et al. 2020):

• Visualization methods: Methods that display an explanation by


highlighting influential features of input through scientific visualizations.
• Model Distillation: A class of methods that focuses on developing a
separate, ’white-box’ model that is trained to imitate the input-output
behavior of the DNN. The distilled ’white-box’ model is intrinsically
explainable and therefore used to identify the decision-making rules that
influence the DNN output.
• Intrinsic methods: Methods where DNNs are specifically developed to
create an explanation in parallel with its output. As a result of the DNNs
design, both the model performance and quality of explanations can be
optimized.

This study will focus on the Model Distillation method and use a specific form of
the technique called Knowledge Distillation, and apply it to DNNs in decision-critical

5
Data Science & Society 2020

scenarios to generate interpretable decision trees. The reason for choosing Knowledge
Distillation is based on the lack of additional case studies that investigate the usage
Knowledge Distillation, whilst the study of Liu, Wang, and Matwin (2018) shows
promising results.

2.2 Knowledge Distillation

Model Distillation refers to post-training explanation methods where the hidden knowl-
edge within a trained DNN is ’distilled’ into a form that is amenable for an explanation
by a user. A specific form of Model Distillation was introduced by Hinton, Vinyals,
and Dean (2015), eventually named Knowledge Distillation (Hinton, Vinyals, and Dean
2015; Gou et al. 2020). In study of Hinton, Vinyals, and Dean (2015), the knowledge of a
DNN (teacher model) is distilled into a single smaller DNN (student model), visualized
in Figure 2. The underlying concept of Knowledge Distillation is called the Teacher-
Student architecture (Gou et al. 2020).

Figure 2: Visual representation of Knowledge Distillation based on Figure 1. of Gou et al.


(2020)

When using a basic form of Knowledge Distillation, the teachers model’s hard tar-
gets are used to train the student model. However, using the soft targets for Knowledge
Distillation has proved to outperform models who only transfer hard targets to the
student model (Liu, Wang, and Matwin 2018). Hard targets (also known as one-hot
labels) only contain the information on the predicted label, while soft targets display all
the predicted probabilities across all classes. In order to avoid the loss of knowledge,
the ’matching logits’ technique is used in the work of Hinton, Vinyals, and Dean (2015).
Through applying the matching logits technique, the created soft targets in the form of
logits will preserve information when transferred to a student model. The knowledge
learned by a teacher model and transferred to a student model is called dark knowledge
(Gou et al. 2020). The study of Hinton, Vinyals, and Dean (2015) provided promising
results of Knowledge Distillation with the matching logits technique when applied to
a state of the art Automatic Speech Recognition model which consists of an ensemble
of ten DNNs. The distilled single model (60.8% accuracy) was able to nearly match

6
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

the performance of the ensemble model (61.1% accuracy). Additionally, the distilled
model outperformed a baseline model which consisted of an outdated version used by
Android voice search (58.9% accuracy) on the same data. In conclusion, Hinton, Vinyals,
and Dean (2015) reported that through Knowledge Distillation, a smaller cumbersome
model could be produced without a significant drop in prediction performance.
Based on the research of Hinton, Vinyals, and Dean (2015), the application of
Knowledge Distillation was further investigated. In successive research, Knowledge
Distillation was applied as a technique to allow advanced DNNs to run on hardware
with less computational power, such as smartphones, without the significant loss in
performance (Urban et al. 2016; Zhou et al. 2017). Often, the student network was a
basic version of the teacher network with fewer layers (Lan, Zhu, and Gong 2018; Wang
et al. 2019). Depth and width are the two dimensions that determine the complexity of
a DNN, which is related to the required time and resources needed to run a DNN (Gou
et al. 2020). However, in terms of explainability, a shallow DNN can still be challenging
to explain, especially for end-users with no Machine Learning knowledge. Focusing on
the problem of explainability, the study by Liu, Wang, and Matwin (2018) took the basis
of Knowledge Distillation and achieved to distill the dark knowledge of a DNN into
a comprehensible decision tree. As Liu, Wang, and Matwin (2018) correctly identified,
the extent to which explainability is present in a model highly depends on which type
of algorithm is used for generating predictions, as shown in Table 1. Decision trees are
commonly known for their similarity to human decision making, making it easier to
explain what properties have driven the decision (Waltl and Vogl 2018).

Table 1: Comparison of common Machine Learning Algorithms by Kotsiantis (2007)


extension by Waltl and Vogl (2018)
Decision Neural Naïve Deductive
kNN SVM
Trees Networks Bayes logic based
Accuracy ** *** * ** *** **
Speed of Learning *** * **** **** * **
Speed of Classification **** **** **** * **** ****
Tolerance w.r.t. input *** ** * * **** **
Transparency of
**** ** *** ** ** ***
the Process
Transparency of
**** * ** *** ** ****
the Model
Transparency of
**** * *** *** * ****
Classification

Liu, Wang, and Matwin (2018) solved two challenges during their research to
successfully apply Knowledge Distillation with two different algorithms. Firstly, when
dealing with a classification task, the targets are limited to categorical and cannot be
numerical or continuous. They tackled this issue by treating the objective as a regression
problem. Secondly, when dealing with a regression task, most algorithms can only
support single-output regressions. When using a multi-class dataset, this becomes a
multi-output regression problem. This issue was mitigated by applying the ’algorithm
adaptation’ method, where the decision tree is modified to handle multi-output datasets
simultaneously. The decision trees were built using the Classification and Regression
Trees (CART) algorithm due to CART’s ability to deal with numerical target values
(Breiman et al. 1983). In short, the key innovation of Liu, Wang, and Matwin (2018) is

7
Data Science & Society 2020

that they convert the problem into a multi-output regression problem and then transfer
the regression result for classification. This enabled Liu, Wang, and Matwin (2018) to
transfer the dark knowledge of both a Convolutional Neural Network (CNN) and a
Multilayer Perceptron (MLP) network into a comprehensive decision tree. The CNN
was trained on the MNIST dataset (LeCun et al. 1998), whilst the MLP network was
trained on the Connect-4 dataset (Dua and Graff 2017). Both student model decision
trees were limited to a depth of 10, creating interpretable tree models that humans are
still able to comprehend. Additionally, in order to show the transfer of dark knowledge,
’vanilla’ decision trees were also built by using the CART algorithm. A vanilla decision
tree is a standard CART decision tree that received no distillation treatment whatsoever.
The single similarity that both the student and the vanilla decision tree have is that
they are trained on the same data features, but with different target values. Through
Knowledge Distillation, Liu, Wang, and Matwin (2018) achieved interesting results with
their distilled decision trees, as shown in Table 2.

Table 2: Test Accuracy Results by Liu, Wang, and Matwin (2018)


Student Decision Vanilla Decision
Teacher Neural
Dataset Tree Accuracy Tree Accuracy
Network Accuracy
(depth = 10) (depth = 10)
MNIST dataset 99.25% 86.55% 84.45%
Connect-4 dataset 86.62% 73.42% 70.44%

Based on the results, Liu, Wang, and Matwin (2018) concluded that there is a
significant improvement of accuracy of the distilled student models compared to the
vanilla decision trees. They attribute this accuracy improvement to Knowledge Dis-
tillation, and utilization of the dark knowledge hidden in the soft predictions of the
teacher model. And most notably, Liu, Wang, and Matwin (2018) are able to use a
non-interpretable algorithm in the form of a DNN to boost the performance of an
inherently interpretable algorithm: a decision tree. However, they also report that the
effect of Knowledge Distillation is highly dataset dependable since the distillation effect
relied on the student’s own generalization ability. In their Future Work paragraph, Liu,
Wang, and Matwin (2018) emphasize investigating new applications and models where
interpretable models can potentially match non-interpretable models.
Although the use of Knowledge Distillation looks promising, the study by Liu,
Wang, and Matwin (2018) does not display any practical use in a real-life setting.
The MNIST and Connect-4 datasets are commonly used to display the effectiveness
of certain Machine Learning applications but lack the characteristics of realistic datasets
that are often highly imbalanced and noisy (Veni and Rani 2018). Amongst existing lit-
erature, it is known that class imbalance can significantly negatively affect classification
performance (Zhang 2016). Therefore, this study aims to fill this research gap by inves-
tigating the use of Knowledge Distillation to generate interpretable student models in
tangible decision-critical scenarios. Thus, this study will investigate the effectiveness
of Knowledge Distillation by using three datasets with varying characteristics, all in
a practical setting. This study aims to answer the following main research question:
To what extent can Knowledge Distillation improve performance of an interpretable
student Decision Tree in decision-critical scenarios? Furthermore, this primary research
question is supplemented with sub-research questions that focus on how class imbal-
ance affects Knowledge Distillation and investigates possible solutions to combat the
class imbalance problem.

8
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

3. Methods

In order to run experiments, this study uses Multi-Layer Feedforward (MLF) neural
networks for classification as teacher models. Subsequently, the hidden knowledge
within the layers of the MLF neural networks are transferred to non-neural networks
in the form of a decision trees through Knowledge Distillation. In the following section,
the algorithms and techniques used in this study are discussed.

3.1 Multi-Layer Feedforward Neural Network

The teacher models in this study are sequential Multi-Layer Feed Forward (MLF)
neural networks, each varying in size and parameters. A sequential model implies
that each layer is directly connected with every other layer in the network. A MLF
neural network comprises of nodes that are arranged into layers. Between the input
and the output layers the hidden layers of the neural network reside. Each node within
a specific layer is connected with all the nodes in the next layer and is assigned a specific
number called the weight coefficient. The weight given to a certain node reflects the
degree of importance of the connection within the neural network. In order to learn the
relationship between input and output, the model uses the backpropagation algorithm.
The backpropagation algorithm consists of two phases:

• Forward Propagation. The input vector is passed through all the nodes of
the network, and an output is produced at the end of the process. During
Forward Propagation the weights inside the networks are fixed.
• Backward Propagation. Through gradient descent, the parameters that
minimize the loss function are found. Subsequently, with backpropagation
the gradient of the loss function is calculated with respect to the weights
inside the neural network, resulting in the weights being updated to
minimize the loss.

During training, the train dataset is fed to the model in epochs. During each epoch,
the MLF performs backwards propagation through all the layers of the neural network.
Based on the provided loss function, the weights inside the hidden layers are adjusted
and the weights gradually converge to a local optimum. The following components
are part of the teacher MLF neural networks in this research (Svozil, Kvasnička, and
Pospíchal 1997; Chishti and Awan 2019).

• Dense Layer. Between the input layer and the output layer of the model
the Dense Layers are positioned. The nodes within the Dense Layers are in
direct contact with the nodes in the next layer, hence the densely connected
layers are also called Fully Connected Layers. The Dense Layers perform
nonlinear transformation on the inputs that are entered into the neural
network. Subsequently, the inputs are processes through an activation
function that ’fires’ if the input exceeds a certain input (Chollet 2018).
• ReLU. The activation function used in all three models is the nonlinear
Rectangular Linear Unit (ReLU) function. Without an activation function,
the model can only learn linear representations. Due to ReLU’s
non-saturating nature, ReLU is known to speed up training time in
comparison to saturating nonlinear functions such as tanh (Krizhevsky,

9
Data Science & Society 2020

Sutskever, and Hinton 2012). For a given input xlij , the value after passing
through ReLU is:

olij = max(0, xlij )


• Dropout Layer. A Dropout layer is included in models where overfitting
on the training data poses a problem. When applying a Dropout layer,
certain hidden nodes inside a neural network are ’dropped’. The choice of
which nodes is at random, where parameter p is the probability of
retention of each unit. This results in a thinner network of the nodes that
survived the Dropout layer. As a result, the hidden nodes inside the neural
network are much more robust and rely less on each other and the
generalization error is much lower when new unseen data is introduced to
the model (Srivastava et al. 2014).
• Batch Normalization Layer. Batch Normalization reduces the covariance
shift of the by normalizing the hidden units inside the model. As a result,
the layer learns a more stable distribution of inputs that reduces the effect
of overfitting (Ioffe and Szegedy 2015).
• Categorical Crossentropy. The Categorical Crossentropy loss function is
applied in all the MLF neural networks in this study. The Categorical
Crossentropy is used to generate two softmax outputs in the last layer of
the model, which is required to extract the logits for Knowledge
Distillation (Hinton, Vinyals, and Dean 2015). The Categorical
Crossentropy function is further explained in section 3.3.

10
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

3.2 CART Algorithm

The decision trees that receive Knowledge Distillation treatment and the vanilla deci-
sion trees are all generated with the Classification and Regression Trees (CART) algo-
rithm. The CART algorithm divides the feature space and groups occurrences with the
same labels together. At first, the root node is constructed with all training examples
S with features as yi ∈ Rn for i = 1. . . l and labels as yi ∈ Rl and splits the node into
two child nodes recursively. Subsequently, a splitting criterion is applied in the form
of C = (a, tn ) where a is the feature to split on and tn is the threshold at node n. The
criterion C splits each partition S into:
Slef t (C) = (x, y)|xa ≤ tn
Sright (C) = S/Slef t (C)
For classification, the vanilla decision tree calculates the impurity at node n with an
impurity function l. However, the student decision tree uses Mean Squared Error to
calculate the impurity since it is dealing with a regression task due to the targets being
logits. I is calculated as:

1 X
yn0 = yi
Mn i∈M
n

1 X
I(Xn ) = (yi − yn0 )2
Mn i∈M
n

The number of instances corresponding to the child node is Mn . Given I, the impurity
for both nodes can be calculated as:

Mlef t Mright
f (S, C) = I(Slef t (C) I(Sright (C)
Mn Mn
Subsequently, the parameters in C are optimized by minimizing f (S, C):

C ∗ = argminC f (S, C)
As a result, the optimal feature and the splitting threshold are found and the algorithm
splits Slef t (C) and Sright (C) until the maximum depth given is reached with Mn <
minsamples or Mn = 1.

11
Data Science & Society 2020

3.3 Matching Logits

When applying Knowledge Distillation, the generalization ability of an advanced


teacher model is transferred to a simpler student model. By using the soft targets for
Knowledge Distillation, the distilled models are supposed to outperform models who
only transfer hard targets to the student model (Liu, Wang, and Matwin 2018) Hard
targets only contain the information on the predicted label, while soft targets display all
the predicted probabilities across all classes as shown in Table 3.

Table 3: Example of hard and soft targets


Apple Pear Banana Car
Hard Targets 0 1 0 0
Soft Targets 0.1 0.9 10−5 10−9

To give an example based on Table 3, we observe that the soft target probabilities
of ’Banana’ and ’Car’ are much smaller than the probabilities of ’Apple’ and ’Pear’. In
other words, the soft targets hold information that an ’Apple’ is more similar than a
’Pear’, compared to a ’Banana’ or a ’Car’. The probabilities contain information about
similarity structure within the data, that is useful for the student model to know for
boosting performance. The model is being encouraged to produce more meaningful
predictions based on the knowledge of wrong predictions given by the soft targets. As
a result, Knowledge Distillation is used as a regularization technique to increase perfor-
mance (Bagherinezhad 2020). Unfortunately, the small probabilities in the soft targets
will vanish to zero when applied to a cross-entropy loss function in the student model.
The vanishing of the probabilities will consecutively result in a loss of knowledge (Liu,
Wang, and Matwin 2018). To illustrate, the last layer before the softmax layer in an MLF
neural network is a Dense layer with logits z as the output:

X
zi = (yi − yn0 )2
j

In this case, zi is the logit for one of the labels: i. The number of hidden nodes is j for
layer l - 1, where W are the weights and b the bias. Up next, the softmax layer calculates
the output probabilities for each class as:

ezi
q i = P zi
je

Subsequently, the categorical cross entropy function is applied to calculate the loss of
the model:

X
Hp (q) = − pi log(qi )
i

12
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

Thus, to prevent the loss of information it is preferable to use the logits z instead of the
predicted probabilities q. This technique is called matching logits, which is based on
the work by Hinton, Vinyals, and Dean (2015). Through applying the matching logits
technique, the created soft targets in the form of logits will preserve information when
transferred to a student model. Usually, the student model cannot exactly match the soft
targets provided, but is guided into the right direction which gives better performance
results (Liu, Wang, and Matwin 2018).

3.4 Distillation process

The aforementioned matching logits technique is applied when distilling an MLF neural
network into a decision tree. The architecture used is displayed in Figure 3, where an
MLF neural network is distilled into a CART decision tree. The MLF Neural Network
displayed in Figure 3 is based on the MLF Neural Network used for the Home Credit
Default Risk dataset.

Figure 3: Distillation architecture of this study

13
Data Science & Society 2020

When the MLF neural network is trained, feature part X of the original training
dataset is fed to the trained MLF neural network to abstract the logits Z. The logits are
then obtained by removing the last softmax layer of the trained MLF neural network.
Successively, the CART algorithm is trained with feature part X, where logits Z are used
as the targets. When using logits Z as continuous targets, the CART algorithm is used
for regression instead of classification. The regression data used should have features as
X with xi ∈ Rn for i = 1. . . l and labels in the form of Z with zi ∈ Rl for I = 1. . . l. The
impurity function (Mean Squared Error) in CART is calculated as:

1 X
zn0 = zi
Mn i∈M
n

1 X
I(Xn ) = (zi − zn0 )2
Mn i∈M
n

When CART is trained, the final prediction k for class i is obtained by applying the
softmax layer Oi on the predictions as displayed below. This result in the continuous
test predictions becoming categorical predictions and metric scores can be calculated in
comparison to the correct labels.

eki
Oi = P ki
je

14
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

4. Experimental Setup

During this study, experiments are performed on three binary classification datasets
in decision-critical scenario’s. These datasets are selected to demonstrate the effect of
Knowledge Distillation with various characteristics and with the usage of different
sampling techniques. All the models built in this study are made using Keras due to
its ease of use and scalability. Additionally, all models are trained on Google’s server-
based GPU’s offered by Google Collaboratory to speed up the training process. All the
notebooks used in this study are available online on Github (Vosters 2020).

4.1 Datasets

The three datasets used in this study are the Home Credit Default Risk Dataset (Kaggle
2018), the Default of Credit Card Clients Dataset (Yeh and Lien 2009) and the Breast
Cancer Wisconsin Diagnostic Dataset (Wolberg, Street, and Mangasarian 1995). All three
datasets are set in a decision-critical scenario, where the potential output can have
significant consequences for the end-user. To evaluate model performance, all datasets
are split using Scikitlearn’s ’train_test_split’ function by a 85-15 percent ratio respec-
tively. Furthermore, all continuous features are scaled using Scikitlearn’s ’Standard
Scaler’ and ’MinMax Scaler’ functions to improve the speed of gradient descent and
make all the inputs comparable at range (Ioffe and Szegedy 2015). Excluding the Breast
Cancer Wisconsin Diagnostic Dataset (Figure 4a), both the Home Credit Default Risk
Dataset (Figure 4b) and the Default of Credit Card Clients Dataset (Figure 4c) suffer
from significant class imbalance.

(a) Breast Cancer Wisconsin (b) Home Credit Default Risk (c) Default of Credit Card
Diagnostic Dataset Dataset Clients Dataset

Figure 4: Label occurrence of each dataset

15
Data Science & Society 2020

Therefore, two different sampling techniques are applied to the datasets to over-
come the problem of the model having low predictive accuracy on the minority class
to validate their effectiveness when conducting Knowledge Distillation. In order to
make comparisons and analyze the effect of class imbalance, the imbalanced default
datasets are also used. However, the neural networks are trained with pre-set class
weights. During training the minority class is given a higher weight compared to
the majority class. The model is then penalized with additional loss when giving a
wrong classification of the minority class. This ’penalty’ can bias the model to increase
its attention to the minority class, and provide better classification results. The class
weights are generated by using the ’class_weight’ module from Scikitlearn. The two
sampling techniques used in this study are SMOTE oversampling and undersampling.
In case a dataset was SMOTE oversampled or undersampled, the splitting of the dataset
was conducted before the sampling to prevent data leakage (Becker 2016). When us-
ing Synthetic Minority Oversampling Technique (SMOTE), the parameter k is set to
k = 5. SMOTE creates synthetic observations based on the k-nearest neighbours of
the minority class of similar observations. SMOTE then chooses one of the k-nearest
neighbours, and randomly tweaks it to generate a new observation (Fernández et al.
2018). Through applying SMOTE, a 50-50 percent class balance ratio within the data
is achieved. When downsampling the dataset, the dataset is randomly shuffled and
subsequently separated based on the classes. Given the size of the minority class, the
equivalent amount of data entries are randomly picked from the majority class until a
50-50 percent ratio is reached.

4.1.1 Home Credit Default Risk Dataset. The Home Credit Default Risk Dataset used
in this study is a merged dataset that comprises of 7 individual datasets together based
on each entry’s identification number. These datasets are obtained from Kaggle, where
the Home Credit company setup a Machine Learning competition in the late summer of
2018. Home Credit aimed to obtain successful submissions from participants to investi-
gate various statistical models that could help them to predict their client’s repayment
abilities. The models built for this decision-critical dataset could potentially decide
whether a person will get a mortgage or not. Unfortunately, there is no information
provided on how the data was obtained (Kaggle 2018). Nonetheless, each individual
dataset contains information about clients that can potentially be valuable to assess
the chance of mortgage default. An overview of the individual datasets can be seen in
Appendix A. Due to the large size of the dataset, a full feature list is not included in this
document. However, a feature description list can be downloaded from Kaggle (Kaggle
2018). The Home Credit Dataset lacks extensive research regarding the effectiveness of
neural networks on the dataset. Nonetheless, this dataset is explicitly chosen to assess
the effectiveness of Knowledge Distillation when dealing with a sizeable imbalanced
dataset in a finance setting. The dimensions of the Home Credit Default Risk Datasets
are found in Table 4 below. The merged dataset was obtained from Kaggle based on the
public notebook provided by user James Shepherd (Shepherd 2018). In the merging
process, Shepherd conducted extensive feature engineering by adding new features
such as Loan/Income ratio and Annuity/Income ratio. Given Shepard’s dataset, all
categorical features were label encoded using Scikitlearn’s ’Label Encoder’ package, for
the model to handle categorical features. Furthermore, feature columns in the dataset
consisting over 80% empty values are dropped from the dataset due to potential noise
that throws the model off when applying a specific imputation technique reducing the
dataset to 368 features. Due to computational constraints given the size of the dataset,
the remaining missing values are filled by their respective feature mean value.

16
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

Table 4: Dimensions of Home Credit Default Risk Dataset


Dataset #Features #Train set (85%) #Test set (15%)
Home Credit Default Risk Dataset
368 261,384 46,127
Default
Home Credit Default Risk Dataset
368 480,552 84,820
SMOTE Oversampled
Home Credit Default Risk Dataset
368 42,202 7,448
Undersampled

4.1.2 Default of Credit Card Clients Dataset. The Default of Credit Card Clients Dataset
is obtained from the Machine Learning Repository website. The dataset consists of
information about credit card holders from an anonymous Taiwanese bank collected
in 2005 (Yeh and Lien 2009). All the features of the dataset are shown in Appendix B.
The models used on this decision-critical dataset can be used to determine whether a
person will receive a credit from a bank. This dataset is explicitly chosen since it requires
no missing data imputation, and it has been proven in the study by Yeh and Lien (2009)
that neural networks perform best at estimating the probability of default compared
to other common Machine Learning techniques (Yeh and Lien 2009). Furthermore, this
dataset has class imbalance and therefore, this dataset is also SMOTE oversampled and
undersampled. The dimensions of the different datasets are shown in Table 5.

Table 5: Dimensions of Default of Credit Card Clients Dataset


Dataset #Features #Train set (85%) #Test set (15%)
Default of Credit Card Clients Dataset
23 25,500 4,500
Default
Default of Credit Card Clients Dataset
23 39,734 6,994
SMOTE Oversampled
Default of Credit Card Clients Dataset
23 11,280 1,991
Undersampled

17
Data Science & Society 2020

4.1.3 Breast Cancer Wisconsin Diagnostic Dataset. The Breast Cancer Wisconsin Di-
agnostic Dataset is generated based on digitized images of a cell sample drawn from
a breast mass and analyzed under a microscope. This decision-critical dataset can
conceivably be employed to train models to give a diagnosis for cancer treatment. The
features, as displayed in Appendix C, describe the characters of the cell core present
in the image (Street, Wolberg, and Mangasarian 1993). This dataset is chosen because
it does not have significant class imbalance, which is required to draw conclusions on
the effect of class imbalance and Knowledge Distillation in decision-critical scenarios.
Furthermore, existing research proved that both neural networks and decision trees
could easily grasp the dataset due to its partly linear nature (Mangasarian, Street, and
Wolberg 1994). Therefore, the Breast Cancer Dataset is also used to verify the validity of
the Knowledge Distillation pipeline used in this study. The dimensions of the dataset
are shown in Table 6.

Table 6: Dimensions of Breast Cancer Wisconsin Diagnostic Dataset


Dataset #Features #Train set (85%) #Test set (15%)
Breast Cancer Wisconsin Diagnostic
30 483 86
Default

4.2 Models

During each of the three model’s optimization process, 15% of the training set is used
as a validation set to select the best performing model with different parameters. This
split is done using the built-in ’validation_split’ function inside Keras’s ’model.compile’
function. The optimization process for each model was done incrementally by conduct-
ing extensive grid search by tweaking various parameters and adding or removing
layers. The models reported in this section were eventually selected as best performing
based on the used validation set. All three models start with an input layer, that is
given the dimensions of the input data. Furthermore, at the end of the model, a softmax
output layer is positioned to match the required output labels for classification. Before
each output layer, an extra Dense layer is inserted to extract the logits for Knowledge
Distillation. Although we are dealing with a binary classification problem, the categor-
ical cross-entropy loss function is applied to all the neural networks in this study to
facilitate Knowledge Distillation. Throughout this study, the evaluation metric of choice
is the AUC – ROC curve, or AUC in short. AUC stand for Area Under The Curve and
represents the degree of separability between classes. The AUC is calculated based on
the area under the entire Receiver Operating Curve, which is a probability curve of both
the True Positives and the False Positives rate at different classification thresholds. The
AUC is a common metric when dealing with binary classification problems in existing
literature since it incorporates how much a model is capable of distinguishing between
classes (Mandrekar 2010; Moi et al. 2018). The degree of separability is highly important
in decision-critical scenarios, where giving the wrong prediction to an end-user can
have big consequences. Lastly, to save time during training and prevent the model
from eventually overfitting, the Keras EarlyStopping module is used to keep track of
improvement on the validation AUC with parameter patience set to 5 epochs (Song
et al. 2019). If the validation AUC does not significantly improve over 5 epochs, the
model stops training and the best performing model is subsequently saved. Therefore,

18
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

the amount of epochs is set to an arbitrary number of 100 for all models since this limit
was never exceeded during training.

4.2.1 Home Credit Default Risk Model. For the Home Credit Default Risk Dataset,
a deep Multi-layer Feedforward neural network is built from scratch consisting of
five Dense layers, two Batch Normalization layers and one Dropout layer. The exact
parameters are shown in Figure 5. and were tuned based on the validation AUC
performance with the default dataset. The last Dense layer named ’logits’ is given two
hidden nodes, to match output and to obtain the logits for Knowledge Distillation. Early
in the optimization process, it became evident that the model suffered from overfitting.
Therefore, multiple regularization methods were applied to the model, including L1
and L2 regularization, Dropout layers, and Batch Normalization layers (Goodfellow,
Bengio, and Courville 2016). Based on validation set performance, eventually two Batch
Normalization layers were added, together with one Dropout layer with dropout rate
set to 0.5 that proved to give better generalization results. Additionally, multiple acti-
vation functions have been tested such as tanh and sigmoid, but ReLU provided the
best overall performance. The optimizer that proved to give the best performance is the
Adam optimizer, whilst also improving the speed of learning (Kingma and Ba 2014).
Furthermore, using small batches of 64 during training helped to improve performance
slightly.

Figure 5: Overview of the MLF neural network for Breast Cancer Wisconsin Diagnostic
Dataset

19
Data Science & Society 2020

4.2.2 Default of Credit Card Clients Model. The MLF neural network for the Default
of Credit Card Clients consists of four Dense Layers, with the parameters shown in Fig-
ure 6. This model is based on the work of (Hussain 2018) and slightly tweaked to meet
the requirements for Knowledge Distillation in the last Dense layer. During optimiza-
tion, the validation AUC is monitored to check performance improvements. Adding an
additional Batch Normalization layer did improve the model’s performance slightly.
Likewise to the previous model, the ReLU activation function and Adam optimizer
outperformed other common activation and optimization functions. Additionally, the
batch size is set to a small batch size of 64, since this improved performance by a few
percents.

Figure 6: Overview of the MLF neural network for the Default of Credit Card Clients
Dataset

20
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

4.2.3 Breast Cancer Wisconsin Diagnostic Model. The model for the Breast Cancer Wis-
consin Diagnostic Dataset consists of 4 Dense layers and one Dropout layer displayed
in Figure 7. This model was based on the work of (Sekaran, Mouli, and Ramalingam
2018), and altered with an extra Dense layer to make fit for Knowledge Distillation.
Additionally, the model did seem to suffer a bit from overfitting, so an extra Dropout
Layer was added. Once again, the ReLU activation function and Adam optimizer
outperformed other activation and optimization functions on this dataset. The batch
size for this model was set to 10, which improved performance slightly.

Figure 7: Overview of the MLF neural network for Breast Cancer Wisconsin Diagnostic
Dataset

21
Data Science & Society 2020

5. Results

In the following section, test AUC performance of the teacher models and the subse-
quent results of Knowledge Distillation are reported. For each classification task, two
tables and a figure with results are reported. Within each table, the highest performance
is made bold. The first table, named Test AUC Results Teacher Models, contains the
performance of the MLF neural network and a baseline model for each version of
the dataset. Excluding the Breast Cancer Wisconsin Dataset, the datasets are SMOTE
oversampled, undersampled, or left default but the model was trained with pre-set
weights. This yields three performances, where each individual dataset is shown in
the ’Dataset’ column of the table. The baseline model AUC performance, shown in
column ’AUC_Baseline’, is reported in the form of a Logistic Regression model by using
Scikitlearn’s ’LogisticRegression’ module, to validate the performance of the neural
networks. The parameters of the Logistic Regression model are left default according
to the Scikitlearn module. Lastly, the performance of the teacher MLF neural network is
reported in the ’AUC_Teacher’ column of the table. The second table, named Test AUC
Results of Knowledge Distillation, contains the performance of the vanilla decision
trees and the student decision trees that received Knowledge Distillation. The columns
contain the AUC performance of the vanilla decision trees (AUC_VanillaTree) and
student decision trees (AUC_StudentTree) with each depth specified between brackets.
The vanilla Decision Trees generated in this study are built using Scikitlearn’s ’Deci-
sionTreeClassifier’ module, where the vanilla Decision Trees did not receive Knowledge
Distillation treatment. Additionally, to maintain the explainable nature of the decision
trees, the maximum tree depths are alternated between 5 and 10 for both the vanilla and
the student trees. The reasoning for limiting the tree depth to 10 is based on the limit
of human cognition to comprehend a decision tree beyond a depth of 10 (Liu, Wang,
and Matwin 2018). All other parameters are left as default values in accordance to the
respective Scikitlearn’s module. For additional visualization, a bar chart figure is added
where the representation for each specific bar is specified in the legend below.

22
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

5.1 Home Credit Default Dataset

Based on Table 7, we observe that the teacher neural network using the default dataset
performs with an AUC of 0.8607, indicating that the model has a 86% chance of dis-
tinguishing a creditworthy customer from a non-credit worthy customer. The SMOTE
oversampling dataset improves the teacher model performance by 4%, resulting in
the highest performance of 0.9012 AUC. The teacher model with the undersampled
dataset performs the worst with an AUC of 0.7428. Additionally, it is clear that the
baseline Logistic Regression struggles with all datasets since it achieves poor AUC
scores between 0.5 and 0.7. Such an AUC score implies that the baseline model is no
better than a flip of a coin which is between 0.5 – 0.7 AUC (Hosmer and Lemeshow
2000). Analyzing Table 7 and Figure 8, the results show mixed Knowledge Distillation
effects. On the default dataset, a performance increase due to Knowledge Distillation is
observed between 9% and 11% for each respective depth. However, most likely because
of class imbalance both the vanilla and the student models cannot achieve at least a
comparable AUC with regards to their teacher model. Unfortunately, the proposed
solutions for class imbalance, SMOTE oversampling and undersampling, are not able to
able to adequately fix this issue. Looking at the SMOTE oversampling results, although
the overall performance of both the vanilla decision trees and the student decision
trees increase to around 0.8 AUC, the student decision trees are outperformed by the
vanilla decision trees by 4% and 6%. This unexpected negative result of Knowledge
Distillation is potentially caused by the teacher model not being powerful enough to
transfer the needed hidden knowledge to boost the performance of the student since
the vanilla decision trees already perform decently. Lastly, the results on the undersam-
pled dataset display both negative and positive Knowledge Distillation boost for each
depth respectively but are too small to call significant differences. These inconsistent
results imply that oversampling is not able to effectively address the class imbalance
problem. Furthermore, the overall performances on the undersampled dataset are not
good enough to be useful in a real-life setting (< 0.7 AUC).

Table 7: Test AUC Results Teacher Models


AUC_Baseline AUC_Teacher
Dataset
(Logistic Regression) (MLF Neural Network)
Home Credit Default Risk Dataset
0.5095 0.8607
Default
Home Credit Default Risk Dataset
0.5075 0.9012
SMOTE Oversampled
Home Credit Default Risk Dataset
0.6959 0.7428
Undersampled

23
Data Science & Society 2020

Table 8: Test AUC Results of Knowledge Distillation


AUC_Vanilla AUC_Student AUC_Vanilla AUC_Student
Dataset Tree Tree Tree Tree
(depth = 5) (depth = 5) (depth = 10) (depth=10)
Home Credit Default Risk Dataset
0.5 0.5972 0.5119 0.6239
Default
Home Credit Default Risk Dataset
0.7859 0.7492 0.8669 0.8029
SMOTE Oversampled
Home Credit Default Risk Dataset
0.656 0.6462 0.6363 0.6545
Undersampled

Figure 8: Knowledge Distillation Results of the Home Credit Default Risk Dataset

24
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

5.2 Default of Credit Card Clients Dataset

When examining Table 9, we observe that the default version of the Default of Credit
Card Clients Dataset trained on an MLF neural network with pre-set weights outper-
forms the other teacher models with an AUC of 0.876. This performance implies that
the model has an 87.6% chance of predicting if a client will receive payment issues
or not. Both oversampling (0.7515) and undersampling (0.7073) the dataset does not
improve the performance of the teacher model, compared to the default dataset. In
line with the expectations, the baseline model is outperformed by each neural network.
Examining Table 10 and Figure 9, we conclude that Knowledge Distillation improves
the performance of all the student models across the board, varying between a 2%
to 13% boost. Although performance increases are higher on the oversampled and
downsampled dataset compared to the default dataset, the performances on either the
vanilla are not great with no model achieving an AUC of higher than 0.7. Once again, it
seems that the class imbalance of the dataset limits the effect of Knowledge Distillation.
Moreover, the proposed sampling techniques are not able to fix this issue properly.

Table 9: Test AUC Results Teacher Models


AUC_Baseline AUC_Teacher
Dataset
(Logistic Regression) (MLF Neural Network)
Default of Credit Card Clients Dataset
0.6389 0.876
Default
Default of Credit Card Clients Dataset
0.6537 0.7515
SMOTE Oversampled
Default of Credit Card Clients Dataset
0.6189 0.7073
Undersampled

Table 10: Test AUC Results of Knowledge Distillation


AUC_Vanilla AUC_Student AUC_Vanilla AUC_Student
Dataset Tree Tree Tree Tree
(depth = 5) (depth = 5) (depth = 10) (depth=10)
Default of Credit Card Clients Dataset
0.6325 0.6586 0.574 0.6625
Default
Default of Credit Card Clients Dataset
0.6169 0.6803 0.5535 0.6885
SMOTE upsampled
Default of Credit Card Clients Dataset
0.6107 0.6819 0.5607 0.6801
Downsampled

25
Data Science & Society 2020

Figure 9: Knowledge Distillation Results of the Home Credit Default Risk Dataset

26
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

5.3 Breast Cancer Wisconsin Diagnostic Dataset

Lastly, Table 11 shows the results of the default Breast Cancer Wisconsin Diagnostic
Dataset. The MLF neural network (0.996) outperforms an already solid performing
Logistic Regression baseline model (0.9537) by about 4%. Furthermore, Table 12 shows
that Knowledge Distillation can significantly improve the performance of the student
model. At a depth of 5, the student model outperforms the vanilla model by 3%. When
the depth is increased to 10, the student model outperforms the vanilla model by almost
6%. With a structured and balanced dataset, containing a data representation that can
be learned adequately by both neural networks and decision trees, the full potential of
Knowledge Distillation can be exploited to achieve high performance on interpretable
decision trees as shown in Table 12 and Figure 10. The best performing student decision
tree has a 93% chance of identifying cancerous cells from non-cancerous cells, whilst
being limited to a depth of 10 and thus being inherently interpretable.

Table 11: Test AUC Results Teacher Models


AUC_Baseline AUC_Teacher
Dataset
(Logistic Regression) (MLF Neural Network)
Breast Cancer Wisconsin Diagnostic Dataset
0.9537 0.9996
Default

Table 12: Test AUC Results of Knowledge Distillation


AUC_Vanilla AUC_Student AUC_Vanilla AUC_Student
Dataset Tree Tree Tree Tree
(depth = 5) (depth = 5) (depth = 10) (depth=10)
Default of Credit Card Clients Dataset
0.8825 0.9132 0.8733 0.9317
Default

Figure 10: Knowledge Distillation Results of the Home Credit Default Risk Dataset

27
Data Science & Society 2020

6. Discussion

This study investigates the usage of Knowledge Distillation with teacher Deep Neural
Networks and student decision trees, on three new decision-critical applications. A
decision-critical scenario is a real-life setting where the model’s output can be leading
for a decision that can have a significant impact on human life (Grigorescu et al. 2020).
In order to set a clear goal for this research, the following main research question is
answered: to what extent can Knowledge Distillation improve performance of an inter-
pretable student Decision Tree in decision-critical scenarios? Additionally, datasets in
decision-critical scenarios often suffer from class imbalance due to a smaller probability
of the negative class occurring in reality. Amongst existing literature, it is known that
class imbalance can negatively affect classification performance (Zheng and Jin 2020).
Therefore, an additional sub-research question is included in this study to investigate
how class imbalance affects Knowledge Distillation: How does class imbalance affect
Knowledge Distillation? Additionally, this study proposes two potential solutions to the
class imbalance problem by answering the final sub-question: What is the most effective
technique to address class imbalance when applying Knowledge Distillation?
Firstly, to answer the main research question, Knowledge Distillation was applied
to three datasets set in decision-critical scenarios. Based on the performance results,
this study shows that Knowledge Distillation is capable of boosting the performance
of interpretable student decision trees in decision-critical scenarios to a certain extent.
On the one hand, the results of this study show that Knowledge Distillation can signif-
icantly improve the student model’s performance. Overall, all teacher neural network
models performed well between 0.86 and 0.99 AUC, and all student models were able
to outperform their non-distillation counterparts with performance increases ranging
between 2% to 13% AUC. Most notably, the Breast Cancer Wisconsin model displays a
near-perfect instance where Knowledge Distillation contributes a performance increase
of 6%, resulting in a high performing student decision tree (0.9317) that is inherently
interpretable with a depth of 10. However, the effect of Knowledge Distillation highly
depends on the student decision tree’s own performance on the data, which is in coher-
ence with the findings by Liu, Wang, and Matwin (2018). Knowledge Distillation proved
not to be consistently powerful enough to boost the student model to an acceptable AUC
score (>0.7) on the other two datasets. Therefore, the effect of Knowledge Distillation is
limited since low AUC scores are not useful in decision-critical scenarios where the
wrong output can have big consequences.
Secondly, a comparison of the results between the two imbalanced datsets and the
balanced dataset indicate that class imbalance negatively affects Knowledge Distilla-
tion. Only the model that uses the balanced Breast Cancer Wisconsin Dataset displays
solid Knowledge Distillation results across the board. The other two models that use
imbalanced datasets are not able to achieve a reasonable AUC score of above 0.7 with
Knowledge Distillation treatment, making the results highly dataset dependable. Given
this result, we can assume that class imbalance contributes negatively to the effect of
Knowledge Distillation since only the balanced dataset was able to display good Knowl-
edge Distillation results across the board. This finding is in line with the expectations
and research on the effect of class imbalance problem on classification tasks (Zheng and
Jin 2020).
Thirdly, this study investigated the usage of techniques to combat the class imbal-
ance problem present in the datasets, by applying SMOTE oversampling and under-
sampling on the given imbalanced datasets. The results showed that it is premature to
conclude which sampling technique is most effective to address the class imbalance

28
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

problem when conducting Knowledge Distillation. The results on the Home Credit
Default Risk Dataset do not show consistent results in order to draw adequate con-
clusions. For the Default of Credit Card Clients Dataset, the SMOTE oversampling
and undersampling technique show almost identical performance increases when using
Knowledge Distillation, but not nearly enough to match the performance of the teacher
model. Once again, the produced student models do not perform well enough (<0.70)
to be used in real-life applications and are therefore considered of no use. As a result,
this study cannot conclude which technique is most effective at addressing the class
imbalance problem.
The theoretical implication for the academic domain of Explainable Deep Learning
is that Knowledge Distillation works best on well structured and balanced datasets since
the effect of Knowledge Distillation is severely limited by the student model’s own
generalization ability on the given data. The practical implication for businesses and
other domains that deploy algorithms in decision-critical scenarios is that Knowledge
Distillation might not be the ’silver bullet’ to fulfill the need for high performing
interpretable models. Within decision-critical domains, the luxury of a balanced dataset
is not always present, and therefore the effect of Knowledge Distillation remains limited
according to this study.
A primary limitation of this study is the limited access to comparable, structured,
balanced datasets that are relativity equal in size and within the scope of this study’s
decision-critical approach. Therefore, the comparisons and additional findings regard-
ing class imbalance might also rely on other specific characteristics of the dataset, such
as a degree of linearity within the data. A possible solution could be to analyze multiple
other dataset characteristics that potentially influence the effect of Knowledge Distilla-
tion and subsequently choose datasets accordingly. Furthermore, this study focuses on
just two techniques to fix the class imbalance problem, leaving other existing techniques
untouched.

29
Data Science & Society 2020

7. Conclusion

This study concludes that Knowledge Distillation can improve the performance of
an interpretable student decision tree in decision-critical scenarios to a certain extent.
This answer is found by providing an extensive analysis of Knowledge Distillation
on three varying datasets in decision-critical scenarios. On the one hand, the results
of this study show that Knowledge Distillation can improve the performance of the
student model. On the other hand, the effect of Knowledge Distillation highly depends
on the student decision tree’s own performance on the data. As a result, Knowledge
Distillation proved not to be consistently powerful enough to boost the student model
to an acceptable AUC score (>0.7). Consequently, the effect of Knowledge Distilla-
tion to produce high performing and interpretable student models remains limited.
In decision-critical scenarios where the outcome can be life-changing, it is eventually
more essential to produce high performing models. The student models are therefore of
little use in real-life. Furthermore, this study investigated how class imbalance affects
Knowledge Distillation. Through dataset comparisons, this study showed that class
imbalance negatively affected Knowledge Distillation. With a balanced dataset (Breast
Cancer Wisconsin Dataset), this study could only produce a high performing and in-
terpretable student decision tree (0.93 AUC). Given this result, this study investigated
sampling techniques that can potentially address class imbalance on the imbalanced
dataset and see whether it improved Knowledge Distillation results. Due to inconsistent
results using both SMOTE oversampling and undersampling, no definitive answer can
be given regarding this sub-question.
The broader implication in the academic domain of Explainable Deep Learning is
that Knowledge Distillation works best on structured, balanced datasets since the effect
of Knowledge Distillation is severely limited by the student model’s own generalization
ability. The practical implication of this finding is that Knowledge Distillation might not
be the ’silver bullet’ to fulfill the need for high performing interpretable models. Within
decision-critical domains, the luxury of a balanced dataset is not always present, and
therefore the effect of Knowledge Distillation remains limited. However, there are sev-
eral directions for future work to take. For instance, it would be valuable to investigate
further what dataset characteristics are most suitable for effective Knowledge Distilla-
tion to produce interpretable and high performing student decision trees. Furthermore,
this study was limited in terms of using only two sampling techniques to solve the
problem of class imbalance. It would be interesting to investigate if other techniques can
effectively resolve class imbalance to produce useful Knowledge Distillation results.

30
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

8. Acknowledgements

I would like to thank Sebastian Olier for assistance and insightful discussions during
the process of writing this thesis. Furthermore, many thanks to the online communities
of Kaggle, StackOverflow and Github for publishing insightful code and notebooks that
benefited this study.

31
Data Science & Society 2020

9. References

Aggarwal, Charu E. 2018. Neural Networks and Deep Learning.


Bagherinezhad, Hessam. 2020. REGULARIZING PREDICTIONS VIA CLASS-WISE
SELF-KNOWLEDGE DISTILLATION. pages 1–11.
Bakator, Mihalj and Dragica Radosav. 2018. Deep learning and medical diagnosis: A review of
literature. Multimodal Technologies and Interaction, 2(3).
Becker, Nick. 2016. The Right Way to Oversample in Predictive Modeling.
Breiman, L, J Friedman, R Olshen, and C J Stone. 1983. Classification and Regression Trees.
Buhrmester, Vanessa, David Münch, and Michael Arens. 2019. Analysis of Explainers of Black
Box Deep Neural Networks for Computer Vision: A Survey.
Chishti, Waseem Ahmad and Shahid Mahmood Awan. 2019. Deep Neural Network a Step by
Step Approach to Classify Credit Card Default Customer. 3rd International Conference on
Innovative Computing, ICIC 2019.
Chollet, Francois. 2018. Deep Learning with Python.
Das, Arun and Paul Rad. 2020. Opportunities and Challenges in Explainable Artificial
Intelligence (XAI): A Survey. pages 1–24.
Dosilovic, Filip Karlo, Mario Brcic, and Nikica Hlupic. 2018. Explainable artificial intelligence: A
survey. 2018 41st International Convention on Information and Communication Technology,
Electronics and Microelectronics, MIPRO 2018 - Proceedings, (May):210–215.
Dua, Dheeru and Casey Graff. 2017. {UCI} Connect-4 Data Set.
Fernández, Alberto, Salvador García, Francisco Herrera, and Nitesh V. Chawla. 2018. SMOTE for
Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary.
Journal of Artificial Intelligence Research, 61:863–905.
GDPR. 2016. General Data Protection Regulation.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Nature,
29(7553):1–73.
Gou, Jianping, Baosheng Yu, Stephen John Maybank, and Dacheng Tao. 2020. Knowledge
Distillation: A Survey.
Grigorescu, Sorin, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. 2020. A survey of deep
learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural
Network.
Hosmer, David and S. Lemeshow. 2000. Area under the roc curve. Applied Logistic Regression,
pages 160–164.
Hussain, Saad. 2018. Credit Card Default Prediction Using TensorFlow.
Ignatov, Andrey, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix Baum, Max Wu,
Lirong Xu, and Luc Van Gool. 2019. AI benchmark: All about deep learning on smartphones
in 2019. Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019,
pages 3617–3635.
Ioffe, Sergey and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift.
Janssen, Janneke. 2019. The right to explanation: means for ’white-boxing’ the black-box?
13(January).
Kaggle. 2018. Home credit default risk. Available at:
https://www.kaggle.com/c/home-credit-default-risk/data.
Kingma, Diederik P. and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization.
Kotsiantis, S B. 2007. Supervised Machine Learning: A Review of Classification Techniques. In
Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer
Engineering: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval and
Pervasive Technologies, pages 3–24, IOS Press, NLD.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet Classification with Deep
Convolutional Neural Networks. Handbook of Approximation Algorithms and Metaheuristics,
pages 1–1432.
Lan, Xu, Xiatian Zhu, and Shaogang Gong. 2018. Knowledge Distillation by On-the-Fly Native
Ensemble.
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323.

32
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

Lepri, Bruno, Nuria Oliver, Emmanuel Letouzé, Alex Pentland, and Patrick Vinck. 2018. Fair,
Transparent, and Accountable Algorithmic Decision-making Processes: The Premise, the
Proposed Solutions, and the Open Challenges. Philosophy and Technology, 31(4):611–627.
Lipton, Zachary C. 2018. The Mythos of Model Interpretability.
Liu, Xuan, Xiaoguang Wang, and Stan Matwin. 2018. Improving the Interpretability of Deep
Neural Networks with Knowledge Distillation.
Mandrekar, Jayawant N. 2010. Receiver Operating Characteristic Curve in Diagnostic Test
Assessment. Journal of Thoracic Oncology, 5(9):1315–1316.
Mangasarian, Olvi, Nick Street, and William Wolberg. 1994. Breast Cancer Diagnosis and
Prognosis Via Linear Programming. Operations Research, 43.
Moi, Sin-Hua, Yi-Chen Lee, Li-Yeh Chuang, Shyng-Shiou F Yuan, Fu Ou-Yang, Ming-Feng Hou,
Cheng-Hong Yang, and Hsueh-Wei Chang. 2018. Cumulative receiver operating
characteristics for analyzing interaction between tissue visfatin and clinicopathologic factors
in breast cancer progression. Cancer cell international, 18:19.
Parloff, By Roger. 2016. The AI Revolution: Why Deep Learning Is Suddenly Changing Your
Life. pages 1–15.
Sekaran, Karthik, Chandra Mouli, and Srinivasa Perumal Ramalingam. 2018. Breast Cancer
Classification Using Deep Neural Networks. (February):1–293.
Shepherd, James. 2018. Deep learning in TF with upsampling. Available at:
https://www.kaggle.com/shep312/lightgbm-with-weighted-averages-dropout-787.
Shwartz-Ziv, Ravid and Naftali Tishby. 2017. Opening the Black Box of Deep Neural Networks
via Information.
Song, Hwanjun, Minseok Kim, Dongmin Park, and Jae-Gil Lee. 2019. How does Early Stopping
Help Generalization against Label Noise?
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine
Learning Research, 15:1929–1958.
Street, W. N., W. H. Wolberg, and O. L. Mangasarian. 1993. Nuclear Feature Extraction For Breast
Tumor Diagnosis. Biomedical Image Processing and Biomedical Visualization, 1905(January
1999):861–870.
Svozil, Daniel, Vladimír Kvasnička, and Jiří Pospíchal. 1997. Introduction to multi-layer
feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems, 39(1):43–62.
Urban, Gregor, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich
Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. 2016. Do Deep
Convolutional Nets Really Need to be Deep and Convolutional?
Veni, Krishna and Sobha Rani. 2018. On the Classification of Imbalanced Data Sets.
Vosters, Boyd. 2020. Data Science in Action Thesis. Available at:
https://github.com/vostersb16/Data-Science-In-Action-Thesis.
Waltl, Bernhard and Roland Vogl. 2018. Explainable artificial intelligence - The new frontier in
legal informatics. Jusletter IT, (February).
Wang, Junpeng, Liang Gou, Wei Zhang, Hao Yang, and Han-Wei Shen. 2019. DeepVID: Deep
Visual Interpretation and Diagnosis for Image Classifiers via Knowledge Distillation. IEEE
Transactions on Visualization and Computer Graphics, PP:1.
Wolberg, William H., Nick Street, and Olvi L. Mangasarian. 1995. Breast Cancer Wisconsin
(Diagnositc) Dataset.
Xie, Ning, Gabrielle Ras, Marcel van Gerven, and Derek Doran. 2020. Explainable Deep
Learning: A Field Guide for the Uninitiated.
Yeh, I. Cheng and Che Lien. 2009. The comparisons of data mining techniques for the predictive
accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2
PART 1):2473–2480.
Zhang, Zhongheng. 2016. Missing data imputation: focusing on single imputation. Annals of
translational medicine, 4(1):9.
Zheng, Wanwan and Mingzhe Jin. 2020. The Effects of Class Imbalance and Training Data Size
on Classifier Learning: An Empirical Study. SN Computer Science, 1(2):1–13.
Zhou, Guorui, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, and Kun Gai. 2017. Rocket
Launching: A Universal and Efficient Framework for Training Well-performing Light Net.

33
Data Science & Society 2020

10. Appendices

Appendix A: Overview of all datasets in the Home Credit Default Risk Dataset

34
B. Vosters Knowledge Distillation: A Decision-Critical Scenario Analysis

Appendix B: Default of Credit Card Dataset Features

Feature Name Details


Target Variable Yes = 1, No = 0
X1 Amount of the given credit in dollars
X2 Gender (1 = male, 2= female)
X3 Education (1 = graduate school, 2 = university, 3 = high school, 4 = others)
X4 Marital status (1 = married, 2 = single, 3 = others)
X5 Age (year)
X6 Repayment status in September, 2005
X7 Repayment status in August, 2005
X8 Repayment status in July, 2005
X9 Repayment status in June, 2005
X10 Repayment status in May, 2005
X11 Repayment status in April, 2005
X12 Amount of bill statement in September, 2005
X13 Amount of bill statement in August, 2005
X14 Amount of bill statement in July, 2005
X15 Amount of bill statement in June, 2005
X16 Amount of bill statement in May, 2005
X17 Amount of bill statement in April, 2005
X18 Amount paid in September, 2005
X19 Amount paid in August, 2005
X20 Amount paid in July, 2005
X21 Amount paid in June, 2005
X22 Amount paid in May, 2005
X23 Amount paid in April, 2005

35
Data Science & Society 2020

Appendix C: The Breast Cancer Wisconsin Diagnostic Dataset Features

Feature Name Details


M = malignant (Cancerous)
Target Variable
B = benign (Not Cancerous)
Radius Mean of distances from center to points on the perimeter
Texture Standard deviation of gray-scale values
Perimeter Distance around the edge of the contour
Area Size of the contour
Smoothness Local variation in radius length
Compactness Perimeter2 /Area - 1
Concavity Severity of concave portions of the contour
Concave points Number of concave portions of the contour
Symmetry Level of symmetry
Fractal dimension Coastline approximation

*Note that dataset was extended to 30 features by adding additional features such as
mean values based on the variables above

36

You might also like