You are on page 1of 13

J. Intell. Syst.

2020; 29(1): 640–652

Mais Haj Qasem* and Loai Nemer*

Extreme Learning Machine for Credit Risk


Analysis
https://doi.org/10.1515/jisys-2018-0058
Received January 25, 2018; previously published online June 18, 2018.

Abstract: Credit risk analysis is important for financial institutions that provide loans to businesses and indi-
viduals. Banks and other financial institutions generally face risks that are mostly of financial nature; hence,
such institutions must balance risks and returns. Analyzing or determining risk levels involved in credits,
finances, and loans can be performed through predictive analytic techniques, such as an extreme learning
machine (ELM). In this work, we empirically evaluated the performance of an ELM for credit risk problems and
compared it to naive Bayes, decision tree, and multi-layer perceptron (MLP). The comparison was conducted
on the basis of a German credit risk dataset. The simulation results of statistical measures of performance cor-
roborated that the ELM outperforms naive Bayes, decision tree, and MLP classifiers by 1.8248%, 16.6346%,
and 5.8934%, respectively.

Keywords: Credit risk analysis; extreme learning machine; naive Bayes; decision tree; multi-layer perceptron.

1 Introduction
Credit risk analysis is crucial for financial institutions that provide loans to businesses and individuals. Such
loans occur for various reasons, including bank mortgages, motor vehicle purchase finances, and credit card
purchases. The importance of performing credit risk analysis for customers and prospects is to build a full
picture of the clientele, help mitigate the risk of default and non-payment, and promote the long-term success
of any banking organization [6].
Most bankers are said to make sound decisions when they have a clear overview of the amount of risk
involved in the current transaction, allowing them to ensure that part of the earnings are kept for these risks.
Granting any form of credit is typical for any bank, and credit risk is profoundly common [2]. Credit providers
normally collect vast amounts of information on borrowers. Statistical predictive analytic techniques can be
employed to analyze or determine risk levels involved in credits, finances, and loans, as well as default risk
levels [1].
An extreme learning machine (ELM) is one of the predictive analytic techniques that can learn patterns
of different credit default ratios and can be used to predict the risk levels of future credit loans [14]. ELM
was developed to solve some challenges, such as intensive human interventions, slow learning speed, and
poor learning scalability, that neural networks face. Feedforward neural networks play key roles in machine
learning and data analysis that are usually considered different learning techniques in the computational
intelligence community. Essential considerations of ELM are to solve these challenges by ensuring high
accuracy, the least user intervention, and decreasing real-time learning at the same time, as illustrated in
Figure 1.
ELM are feedforward neural networks for classification, regression, clustering, sparse approximation,
compression, and feature learning. ELMs are able to produce good generalization performance and learn
thousands of times faster than neural networks. Neural networks train using backpropagation, and because
the slow gradient-based learning algorithms are extensively used for training, all the parameters of the
networks are tuned iteratively by using such learning algorithms.

*Corresponding authors: Mais Haj Qasem and Loai Nemer, King Abdullah II School for Information Technology, The University
of Jordan, Amman, Jordan, e-mail: mais_hajqasem@hotmail.com (M.H. Qasem); l.nemer@ju.edu.jo (L. Nemer)

Open Access. © 2020 Walter de Gruyter GmbH, Berlin/Boston. This work is licensed under the Creative Commons Attribution
4.0 Public License.
M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 641

Figure 1: Essential Considerations of ELM.

Figure 2: ELM Theories.

An ELM is an easy-to-use and effective algorithm for a single-hidden-layer feedforward neural network.
The traditional neural network learning algorithm must set up numerous artificial network training param-
eters and can easily lead to the local optimal solution. By contrast, the ELM algorithm only needs to set the
number of hidden-layer nodes, and the algorithm implementation process does not need to adjust the net-
work input weights and hidden biases. Furthermore, the ELM generates a unique optimal solution, with
advantages of fast learning speed and generalization performance. Therefore, the ELM outperforms the
traditional neural network [11].
ELM is different from neural networks, which consider multiple layer of networks as a black box. ELM han-
dles both single-hidden-layer feedforward networks (SLFNs) and multi-hidden-layer networks similarly. ELM
considers multi-hidden-layer networks as a white box and trained layer-by-layer, as illustrated in Figure 2.
However, different from deep learning, which requires intensive tuning in hidden layers and hidden neu-
rons, ELM theories show that hidden neurons are important but need not be tuned (for both SLFNs and
multi-hidden-layer networks); learning can simply be made without iteratively tuning hidden neurons.
In this paper, an ELM was implemented on a German credit risk dataset, the most popular dataset on
credit risk analysis that was prepared by Prof. Hofmann [17]. Different classifiers, namely naive Bayes [18],
decision tree [21], and multi-layer perceptron (MLP) [23], are also used to compare their statistical measures
of performance with the ELM.
The rest of this paper is organized as follows. Section 2 reviews works related to ELM. Section 3 presents
extreme learning. Section 4 presents credit risk analysis. Section 5 presents the German credit risk dataset.
Section 6 describes the experimental results. Section 7 concludes this paper.

2 Related Work
The ELM algorithm has been widely used and became the research focus of data mining, machine learning,
image processing, and other areas. In recent years, ELM has attracted increasing attention from researchers
in the pattern recognition field [3, 4, 14].
Huang et al. [14] proposed a theory in their applications paper on ELM. They compared the feedforward
neural networks with support vector machines (SVMs), and showed in their simulations that the learning
phase of ELM can be completed in seconds or faster for many applications. They concluded that ELM may
642 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

learn faster than SVM by a factor of up to thousands, especially on forest-type prediction application. More-
over, the response speed of a trained SVM to external new unknown observations is much slower than that of
feedforward neural networks because SVM algorithms normally generate a much larger number of support
vectors, whereas feedforward neural networks require very few hidden nodes for the same applications.
Miche et al. [16] presented an ELM, optimally pruned ELM (OP-ELM), methodology based on the original
ELM algorithm, with additional steps to render it more robust and generic. Their proposed OP-ELM applied to
both classification and regression problems and used a leave-one-out criterion for the selection of an appro-
priate number of neurons. Results regarding the speed and accuracy of the OP-ELM methodology through
experiments using 12 different data sets for both regression and classification problems revealed that the OP-
ELM achieves roughly the same level of accuracy to the other well-known methods such as SVM, MLP, or
GP. Although the original ELM is much faster than the OP-ELM based on it, the accuracy of the ELM can be
problematic in many cases, but the OP-ELM remains robust to all tested data sets.
Zhu et al. [26] developed a hybrid learning algorithm called evolutionary ELM (E-ELM), which makes use
of the advantages of both ELM and differential evolution (DE). Their algorithm utilizes the fast-minimum norm
least-square scheme to analytically determine the output weights instead of tuning, and a modified form of
DE is employed to optimize the input weights and hidden biases. Unlike the gradient-based methods, E-ELM
does not require the activation functions to be differentiable, implying that E-ELM can be used to train SLFNs
with many non-linear hidden units, such as threshold units, which are reportedly easier for hardware imple-
mentation. Results revealed that E-ELM generally achieves higher generalization performance than other
algorithms, including backpropagation, globally asynchronous locally synchronous (GALS), and the origi-
nal ELM. Findings also verify that GALS is relatively slow and require more memory due to the large storage
involved.
Mahmud and Al Mamun [15] proposed a model facial expression recognition based on ELM. They detected
salient facial feature segments, and then converted the feature segments into binary images from gray-level
images by using Otsu’s optimum global thresholding algorithm. Subsequently, they applied morphological
operation on the segments with binary features to omit noises and render the edges of feature segments
smooth. Laplacian of Gaussian filter was used to detect edges from the morphologically operated image,
and then feature vectors were created. The ELM algorithm is used as a classifier to recognize the six basic
types of expressions. They concluded that the training time of ELM is much less than that of the traditional
gradient-based neural network, and the overall recognition accuracy of ELM is satisfactory.
Golestaneh et al. [9] presented a novel structure called fuzzy wavelet ELM (FW-ELM) based on the the-
ory of multi-resolution analysis of wavelet transforms, fuzzy concepts, and ELM. They aimed to significantly
reduce network complexity by minimizing the number of linear learning parameters, and decrease the sen-
sitivity to the random initialization procedure while maintaining acceptable accuracy and generalization
performances. They compared their algorithm with other well-known methods by using real machine learn-
ing benchmark problems in function approximation, regression, classification, identification of non-linear
systems, and time series prediction. The results of comparison affirmed that the number of linear learning
parameters is efficiently reduced in comparison to the other reported works. For the identification of plants
and prediction of time series, in comparison to fuzzy wavelet neural networks, FW-ELM achieves the best
performance with a smaller number of learning parameters and using a one-pass learning method. For the
classification and regression problems, the performance of FW-ELM is comparable with that of the online
sequential fuzzy-ELM and is better than other reported works, and the number of linear learning parameters
is decreased and SDs are smaller.
Finally, Zhang et al. [25] put forward a multi-kernel ELM (MKELM)-based method for motor imagery
electroencephalogram (EEG) classification associated with motor imagery tasks in brain-computer interface
(BCI) applications. Two different types of kernels, Gaussian and polynomial kernels, were exploited to map
the original CSP features to different non-linear feature spaces that provide richer discriminant informa-
tion. The results of extensive comparison with two public EEG datasets indicate that the MKELM method
achieves higher classification accuracy than those of the other competing algorithms. Their paper concludes
the superiority of the proposed MKELM-based method for accurate classification of EEG associated with motor
imagery in BCI applications, and provides a promising and generalized solution to investigate the complex
and non-linear information for various applications in the fields of expert and intelligent systems.
M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 643

3 Credit Risk Analysis


Credit risk analysis presents a process that has long been a challenge to financial organizations. It implies
a potential risk wherein the counterparty of a loan agreement is likely to fail to meet its obligations as per
the original loan agreement, and may eventually default on the obligation. Credit risks can be classified into
many forms, such as options, equities, mutual funds, bonds, loans, and other financial issues, which involve
the extensions of guarantees and the settlement of these transactions [19].
Risk managers understand the importance of identifying and quantifying the various sources of credit
risk. Credit risk analysis consists of building a full picture of the clientele; hence, analysis of customers and
prospects helps mitigate the risk of default and non-payment. Better credit risk analysis also presents an
opportunity to significantly improve overall performance and secure a competitive advantage. As discussed,
most bankers are said to make sound decisions when they have a clear overview of the amount of risk involved
in the current transaction. Such bankers ensure that some of the earnings are kept for these risks. Granting
of any form of credit is usual for any bank, and the said risk is very common.
In recent years, banks realized that credit risk is important and that they must monitor, identify, control,
and measure it [7]. Effective analysis of credit risk has become a critical component of approaching risk man-
agement. Banks now ensure that they have sizeable capital against any form of credit risks such that they can
adequately tackle any risks incurred.
Banks also incur risk associated with individual credits or any other transactions that have to be managed
appropriately. The relationship between the credit risk and other forms of risks must always be considered
seriously to increase shareholder value through value creation, value preservation, and value optimization,
and thus increase confidence in the market.
– Credit risk analysis advantages
1. Credit risk management allows for predicting, forecasting, and measuring the potential risk factor in
any transaction.
2. Bank management can also make use of certain credit models that can act as a valuable tool to
determine the correct level of lending by measuring the risk.
3. It presents alternative techniques and strategies for transferring credit and pricing and hedging
options.
– Credit risk analysis disadvantages
1. Deciding on how good a risk a client is cannot be entirely scientific, so the bank must also use
judgments.
2. Cost and control associated with operating a credit scoring system.
3. Deciding on a model is difficult, and companies often take a one-model-fits-all approach to credit
risk, which can result in wrong decisions.

The credit risk level faced by a bank is generated by the structure of a bank’s credit portfolio. If the portfolio
consists of large loans in a certain asset class, then this circumstance might be an indication of an increased
risk [8]. Similarly, the presence of complex financial transactions, such as lending, may also suggest a larger
risk.
Banks have recently started to employ models to assess the risks for the credit they lend. Credit risk mod-
els are highly complex and include algorithm-based methods of assessing credit risk. Such models aim to
help banks in quantifying, aggregating, and managing credit risk. Regardless of the method used, the focus
remains on credit risk assessment to maintain credit quality and risk exposure.

4 Methodology
In this research, a German credit dataset is used, which is the most popular dataset on credit risk analysis.
The German credit risk is an imbalanced dataset, where the ratio between positive and negative values is
700 to 300. The Synthetic Minority Oversampling Technique (SMOTE) filter by Weka was used to balance the
644 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

dataset to become 700 for positive and 600 for negative. The German credit dataset and SMOTE filter will be
discussed in the following subsection.
The attributes of the said dataset differ in value ranges, where one column has values ranging from 1 to 4
and another column has values ranging from 2 to 148. The considerable difference in the scale of the numbers
could cause problems when one attempts to combine the values as features during modeling.
Normalization is used to avoid these problems by creating new values that maintain the general distri-
bution and ratios in the source data while keeping values within a scale applied across all numeric columns
used in the model. In this research, normalization was applied on all columns to change all values to a 0–1
scale in the dataset by using Eq. (1):
(︂ )︂
Xold
Xnew = 0.5 × . (1)
Max(X i )

The German credit risk dataset is shuffled randomly by row to distribute the values of attributes on the
training and testing sets. Subsequently, the dataset was divided into two sets for training and testing (60%
was used for training and the rest for testing).
Different classifiers were used to compare their statistical measures of the performance with ELM on the
credit risk analysis dataset, namely naive Bayes, decision tree, and MLP, which are discussed in the follow-
ing subsection. Sensitivity, specificity, and predictive values are statistical measures of the performance of
a binary classification test, also known in statistics as classification function, that are used to quantify the
performance of a case definition or the results of a diagnostic test or algorithm.

4.1 German Credit Dataset

Prepared by Prof. Hofmann [17], the German credit dataset is the most popular dataset on credit risk analy-
sis. It encompasses two datasets. The original dataset contains the categorical/symbolic values, whereas the
other dataset was created by Strathclyde University for algorithms that require numerical attributes. They
edited the original and added several indicator variables to make it suitable for algorithms that cannot cope
with categorical variables. Several attributes that are ordered categorical have been coded as integer. As ELM
is one of the algorithms that use numerical attributes, we used the second dataset created by Strathclyde
University. Dataset descriptions have 1000 instance and 24 attributes.
This dataset requires use of a cost matrix as shown in Figure 3, where rows represent the actual classifi-
cation and columns present the predicted classification. Classifying a customer as good when they are bad is
preferable (5) to classifying a customer as bad when they are good (1).

4.2 ELM

As discussed, the ELM is a learning algorithm with feedforward neural networks used for classification and
regression, clustering, sparse approximation, compression, and feature learning, with a single layer or mul-
tiple layers of hidden nodes. The term “extreme learning machine” was given to such models by its main
inventors, Huang et al. [10].

Figure 3: Cost Matrix.


M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 645

The parameters of hidden node are not just the weights connecting inputs to hidden nodes, but they also
do not need to be tuned. These nodes can be randomly assigned and never updated. They are random projec-
tions but with non-linear transforms, or they can be inherited from their ancestors without being changed.
The input hidden weights are constant, and they are apparently initialized analytically, such that they are
effective despite being semi-random [13]. In most cases, the output weights of hidden nodes are usually
learned in a single step, which essentially amounts to learning a linear model, hence the speed [12].
ELM is an effective solution for the SLFNs. Given SLFNs as illustrated in Figure 4, with N hidden layer
nodes and activation function gi (.), where g can be a sigmoid function or sine function, the radial basis func-
tion (RBF) ai is the input weight vector that connects the input layer L to the ith hidden layer, bi is the bias
weight of the ith hidden layer, and βi is the output weight. They are mathematically modeled as follows:

N
= g i (x, a i , b i ) · β i , a i ∈ R d , b i ∈ R.
∑︁
fL(x) = (2)
i=1

For additive nodes with activation function g, which can be a sigmoid function or sine function, gi is
defined in Eq. (3) for sigmoid and in Eq. (4) for sine, and for RBF nodes it is defined in Eq. (5), as follows:

1
S (x, a i , b i ) = . (3)
1 + e−1
f (x, a i , b i ) = Sin(x, a i , b i ). (4)

g i (x, a i , b i ) = g i (b i || x − a i ||). (5)

ELM can be built with randomly initialized hidden nodes. Given N arbitrary distinct samples {(x i , t i | x i ∈
R d , t i ∈ R m , i = 1, ..., N}, where xi is the training data vector and ti represents the target of each sample.
Equation (1) can be written as the follows:

Hβ = T (6)

⎡ ⎤
g (x1 , a1 , b1 ) .. g (x1 , a k , b k )
=⎣ : : : ⎦ , n · k where n is hidden layer and k is number of observations.
⎢ ⎥
g (x n , a1 , b1 ) : g (x n , a k , b k )
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
β1T β11 .. β1m t1T t11 .. t1m
β = ⎣ : ⎦ = ⎣ : : : ⎦,T = ⎣ : ⎦ = ⎣ : : : ⎦.
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
β Tn β n1 : β nm t Tn t n1 : t nm

where H is the hidden layer output matrix that is the randomized matrix of the neural network and T is the
training data target matrix.

Figure 4: SLFN Example.


646 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

ELM is a type of regularization neural network but with non-tuned hidden layer mappings, which aims
to reach the smallest training error but also the smallest norm of output weights. Its objective function is

Minimize || β ||σ1 σ2
p +h || Hβ − T ||q ,

where σ1 > 0, σ2 > 0, p, q = 0, 12 , 1, 2, ..., ∞. Different combinations of σ, σ2, p, and q can be used,
resulting in distinct learning algorithms for regression, classification, sparse coding, compression, feature
learning, and clustering.

4.3 Naive Bayes

The naive Bayes classifier is a straightforward and powerful algorithm for the classification task. It is easy
to build and particularly useful for very large data sets. Along with its simplicity, naive Bayes is known to
outperform even highly sophisticated classification methods. The naive Bayesian classifier is based on Bayes’
theorem with independence assumptions among predictors. In simple terms, a naive Bayes classifier assumes
that the presence of a particular feature in a class is unrelated to the presence of any other feature.
The Bayes theorem works on conditional probability. (Conditional probability is the probability that
something will happen, given that something else has already occurred.) Naive Bayes using the conditional
probability is expressed in Eq. (7):

P(c|h) × P(h)
P(h|x) = , (7)
P(x)

where
– P(h|x) = posterior probability of class (h, target) given predictor (x, attributes);
– P (h) = prior probability of class;
– P(x|h) = likelihood, which is the probability of predictor given class;
– P (x) = prior probability of predictor.

4.4 Decision Tree

The decision tree classifier is a predictive machine learning model that decides the target value of a new sam-
ple based on various attribute values of the available data. Decision tree builds classification or regression
models in the form of a tree structure. The internal nodes of a decision tree denote the different attributes,
the branches between the nodes depict the possible values that these attributes can have in the observed
samples, whereas the terminal nodes express the final value of the dependent variable. Decision trees can
handle both categorical and numerical data.
The core algorithm for building decision trees, called ID3 by J. R. Quinlan, employs a top-down, greedy
search through the space of possible branches with no backtracking. ID3 uses entropy and information gain
to construct a decision tree.
The ID3 algorithm uses entropy to calculate the homogeneity of a sample, because decision tree is built
top-down from a root node and involves partitioning the data into subsets that contain instances with similar
values (homogenous). If the sample is completely homogeneous, the entropy is zero. If the sample is equally
divided, it has an entropy of 1. To build a decision tree, we need to calculate two types of entropy using fre-
quency that is expressed in Eq. (8) for using the frequency table of one attribute and Eq. (9) for using the
frequency table of two attributes.
c
∑︁
E(S) = −p i log2 p i . (8)
i=1
∑︁
E(T, X) = P(c)E(c). (9)
c∈X
M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 647

The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding the attribute that returns the highest information gain.

4.5 MLP

The MLP, a classifier based on the feedforward artificial neural network, consists of multiple layers of nodes
of at least three layers. Each layer is fully connected to the next layer in the network. Nodes in the input layer
represent the input data. All other nodes map inputs to the outputs by performing linear combination of the
inputs with the nodes weights w and bias b and applying an activation function. It can be written in matrix
form for MLP with K + 1 layers, expressed in Eq. (10):

y(x) = f x (. . . f2 (w2T f1 (w1T x + b1 )... + b k ). (10)

MLP has a linear activation function in all neurons; that is, a linear function that maps the weighted
inputs to the output of each neuron. Linear algebra shows that any number of layers can be reduced to
a two-layer input-output model. MLP activation functions are both sigmoid, and are described by Eqs. (11)
and (12):

y(v i ) = tanh(v i ), (11)


−v i −1
y(v i ) = (1 + e ) , (12)

where N is the number of nodes in the output layer corresponds to the number of classes.

4.6 SMOTE Filter

In this work, the case of imbalanced data is handled using the SMOTE algorithm. Sampling is one of the easiest
methods for balancing datasets; it involves oversampling and under sampling. Oversampling is a sampling
technique that balances the data set by replicating the examples of minority class [17].
SMOTE, proposed by Chawla et al. [5], is an oversampling method that is designed for generating new
minority class data [22]. The minority class is oversampled by taking each minority class sample and intro-
ducing new synthetic examples along the line segments joining any or all of the k minority class nearest
neighbors. Depending on the amount of oversampling required, neighbors from the k nearest neighbors are
randomly chosen [24].

4.7 Statistical Performance Evaluation Measures

Recently, prediction methods are increasingly used to go from observations about an item to conclusions
about the item’s target value. Binary classifications are the most used prediction methods, which are usually
based on machine learning approaches [20]. The end user should be able to understand how evaluation is
done and how to interpret the results.
After classification methods build their model on the dataset, it is necessary to have a quantitative way
of evaluating their classification model, by measuring whether the model assigns the correct class value
to the test instances. In this research, five main performance evaluation measures are used. These include
sensitivity, specificity, positive predictive value, negative predictive value, and accuracy.
A classification problem have only two classes: positive and negative, 0 or 1, true or false, etc. Each
instance in dataset is mapped to one of these two classes. Given a classifier and an instance, there are four
possible outcomes:
– True positive (TP) = Instance is 0 and it is classified as 0.
– False negative (FN) = Instance is 0 but it is classified as 1.
648 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

– True negative (TN) = Instance is 1 and it is classified as 1.


– False positive (FP) = Instance is 1 but it is classified as 0.

The outcomes of the classification test can then be summarized in a confusion matrix, as shown below. Main
performance evaluation measures using the values held in the confusion matrix evaluate how well the model
assigns the correct class value to the test instances.

Predicted positive Predicted negative

Actual = positive True positive False negative


Actual = negative False positive True negative

– Accuracy: calculates the probability of correctly classified instances.

Accuracy = (TP + TN)/(TP + TN + FP + FN).

– Sensitivity: calculates the probability of actual positives that are correctly identified as positives by the
classifier (true positive rate).

Sensitivity = TP/(TP + FN).

– Specificity: calculates the probability of actual negatives that are correctly identified as negative by the
classifier (true negative rate).

Specificity : TN/(TN + FP).

– Positive predictive value (PPV): calculates the probability that the positive is present when the test is
positive.

Sensitivity × Prevalence
PPV = .
Sensitivity × Prevalence + (1 − Specificity) × (1 − Prevalence)

– Negative predictive value (NPV): calculates the probability that the negative is present when the test is
negative.

Specificity × (1 − Prevalence)
NPV = ,
( 1 − Sensitivity) × Prevalence + Specificity × (1 − Prevalence)

where Prevalence = TP + FP/(TP + FP + FN + TN).

5 Experiments and Results


Different classifiers were used to compare their statistical measures of performance with ELM on the credit risk
analysis dataset, namely naive Bayes, decision tree, and MLP, in order to obtain the best statistical measures
of performance for each classifier in terms of sensitivity, specificity, and predictive value.
ELM was tested with the sigmoid and sine functions on different numbers of hidden layers. After 200
hidden layers, the training accuracy entered overfitting, thereby rendering inefficient the increase in the
number of hidden layers. The results of sensitivity, specificity, and predictive value for different numbers
of hidden layers are shown in Table 1 for sigmoid function and illustrated in Figure 5, and those for sine func-
tion is shown in Table 2 and illustrated in Figure 6. The naive Bayes, decision tree, and multi-layer (MLP)
M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 649

results of sensitivity, specificity, and predictive value are shown in Table 3. This classifier was tested by using
Weka.
The final statistical measures of the performance results of the ELM, naive Bayes, decision tree, and MLP
are summarized in Table 4 and illustrated in Figure 7. The results are presented in terms of accuracy, sensitiv-
ity, specificity, and predictive value. As it can be noticed in the evaluation results, the ELM showed the best
statistical measures of performance for the German credit risk dataset. While naive Bayes is better than both

Table 1: Extreme Learning Machine Using Sigmoid Function Result.

Hidden layer 50 100 200


Time 0.030 0.057 0.036
Accuracy 77.115% 76.154% 76.346%
Sensitivity 80.755% 80.370% 79.182%
Specificity 73.770% 72.653% 71.713%
PPV 77.422% 76.042% 73.983%
NPV 86.304% 83.604% 79.465%

Figure 5: ELM Statistical Performance Measures Plotting (Sigmoid Function).

Table 2: Extreme Learning Machine Using Sine Function Result.

Hidden layer 50 100 200


Time 0.023 0.019 0.033
Accuracy 76.731% 76.538% 75.769%
Sensitivity 81.061% 80.989% 78.277%
Specificity 73.362% 73.663% 72.654%
PPV 76.670% 76.583% 76.264%
NPV 84.453% 83.446% 84.041%

Figure 6: ELM Statistical Performance Measures Plotting (Sine Function).


650 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

Table 3: Naive Bayes, Decision Tree, and MLP Results.

Classifier Naive Bayes Decision tree MLP

Accuracy 76.462% 69.039% 73.231%


Sensitivity 80.370% 71.674% 76.506%
Specificity 73.200% 66.889% 69.811%
PPV 69.592% 46.165% 66.221%
NPV 83.746% 55.554% 77.258%

Table 4: Comparison Results.

Classifier ELM Naive Bayes Decision tree MLP

Accuracy 77.115 76.462 69.039 73.231


Sensitivity 81.061 80.370 71.674 76.506
Specificity 73.770 73.200 66.889 69.811
PPV 77.422 69.592 46.165 66.221
NPV 86.304 83.746 55.554 77.258

Figure 7: Statistical Performance Measure Compression Results Plotting.


M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis | 651

decision tree and MLP, we can conclude that the ELM is very efficient when applied for credit risk analysis
problems. So, the goal of our paper is achieved.

6 Conclusion
In this work, an ELM was evaluated for a credit risk analysis problem. For the purpose of benchmarking and
evaluation, a German credit risk dataset (the most popular dataset in credit risk analysis) was utilized. The
evaluation results of the ELM were compared with three well-known classifiers, naive Bayes, decision tree,
and MLP, and indicated that ELM achieved the highest statistical measures of performance.
The simulation results of statistical measures of performance corroborated that ELM outperforms naive
Bayes, decision tree, and MLP classifiers by 0.653%, 8.076%, and 3.884% for accuracy; 0.691%, 9.387%, and
4.555% for sensitivity; 0.57%, 6.881%, and 3.959% for specificity; 1.288%, 24.715%, and 4.659% for PPV; and
5.922%, 34.114%, and 12.41% for NPV, respectively.

Bibliography
[1] E. I. Altman and A. Saunders, Credit risk measurement: developments over the last 20 years, J. Banking Finance 21 (1997),
1721–1742.
[2] L. Andersen and J. Sidenius, J. Extensions to the Gaussian copula: random recovery and random factor loadings, J. Credit
Risk Vol. 1 (2004), 5.
[3] E. Avci, A new method for expert target recognition system: genetic wavelet extreme learning machine (GAWELM), Expert
Syst. Appl. 40 (2013), 3984–3993.
[4] E. Avci and R. Coteli, A new automatic target recognition system based on wavelet extreme learning machine, Expert Syst.
Appl. 39 (2012), 12340–12348.
[5] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif.
Intell. Res. 16 (2002), 321–357.
[6] D. Cossin and H. Pirotte, Advanced Credit Risk Analysis: Financial Approaches and Mathematical Models to Assess, Price,
and Manage Credit Risk, Wiley, New York, 2000.
[7] M. Crouhy, D. Galai and R. Mark, A comparative analysis of current credit risk models, J. Banking Finance 24 (2000),
59–117.
[8] B. Ganguin and J. Bilardello, Standard & Poor’s Fundamentals of Corporate Credit Analysis, McGraw Hill Professional,
New York, 2004.
[9] P. Golestaneh, M. Zekri and F. Sheikholeslam, Fuzzy wavelet extreme learning machine, Fuzzy Sets Syst. 342 (2018),
90–108.
[10] G. B. Huang, Q. Y. Zhu, and C. K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks,
in: Proceedings 2004 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990, IEEE, 2004.
[11] G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006),
489–501.
[12] G. B. Huang, D. H. Wang and Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybernet. 2 (2011), 107–122.
[13] G. B. Huang, H. Zhou, X. Ding and R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE
Trans. Syst. Man Cybernet. Pt. B (Cybernet.) 42 (2012), 513–529.
[14] G. Huang, G. B. Huang, S. Song and K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015), 32–48.
[15] F. Mahmud and M. Al Mamun, Facial expression recognition system using extreme learning machine, Int. J. Sci. Eng. Res. 8
(2017), 26–30.
[16] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten and A. Lendasse, OP-ELM: optimally pruned extreme learning machine,
IEEE Trans. Neural Netw. 21 (2010), 158–162.
[17] P. Murphy and D. Aha, UCI Repository of Machine Learning Databases, Department of Information and Computer Science,
University of California, Irvine, CA, 1994.
[18] T. R. Patil and S. S. Sherekar, Performance analysis of naive Bayes and J48 classification algorithm for data classification,
Int. J. Comput. Sci. Appl. 6 (2013), 256–261.
[19] H. Pirotte and D. Cossin, Advanced credit risk analysis: financial approaches and mathematical models to assess, price,
and manage credit risk (No. 2013/191833), ULB – Universite Libre de Bruxelles, 2000.
[20] M. H. Qasem, H. Faris, A. Rodan and A. Sheta, Empirical evaluation of the cycle reservoir with regular jumps for time series
forecasting: a comparison study, in: Computer Science On-line Conference, pp. 115–124, Springer, Cham, 2017.
652 | M.H. Qasem and L. Nemer: ELM for Credit Risk Analysis

[21] S. R. Safavian and D. Landgrebe, A survey of decision tree classifier methodology, IEEE Trans. Syst. Man Cybernet. 21
(1991), 660–674.
[22] A. Sonak and R. Patankar, A survey on methods to handle imbalance dataset, Int. J. Comput. Sci. Mobile Comput. 4 (2015),
338–343.
[23] H. Taud and J. F. Mas, Multilayer perceptron (MLP), in: Geomatic Approaches for Modeling Land Change Scenarios,
pp. 451–455, Springer, Cham, 2018.
[24] S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: IEEE Symposium on
Computational Intelligence and Data Mining, CIDM’09, pp. 324–331, IEEE, Nashville, TN, USA, 2009.
[25] Y. Zhang, Y. Wang, G. Zhou, J. Jin, B. Wang, X. Wang and A. Cichocki, Multi-kernel extreme learning machine for EEG
classification in brain-computer interfaces, Expert Syst. Appl. 96 (2018), 302–310.
[26] Q. Y. Zhu, A. K. Qin, P. N. Suganthan and G. B. Huang, Evolutionary extreme learning machine, Pattern Recogn. 38 (2005),
1759–1763.

You might also like