You are on page 1of 58

College of Computer and Information Sciences

Computer Science Department

“Learning from Imbalanced Data”


CSC 496 – Final Report
Prepared by:
Noura Alomran 439200686
Sarah Al-mubarraz 439200728
Shahad Alhuwaymil 439200886
Haya Alojairi 439201214
Supervised by:
Supervisor Name
Dr. Sultanah Alotaibi
Research project for the degree of Bachelor in Computer Science
I.English Abstract
Around the world, there are two types of data: balanced and imbalanced data. Originally
machine learning designed to deal with balanced data, however, in the real-world machine
learning is often faced with imbalanced data. This challenging problem known as learning
from imbalanced data. It is a classification problem where the distribution of examples
across the class labels is biased or skewed. As a result, the chances of inaccurate results
increased. The imbalanced data problem widely appears in real-life applications such as:
medical applications and spam filtering. In this research, we will focus on the problem of
learning from imbalanced data by studying the current state of the field and exploring all
different types of problems and data from various domains. We will analyze the
performance of several approaches previously proposed to counter the issue of imbalanced
data and test the ability to extend these approaches for big and complex data problems.
Among the proposed approaches, we choose three models to evaluate on different datasets,
including Fraud Credit Card Detection, Melanoma, and Depression
Tweets. Undersampling techniques combined with weighted-SVMs, DSMOTE, and
combination models will be used to achieve the objectives of this research
Keywords: Imbalanced data, balanced data, majority class, minority class,
Undersampling techniques combined with weighted-SVMs model, DSMOTE model,
combination model.

Page | II
‫‪II.Arabic Abstract‬‬
‫تنقسم أنواع البيانات حول العالم إلى قسمين‪ ،‬وهما‪ :‬البيانات المتّزنة والبيانات غير المتّزنة‪ .‬واقتصر تعلّم اآللة في‬
‫كثيرا ما تواجه البيانات غير المتّزنة في مجاالت العمل الواقعيّة؛‬
‫بادئ األمر على التعامل مع البيانات المتّزنة‪ ،‬ولكنها ً‬
‫لتعرف هذه المشكلة العصيبة فيما بعد بالتعلّم من البيانات غير المتّزنة‪ .‬وتتمحور هذه المشكلة حول التصنيف‪ ،‬حيث‬
‫التطرف لفئة دون أخرى؛ مما يؤدّي إلى ارتفاع عدد النتائج غير‬
‫ّ‬ ‫يتّسم توزيع األمثلة في تصنيف فئاتها بالتحيّز أو‬
‫الدقيقة‪ .‬تشيع مشكلة البيانات غير المتّزنة في التطبيقات الواقعيّة مثل‪ :‬التطبيقات الطبيّة وتطبيقات تصفية الرسائل غير‬
‫المرغوبة‪ .‬وير ّكز هذا البحث على مشكلة التعلّم من البيانات غير المتّزنة من خالل دراسة الوضع الحالي للمجال‪،‬‬
‫صي عن جميع أنواع المشكالت للبيانات المستقاة من مجاالت مختلفة والبحث فيها‪ .‬كما يح ّلل البحث أداء عدد من‬
‫والتق ّ‬
‫األساليب المطروحة مسبقًا للتغلّب على مشكلة البيانات غير المتّزنة‪ ،‬ودراسة جدوى أي منها عند تطبيقها على مشكالت‬
‫في البيانات الكبيرة والمعقّدة‪ .‬ووقع اختيار الباحثات من بين تلك األساليب على ثالثة نماذج لتقييمها ودراسة جدواها‬
‫وهي مجموعات البيانات للكشف عن االحتيال في البطاقات االئتمانيّة ومرض سرطان الجلد وكذلك تغريدات االكتئاب‬
‫صة تويتر‪ .‬على أن تستخدم في هذا البحث نموذج تقنيات إزالة العيّنات وآلة المتجهات الدّاعمة ونموذج تقنية‬
‫على من ّ‬
‫اإلفراط في زيادة العينات االصطناعيّة المعتمدة على الكثافة‪ ،‬إضافةً إلى نموذج مركب لتحقيق أهداف هذا البحث‪.‬‬
‫الكلمات المفتاحية‪ :‬بيانات غير متوازنة‪ ،‬بيانات متوازنة‪ ،‬فئة األغلبية‪ ،‬فئة األقلية‪ ،‬نموذج تقنيات إزالة العيّنات وآلة‬
‫المتجهات الدّاعمة‪ ،‬نموذج تقنية اإلفراط في زيادة العينات االصطناعيّة المعتمدة على الكثافة‪ ،‬النموذج المركب‪.‬‬

‫‪Page | III‬‬
Table of Contents
I. English Abstract............................................................................................................ II

II. Arabic Abstract ....................................................................................................... III

Chapter 1: Introduction ....................................................................................................... 7


1.1 Problem Statement ............................................................................................................ 7
1.2 Goals and Objectives ........................................................................................................ 8
1.3 Proposed Solution .............................................................................................................. 8
1.4 Research Scope ................................................................................................................... 9
1.5 Research Significance ....................................................................................................... 9
1.6 Ethical and Social Implications ...................................................................................... 9
1.7 Report Organization ...................................................................................................... 10

Chapter 2: Background..................................................................................................... 10

Chapter 3: Related Work ................................................................................................. 32

Chapter 4: Methodology .................................................................................................. 46


4.1 Models ............................................................................................................................... 46
4.1.1 DSMOTE model ..................................................................................................................... 46
4.1.2 Combination model............................................................................................................... 47
4.1.3 Undersampling techniques combined with weighted-SVM ................................. 48

Chapter 5: Experimental Design ................................................................................... 51


5.1 Datasets ............................................................................................................................. 51
5.1.1 Fraud Credit Card Detection ........................................................................................... 51
5.1.2 Melanoma ................................................................................................................................. 52
5.1.3 Depressed Tweet..................................................................................................................... 52
5.2 Performance measures .................................................................................................. 52

Chapter 6: Conclusion ...................................................................................................... 54

References............................................................................................................................. 55

Page | IV
List of Tables
TABLE 1: PERFORMANCE EVALUATION METRICS. 30
TABLE 2: RELATED WORKS SUMMARY. 45

Page | V
List of Figures
FIGURE 1: IMBALANCED DATASET IN TRANSACTIONS [3]. 7
FIGURE 2. CLASSIFICATION OF DATA BY SUPPORT VECTOR MACHINE (SVM) [13] 14
FIGURE 3: MULTILAYER NEURAL NETWORK [16]. 16
FIGURE 4: CONVOLUTIONAL NEURAL NETWORK [16]. 17
FIGURE 5: EXAMPLE OF A TWO-CLASS IMBALANCED DATA PROBLEM [23]. 18
FIGURE 6: EXAMPLE OF DIFFICULTIES IN IMBALANCED DATASETS. (A) CLASS OVERLAPPING. (B)
SMALL DISJUNCTS [23]. 20
FIGURE 7: EXAMPLE OF GOOD BEHAVIOR (NO DATASET SHIFT) IN IMBALANCED DOMAINS: ECOLI4
DATASET, 5TH PARTITION [21]. 22
FIGURE 8: EXAMPLE OF BAD BEHAVIOR CAUSED BY DATASET SHIFT IN IMBALANCED DOMAINS:
ECOLI4 DATASET, 1ST PARTITION [21]. 23
FIGURE 9: CONFUSION MATRIX FOR A TWO-CLASS PROBLEM [36] 29
FIGURE 10: IMAGE GENERATIONS ON YOUTUBE-FACES [40]. 38
FIGURE 11: DATASET DSIE10D20 [44]. 43
FIGURE 12: DATASET DSIE50D60 [44]. 43
FIGURE 13: FLOW CHART OF DSMOTE [37]. 46
FIGURE 14: ILLUSTRATION OF COMBINATION MODEL [29]. 48
FIGURE 15: TH APPROACH DESIGN FOR UNDERSAMPLING TECHNIQUES COMBINED WITH
WEIGHTED-SVM. 50

Page | VI
Chapter 1: Introduction
It has become increasingly important for decision-making processes to understand the
fundamentals behind knowledge discovery and analysis from raw data, as a result of the
rapid development of extensive, sophisticated systems. It has been demonstrated that
existing knowledge discovery techniques have been successfully applied to a variety of
applications, however, learning from imbalanced data remains a challenge that requires
further exploration. This problem refers to learning algorithms in the event of
underrepresented data and skewed distributions of class rankings. In order to effectively
represent massive quantities of raw data as information and knowledge, novel principles,
algorithms, and techniques must be developed [1].

1.1 Problem Statement


An imbalanced dataset problem occurs when there is an uneven distribution of data
examples within classes, in which the majority class has the most examples while the
minority class has the fewest examples, and this classification problem is known as an
imbalanced dataset problem [2].

Figure 1: Imbalanced dataset in transactions [3].

Figure 1. illustrates the problem more. It shows the numerous valid financial
transactions compared to frauds that occur infrequently. Even though fraudulent
transactions are rare events, it causes enormous losses due to the considerable amount.

Page | 7
Companies refuge to machine learning to avoid this dilemma. However, machine learning
failed to detect fraud due to imbalanced data. [3]

Many approaches have been proposed imbalanced data problem. The first category of
approaches deals with the problem at the data level, which involves modifying the data
classes selected to make it easier and simpler for a typical learning algorithm by:
• Generate new examples for minority classes (oversampling).
• Remove examples from majority classes (undersampling).

The second category of approaches focus on the algorithm level. The idea is to improve
an existing algorithm in data mining. The concept hybrid methods means a methodology
that integrates two techniques to enhance performance [4].

1.2 Goals and Objectives


This research's aim is to examine the issue of imbalanced data in classification by
accomplishing the following:
• Conduct a literature review on learning from imbalanced data.
• Examine the pertaining methods employed to handle this issue of imbalanced
data.
• Study the available datasets and investigate different types of data and
applications.
Perform a comparative study based on the previous objectives:
A. Implement the proposed algorithms.
B. Conduct a controlled set of experiments on different datasets.
C. Evaluate the performance of various algorithms employed in the studies.
D. Analyze the results.
E. Select the best performing algorithm.

1.3 Proposed Solution

We intend to implement three of the chosen models that have been proposed to handle
the problem of imbalanced data, based on the results that we identified after reviewing

Page | 8
many kinds of research addressing this issue. They will be evaluated using public datasets
to see if they can also help in tackling big and complex data sets.

1.4 Research Scope

In this research, we seek to find the best solution to balance the imbalanced classification
that occurs in text, numeric, or images. This is accomplished by exploring shallow learning
approaches and models. The scope of this research is limited to binary classification
problems which deal with two classes only, multi-Class Classification is excluded.

1.5 Research Significance

Imbalanced data affects a wide range of different sensitive fields, such as economics
and security. As we know, machine learning can deal with classification problems;
however, it cannot handle imbalanced data, since it assumes by default that the data it deals
with is balanced. As a result, the most critical data is discarded, while other data is not.
Therefore, learning from an imbalanced dataset may prove beneficial for handling rare
events (i.e., machine learning discarded data) that can cause substantial losses. For
example, in economics every year an organization loses income as a consequence of fraud.
The research helps mitigate some of these losses.

1.6 Ethical and Social Implications

The ethical and social implications in data-related research are significantly essential to
take into consideration and since this research deals with multiple datasets and perhaps
personal and social sensitive information, the ethical and social parts have been carried out
under great care not to compromise any personal security information or privacy including
the depressed tweets dataset. Furthermore, datasets are only utilized in this research for the
purpose of testing and training the models.
Thus, no laws are violated in this research, regards the datasets that are used in this
research are legally gathered and obtained from open sources. Detailed in Chapter 4.

Page | 9
1.7 Report Organization

The remaining of this report is organized as follows: Chapter two is the Background
devoted to an overview of machine learning and deep learning, as well as a discussion of
some of the classification algorithms that are relevant to this research. Moreover, an
introduction to the nature of the class imbalanced problems with some of its application
domains. Additionally, the types of solutions available for dealing with imbalanced data
sets were presented. Towards the end of Chapter two, describe briefly the standard
performance analysis used in assessment methodologies. A review of related works is
presented in chapter three, which is dedicated to this research of particular interest. Chapter
four describes the methodology employed in this research. In chapter five the experimental
design is introduced. Lastly, chapter eight concludes this research.

Chapter 2: Background

2.1 Machine Learning.


A significant outcome of the latest computer revolution is that a computer has become
capable of reliably classifying new data - given the right conditions - due to being trained.
This capability is achieved by machine learning. As a result, machine learning has been
applied in technology, commerce, and medicine [5].
It is possible for computers to think and learn without being explicitly programmed.
This is known as machine learning (ML). A computer program is designed to modify its
actions so that it may achieve greater accuracy. Accuracy can be measured in how many
times the chosen actions are actually successful [6].
Typically, machine learning data consists of two types: training data and test data.
Training data is obtained from the first split of data, i.e., the initial reserve of data used to
develop the model. Once a model has been successfully developed and satisfied with its
accuracy, it can be tested against the remaining data, otherwise known as the test data. The
machine learning model is ready for use once both the training and test data are processed
[7].
The ML major fields (domains) include data mining, deep learning, natural language
processing, data science, etc.

Page | 10
It is possible to differentiate four different machine learning paradigms based on how
an algorithm is trained and whether it can utilize outputs during training. These are:
supervised, semi-supervised, unsupervised, and reinforcement learning [6].
As for the first category which is supervised learning, a labeled data set is provided by
the operator, and then a model is trained using the labeled data set. The algorithm must
obtain the correct output. Hence, the algorithm recognizes patterns and gains knowledge
until it reaches a high level of accuracy [6].
Supervised learning can be divided into two groups or types. Firstly, classification
involves using a model to predict unknown values (outputs) based on some known values
(inputs). As well, regression constitutes a supervised learning model designed to provide
predictions based on the known input variables [8] [9].
Another type of learning is unsupervised learning. Contrary to supervised learning, it
receives unlabeled input. In turn, it analyzes the results by the application of algorithms.
The purpose is usually to detect patterns or structures hidden in data. Consequently,
unsupervised learning offers greater diversity by allowing it to work with a larger dataset
in that it does not require a machine-readable one [7] [8].
A standard model of unsupervised learning is clustering. Clustering can occur when
labels are not explicit specifying which portions of observed data are desirable (and a
method to predict future data), and it poses a problem. Clustering procedures have been
developed in many different ways, based on specific assumptions regarding what
constitutes a cluster [10].
The third category is a combination of labeled and unlabeled examples known as semi-
supervised. It is common for the number of unlabeled examples to exceed the number of
labeled examples. As with supervised learning, a semi-supervised learning algorithm aims
to achieve the same objective. There is the expectation that many unlabeled examples can
help the learning algorithm construct a more accurate model [11].
Finally, reinforcement learning is another type of machine learning. It is the most
advanced algorithm category. Reinforcement learning utilizes feedback from previous
iterations to improve its model continuously. Contrary to supervised and unsupervised
learning, each appears to reach an indefinite end when a model is formed from the training
and test sets [7] [12].

Page | 11
In this method, an agent interacts with an uncertain environment. Upon receiving
feedback from its actions, the agent learns by recognizing that the agent has a reward that
must be maximized (i.e., the agent is far from the goal, the reward is lowered, and the
reward is maximized if it is closer to the goal). Ultimately, a decision is then made. Further,
the agent repeats the steps that lead to the optimal reward. Accordingly, reinforcement
learning is a method that allows agents to learn by trial and error based on feedback from
their actions [6].
Due to the fact that this research focuses exclusively on imbalanced data, the machine
learning algorithm applied to this problem will be discussed only.

• Decision tree
A standard classification method is the decision tree. Decision trees offer high efficiency
and simplicity in interpretation. Those two features make it popular in the area of machine
learning. This method is primarily used for solving classification problems that require
supervised learning. Both quantitative and categorical data can be incorporated into
decision trees to model categorical outcomes [7].
As the decision tree grows, it begins with a root node (the top), followed by splits that
produce branches. Each of these branches is called an edge. As branches become leaves,
nodes become decision points. As a result, a leaf will eventually produce no new branches
and become a terminal node [7] [11].
Decision trees are implemented using a variety of algorithms. Automatic Interaction
Detection (CHAID), Classification, Iterative Dichotomiser 3 (ID3), Regression Tree
(CART), Chi-Squared C4.5 and C5.0, and M5 are the most popular ones [6].

• Logistic Regression (LR)


LR is considered as supervised that operates by obtaining a collection of weighted data
from the input and taking the logs and then combining them linearly, which means that
each data is multiplied by a weight and summed. In LR, the result is a discrete binary value
(zero or one). This type of algorithm is typically used for binary classification algorithms
with binary results. Thus, the predicted output is bounded to a finite number of values. The
logistic function (sigmoid function) 𝑌 = 1/1 + 𝑒 − 𝑧 is used in logistic regression. This
function can convert any real value to a value between one and zero to decide if the value

Page | 12
should pass in the final result. So, the predicted value 𝑌 is one if the value of 𝑍 goes to
positive infinity, and when the value 𝑍 goes to negative infinity, the predicted value 𝑌 will
be 0. If the overall result is less than 0.5, the outcome will be classified as a negative class.
On the other hand, if the overall result is greater than 0.5,then the outcome will be classified
to the positive class [6] [12].

• Naïve Bayes (NB)


NB is considered as supervised learning that is based on the bayes’s theorem. Where
𝑃(𝐶|𝐷) = 𝑃(𝐷|𝐶) 𝑃(𝐶)/𝑃(𝐷) where both 𝐶 and 𝐷 are the events and 𝑃(𝐶|𝐷) denotes
the posterior probability of class C in the presence of predictor D. Furthermore, 𝑃(𝐷|𝐶) is
the probability of the predictor given class. The prior probability of class 𝐶 is 𝑃(𝐶). Lastly,
𝑃(𝐷) is the prior probability of predictor. NB algorithms assume that all data are irrelevant
,this means that the presence of one data won't disrupt the presence of another, ignoring
the fact that the data are dependent [6] [12].

• Support Vector Machines

As a learning method for optimization-based and margin-based classifiers, support vector


machines (SVMs) are considered to be one of the most popular ones. SVMs are primarily
employed to determine the maximum margin of a decision's boundary between data points
in a class. The optimal hyperplanes that are the maximum distance between the hyperplane
and each of the two classes' closest examples can be determined by SVMs (see Figure 2).
The optimal SVM solves the following optimization problem:
1
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒘 ∙ 𝒘 + ∁ ∑𝑛𝑖=1 𝜉𝑖 (1)
2

𝑠. 𝑡. 𝑦𝑖 (𝒘∅(𝑥𝑖 ) + 𝑏) ≥ 1 − 𝜉𝑖 𝜉𝑖 ≥ 0 𝑖 = 1,2, … , 𝑛 (2)

Page | 13
SVM specified by the 𝒘 and 𝑏 parameters, whereas 𝒘 = [𝑤1 , 𝑤2 , … , 𝑤𝑑 ]𝑇 refer to
weight vector and 𝑏 refer to the bias T. 𝑥𝑖 describes the training dataset and 𝑦𝑖 ∈ {−1, 1} [1 3]

Figure 2. Classification of data by support vector machine (SVM) [13]

refers to the label of the class for each 𝑖 data point. 𝐶 here is the cost parameter that makes
the trade-off between training accuracy and generalization balanced, where it controls the
magnitude of the penalization. The SVM will be generated using the function radial basis
function kernel (RBF kernel), with the 𝐶 cost and kernel 𝛾 as a parameter (Eq. 3) [14] [15].

2
𝑘(𝑥𝑖 , 𝑥𝑗 ) = √−𝛾‖𝑥𝑖 − 𝑥𝑗 ‖ , 𝛾 ≥ 0 (3)

• Weighted Support Vector Machines

Many standard SVM algorithms have a noticeable decline in their performance when
dealing with imbalanced datasets. One well-known solution to this situation is to take the
minority class penalty and increase it. This can be accomplished in several ways, but the
most accessible approach is to do this by using measurement factors inversely proportional
to each class example's number to the overall number of examples, and this to institute a
cost parameter 𝐶. We call this approach weighted-SVM. The WSVM differs from the
standard SVM in the difference in the costs of the misclassification for the positive 𝐶+ and
negative 𝐶− classes: [14] [15].
1
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒘 ∙ 𝒘 + ∁ ∑𝑛𝑖=1 𝜉𝑖 (4)
2

𝑠. 𝑡. 𝑦𝑖 (𝒘∅(𝑥𝑖 ) + 𝑏) ≥ 1 − 𝜉𝑖 𝜉𝑖 ≥ 0 𝑖 = 1,2, … , 𝑛 (5)


Each data class is given various weights by the 𝐶+ and 𝐶− parameters [15]. The positive
𝐶+ and negative 𝐶− classes misclassification costs assigned as 𝐶− = 𝐶 ∙ 𝑤+ and 𝐶+ = 𝐶 ∙

Page | 14
𝑛 𝑛
𝑤− , whereas 𝑤+ = 𝑛+ + 𝑛− and 𝑤− = 𝑛− + 𝑛+ . 𝑛+ represent the number of the training
+ −

positive examples and 𝑛− defer the number of the training negative example [14].

• Random Forests (RF)


RF is considered as supervised learning that is used in regression and classification by
using bagging method, the RF generates decision trees from random picks of data. The
output will be generated from the combination of all the decision trees and produce the
resultant decision trees the RF has two stages the first stage is generating random forest
and second stage is to generate a prediction based upon the RF classifier that was generated
in first stage [6].

• K Means Clustering Algorithm (KMC)


KMC is considered as unsupervised learning which uses repetition concept to split the
’d’ data to ‘k’ number of clusters, KMC works by firstly, set the number ‘k’ clusters. Then
the dataset is shuffled, and the process is repeated. Pick ‘k’ number of instances from
random and without replacements, initialize centroids, repeat until there is no change in the
assignment of instances to clusters, then calculate sum of the squared distance. Next, assign
each instance to the closest cluster. After that, take the average of all instances in each
cluster to calculate the centroids of those clusters. Lastly, calculate cluster centroids by
taking the average of all instances in each cluster [6].

•K nearest neighbor (KNN)


KNN is considered as supervised learning that assumes that examples from a dataset
will have similar characteristics and reside close to one another. If the examples have been
labeled with a classification label, the value of an unclassified example's label can be
determined by observing the class of its closest neighbors. The KNN finds the k examples
closest to the query point and determines its class by identifying the single most common
class [6].

Page | 15
2.2 Deep Learning

Recently, deep learning has been extensively studied in imbalanced data environments.
The deep learning concept is part of the broader family of machine learning algorithms that
are based on the learning of data representations. Deep learning has contributed to the use
of machine learning in a wide variety of applications of practical significance, including
ecommerce, medicine, advertisement, and entertainment. This improves the extension field
of Artificial Intelligence as a whole. Moreover, increasing the ability to learn from a piece
of abstract information is a significant benefit of deep learning, whereas this could be done
by building a deeper architecture [16] [17].
One of the primary characteristics of deep learning algorithms is that they are capable
of learning feature representations automatically, thereby avoiding a lot of engineering
time and effort. Generally, in the standard approach to machine learning, shallow networks
are used, meaning only a single layer for the input and a single layer for the output, with
no more than one hidden layer between the layers. In a deep learning setup, more than three
layers are present in the network, including input and output layers. It is, therefore, evident
that the greater the number of layers hidden within the network, the deeper the network
becomes [16], as shown in Figure. 3. This section reviewed the fundamentals of deep
learning, and now we will show the most popular algorithm in deep learning, which is the
Convolutional Neural Network (CNN).

• CNN

Figure 3: Multilayer Neural Network [16].

Page | 16
Convolutional neural network (CNN) has been widely applied to a wide range of
research fields and have shown to be effective time series classifiers. A CNN is a special
type of artificial neural network for dealing with pixel-related information used in the
processing and recognition of images. This approach is also considered to be a prominent
algorithm of deep learning that involves the training of multiple layers simultaneously
throughout the deep learning process [17].
CNNs are feedforward neural networks that have similar hierarchical structures but
employ a variety of different node layers. The architecture of this algorithm typically
consists of convolutional layers, pooling (subsampling) layers, batch normalization layers,
as well as fully connected layers as shown in Figure. 4.
Convolutional layers can extract features from input images and encode them. Several
components are required, including input data, a filter that is considered a feature detector
that checks whether a feature is present in the receptive field of the image, and the last
component is a feature map. Additionally, this layer enables the filter bank to slide over
the input, thereby activating each of the receptive fields, which combine to form a feature
map that is used as an input in the next layer. In other words, a specific feature is detected
with the same set of weights [16][18].
In order to combine semantically similar features and reduce the number of input
parameters as well as the dimension of the result, a pooling layer or downsampling is added
after using one or more convolutional layers [18]. The pooling layer contains many types,
but the most popular ones are Max pooling: which determines the largest value that can be
sent to the output array, and Average pooling: which determines the largest value that can
be sent to the output array [17].

Figure 4: Convolutional Neural Network [16].

Page | 17
When the output is convolution and pooled, it is flattened and used by fully connected
layers as an input for the classification. The SoftMax activation function is usually used in
a fully connected layer to classify inputs appropriately, resulting in a probability between
zero and one. As it is expressed in SoftMax, an arbitrary vector of real-valued scores is
squashed to a vector of zero to one value that adds up to one [19].
In large-scale datasets, CNNs can achieve higher classification accuracy due to their
ability to learn jointly from features and classifiers. The CNN has proven to be successful
at classifying large datasets. In the meantime, the imbalance in the data presents significant
problems in the training process and could have a negative impact on the performance of
the classification model [20].

2.3 The Nature of the class imbalanced problem.


Classification problems are prevalent nowadays in machine learning, where we aim to
predict the classification of the class by looking at the data entered or the predictor, as the
target or output variable will be the categorical variable [21]. In technical terms, any dataset
with an uneven or biased distribution between its classifications are said to be imbalanced
[1]. Also, if a dataset has more examples from one class than the other classes, it is called
imbalanced [22]. A class that is represented by a small number of examples is called the
minority class “Positive”, while the remaining classes represent the majority class
“Negative” (See Figure 5). Usually, the minority class performs the most crucial concept
to grasp, which is difficult to identify since it may be associated with exceptional and
noteworthy occurrences or because obtaining data for these examples is expensive [21]. In [23]

Figure 5: Example of a two-class imbalanced data problem [23].

Page | 18
light of the fact that the traditional criteria of training are highly influenced by the majority
class, a majority class classifier can display reasonable accuracy results. Unfortunately, for
the minority class(es), the accuracy will be relatively low in this type of case. [22].
Binary classification is related to the imbalanced class problem. Binary classification
problems deal with two classes, it is considered one of the branches of learning from
imbalanced data, which is the most advanced branch, and this is the area that we will focus
on in this research. Furthermore, for datasets that consist of more than two classes, a
skewness in the distribution is familiar. That means that the possibility of the existence of
several classes of minorities is possible, and here finding the solution becomes more
difficult because it is easy for the improvements of the performance in one of the classes
to lose the performance of the other classes. Wherefore, a deep and specialized
understanding of the nature of the class imbalance problem is mandatory. When trying to
find or build a specialized solution for this problem, it is necessary to know which areas in
which the performance of the standard classifiers of the multi-class are affected by class
imbalance [4].
Most of the studies that were conducted on most standard classifiers to monitor their
behavior on datasets that suffer from the imbalanced class showed that the skewness of the
distribution in the classes is one of the most explicit characteristics that have been shown
to lead to a significant deterioration in performance. Moreover, the process of measuring
this uneven distribution across classes can be easily done by using the imbalance ratio (IR).
The number of occurrences in the majority class to the number of instances in the minority
class is determined by IR [21] [24]. The ratio can be as high as 1:100, 1:1000, or even
higher in actual situations [25]. IR is a positive integer that is more than or equal to one.
The dataset is properly balanced when IR = 1. The larger the value, the greater the gap in
class sizes. Any dataset with an imbalance ratio more than 1.5 is considered imbalanced,
with IR = 9 being a common cut-off point for very imbalanced datasets [24]. However, IR
cannot be considered the only parameter that influences the modeling of the classifier that
can identify those uncommon events and provide obstacles that disrupt the proper
functioning of learning algorithms, according to some of the investigations, both theoretical
and experimental. Unfortunately, studies have shown that the IR alone is insufficient to
predict how well a classification algorithm will perform on an unbalanced dataset.

Page | 19
Furthermore, several authors claim that the dataset's actual complexity and fundamental
properties have an impact on the classifier's recognition capacity [24]. The other significant
facts include:
• Small sample size: means that the minority class examples are lacking in
frequently imbalanced datasets [25]. Additionally, it can be regarded as a critical
component of the quality of a classification model. It is unreliable to uncover regularities
inherent in small classes when the example size is restricted. Moreover, it is possible to
reduce the error rate caused by imbalanced class distributions by enhancing the size of the
training set, which means if there is a sufficient amount of data available, this data will
make it easier for the classification to identify the small class, which enables me to take a
rare distinction example from the majority class. Hence, providing a large dataset as it
should, the imbalanced distribution of the class will not hinder the classification process
because of the availability of the dataset and the acceptance of the time needed to learn
from a large-scale dataset [26].
• Class separability or overlapping (Figure. 6a): It may be difficult to separate the
small class from the dominant class, as it is regarded as a critical issue in the small class
situation [26]. Assuming each class has highly discriminative patterns, class objects can be
distinguished with relatively simple criteria. When examples from both classes are
combined to some extent in the feature space, however, discriminative rules are difficult to
generate. The degree of overlap across classes varies, according to the previous
experiments [22].

Figure 6: Example of difficulties in imbalanced datasets. (a) Class overlapping. (b) Small disjuncts [23].

Page | 20
The distribution of class imbalances in and of itself is not necessarily problematic. Even
yet, when combined with heavily overlapping classes, it can reduce the number of minority
class instances correctly categorized dramatically. Linearly separable domains are also
unaffected by any degree of imbalance. In reality, when the concept is more complex, then
the sensitivity of the system to imbalance increases. However, if there is no overlap
between the classes, then the distribution of the classes of the instances are of less
importance, since the problem can be solved by almost any simple technique other than IR
[22] [26].
• Small disjuncts (Figure. 6b): It is closely related to the imbalanced classes
occurrence. This problem occurs when the minority represents the concepts within small
clusters, which arise from underrepresented sub concepts. Those slight disjunctions are
sometimes considered implicit issues as they complicate the problem of class imbalance
when it exists since it is difficult to know what these examples represent. Is it a sub-concept,
or is it just noise? [21] [26].
• Noisy data: Any data mining system is known to be affected by noisy data. Noise
significantly affects the minority class in the imbalanced data, in contrast to its effect on
the majority classes, which is less; because the positive class has fewer instances, to begin
with, it will take fewer "noisy" examples to impact the learned sub idea. Data mining
systems have been known to be affected by noisy data. In the event of imbalanced data,
noise has a more substantial impact on minority classes than on majority classes; because
the positive class has fewer examples, to begin with, it will take fewer "noisy" examples to
affect the learned sub idea. There are two key lessons about the impact of class imbalance
and noise on various classification algorithms and data sampling approaches, which are:
1. Noise affects classification algorithms more than imbalance. However,
noise greatly affects the performance of the classifiers and sampling strategies due
to the development of the severity of the imbalance that occurs.
2. Simple under-sampling strategies such as random undersampling
performed the best overall, at all degrees of noise and imbalance, when it came to pre-
processing procedures [21].
• Within-class concept: A single class in many classification issues comprises
several sub-clusters or sub concepts. A class's examples are gathered from various sub-

Page | 21
concepts. There aren't usually the same number of instances in each sub-concept. This
phenomenon is known as within-class imbalance, and it refers to the unequal distribution
of classes within classes. In two ways, the presence of intra-class sub-concepts exacerbates
the imbalanced distribution problem (whether within or within classes):
1. The dataset's learning concept difficulty is increased by the presence of within-class
sub-concepts.
2. In most circumstances, the presence of within-class sub-concepts is assumed [22].
• Data shift: There is a problem when the distributions of training and test data are
not the same. This is a prevalent issue that can affect any classification problem, and it
frequently occurs as a result of example selection bias difficulties. Most real-world
problems have some dataset shift, although general classifiers can usually handle it without
suffering too much performance loss. As a consequence, the dataset shift problem usually
occurs when dealing with imbalanced classification, as the small number of examples in
the minority class may lead to singular classification mistakes.
• In extreme circumstances, incorrectly categorized examples of the minority class
could result in a significant deterioration in performance [21]. For clarity, in [21]
experiment, they offer two examples of the influence of the dataset shift in imbalanced
classification, as seen in Figs. 7 and 8.
With respect to the first example (Figure. 7), classes that are ideally translated from the
training set to the test set are very clearly differentiable. However, for the second example
(Figure. 8), it's worth noting that several minority class examples in the test set are

Figure 7: Example of good behavior (no dataset shift) in imbalanced domains: ecoli4 dataset, 5th partition [21].

Page | 22
concentrated in the bottom and rightmost regions. In contrast, they're scattered throughout
the training set, resulting in a performance disparity between training and testing.

Figure 8: Example of bad behavior caused by dataset shift in imbalanced domains: ecoli4 dataset, 1st partition [21].

Given how vital dataset shift is in imbalanced classification, it's simple to understand
why it'd be a fascinating angle to explore in future studies. In the investigation of the dataset
shift in imbalanced domains, two distinct methodologies are possible. The first one focuses
on intrinsic dataset shift, which means that the data of interest contains some degree of
shift, causing a significant performance reduction. We might devise tools for detecting and
quantifying dataset change in this situation, but we'd tailor them to focus on the minority
group. The second is a dataset shift that is induced. The majority of contemporary state-of-
the-art research is validated using stratified cross-validation approaches, another source of
learning shift. However, to avoid artificially causing dataset shift concerns, a more
appropriate validation technique must be devised [21].

2.4 Application Domain


It is well established that class imbalance exists across various fundamental fields in
many real-world applications This section reviews some application domains where there
is a class imbalance. The following are some examples illustrating such cases [27].

Page | 23
2.4.1 Medicine:
Applications that deal with medical issues are common examples of imbalanced class
distributions. Quality control refers to data mining strategies to improve health care
services, like creating a system for forecasting lung cancer patients' post-operative life
expectancy. The number of patients surviving the assumed interval was much greater than
the number of deaths. These types of technologies can assist clinicians in determining
which patients should be chosen for surgery and identify those who are at the most danger
following surgery. In hospitals, vast amounts of data on patients and their medical
conditions are recorded in massive databases. The use of data mining techniques is aimed
at understanding an etiology, progression, and characteristics of certain diseases and
identifying patterns and links between clinical and pathological data. Like The new
information can be utilized to make an early diagnosis. A computer-assisted technique for
detecting lung nodules in computer tomography pictures has been created. Pulmonary
nodules are a significant clinical indicator for lung cancer diagnosis in its early stages.
Fortunately, there are many more photos without nodules, creating an imbalanced
classification problem [27] [28] [23].

2.4.2 Fraud Detection:


Frauds such as credit card fraud and check fraud can be pretty costly. Approximately
5% of the revenue of a company is lost as a result of fraud, according to the Association of
Certified Fraud Examiners. According to the research, global card fraud losses totaled
$22.8 billion in 2016, up 4.4 to the last year. In addition, cellphone fraud is an expensive
issue for many businesses. The United States' telecom industry is impacted by cellular
fraud each year to the tune of hundreds of millions of dollars. By utilizing huge databases
of client data, data mining can help mitigate some of these losses. The fraud-detection task
is plagued by technical issues such as an imbalanced dataset [27] [28].

2.4.3 Education:
Educational Data Mining is a relatively new study subject in which Data Mining
techniques are applied to better such a critical service to society. Student failure, for
example, was an efficient technique to understand better why so many youngsters fail to

Page | 24
complete their school education in this sector. This was a difficult undertaking because
numerous elements can contribute to school failure. To foretell whether a student would
pass or fail school, socioeconomic factors, personal, social, family, and school factors, as
well as previous and current grades, were chosen [23] [3].

2.4.4 Security:
It has become security one of the dominant challenges to our society. There are many
sources of security risks like terrorism. The danger of missing a terrorist is greater than the
risk of searching for a bomb in an innocent individual. Face recognition is also used to
determine the presence of target individuals of interest in video surveillance applications
in complicated and changing circumstances. The issue is that insufficient reference target
data are usually accessible at learning time, which produces an undesirable class imbalance
problem [23] [22].

2.4.5 Bioinformatics
Imbalanced data offers a wide range of applications in bioinformatics, including the
suggested protein sub-cellular prediction. In recent years, the field of protein research has
gained a great deal of attention in bioinformatics and biotechnology. Protein structure and
function identification is a critical problem in this subject because they are directly related
to the functioning of an organism. Protein categorization is one of the most efficient
methods for resolving the issue. However, protein databases are invariably imbalanced.
Moreover, protein determination is not the only area where an imbalance is present in
bioinformatics [27] [23].

2.5 Dealing with Imbalanced Data Sets


The imbalanced classification problem has been addressed in many ways. There are
three types of solutions available: algorithms, data, and cost-sensitive approaches [29].

2.5.1 Data-Level Approach


An approach at the data level is one that utilizes training data. By using data-level
methods, it is possible to provide more accurate training data. These approaches work by

Page | 25
affecting the data directly and are intended to minimize the imbalance between classes. It
is possible to classify data in a variety of ways, so the actual classification process can be
varied [27]. At the data level, several resampling methods have been proposed [30].
• Oversampling is the process by which the size of a minority class is augmented to
achieve balance within the class. In order to duplicate examples, random oversampling,
which involves randomly selecting examples, is used [30]. Because of this, the class size
increases. However, the problem with oversampling is likely to be over-fitting [27].
An example of a method that incorporates oversampling is synthetic minority
oversampling (SMOTE). Basically, the idea of this method is interpolating between
instances of multiple minority classes that are closely related, one aims to create new
minority class examples. By using this approach, the minority class's decision-making area
is significantly expanded [31].
• Undersampling has its primary purpose of reduce the number of members of the
majority class. In order to achieve a feeling of balance within the classes and obtain more
balanced distribution of examples, examples randomly selected from the majority class are
eliminated. The loss of potentially valuable data can result from undersampling. Some
examples that use this technique are one-sided selection (OSS), Condensed Nearest
Neighbor Rule (CNNR), and Neighborhood cleaning rule (NCL) [31].
Also, Tomek links; which is a method for removing noisy and borderline majority class
examples by undersampling. We can represent Tomek links as: (𝐼𝑖, 𝐼𝑗), which 𝐼𝑖 and 𝐼𝑗 are
two examples from two different classes. 𝐷(𝐼𝑖, 𝐼𝑗) refers to the distance between the two
examples. A Tomek link exists without Il example such 𝑎𝑠 𝑑(𝐼𝑖, 𝐼𝑗). Moreover, regarding
the undersampling technique, only the majority class examples are removed, and for the
data cleaning method, the examples from both classes are removed. One of the most
significant disadvantages of Tomek link undersampling is that it could remove some of the
potentially valuable data for the induction process [31].

Page | 26
2.5.2 Algorithm-Level Approach
In an algorithm-level approach, handling skewness in the distribution of classes can be
handled without affecting the training data. The process involves only adapting the learning
phases of classifiers to make decisions based on the distribution of classes [29].
AdaBoost.MH, for instance, is a boosting strategy that increases the weights of poorly
classified examples and decreases the weights of well-classified examples without
considering the imbalance of the data sets. To better understand the AdaBoost.MH we
should know that one of the most used and well-known boosting methods is AdaBoost
(Adaptive Boosting). An ensemble of M classifiers is learned progressively in its approach.
AdaBoost's basic principle is to amplify the effect of misclassified data by increasing their
weights in subsequent boosting iterations. As a result, the example set is fixed in all
algorithm iterations, with just their weights changing. Classifiers can learn from current
classifiers' mistakes and improve accuracy by doing so [32].
Several approaches for the boosting algorithm: AdaBoost.M1, a multiclass AdaBoost
variant that employs a multiclass base classifier. The mistake rate determines the weight of
each base classifier. AdaBoost.M2, AdaBoost with many classes. Whereas the pseudo-loss
uses in each base classifier to determine its weight. CSB2, which is for managing
imbalanced data, a cost-sensitive variant of AdaBoost, has been proposed. AdaBoost's
weight update formula now includes a cost item. In the weight update formula, such as
AdaBoost, the step size is considered. AdaBoostCost boosts the weight of misclassification
that is costly, and it does the opposite with the cases with the high cost that are classified
correctly, and this is done based on the oncost-adjustment function. RAMOBoost,
proposed with the goal of imbalanced data handling combining ranked minority
oversampling with AdaBoost.M2 using the sampling probability distribution for ranking
the minority class examples [33] [32].

2.5.3 Cost-Sensitive Approach


Cost-sensitive learning is a machine learning subfield that considers the costs of
prediction errors when training a machine learning model. There is a link between class
imbalance and cost-sensitive learning. The distinctions in misclassification mistakes can
be fairly significant. The cost of forecasting an instance from class 𝐼 as class 𝑗 is denoted

Page | 27
by 𝐶(𝑖, 𝑗) . A positive (rare class) instance has a cost of classification error equal
to 𝐶(+, −), while a negative instance (common class) has a cost of classification error
equal to 𝐶(−, +) . It is usually more interesting to focus on the positive aspects of
imbalanced cases than the negative. Due to this, there must be a greater cost associated
with incorrectly classifying a positive instance than a negative instance (i.e., 𝐶 (+, −) >
𝐶 (−, +)). Correct classification frequently comes presents 0 penalty (i.e., 𝐶(+, +) =
𝐶(−, −) = 0) [21] [25].
The cost matrix is considered during the model construction process, and the model with
the lowest cost is generated. It is possible to divide cost-sensitive learning into two
categories in general. In the first category, cost-sensitive classifiers are created. They are
referred to as direct methods. Cost-sensitive decision trees are an example of direct cost-
sensitive learning. The tree-building algorithms are tweaked to reduce the costs of
misclassification. We can determine which attribute is most suitable to split the data and
determine whether a subtree needs to be pruned by analyzing the cost data. The other
category is to create a "wrapper" that transforms existing cost-insensitive classifiers into
cost-sensitive ones. Wrapper methods, which are also known as cost-sensitive meta-
learning methods, are divided into two categories: thresholding and sampling.
Thresholding is based on fundamental decision theory, which allocates instances to classes
with the lowest projected cost. Sampling is accomplished by modifying the training dataset
[21] [34].

2.6 Performance Analysis


With the expanding number of different models and algorithms generated in machine
learning and statistics communities, it is important to standardize assessment
methodologies to compare different algorithms. This section explains the several
performance metrics used to assess the im-balanced data approaches. The metrics used to
analyze and quantify a model's performance are known as performance metrics. Confusion
matrices are probably the most straightforward and extensively used performance statistics
in classification problems [27] [35].

Page | 28
Figure 9: Confusion matrix for a two-class problem [36]

The confusion matrix in Figure 9. divides the examples into four categories. Confusion
[36]

matrices consist of columns representing possible predicted conditions and rows


representing possible true conditions. The first category is a true positive, which contains
the positive examples correctly predicted as positive. The second category is false positive,
and it includes the negative examples that are correctly predicted as negative. The third
category is a false negative, which contains the negative examples incorrectly predicted as
positive. And the fourth category is a true negative, and it includes the positive examples
that are incorrectly predicted as negative. In table 1, The set of performance metrics that
could be derived from the confusion matrix includes Accuracy, Sensitivity, Specificity,
Precision, Recall, Fmeasure, G-ME AN, and the AUC [27] [35].

Page | 29
Metric Formula
𝑻𝑷 + 𝑻𝑵
Accuracy (6)
𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵 + 𝑻𝑵
𝑻𝑷
Sensitivity (7)
(𝑻𝑷 + 𝑭𝑵)

𝑻𝑵
Specificity (8)
(𝑻𝑵 + 𝑭𝑷)

𝑻𝑷
Precision (9)
(𝑻𝑷 + 𝑭𝑷)

𝑻𝑷
Recall (10)
(𝑻𝑷 + 𝑭𝑵)

(𝑭𝑷 + 𝑭𝑵)
Error rate (11)
(𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵 + 𝑻𝑵)
𝟐 ∗ 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍
F-measure (12)
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝑹𝒆𝒄𝒂𝒍𝒍

G-mean √𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 ∗ 𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 (13)


𝟏 + 𝑻𝑷𝑹 − 𝑭𝑷𝑹
AUC (14)
𝟐
𝑭𝑷
FP rate (15)
(𝑻𝑵 + 𝑭𝑷)

F−value (𝟏 + 𝜷𝟐 ) ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 ∗ 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏


(16)
𝜷𝟐 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 + 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏

Table 1: Performance Evaluation Metrics.

Accuracy counts how many correctly predicted examples there are out of all the
potential ones. The Accuracy would be quite high if the examples were not evenly
distributed among the classes and the model was biased toward the dominating class. So,
Accuracy isn't a good predictor of the model's performance [27] [21].
Sensitivity, sometimes also called True Positive Rate (TPR), counts how many
measures of positive examples a model correctly predicts [21].

Page | 30
Specificity, sometimes also called True Negative Rate (TNR), counts how many
measures of negative examples a model correctly predicts [27].
Precision represents the proportion of correctly classified positive examples to
examples classified as positive [27].
Recall represents the proportion of correctly classified positive examples to all actual
positive examples [27].
Error rate: This metric counts how many wrongly predicted examples there are out of
all the potential ones [21].
F-measure is used to assess prediction accuracy when dealing with binary decision
problems. It is essentially the harmonic mean of Precision and Recall [21].
G-mean: This is known as the square root of the product of sensitivity and specificity
[27].
Area Under Curve (AUC) is the difference between accuracy on positive examples
and error on negative examples. The ideal model has a True Positive Rate of 1 and a False
Positive Rate of 0 [27] [21].
FP rate is the proportion of misclassified negative instances [6].
F-value is combination of two useful metrics for the information extraction, recall and
precision, where value of F is high if both recall, and precision are high. F-value may be
altered by adjusting the value of β , which is normally set to 1 [27].

Page | 31
Chapter 3: Related Work
To deal with the imbalanced classification of data, numeric approaches have been
proposed over the past decade. Various solutions have been presented using different
approaches in order to address this issue. Below is a brief review of articles that present
solutions, arranged by their approach. See Table 2 for a general overview of related work.

• Data-Level Approach
Different forms of resampling are used in a data-level approach to utilize the training
data. The precision of training data can be improved by using data-level methods (cf. result
of Section 2.5.1). Some researchers have researched resampling to investigate its use when
learning from imbalanced data [30].
To address the issue of imbalanced data, the authors of [29], [37], [38], and [39] have
suggested some solutions based on the data-level approach.

[29] is a study conducted in 2004 that focused on finding the best possible way of
adjusting the resampling model. The paper followed the approach of choosing the dataset
from 3 different domains. These domains are datasets reflecting target concepts with
different complexities, UCI Repository, and Reuters-21578 data set. For the first variety of
data, a training set was designed. Next, positive examples were randomly removed from
the training set to achieve a 1:5 class imbalance, leaving the negative class with a higher
percentage.
Further, extra examples were randomly removed from the training set, creating a 1:25
class imbalance in the negative class. This process was then repeated, only to balance the
scales for the positive class. In the end, five cases had no class imbalance: one had a 1:5
class imbalance and one had a 1:25 class imbalance. These results were both found in each
class. The results gathered using C4.5 are averaged after ten runs of multiple domains of
similar complexity. For the second type of datasets, the authors chose a highly challenging
binary classification; however, this classification had smaller domains than the first and
third versions. The author's objective was to observe closely the behavior of C4.5 in a very
small area with a high level of class imbalance. However, only two of the top categories
from the Reuters-21578 data were used in the third version.

Page | 32
When all three domains were tested, it was discovered that they achieved the same
results. This proved that class imbalances could affect the classification performance of
C4.5. Furthermore, this research also studied the effects of oversampling and under-
sampling to full balance. Eventually, the study concluded that these two methods do not
individually solve every imbalanced problem. However, when combined, they might be
just the solution. The authors also examined the effects of oversampling and under-
sampling at various rates. Following experiments, it was determined that resampling to full
balance is not the optimal resampling rate. Optimal resampling rates vary from one domain
to another and according to the resampling method used. There is, then, a need to combine
the two methods of oversampling and under-sampling, a combination that could prove
useful since each method is capable of processing imbalanced classification in its own way.
According to the authors, the combination proposed includes two main parts; the first
explains which classifiers will be combined, and the second explains how these will be
combined.
A further consequence of this combination is a hierarchy of three levels, as follows: In the
first step, the output level combines the expert level of oversampling and undersampling
recorded in the expert's analysis. Secondly, the expert level combines the results of 10
classifiers at the classifier level. The classifier level is achieved through the use of learners
being trained on data sets that have been exampled at different rates. Furthermore, this
combination is dependent on two different assumptions:
1. Different classifiers could classify different test points within the same testing set.
2. Classifiers that operate in class-imbalanced domains make numerous errors in
classifying the non-dominant class.
As a means of resolving the first assumption, the authors passed a classifier that was
considered "good enough" to deal with a given testing point instead of letting different
classifiers decide on that point. Taking into account that the classification set selected for
a particular data point may not be identical to that chosen for another data point, is critical.
Therefore, using a single classifier to test the validity of this data point may lead to
unreliable results. Therefore, to prevent occurrences of ineffective classifiers among the
classification levels, the authors developed their elimination procedure, which resulted in
modifications to the suggested combination. These modifications consist of three

Page | 33
components:
1. They must be applied to each expert at their level of expertise.
2. The combination is applied at the output level.
3. The elimination procedure must be applied at the classifier level.
Both levels of the expert and output are combined using a very simple discovery
technique. In other words, if one of the non eliminated classifiers argues that one example
is positive, it is also decided by the expert level to which the classifier belongs.
Additionally, the output level is also taken into account, meaning that when one of the two
experts states that one example is positive, the output level is also deemed positive. This
makes the example positive throughout all of the system.
Observe that both the expert and output levels of this scheme are biased in favor of the
underrepresented class. The objective was to balance the natural bias against the classes
that individual classifiers in the class-imbalanced domain trained on. However, the scheme
limits any bias towards the minority class, ensuring that any classifiers biased toward the
minority class are excluded.
Based on the resampling method, this paper compared the C4.5 value of the data set
with the Adaboost value. The paper results indicate that the suggested method is more
effective than a single learner. The research also emphasizes that a combination method,
such as Adaboost, is more effective when dealing with imbalanced class problems.

The authors of [37] developed an enhanced version of the density-based spatial


clustering of applications with noise (DBSCAN) algorithm and integrated it with the
synthetic minority over-sampling technique (SMOTE) and proposed the algorithm density-
based synthetic minority over-sampling technique (DSMOTE). Since density-based spatial
clustering of applications with noise (DBSCAN) is not very efficient in handling the
examples at the border line Therefore , the enhanced version of the DBSCAN will emphasis
on classifying minority class examples into three groups: core examples, noisy examples,
and borderline examples. Thus, by eliminating the noise examples of the minority class
and applying various methods on the core and borderline examples, both core and
borderline examples are oversampled.
The DSMOTE algorithm workflow: firstly, input dataset, secondly, they used the
optimized DBSCAN to classify minority class into three groups, thirdly remove noise

Page | 34
examples, fourthly checking if the example is core example if yes then oversample core
examples if not then oversample borderline examples. Lastly, generate new synthetic
minority examples.
This work applied the j48 classification algorithms which is the java implementation of
the C4.5 algorithm and it’s used to build decision trees on set of training data. And the
authors of “Over-sampling algorithm for imbalanced data classification” had applied it on
four datasets two of a binary class and two of multiclass to measure the efficiency of these
over-sampling algorithms: DSMOTE, SMOTE and Borderline-SMOTE and experiment
results have showed that DSMOTE performs better. The DSMOTE algorithm has a
drawback that is not much practical since it needs to set some parameters manually.

The authors of [38] had presented experimental analysis on 35 real world datasets using
some classification algorithms and some sampling techniques that are: random under
sampling (RUS), random over sampling (ROS), one-sided selection (OSS), cluster-based
oversampling (CBOS), Wilson’s editing (WE), SMOTE (SM), and borderline-SMOTE
(BSM).
The classification algorithms are: C4.5, K nearest neighbors classifiers, Naive Bayes
(NB) classifier, Multilayer perceptrons (MLP) learner, logistic regression learner is
denoted LR, The random forest (RF) classifier, support vector machine learner is called the
SVM learner.
The area under the ROC curve (AUC), geometric mean (G), F-measure (F), and
accuracy (Acc) are used to evaluate the performance of the algorithms. The (AUC)
measures the ability of the classifier to divide the minority and majority examples. And
(G), (F) and (Acc) measures utilize the implicit classification threshold of 0.5 for example:
if posterior probability of minority class is > 0.5 then the example classified as belonging
to the minority class.
The results of the conducted experiments showed that the advantage can which be
gained from sampling depends on numerous of factors that different types of sampling
work great with different algorithms such as: RUS worked great for C4.5D and RF on the
other hand, ROS worked great with LR. Also, the performance measure being used heavily
affect the value of sampling such as: AUC generated different results than other
performance measures G, F, and Acc.

Page | 35
The authors of [39] had proposed two algorithms borderline-SMOTE1 and borderline-
SMOTE2 based on the synthetic minority over-sampling technique (SMOTE) algorithm.
The SMOTE algorithm oversamples the minority class by generating synthetic examples,
unlike this oversampling algorithm the borderline-SMOTE1 and borderline-SMOTE2
focuses on oversampling and potencing the borderline examples of the minority class.
The procedures of the borderline-SMOTE1: firstly, for every example in minority class
compute its m nearest neighbors and the number of majority examples among the m nearest
neighbors is denoted by 𝑚′ (0 ≤ 𝑚′ ≤ 𝑚) . Secondly, if 𝑚’ = 𝑚 the minority class
example is regarded to be noise, and if 𝑚 / 2 ≤ 𝑚′ < 𝑚 the minority class example is
regarded to be danger, and if 0 ≤ 𝑚′ < 𝑚/2 the minority class example is regarded to be
safe. Thirdly, the examples in danger set are the borderline examples of the minority class
and for every example in danger set compute its k nearest neighbors from minority class.
Lastly, produce synthetic minority class examples from the examples in danger set by
multiplying 𝑆 (𝑆 is an integer between 1 and 𝑘) with the number of examples in the danger
set 𝑆 ∗ 𝑛𝑢𝑚 . Since new synthetic examples are produced along the line between the
borderline examples of the minority class and their nearest neighbors of the same class
Thus, the borderline examples have been strengthened.
The authors of [39] had applied C4.5 classification algorithms which builds decision
trees and transform them into collection of rules and prune each rule by removing any
preconditions which improve its accuracy. And they used it on four datasets, one of them
is a simulated dataset and the three others are from UCI.
The experiment results have showed after performing the four oversampling algorithms:
borderline-SMOTE1, borderline-SMOTE2, SMOTE and random over-sampling (ROS)
and concluded That the performance of the borderline-SMOTE1 and borderline-SMOTE2
is better.

Page | 36
• Algorithm-Level Approach
Algorithm-level approaches that adapt existing learning algorithms to mine data with
skewed distributions and alleviate their bias towards majority objects (cf. result of Section
2.5.2) [4].
There have been several approaches proposed by Seyda et al. [33], Utkarsh et al. [40],
and Hongyu and Herna [41] that fall under the algorithm-level approach.

Gue et al. [33] study they use the DataBoosting-IM, which is considered one of the
newest classifications during that period. DataBoosting-IM takes advantage of the process
of boosting and combines it with data generation, resulting in improved predictive
accuracies for classes of both majorities and minorities.
In this study, the researchers followed different ways than the others who used the same
approach. The DataBoost-IM algorithm was compared in this study with the C4.5 decision
tree, SMOTEBoost, AdaBoostM1, AdaCost, CSB2 boosting algorithms, to evaluate its
performance. The performance was evaluated based on some metrics which are the TP and
FP Rate, F-Measures, G-Mean, and finally the overall accuracy. There are 16 data sets used
in this study, as each has a different size, class distribution, and feature characteristics.
they used Weka, to carry out the trials. The findings of five conventional 10-fold cross-
validation studies were averaged. However, the data set was divided equally into ten parts,
where each part was considered a test set, and the remaining nine sets were trained by the
classifier. Moreover, to make sure that the proportion of distinct classes is the same in each
class, a stratified sampling procedure was used. An ensemble of ten component classifiers
was developed for each fold.
A comparison of the G-mean and F-measures indicates that DataBoost-IM and the other
strategies produce comparable results. The findings against severely unbalanced data sets,
in particular, are promising. Notably, the DataBoost IM technique gives the best results for
minority and majority class F-measures for several severely imbalanced data sets.
According to the results, DataBoost-IM does not sacrifice one class for another. Instead, it
seeks to create an ensemble that achieves good results against both.

In [40] approach, Elastic-InfoGAN, the authors improve InfoGAN in two ways: more
straightforward and intuitive. First, they change how the latent distribution retrieves the

Page | 37
latent variables. The second improvement stems from an observation of a failure case of
InfoGAN, in which the model struggles to generate consistent images from the same
category for a latent dimension. An unsupervised learning algorithm, InfoGAN maximizes
the reciprocal information between the hidden variables and the produced examples. They
compared their approach to a group of methods like Uniform InfoGAN and JointVAE. The
Uniform InfoGAN This is the original InfoGAN, which uses a categorical distribution that
is fixed and uniform. JointVAE Includes this VAE-based baseline, which combines
discrete and continuous components in a single model. Demonstrate the advantage of their
model Elastic-InfoGAN for disentangling imbalanced datasets like (1) MNIST (2) 3D Cars
(3) 3D Chairs (4) ShapeNet (5) YouTube-Faces. General, Elastic-InfoGAN is able to
generate more consistent latent code images than Uniform InfoGAN and JointVAE. For
instance, in Figure.10, their model produces the same person's face more often than the
baselines do. While Uniform InfoGAN and JointVAE are frequently incorrect in
considering prior uniform distribution, they often mix up identities within the same
categorical code. Their hope with this paper is to smooth the evolution of unsupervised
learning-based methods to work well in imbalanced class data.

Figure 10: Image generations on YouTube-Faces [40].

[41] is another paper that focuses on using active learning to overcome the problem of
imbalanced data classification, it was suggested using the Active Learning AL, which is a
beneficial technique in dealing with large amounts of data, in labeling examples by
choosing the most informative ones. This method uses a Support Vector Machine (SVM)
to determine which example is the most informative by setting the closest examples to the

Page | 38
margin (hyperplane) because data within the margin is less imbalanced than the rest. After
determining the closest examples to the margin, the selected examples are introduced to
the training set and then retrained. The original form of Al can be very long and expensive
to apply to the entire dataset in order to find the most informative example. This led the
researchers to design a new method called “59 tricks “that does not require going through
the entire dataset. In fact, their suggested method consists of picking 59 random examples,
and choosing the closest example to the margin at each learning step. This method (AL
random pool) proved to make good prediction performance compared to the performance
of the classical method of AL. It was also ten times faster.
Additionally, this paper uses AL with early stopping, meaning that the training must be
stopped when an example is selected by AL and not closer than any support-vector. This
leads to the unreliability of the margin. The method can be carried out during the AL
training process by counting the occurrences of support vectors. This means all possible
support vectors are selected when their occurrences are stable.
For evaluating the results of their methods, the researchers used three metrics, which
are commonly found in Imbalanced data, namely g-means, AUC, and PREBEP.
The research was conducted on various real-world datasets, in which they used 8 of the
popular text mining Reuters-21578, five categories from CiteSeer data, and four datasets
from the USI Machine Learning Repository. All these datasets share a common feature: a
wide range of data imbalance ratios.
Initially, the researcher compared the performance of AL (random pool) with AL (Full
search), which proved that the suggested method is much faster in comparison. Then the
researchers compared the performance of early stopped Al in the Batch algorithm, which
showed that AL could make good prediction performance similar to the Batch algorithm
or higher. Then, applying AL (with early stopping) and the Resampling method (RC) on
the dataset which proved that the value of the AUC of RC was declined than AL. This can
only confirm that AL is faster when using fewer and more informative examples than RC.
Additionally, the research also examined the changes of support vector numbers in AL
and Random Sampling (RS). When using the method of RS, which entails the random
selection of examples, learning the support vector imbalance ratio limits the data imbalance
ratio within the margin. In contrast, AL aims to select examples that are only close to the

Page | 39
margin at each learning step due to the lower ratio of data. Notably, by the end of learning
the two-method, it was discovered that they achieved similar support vector imbalance
ratios. However, AL is achieved in much earlier steps of learning. Lastly, the method of
SMOTE was also compared to AL and took much longer, which slows the training time in
larger datasets. After all, AL proved to be better than any other resampling methods in
processing imbalanced data.

• Hybrid Approach
At the hybrid level, the advantages of algorithmic-level and data-level approaches are
combined. In other words, you need to combine the approaches to gain their advantages
and reduce their disadvantages [4].
In their studies [42], [14], [43], and [44], propose solutions to address the imbalanced
data using the hybrid approach.
In Wang paper [42], there is a new hybrid sampling SVM strategy that was proposed to
handle the imbalance problem in the datasets. This proposed technique of undersampling
combines the undersampling technique, which works to exclude certain examples from a
majority class, with the oversampling technique to generate progressively more positive
examples.
Six datasets from KEEL are used in the trials, each with varying degrees of imbalance,
including Letter4, Vowel, Glass7, Abalone7, Yeast, and Cmc2. Moreover, researchers
analyze the performance of unbalanced data categorization techniques using a variety of
measures. The accuracy rate, F-measure, geometric mean (G-mean), and AUC are among
these metrics. Additunally, measuring the performance of the represented method and the
other suggested methods and comparing their results 𝐹-measure and 𝐺-mean are used.
They reached the results of four different methods: the proposed method, undersampling
(random sampling), asyEnsemble (C4.5 decision tree), and the SMOTE with five
neighbors. They undertake 10-fold cross-validation in all of the studies.
On all six datasets, compared with the other tested methods, the proposed strategy was
shown to have a higher F-measure. On the Glass7 dataset, EasyEnsemble surpasses other
compared approaches. Results shows that the proposed method can further improve the F-
measure metric of unbalanced learning.

Page | 40
In Anand et al. paper [14]. They suggest an undersampling strategy that targets border
examples, which are notoriously difficult for all classifiers to cope with. It's relatively
simple to classify instances that aren't close to decision boundaries. Their proposed method
to use a method that selects examples from the majority class; the selected examples may
more likely be closest to the decision border. The weighted SVM is then integrated with
the proposed undersampling approach. They employ a two-loop cross-validation technique
to achieve trustworthy performance estimates and avoid overfitting. In the inner loop with
the fivefold cross-validation, the best values of the parameters were determined for the
classifier, and to measure the classifier's performance that was generated from the inner
loop the outer loop with tenfold cross-validation was used.
In their simulation studies, they used four datasets. The ratio of positive to negative
examples varied from 1:9 to 1:100 across different datasets, Micropred, Xwchen, Active-
site, and Cysteine. Overall, and given the overall number of examples and features, there
were notable differences. The result for the proposed algorithm was compared with the best
previous results that were available for the weighted SVM and SMOTE algorithms.
Moreover, the performance between them was evaluated using the average G-mean,
sensitivity, specificity, and average overall accuracies.
The proposed approach has a consistent performance compared with the traditional
weighted-SVM, which uses all of the data, and to infer this, sensitivity and specificity data
are used. In the Micropred dataset, the proposed approach shows better results using G-
mean and sensitivity than the previous best performance; it enhanced the G-mean and
sensitivity by 6% and 10%, respectively. For all of the datasets, the Weighted-SVM
showed the best specificity result, but unfortunately, based on Xwchen-data and Active-
site datasets results using sensitivity and G-mean the Weighted-SVM showed its failure.
G-mean and sensitivity were improved using the proposed method. Based on the Active-
site and Cysteine data, the proposed technique increased sensitivity and G-mean.

The authors in [43] utilize three ensemble models (UnderBagging, OverBagging, and
SMOTEBagging) to investigate how diversity influences the analysis of eight UCI data
sets. Diversity is how classifiers make different decisions on the same problem. According
to the findings, recall value is greatly influenced by variety. Essentially, greater diversity
leads to better recall for minority classes but worse recall for majority classes. The tendency

Page | 41
of G-mean and F-measure is decided by classifier accuracy and variety. Mean's The best
F-measure value and G-mean value in their experiment emerge in the status with medium
accuracy and medium diversity, rather than the status with high accuracy and low diversity.
They have a multi-class system that is more flexible and advantageous to diversity analysis.
According to findings, variety has a similar influence on each class in two-class and multi-
class; however, the impact is in multi-class observations low due to the falloff in imbalance
rate. Eventually, SMOTE adds variability to the ensemble system with multi-class data
sets. Both the overall performance (G-mean) and the degree of diversity have improved.
There are just two data sets in the multi-class investigated in this research. This is enough
to exploring of diversity.

The authors in [44] The performance of the undersampling approaches proposed in their
manuscripts has been evaluated on real datasets as well as synthetic datasets. A precision
rate P, a recall rate R, and an F-measure were used to determine the degree of accuracy in
the classification of minorities. By comparing their methods to the other AT, RT, and
NearMiss-2 based on neural network. The procedure AT does not select examples when
training classifiers and uses all of them. One of the most common methods of random
undersampling is RT, which randomly selects samples from the class of the majority.
NearMiss-2 chooses majority class examples based on their proximity to three further
minority class examples. SBC, SBCNM-1, SBCNM-2, SBCNM-3, SBCMD, SBCMF
based under-sampling methods are used to evaluate the effectiveness of their approaches.
A classifier whose precision is high will have a low recall, indicating they are mutually
exclusive. They can't utilize one of the two criteria to assess a classifier's performance. This
resulted in the F-measure, which is illustrated in the expression below, combining the
precision and recall rates.
2×𝑃×𝑅
𝑀𝐼’𝑠 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (17)
𝑃+𝑅
Where precision rate (𝑃), recall rate (𝑅). Dataset potentially can be separated into
(𝑖) clusters represented as DSi. DSi with 𝑗% exceptional examples and 𝑘% disordered
examples described as DSiEjDk. Figure. 11. illustrates the experimental outcomes they
generated several synthetic datasets DSiE10D20 where (i) is in the range of 2 to 16. we
can see that MI's F-measure for SBC is higher than other methods on average. In addition,

Page | 42
we continue to increase the percentage of exceptional and disordered examples to 50% and
60%, respectively. Experimental results shown in Figure 12. show that SBCMD has the
highest F-measure in both of the synthetic datasets and is the most stable method. Although
RT is also a stable method in the experiments, SBCMD performs better in most cases than
RT.

Figure 11: Dataset DSiE10D20 [44].

Figure 12: Dataset DSiE50D60 [44].

Page | 43
Define
The Title of The Algorithm’s Performance Challenges
Classification Domains Approach
Research Name Matrix Encountered
Problem

A Hybrid
Hybrid F-measures,
Sampling SVM
Sampling Hybrid G-mean,
Approach to - ✓ -
SVM Approach accuracy, and
Imbalanced Data
Approach AUC
Classification [42]

A Multiple
Resampling
Method for ✓ combination Data-Level
Error rate -
Learning from ✓ scheme Approach
Imbalanced Data
Sets [29]

An Approach for
Classification of
An Accuracy,
Highly
✓ undersampling Hybrid sensitivity,
Imbalanced Data -
✓ approach with Approach specificity,
Using Weighting
weighted-SVM and G-mean
and Under
Sampling [14]

Borderline-
SMOTE: A New
Over-Sampling ✓ Borderline- Data-Level
F-value -
Method in ✓ SMOTE Approach
Imbalanced Data
Sets Learning [39]

Cluster-Based
Under-Sampling Under- Precision,
Hybrid
Approaches for - ✓ sampling based recall, and F- -
Approach
Imbalanced Data on clustering measure
Distribution [44]

Page | 44
Diversity Analysis
on Imbalanced
Ensemble Hybrid F-measures,
Data Sets by ✓ ✓ ✓
Model Approach and G-mean
Using Ensemble
Models [43]

Elastic-Infogan:
Unsupervised
Disentangled Algorithm-
Elastic-
Representation - - Level - -
InfoGAN
Learning in Class- Approach
Imbalanced Data
[40]

Experimental
AUC, G-
Perspectives on
Data-Level mean, F-
Learning from - - - ✓
Approach measure, and
Imbalanced Data
accuracy
[38]

Learning from
Imbalanced Data
Sets with Boosting Algorithm- F-measures,
and Data ✓ ✓ DataBoost-IM Level G-mean, and ✓
Generation: The Approach accuracy
Databoost-IM
Approach [33]

Learning on The
Border: Active Active Algorithm-
Learning in ✓ ✓ Learning with Level G-means -
Imbalanced Data SVM Approach
Classification [41]

Over-Sampling
Precision,
Algorithm for Data-Level
✓ ✓ DSMOT recall and F- ✓
Imbalanced Data Approach
value
Classification [37]

Table 2: Related Works Summary.

Page | 45
Chapter 4: Methodology
The research's goal is to review the performance of several approaches previously
proposed to address the issue of imbalanced data on a variety of datasets. Ultimately, we
will select the best algorithm to solve that issue.
In the present chapter, we identified three models, density-based synthetic minority
over-sampling technique (DSMOTE), Combination, and Undersampling techniques
combined with weighted-SVM, and presented an overview of the three models.

4.1 Models
4.1.1 DSMOTE model
Inspired by borderline-SMOTE the authors of [37] have proposed DSMOTE which is
the combination of their optimized version of DBSCAN and SMOTE. The optimized
DBSCAN focuses on dividing minority class into three parts core examples, borderline
examples, and noisy examples. then it tries to remove noisy examples and oversamples
both the core and borderline examples. SMOTE improves learning from imbalanced data
sets by synthetically oversampling borderline examples and core examples of minority
classes. The DSOMTE utilizes the K-nearest neighbors (KNN) which we have explained
in (cf. results of Chapter 2.1). in addition to eliminating the limitation of linear interpolation

Figure 13: Flow Chart of DSMOTE [37].

Page | 46
only within the positive examples, DSMOTE can also eliminate the noise in the dataset.
The input of this model is dataset, Eps which is the radius of neighbors to some point (the
junction of the two clusters), Minimum Points which is the least number to form a region
for the cluster, and the number of minority examples. The output will be synthetic minority
class examples. Some Parameters of DSMOTE need to be set manually to get better
performance.

4.1.2 Combination model


Considering that both oversampling and undersampling are valuable techniques when
dealing with imbalanced data sets, however, each technique yields different results
depending on the rate of imbalance. Hence, Combining the two techniques would be highly
advantageous. Therefore, we have chosen to use a hybrid method, a combination of two
techniques.
Three levels of the combination model are involved. The output level at which the
results of the oversampling expert and the undersampling expert are combined. At the
expert level, 10 classifiers are combined with the results at the classifier level. At the
classifier level, learners are trained on data sets exampled at varying rates (The figure
below illustrates the architecture of the combined model). More precisely, both the 10
oversampling learners (classifier) and the 10 undersampling learners (classifier) will be
trained on datasets that have been oversampled or undersampled at rates ranging from 0%
to 100%. For each test point, the combination model allowed a different classifier to make
a decision on that point. This is risky given that, if a single classifier is not considered
reliable, the result for the given data point will be equally unreliable. An elimination
process was developed to address this problem by eliminating any unfit classifier from
participating in the decision-making process at the classification level of this model. Using
the results of C4.5, this elimination program applies tenfold cross-validation to the original
imbalanced training data. Each individual classifier of the combination scheme that
displays a lower error rate than the average tenfold cross-validation error is selected and
the learning systems from which they were derived are trained again, but without using
cross-validation. The remaining ones are eliminated [29].

Page | 47
Figure 14: Illustration of combination model [29].

4.1.3 Undersampling techniques combined with weighted-SVM

Based on what [14] authors present in their research, the undersampling technique
targets one of the hardest issues that all the classifiers deal with, which is the boundary
examples. Since the majority of class examples have the probability of being near the
decision boundary, these examples will be chosen by this undersampling technique. The
proposed technique then will be combined with the weighted-SVM. Moreover, the
undersampling technique was already mentioned with a brief description (cf. result in
Chapter: 2.5.1).
• The undersampling technique
The closer the examples from the two classes are to each other, the more incorrect the
classification of these examples by any classifiers. Also, there will be a probability of
decision boundary near these "boundary examples." This property has been used in the
undersampling technique for selecting a negative example. Therefore, when the examples
are selected at random, there is no difference in their location. If the examples are selected
far from the decision boundary or those examples are selected that are not situated near the
actual decision boundary, the optimal separating hyperplane will be incorrectly positioned,
resulting in incorrect classification. According to this, a multiple random undersampling of

Page | 48
the SVM trained by random sampling is needed to assess the true predictive performance.
Based on the property that we are discussed [14], the undersampling technique will be as
follow:
To determine the distance between each negative and each positive example, compute
the weighted Euclidean distance. Fisher’s score (Eq. 18) will give a weight to all of
features. Fisher ratio 𝐹𝑅 (𝑥𝑖 ) is calculated for each 𝑥𝑖 feature as
(𝑥̅ 𝑖,𝑝 −𝑥̅ 𝑖,𝑛 )2
𝐹𝑅(𝑥𝑖 ) = 2 +𝜎
̂𝑖,𝑝 2
̂𝑖,𝑛
(18)
𝜎

𝑥̅𝑖,𝑝 represents the positive examples mean value for the 𝑥𝑖 feature, 𝑥̅ 𝑖,𝑛 represents the
2 2
negative examples mean value for the 𝑥𝑖 feature, 𝜎̂𝑖,𝑝 𝑎𝑛𝑑 𝜎̂𝑖,𝑛 are refer to the variances of
the positive and negative examples, respectively of the 𝑥𝑖 feature. Therefore, any two
examples 𝑋1 and 𝑋2 , whereas 𝑋1 = (𝑥11 , … , 𝑥1𝑖 , … , 𝑥1𝑛 ) and 𝑋2 = (𝑥21 , … , 𝑥2𝑖 , … , 𝑥2𝑛 )
[14], the distance of the weighted Euclidean can be calculated as

𝐷(𝑋1 , 𝑋2 ) = √∑ 𝐹𝑅(𝑥𝑖 ) × (𝑥1𝑖 − 𝑦1𝑖 )2 (19)

The negative examples should be sorted by their distance from the positive example if
there is a one. If there is a positive example, then for each one of them a number of negative
examples are selected as a user-defined, which is refer to the percentage that the positive
examples desires from the negative examples. furthermore, in this level frequently
selecting negative examples will be avoided, in case that the negative example has been
chosen before then the next available negative example will be the other selection [14].

• Approach design
The model's design will use a two-loop cross-validation technique to avoid overfitting
and get the best results for the performance. A classifier will be constructed in the inner
loop, and performance will be estimated using the outer loop with tenfold cross-validation.
As for the inner loop, fivefold cross-validation will be applied. Moreover, the most suitable
models are selected using the inner loop, and this will be done by determining the best
classifier parameters, for example, and so on [14].
As a summary for the approach, we can say: First, for the outer loop the data will be
divided and stratified into 𝐾 partitions. Seconds, for the inner loop, for each 𝑖 fold the

Page | 49
follow will applied, in the beginning the (𝑘 − 1) of the parts are considered as a training
data and the 𝑖th part is considered as the test data, then in the training data for each negative
example from all positive examples, the distance of the weighted Euclidean will be
calculated. Next, negative examples are going to be selected, then these selected examples
and the positive examples from the training data will be gathering to determine the
modified training data, then theses data will be split into 𝐿 parts. Next, depending to the L-
fold experiment of the best average Gmean the best parameters for the model will be giving.
For the third step, the modified training data and the best parameters for the model that has
been selected, will used to build a classifier model. Fourthly, the 𝑖th test data performance
measures are obtained. Fifth, the second step until the fourth are repeated. Finally, the k-
folds performance measure is computed [14].

Figure 15: Th approach design for Undersampling techniques combined with weighted-SVM.
Page | 50
Chapter 5: Experimental Design
One of the objectives of this research is to determine the most appropriate method for
handling imbalanced datasets. This can be accomplished by first implementing the three
previous models DSMOTE, Combination, and Undersampling techniques combined with
weighted-SVM. Then designing and conducting controlled experiments using different
data sets to test the performance of these models used in the research. Once the results have
been analyzed, we will select the most effective model.

5.1 Datasets
In order to conduct the experiment, three different datasets were selected, Fraud Credit
Card Detection1, Melanoma2 dataset, and depressed Tweets detection. The first two have
already been collected and are available for use online, and we will generate labeled tweets
dataset from scratch for the last one.

5.1.1 Fraud Credit Card Detection


The Credit Card Fraud Detection dataset was obtained from Kaggle. This dataset is a
numerical dataset collected in 2013. This dataset contains many transactions processes
made by credit cards, where there are 284,807-transaction processes, 492 of which are
frauds. We can see that the dataset is clearly imbalanced, whereas the minority
class(frauds) represents 0.172% of the entire transaction process. Credit Card Fraud
Detection has two classes fraudulent class and genuine class. Unfortunately, the original
features are not available due to privacy reasons therefore it was replaced with numerical
input variables which are the result of a Principal component analysis (PCA)
transformation, but the only available features are the 'class' feature which takes values
whereas 1 used to represent the fraudulent transactions and 0 otherwise. The 'Time' which
refers to the time between two transactions, and the 'Amount' feature, which refers to the

1 https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
2 https://www.kaggle.com/datasets/drscarlat/melanoma

Page | 51
amount of the money transferred. The 'Amount' feature can be useful in cost-sensitive
learning situations.

5.1.2 Melanoma
The melanoma dataset was obtained from [45] in 2019. this imbalanced dataset contains
10,015 images of skin lesion where 1,113 images are melanoma, and 8902 images are not
melanoma. [45] has applied data augmentation on the number of images in the melanoma
class to be similar to the number of not melanoma class, therefore the previously
imbalanced 10k images is now a dataset containing 17805 examples of images where 8902
examples of them are not melanoma images and the other 8903 examples are the melanoma
images. The melanoma dataset has two classes melanoma class and not melanoma class
the dataset from dermoscopic images has been curated and normalized by Alexander
Scarlet in terms of resolution, colors, and luminosity. This dataset was used by some
studies, the authors of [45] used 10682 images for training, 3562 images for validation, and
3561 for testing Also, the authors of [46] used 7122 of images randomly for each type of
class were used to compose the training and testing set. The rest of the images, which are
1781 images of class melanoma and 1780 images of class, not melanoma, make up the test
set. We will apply the three previously chosen models DSMOTE, Combination, and
Undersampling techniques combined with weighted-SVM on melanoma dataset before
data augmentation.

5.1.3 Depressed Tweet


We will collect and prepare a dataset of Arabic tweets in order to examine if individuals
express their depressed feelings. This dataset will be extracted from Twitter based on some
keywords (such as "‫ "اكتئاب‬and "‫)"حزين‬, and it will specifically exclude social media
accounts that are related to tech, politics, etc. Each tweet will be labelled with 1 or 0. 1
indicates a DepressedTweet, and 0 indicates NotDepressedTweet.

5.2 Performance measures


It is essential to assess the effectiveness of the models reviewed in (cf. results of
Chapter 4). We consider accuracy, precision, recall, and error rate metrics to measure the

Page | 52
performance of our models in this research. For the performance evaluation metrics
equations, see Section 2.6, Eq 6, 9, 10, and 11 respectively.

In summary, we will use DSMOTE, Combination, and Undersampling techniques


combined with weighted-SVM on three datasets: Fraud Credit Card Detection, Melanoma,
and Twitter Depression. For evaluating the models, the Accuracy, Recall, Error rate, and
Precision performance measures are applied to find the most suitable model for dealing
with imbalanced datasets.

Page | 53
Chapter 6: Conclusion
Learning from imbalanced data sets is a significant problem in machine learning.
Therefore, it has drawn lots of studies both theoretically and practically. Traditional data
mining methods, on the other hand, are unsatisfactory. To directly handle the imbalance
problem, balance the class distribution artificially. Proper balance can be accomplished by
under-sampling the majority class, or over-sampling the minority class, or by doing both.
Various studies in the literature show the efficiency of these techniques in practice.
Furthermore, some research indicates that explicitly re-balancing the class distributions has
no impact on the performance of the derived classifier since certain learning environments
are not influenced by differences in class distributions. it appears to still need a better
knowledge of how class distributions affect each phase of the process of learning. A better
understanding of the fundamentals will enable us to design more effective methods. And
the objective of this research is to apply and compare several models on multiple datasets
to assess which model performs the best.

Page | 54
References
[1] Haibo He and E. A. Garcia, ‘Learning from Imbalanced Data’, IEEE Trans. Knowl.
Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[2] S. Visa and A. Ralescu, ‘Issues in Mining Imbalanced Data Sets - A Review Paper’,
p. 7.
[3] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, ‘Learning
from class-imbalanced data: Review of methods and applications’, Expert Syst. Appl., vol.
73, pp. 220–239, May 2017.
[4] B. Krawczyk, ‘Learning from imbalanced data: open challenges and future
directions’, Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, Nov. 2016.
[5] I. El Naqa, R. Li, and M. J. Murphy, Eds., Machine Learning in Radiation
Oncology. Cham: Springer International Publishing, 2015.
[6] J. Alzubi, A. Nayyar, and A. Kumar, ‘Machine Learning from Theory to
Algorithms: An Overview’, J. Phys. Conf. Ser., vol. 1142, p. 012012, Nov. 2018.
[7] O. Theobald, ‘Machine Learning For Absolute Beginners’, p. 128.
[8] K. Liakos, P. Busato, D. Moshou, S. Pearson, and D. Bochtis, ‘Machine Learning
in Agriculture: A Review’, Sensors, vol. 18, no. 8, p. 2674, Aug. 2018.
[9] Southwest Jiaotong University, China, I. Muhammad, Z. Yan, and Southwest
Jiaotong University, China, ‘SUPERVISED MACHINE LEARNING APPROACHES: A
SURVEY’, ICTACT J. Soft Comput., vol. 05, no. 03, pp. 946–952, Apr. 2015.
[10] M. I. Jordan and T. M. Mitchell, ‘Machine learning: Trends, perspectives, and
prospects’, Science, vol. 349, no. 6245, pp. 255–260, Jul. 2015.
[11] A. Burkov, ‘“All models are wrong, but some are useful.” — George Box’, p. 152.
[12] V. Nasteski, ‘An overview of the supervised machine learning methods’,
HORIZONS.B, vol. 4, pp. 51–62, Dec. 2017.
[13] E. García-Gonzalo, Z. Fernández-Muñiz, P. García Nieto, A. Bernardo Sánchez,
and M. Menéndez Fernández, ‘Hard-Rock Stability Analysis for Span Design in Entry-
Type Excavations with Learning Classifiers’, Materials, vol. 9, no. 7, p. 531, Jun. 2016.

Page | 55
[14] A. Anand, G. Pugalenthi, G. B. Fogel, and P. N. Suganthan, ‘An approach for
classification of highly imbalanced data using weighting and undersampling’, Amino
Acids, vol. 39, no. 5, pp. 1385–1391, Nov. 2010.
[15] T. Razzaghi, O. Roderick, I. Safro, and N. Marko, ‘Multilevel Weighted Support
Vector Machine for Classification on Healthcare Data with Missing Values’, PLOS ONE,
vol. 11, no. 5, p. e0155119, May 2016.
[16] J. Zhang, M. Zheng, J. Nan, H. Hu, and N. Yu, ‘A Novel Evaluation Metric for
Deep Learning-Based Side Channel Analysis and Its Extended Application to Imbalanced
Data’, IACR Trans. Cryptogr. Hardw. Embed. Syst., pp. 73–96, Jun. 2020.
[17] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, ‘Deep learning for
visual understanding: A review’, p. 22.
[18] J. M. Johnson and T. M. Khoshgoftaar, ‘Survey on deep learning with class
imbalance’, J. Big Data, vol. 6, no. 1, p. 27, Dec. 2019.
[19] Department of Electrical and Electronics Engineering, Firat University, Elazığ,
Turkey et al., ‘AN OVERVIEW OF POPULAR DEEP LEARNING METHODS’, Eur. J.
Tech., vol. 7, no. 2, pp. 165–176, Dec. 2017.
[20] I. Valova, C. Harris, T. Mai, and N. Gueorguieva, ‘Optimization of Convolutional
Neural Networks for Imbalanced Set Classification’, Procedia Comput. Sci., vol. 176, pp.
660–669, 2020.
[21] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, ‘An insight into
classification with imbalanced data: Empirical results and current trends on using data
intrinsic characteristics’, Inf. Sci., vol. 250, pp. 113–141, Nov. 2013.
[22] V. Ganganwar, ‘An overview of classification algorithms for imbalanced datasets’,
vol. 2, no. 4, p. 6, 2012.
[23] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera,
Learning from Imbalanced Data Sets. Cham: Springer International Publishing, 2018.
[24] S. Vluymans, ‘Learning from Imbalanced Data’, in Dealing with Imbalanced and
Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods, vol. 807,
Cham: Springer International Publishing, 2019, pp. 81–110.

Page | 56
[25] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, ‘Cost-sensitive boosting for
classification of imbalanced data’, Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, Dec.
2007.
[26] P. K. Singh, A. K. Kar, Y. Singh, M. H. Kolekar, and S. Tanwar, Eds., Proceedings
of ICRIC 2019: Recent Innovations in Computing, vol. 597. Cham: Springer International
Publishing, 2020.
[27] H. Kaur, H. S. Pannu, and A. K. Malhi, ‘A Systematic Review on Imbalanced Data
Challenges in Machine Learning: Applications and Solutions’, ACM Comput. Surv., vol.
52, no. 4, pp. 1–36, Jul. 2020.
[28] Y. Sun, A. K. C. Wong, and M. S. Kamel, ‘CLASSIFICATION OF
IMBALANCED DATA: A REVIEW’, Int. J. Pattern Recognit. Artif. Intell., vol. 23, no.
04, pp. 687–719, Jun. 2009.
[29] A. Estabrooks, T. Jo, and N. Japkowicz, ‘A Multiple Resampling Method for
Learning from Imbalanced Data Sets’, Comput. Intell., vol. 20, no. 1, pp. 18–36, Feb. 2004.
[30] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, ‘Handling imbalanced datasets: A
review’, p. 13, 2006.
[31] ‘[No title found]’, presented at the 2008 International Conference on Advanced
Computer Theory and Engineering (ICACTE), Phuket, Thailand, 2008.
[32] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, ‘Boosting methods
for multi-class imbalanced data classification: an experimental review’, J. Big Data, vol.
7, no. 1, p. 70, Dec. 2020.
[33] H. Guo and H. L. Viktor, ‘Learning from Imbalanced Data Sets with Boosting and
Data Generation: The DataBoost-IM Approach’, p. 10.
[34] C. X. Ling and V. S. Sheng, ‘Cost-Sensitive Learning and the Class Imbalance
Problem’, p. 8.
[35] H. Chen, ‘Novel Machine Learning Approaches for Modeling Variations in
Semiconductor Manufacturing’, p. 97.
[36] O. Bulut and C. Desjardins, ‘Exploring, Visualizing, and Modeling Big Data with
R’, p. 154.
[37] X. Xu, W. Chen, and Y. Sun, ‘Over-sampling algorithm for imbalanced data
classification’, JSEE, vol. 30, no. 6, pp. 1182–1191, 2019.

Page | 57
[38] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano, ‘Experimental perspectives
on learning from imbalanced data’, in Proceedings of the 24th international conference on
Machine learning - ICML ’07, Corvalis, Oregon, 2007, pp. 935–942.
[39] H. Han, W.-Y. Wang, and B.-H. Mao, ‘Borderline-SMOTE: A New Over-
Sampling Method in Imbalanced Data Sets Learning’, in Advances in Intelligent
Computing, vol. 3644, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887.
[40] U. Ojha, K. K. Singh, C.-J. Hsieh, and Y. J. Lee, ‘Elastic-InfoGAN: Unsupervised
Disentangled Representation Learning in Class-Imbalanced Data’, p. 13.
[41] S. Ertekin, J. Huang, L. Bottou, and L. Giles, ‘Learning on the border: active
learning in imbalanced data classification’, in Proceedings of the sixteenth ACM
conference on Conference on information and knowledge management - CIKM ’07,
Lisbon, Portugal, 2007, p. 127.
[42] Q. Wang, ‘A Hybrid Sampling SVM Approach to Imbalanced Data Classification’,
Abstr. Appl. Anal., vol. 2014, pp. 1–7, 2014.
[43] S. Wang and X. Yao, ‘Diversity analysis on imbalanced data sets by using ensemble
models’, in 2009 IEEE Symposium on Computational Intelligence and Data Mining,
Nashville, TN, USA, Mar. 2009, pp. 324–331.
[44] S.-J. Yen and Y.-S. Lee, ‘Cluster-based under-sampling approaches for imbalanced
data distributions’, Expert Syst. Appl., vol. 36, no. 3, pp. 5718–5727, Apr. 2009.
[45] Md. F. Rasul, Md. F. Rasul, N. K. Dey, N. K. Dey, M. M. A. Hashem, and M. M.
A. Hashem, ‘A Comparative Study of Neural Network Architectures for Lesion
Segmentation and Melanoma Detection’, in 2020 IEEE Region 10 Symposium
(TENSYMP), Dhaka, Bangladesh, 2020, pp. 1572–1575.
[46] C. E. B. Sousa, L. B. G. Nascimento, C. M. S. Medeiros, and I. A. Trajano,
‘Decision Support Software for Melanoma Skin Cancer Detection (DECIME)’, p. 9.

Page | 58

You might also like