You are on page 1of 7

The Class Imbalance Problem: Signi cance and

Strategies 
Nathalie Japkowicz
Faculty of Computer Science
DalTech/Dalhousie University
6050 University, Halifax, N.S.
Canada, B3H 1W5

Abstract Although the majority of concept- in certain cases, to cause a signi cant bottle-
learning systems previously designed usually as- neck in the performance attainable by standard
sume that their training sets are well-balanced, this learning methods which assume a balanced dis-
assumption is not necessarily correct. Indeed, there tribution of the classes. For example, the prob-
exist many domains for which one class is rep- lem occurs and hinders classi cation in appli-
resented by a large number of examples while the cations as diverse as the detection of oil spills
other is represented by only a few. The purpose of in satellite radar images [5], the detection of
this paper is 1) to demonstrate experimentally that, fraudulent telephone calls [1] and in- ight he-
at least in the case of connectionist systems, class licopter gearbox fault monitoring [2].
imbalances hinder the performance of standard clas- To this point, there have only been a few
si ers and 2) to compare the performance of sev- attempts at dealing with the class imbalance
eral approaches previously proposed to deal with the problem ([7], [2], [6], [4], [1], [5]); and these
problem. attempts were mostly conducted in isolation.
In particular, there has not been, to date, any
1 Introduction systematic strive to link speci c types of imbal-
ances to the degree of inadequacy of standard
As the eld of machine learning makes a rapid classi ers. Furthermore, no comparison of the
transition from the status of \academic disci- various methods proposed to remedy the prob-
pline" to that of \applied science", a myriad lem has yet been performed.
of new issues, not previously considered by the The purpose of this paper is to address these
machine learning community, is now coming two concerns in an attempt to unify the re-
into light. One such issue is the class imbalance search conducted on this problem. In a rst
problem. The class imbalance problem corre- part, the paper concentrates on nding out
sponds to domains for which one class is rep- what type of imbalance is most damaging for a
resented by a large number of examples while standard classi er that expects balanced class
the other is represented by only a few.1 distributions; and in a second part, several im-
The class imbalance problem is of crucial plementations of three categories of methods
importance since it is encountered by a large previously proposed to tackle the problem are
number of domains of great environmental, vi- tested and compared on the domains of the
tal or commercial importance, and was shown, rst part.
The remainder of the paper is divided into
 I would like to thank Danny Silver and Afzal Upal four sections. Section 2 is a statement of the
for their very helpful comments on a draft of this paper.
1
In this paper, we only consider the case of concept- speci c questions asked in this study. Sec-
learning. tion 3 describes the part of the study focus-
ing on what types of class imbalance problems researchers for tackling the class imbalance
create diculties for a standard classi er. Sec- problem3 :
tion 4 describes the part of the study designed
to compare the three categories of approaches
previously attempted and considered here, on 1. Methods in which the class represented by
the problems of section 3. Sections 5 and 6 a small data set gets over-sampled so as to
conclude the paper. match the size of the other class [6].

2 Questions of Interest 2. Methods in which the class represented by


the large data set can be down-sized so as
The study presented in this paper can be to match the size of the other class [4].
thought of as a rst step in the investigation of
the following two questions: 3. Methods that mostly ignore one of the two
classes, altogether, by using a recognition-
Question 1: What types of imbalances hin- based instead of a discrimination- based
der the accuracy performance of standard inductive scheme ([2], [5]).
classi ers?
Question 2: What approaches for dealing The quest of this part of the study is aimed
with the class imbalance problem are most at nding out what approaches are most ap-
appropriate? propriate given certain speci c domain con-
ditions. In order to answer this question,
These questions are important since their each scheme was implemented using closely
answers may suggest fruitful directions for fu- related methods, namely, various versions of
ture research. In particular, they may help Discrimination-based and Recognition-based
researchers focus their inquiry onto the par- MLP networks (DMLP and RMLP4 ), in an at-
ticular type of solution found most promising, tempt to limit the amount of bias that could
given the particular characteristics identi ed in be introduced by di erent and unrelated learn-
their application domain. ing paradigms. All the schemes were tested on
Question 1 raises the issue of when class im- the arti cial domains previously generated to
balances are damaging. While the studies pre- answer Question 1.
viously mentioned identi ed speci c domains
for which an imbalance was shown to hurt Note that although this study is restricted
the performance of certain standard classi ers, to arti cial domains, it could be easily ex-
they did not discuss the questions of whether tended to real-world domains. In particular,
imbalances are always damaging and to what text classi cation domains could be good test
extent di erent types of imbalances a ect clas- beds given the imbalanced nature of their data
si cation performances. This paper takes a sets, the wide availability of text data and the
global stance and answers these questions in ease with which concept complexity can be
the context of the DMLP classi er 2 on a series controlled|e.g., by merging several categories
of arti cial domains spanning a large combina- of text data together in order to form a super-
tion of characteristics. category.
Question 2 considers three categories of ap-
proaches previously proposed by independent 3
In this study, we focus on \external" approaches
to the problem which do not bring upon any modi -
2
DMLP refers to the standard multi-layer percep- cations to the classi er. The study of \internal" ap-
tron trained to associate an output value of \1" with proaches which bias the classi er in order to deal with
instances of the positive class and an output value of class imbalances [7] has been left for future work.
\0" with instances of the negative class [8]. 4
RMLP is discussed in Section 4.1 below and in [2].
3 When does a Class Imbal- complexity (c) = 3, + = class 1, - = class 0

ance Matter? + - + - + - + -

In order to answer Question 1, a series of arti-


cial concept-learning domains was generated 0 .125 .25 .375 .5 .625 .75 .875 1
that varies along three di erent dimensions: Figure 1: A Backbone Model of Complexity 3
the degree of concept complexity, the size of the
training set, and the level of imbalance between
the two classes. The standard classi er system complexity level c = 1 are such that every point
tested on this domain was a simple DMLP sys- whose input is in range [0, .5) is associated with
tem such as the one described in [8]. This sec- a class value of 1, while every point whose in-
tion rst discusses the domain generation pro- put is in range (.5, 1] is associated with a class
cess followed by a report of the results obtained value of 0; At complexity level c = 2, points
by DMLP on the various domains. in intervals [0, .25) and (.5, .75) are associated
with class value 1 while those in intervals (.25,
.5) and (.75, 1] are associated with class value
3.1 Domain Generation 0; etc., regardless of the size of the training set
For the experiments of this section, 125 do- and its degree of imbalance.5
mains were created with various combinations Five training set sizes were considered (s =
of concept complexity, training set size, and 1::5) where each size, s, corresponds to a train-
degree of imbalance. The generation method ing set of size round((5000=32)  2s). Since this
used was inspired by Scha er who designed a training set size includes all the regular inter-
similar framework for testing the e ect of over- vals in the domain, each regular interval is, in
tting avoidance in sparse data sets [9]. How- fact, represented by round(((5000=32)  2s)=2c)
ever, the two data generation schemes present training points (before the imbalance factor is
a number of di erences. considered). For example, at a size level of
In more detail, each of the 125 generated do- s = 1 and at a complexity level of c = 1 and
mains is one-dimensional with inputs in the [0, before any imbalance is taken into considera-
1] range associated with one of the two classes tion, intervals [0, .5) and (.5, 1] are each repre-
(1 or 0). The input range is divided into a sented by 157 examples; If the size is the same,
number of regular intervals (i.e., intervals of but the complexity level is c = 2, then each of
the same size), each associated with a di er- intervals [0, .25), (.25, .5), (.5, .75) and (.75, 1]
ent class value. Contiguous intervals have op- contains 78 training examples; etc.
posite class values and the degree of concept Finally, ve levels of class imbalance were
complexity corresponds to the number of alter- also considered (i = 1::5) where each level,
nating intervals present in the domain. Actual i, corresponds to the situation where each
training sets are generated from these back- sub-interval of class 1 is represented by all
bone models by sampling points at random (us- the data it is normally entitled to (given c
ing a uniform distribution), from each of the and s), but each sub-interval of class 0 con-
intervals. The number of points sampled from tains only 1=(32=2i)th (rounded) of all its nor-
each interval depends on the size of the domain mally entitled data. This means that each of
as well as on its degree of imbalance. An exam- the sub-intervals of class 0 are represented by
ple of a backbone model is shown in Figure 1. round((((5000=32)  2s )=2c)=(32=2i)) training
examples. For example, for c = 1, s = 1, and
Five di erent complexity levels were consid- i = 2, interval [0, .5) is represented by 157 ex-
ered (c = 1::5) where each level, c, corresponds In this paper, complexity is varied along a single
5
to a backbone model composed of 2c regular in- very simple dimension. Other more sophisticated mod-
tervals. For example, the domains generated at els could be used in order to obtain ner-grained results.
amples and (.5, 1] is represented by 79; If c = 2, Because the performance of DMLP depends
s = 1 and i = 3, then [0, .25) and (.5, .75) are upon the number of hidden units it uses, we
each represented by 78 examples while (.25, .5) experimented with 2, 4, 8 and 16 hidden units
and (.75, 1] are each represented by 20; etc. and reported only the results obtained with the
In the reported results, the number of test- optimal network capacity. Other default val-
ing points representing each sub-interval was ues were kept xed (i.e., all the networks were
kept xed (at 50). This means that all do- trained by the Levenberg-Marquardt optimiza-
mains of complexity level c = 1 are tested on tion method, the learning rate was set at 0.01;
50 positive and 50 negative examples; all do- the networks were all trained for a maximum
mains of complexity level c = 2 are tested on of 300 epochs or until the performance gradi-
100 positive and 100 negative examples; etc. ent descended below 10?10; and the threshold
for discrimination between the two classes was
3.2 Results for DMLP set at 0.5). This means that the results are re-
ported a-posteriori (after checking all the pos-
The results for DMLP are displayed in Figure 2 sible network capacities, the best results are
which plots the error DMLP obtained for each reported). Given the fact that each experi-
combination of concept complexity, training ment is re-ran 5 times, it is believed that the
set size, and imbalance level. Each plot in Fig- a-posteriori view is sucient, especially since
ure 2 represents the plot obtained at a di er- all the systems are tested under the same con-
ent size. The leftmost plot corresponds to the ditions.
smallest size (s = 1) and progresses until the
rightmost plot which corresponds to the largest The results indicate several points of inter-
(s = 5). Within each of these plots, each clus- est. First, no matter what the size of the train-
ter of ve bars represent the concept complex- ing set is, linearly separable domains (domains
ity level. The leftmost cluster corresponds to of complexity level c = 1) do not appear sen-
the simplest concept (c = 1) and progresses sitive to any amount of imbalance. Related
until the rightmost one which corresponds to to this observation is the fact that, as the de-
the most complex (c = 5). Within each clus- gree of concept complexity increases (to a point
ter, nally, each bar corresponds to a partic- where the problem still obtains an acceptable
ular imbalance level. The leftmost bar corre- accuracy when the domain is balanced|i.e.,
sponds to the most imbalanced level (i = 1) with complexity levels of c  4, in our partic-
and progresses until the rightmost bar which ular case), so does the system's sensitivity to
corresponds to the most balanced level (i = 5, imbalances. Indeed, the gap between the dif-
or no imbalance). The height of each bar rep- ferent imbalance levels seems to increase as the
resents the average percent error rate obtained degree of concept complexity increases (again,
by DMLP (over ve runs on di erent domains up to c = 4) in all the plots of Figure 2.
generated from the same backbone model) on
the complexity, class size and imbalance level Finally, it can also be observed that the size
this bar represents. Please note that all graphs of the training set does not appear to be a fac-
indicate a large amount of variance in the re- tor in the size of the error-rate gap between bal-
sults despite the fact that all results were aver- anced and imbalanced data sets. This suggests
aged over ve di erent trials. The conclusions that the imbalance problem is a relative prob-
derived from these graphs thus re ect general lem (i.e., it depends on the proportion of imbal-
trends rather than speci c results. Because the ance experienced by the domain) rather than a
scaling of the di erent graph is not necessarily problem of intrinsic training set size (i.e., it is
the same, lines were drawn at 5, 10, 15, etc. meaningless to say that a system will perform
percent error marks in order to facilitate the poorly on a domain that contains only n neg-
interpretation of the results. ative training examples without specifying the
40 40 40 40 40

35 35 35 35 35

30 30 30 30 30

25 25 25 25 25

20 20 20 20 20

15 15 15 15 15

10 10 10 10 10

5 5 5 5 5

0 0 0 0 0

(a) Size=1 (b) Size=2 (c) Size=3 (d) Size=4 (e) Size=5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 2: Experimental Results # 1

size of the positive class6 ). sent closeness to the boundaries.7


Down-Sizing Two down-sizing methods,
4 A Comparison of Various closely related to the re-sampling methods
Strategies were considered in this category. The rst one,
rand downsize, consists of eliminating, at ran-
Having identi ed the domains for which a class dom, elements of the over-sized class until it
imbalance does impair the accuracy of a regu- matches the size of the other class. The sec-
lar classi er such as DMLP, this section now ond one, focused downsize, consists of eliminat-
proposes to compare a few of the methodolo- ing only elements further away (where, again,
gies that have been proposed to deal with this = :25 represents closeness to the boundaries)
problem. First, the various schemes used for
this comparison are described, followed by a re- Learning by Recognition Two methods
port on their performance. Rather than com- were, once again, considered in this cate-
paring speci c methods, this study compares gory. Both of these methods are based on the
various kinds of methods. These methods are autoassociation-based classi cation approach
all implemented in the connectionist paradigm described in [2]. The approach consists of
and are closely related so as to minimize dif- training an autoassociator|a multi-layer per-
ferences in performance caused by phenomena ceptron designed to reconstruct its input at the
other than their particular methodology. output layer|to learn how to recognize one
of the two classes. Once trained, the network
4.1 Schemes for Dealing with Class will either recognize a testing example as an
Imbalances example of the class it was trained on or re-
ject it as belonging to the class on which it
Re-Sampling Two re-sampling methods was not trained. This training scheme was
were considered in this category. The rst one, used rst on the over-represented class of the
rand resamp, consists of re-sampling the small domain (over recog ) and then on the under-
class at random until it contains as many ex- represented class (under recog ). On every do-
amples as the other class. The second method, main, the threshold for discriminating between
focused resamp, consists of re-sampling the recognized and non-recognized examples was
small class only with data occuring close to the set by comparing the accuracy obtained with
boundaries between the concept and its nega- 7
This factor means that for interval [a, b], data
tion. A factor of = :25 was chosen to repre- considered close to the boundary are those in [a, a+
.25  (b-a)] and [a+.75  (b-a), b]. If no data were
6
Note, however, that too small a class size is also found in these intervals (after 500 random trials were
inherently harmful, but this issue is separate from the attempted), then the data were sampled from the full
one considered here. interval [a, b] as in the rand resamp methodology.
100 di erent threshold values and retaining the does the class imbalance problem matter? and
one yielding optimal performance. (2) How do the various categories of methods
attempted to solve the problem (and their dif-
4.2 Results ferent realizations) compare?
It concluded that while a standard multi-
The results for rand resamp, rand downsize layer perceptron is not sensitive to the class im-
and over recog are reported in Figure 3, balance problem when applied to linearly sep-
while the results for focused resamp, fo- arable domains, its sensitivity increases with
cused downsize and under recog are reported the complexity of the domain. The size of the
in Figure 4. We only report the results ob- training set does not appear to be a factor.
tained for a single domain size (Size=3).8 The paper also showed that both over-
The results of Figures 3(a), 3(b), 4(a) and sampling the minority class and down-sizing
4(b) as compared to those of Figure 2(c) the majority class are very e ective methods
(i.e., DMLP at Size=3) indicate that both re- of dealing with the problem. In addition, it
sampling and down-sizing methods are very showed that using more sophisticated over-
e ective especially as the concept complexity sampling or down-sizing methods than a simple
gets larger. In addition, comparisons of Fig- uniformly random approach appears unneces-
ures 3(a) and 4(a) as well as 3(b) and 4(b) sary (at least in the case of feedforward neural
indicate that there is no clear advantage in networks and simple arti cial domains of the
using sophisticated re-sampling or down-sizing type designed for this study).
schemes, at least, in our particular domains. The recognition-based approach was shown
On the other hand, the performance of to have the potential to help when used on the
over recog and under recog is generally not as majority class. However, it appears less e ec-
good as that of rand resamp, rand downsize, tive than the re-sampling and down-sizing ap-
focused resamp, and focused downsize: the proaches. When applied to the minority class,
overall results obtained by over recog are less it does not present any advantages over the
accurate than those of the re-sampling and discrimination-based approach (DMLP) ap-
down-sizing methods. It is only when the com- plied to unaltered imbalanced domains.
plexity of the concept reaches c = 5 (i.e.,
when, we assume, the problem of recognizing
one class is simpler than that of discriminat- 6 Future Work
ing between two classes) that over recog be-
comes slightly more accurate. Furthermore, There are many directions left to explore in the
there does not seem to be any advantage to future. First, it would be useful to test di er-
using under recog since its results are compa- ent types of imbalances: so far, only \balanced
rable to those of DMLP used on unaltered im- imbalances" were considered. \Imbalanced im-
balanced domains. balances" in which di erent subclusters of a
class have di erent numbers of examples rep-
resenting them should also be surveyed.
5 Conclusion In order to get a more precise understanding
of the di erent results it would also be useful
The purpose of this paper was to unify some to report the results in terms of false positives
of the research that has been conducted in iso- and false negatives or to run ROC Analyses.
lation on the problem of class imbalance and A third issue has to do with the type of
to guide future research in the area. The pa- classi er used. In this study, only feedforward
per was concerned with two issues: (1) When neural networks were considered. It would be
8
Results involving di erent domain sizes can be worthwhile to check the performance on the
found in [3]. problems of Section 3 of other standard classi-
35 35 25

30 30

20

25 25

15
20 20

15 15
10

10 10

5 5

0 0 0

(a) rand resamp (b) rand downsize (c) over recog


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 3: Experimental Results # 1


30 40 45

35 40
25
35
30

20 30
25

25
15 20
20

15
10 15

10
10
5
5 5

0 0 0

(a) focused resamp (b) focused downsize (c) under recog


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 4: Experimental Results # 2

ers (e.g., C4.5, Nearest-Neighbours, etc.). Sets: One-Sided Sampling Proceedings of


Finally, the use of arti cial domains the Fourteenth International Conference on
throughout this study may have occulted im- Machine Learning, 179{186, 1997.
portant issues pertaining to practical prob- [5] Miroslav Kubat, Robert Holte and Stan
lems. It would, thus, be useful to repeat the Matwin Machine Learning for the Detec-
experiments on real-world domains. tion of Oil Spills in Satellite Radar Images
Machine Learning, 30:195{215, 1998.
References [6] Charles X. Ling and Chenghui Li Data
[1] Tom E. Fawcett and Foster Provost Adap- Mining for Direct Marketing: Problems
tive Fraud Detection Data Mining and and Solutions International Conference on
Knowledge Discovery, 3(1):291{316, 1997. Knowledge Discovery Data Mining, 1998.
[7] Pazzani, M., Merz, C., Murphy, P., Ali,
[2] Nathalie Japkowicz, Catherine Myers and K., Hume, T. a nd Brunk, C. Reducing
Mark Gluck A Novelty Detection Approach Misclassi cation Costs Proceedings of the
to Classi cation Proceedings of the Four- Eleventh International Conference on Ma-
teenth Joint Conference on Arti cial Intel- chine Learning, 217{225, 1994.
ligence, 518{523, 1995.
[8] David E. Rumelhart, Geo E. Hinton and
[3] Nathalie Japkowicz Learning from Imbal- R. J. Williams Learning Internal Represen-
anced Data Sets: A Comparison of Various tations by Error Propagation Parallel Dis-
Solutions Proceedings of the AAAI'2000 tributed Processing, David E. Rumelhart
Workshop on Learning from Imbalanced and J. L. McClelland (Eds), MIT Press,
Data Sets, 2000. Cambridge, MA, 318{364, 1986.
[4] Miroslav Kubat and Stan Matwin Ad- [9] Cullen Scha er Over tting Avoidance as
dressing the Curse of Imbalanced Data Bias Machine Learning, 10:153{178, 1993.

You might also like