Professional Documents
Culture Documents
Strategies
Nathalie Japkowicz
Faculty of Computer Science
DalTech/Dalhousie University
6050 University, Halifax, N.S.
Canada, B3H 1W5
Abstract Although the majority of concept- in certain cases, to cause a signicant bottle-
learning systems previously designed usually as- neck in the performance attainable by standard
sume that their training sets are well-balanced, this learning methods which assume a balanced dis-
assumption is not necessarily correct. Indeed, there tribution of the classes. For example, the prob-
exist many domains for which one class is rep- lem occurs and hinders classication in appli-
resented by a large number of examples while the cations as diverse as the detection of oil spills
other is represented by only a few. The purpose of in satellite radar images [5], the detection of
this paper is 1) to demonstrate experimentally that, fraudulent telephone calls [1] and in-
ight he-
at least in the case of connectionist systems, class licopter gearbox fault monitoring [2].
imbalances hinder the performance of standard clas- To this point, there have only been a few
siers and 2) to compare the performance of sev- attempts at dealing with the class imbalance
eral approaches previously proposed to deal with the problem ([7], [2], [6], [4], [1], [5]); and these
problem. attempts were mostly conducted in isolation.
In particular, there has not been, to date, any
1 Introduction systematic strive to link specic types of imbal-
ances to the degree of inadequacy of standard
As the eld of machine learning makes a rapid classiers. Furthermore, no comparison of the
transition from the status of \academic disci- various methods proposed to remedy the prob-
pline" to that of \applied science", a myriad lem has yet been performed.
of new issues, not previously considered by the The purpose of this paper is to address these
machine learning community, is now coming two concerns in an attempt to unify the re-
into light. One such issue is the class imbalance search conducted on this problem. In a rst
problem. The class imbalance problem corre- part, the paper concentrates on nding out
sponds to domains for which one class is rep- what type of imbalance is most damaging for a
resented by a large number of examples while standard classier that expects balanced class
the other is represented by only a few.1 distributions; and in a second part, several im-
The class imbalance problem is of crucial plementations of three categories of methods
importance since it is encountered by a large previously proposed to tackle the problem are
number of domains of great environmental, vi- tested and compared on the domains of the
tal or commercial importance, and was shown, rst part.
The remainder of the paper is divided into
I would like to thank Danny Silver and Afzal Upal four sections. Section 2 is a statement of the
for their very helpful comments on a draft of this paper.
1
In this paper, we only consider the case of concept- specic questions asked in this study. Sec-
learning. tion 3 describes the part of the study focus-
ing on what types of class imbalance problems researchers for tackling the class imbalance
create diculties for a standard classier. Sec- problem3 :
tion 4 describes the part of the study designed
to compare the three categories of approaches
previously attempted and considered here, on 1. Methods in which the class represented by
the problems of section 3. Sections 5 and 6 a small data set gets over-sampled so as to
conclude the paper. match the size of the other class [6].
ance Matter? + - + - + - + -
35 35 35 35 35
30 30 30 30 30
25 25 25 25 25
20 20 20 20 20
15 15 15 15 15
10 10 10 10 10
5 5 5 5 5
0 0 0 0 0
(a) Size=1 (b) Size=2 (c) Size=3 (d) Size=4 (e) Size=5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
30 30
20
25 25
15
20 20
15 15
10
10 10
5 5
0 0 0
35 40
25
35
30
20 30
25
25
15 20
20
15
10 15
10
10
5
5 5
0 0 0