Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
5Activity
0 of .
Results for:
No results containing your search query
P. 1
An Analysis for Mining Imbalanced Datasets

An Analysis for Mining Imbalanced Datasets

Ratings: (0)|Views: 328 |Likes:
Published by ijcsis
Mining Imbalanced datasets in a real world domain is an obstacle where the number of one (majority) class greatly outnumbers the other class (minority). This paper traces some of the recent progress in the field of learning of Imbalanced data. It reviews approaches adopted for this problem and it identifies challenges and points out future directions in this field. A systematic study is developed aiming to question 1) what type of Imbalance hinders the accuracy performance? 2) Whether the Imbalances are always damaging and to what extent? 3) Whether Downsizing approach and Over-sampling approaches can be proposed to deal with the problem? Finally this paper leads to a profitable discussion of what the problem is and how it might be addressed most effectively.
Mining Imbalanced datasets in a real world domain is an obstacle where the number of one (majority) class greatly outnumbers the other class (minority). This paper traces some of the recent progress in the field of learning of Imbalanced data. It reviews approaches adopted for this problem and it identifies challenges and points out future directions in this field. A systematic study is developed aiming to question 1) what type of Imbalance hinders the accuracy performance? 2) Whether the Imbalances are always damaging and to what extent? 3) Whether Downsizing approach and Over-sampling approaches can be proposed to deal with the problem? Finally this paper leads to a profitable discussion of what the problem is and how it might be addressed most effectively.

More info:

Published by: ijcsis on Jun 30, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/06/2013

pdf

text

original

 
An Analysis for Mining Imbalanced Datasets
T.Deepa ,Dr.M.Punithavalli,
 DeepaRaman12@gmail.com ,mpunitha_srcw@yahoo.co.in
 
Faculty of Computer Science Department, Sri Ramakrishna College of Arts and Science forWomen,Coimbatore,Tamilnadu,India.
††
Director & Head, Sri Ramakrishna College of Arts & Science for Women, Coimbatore ,Tamil  Nadu,India
 
Summary:
Mining Imbalanced datasets in a real world domain isan obstacle where the number of one (majority) classgreatly outnumbers the other class (minority). This paper traces some of the recent progress in the field of learning of Imbalanced data. It reviews approachesadopted for this problem and it identifies challengesand points out future directions in this field. Asystematic study is developed aiming to question 1)what type of Imbalance hinders the accuracy performance? 2) Whether the Imbalances are alwaysdamaging and to what extent? 3) Whether Down-sizing approach and Over-sampling approaches can be proposed to deal with the problem? Finally this paper leads to a profitable discussion of what the problem isand how it might be addressed most effectively
.
 Keywords
:
 Imbalanced Datasets, Undersampling,Oversampling
Introduction:
The field of machine learning when transitedfrom the status of “academic displine” to“applied science” a myriad of new issues arised,one such issue is the class imbalance problem.The Class Imbalance problem address the casewhere the training sets of one class (majority)outnumbers the other class (minority). It isamply used in the world of business, industryand scientific research. Its importance grew asmore and more researchers realized that it has asignificant bottleneck in the performance bystandard learning methods. On the other hand, itis observed that many real world domainsavailable datasets are imbalanced.Through literature it is analyzed thatimbalanced datasets is also dealt with rareclasses or skewed data.
2. The Class Imbalance Problem:
The class imbalance problem occurs when,in a classification problem, there are many moreinstances of some classes than others.The class imbalance problem is pervasiveand ubiquitous, causing trouble to a largesegment of the data mining community [N.Japkowicz.2000]. To better understand this problem the situation is illustrated in Figure 1. InFig 1(a) there is a large imbalance between themajority class (-) and the minority class(+).Fig1(b) the classes are balanced.Figure 1: (a) Many negative cases against somespare positive cases .(b) balanced data set withwell-defined clusters.It is prevalent in many applications,including: fraud/intrusion detection, risk management, text classification, and medicaldiagnosis/monitoring, and many others. It isworth noting that in certain domains (like those just mentioned) the class imbalance is intrinsic tothe problem. For example, within a given setting,there are very few cases of fraud
 
as compared to the large number of honest use of the offered facilities.However, class imbalances sometimes occur in domains that do not have an intrinsicimbalance. This will happen when the datacollection process is limited (e.g., due toeconomic or privacy reasons), thus creating\artificial Imbalances. Conversely, in certaincases, the data abounds and it is for the scientist
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010132http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
to decide which examples to select and in whatquantity [ G. Weiss and F. Provost,2003]. Inaddition, there can also be an imbalance in costsof making different errors,things could vary per case [N. V. Chawla, N. Japkowicz, and A. Kolcz, 2003].
3. Reasons for Imbalanced Datasets:
1)
 
Imbalanced ratio
:
The data arenaturally imbalanced (e.g. credit cardfrauds and rare disease)(i.e.)IR=Number of minority Number of majority2)
 
Lack of Information
:
The data are not naturally imbalanced but itis too expensive to obtain data for learningthe minority class.Figure 2: Lack of positive data3)
 
Complexity
:W 
hen the complexityraises, learning the datasets is crucial.Figure 3: High complexity data.4)
 
Overlapping classes
:
where the data points belong to both the classes.Figure 3: Overlapping data.
4. Empirical Methods dealing withImbalanced Datasets:
A number of solutions to the class-imbalance problem were previously proposed both at the
data and algorithmic levels
[SofiaVisa,Anca Ralescu ,2005] [N. V. Chawla, N.Japkowicz, and A. Ko lcz, 2003].At the
data level
, these solutions includedifferent forms of re-sampling such as randomover sampling with replacement, random under sampling, directed over sampling (in which nonew examples are created, but the choice of samples to replace is informed rather thanrandom), directed under sampling (where, again,the choice of examples to eliminate is informed),over sampling with informed generation of newsamples, and combinations of the abovetechniques.At the
algorithmic level
, solutions includeadjusting the costs of the various classes so as tocounter the class imbalance, adjusting the probabilistic estimate at the tree leaf (whenworking with decision trees), adjusting thedecision threshold, and recognition-based (i.e.,learning from one class) rather thandiscrimination-based (two class) learning.
4.1Solution based on Data level for handlingImbalanced datasets
Data level solutions include many differentforms of re-sampling such as random over sampling with replacement, random under sampling, directed over sampling, directed under sampling, over sampling with informedgeneration of new samples, and combinations of the above techniques.
4.1.1 Under sampling
Random under-sampling [Sofia Visa, AncaRalescu,2005] is a non-heuristic method thataims to balance class distribution through therandom elimination of majority class examples.The logic behind this is to try to balance out thedataset in an attempt to overcome theidiosyncrasies of the machine learning algorithm.The major drawback of random under samplingis that this method can discard potentially usefuldata that could be important for the induction process.
4.1.2 Over sampling
Random over-sampling is a non-heuristicmethod that aims to balance class distributionthrough the random replication of minority classexamples. Several authors[Sotiris Kotsiantis,Dimitris Kanellopoulous, Panayiotis, 2006], [N.V. Chawla, L. O. Hall, K. W. Bowyer, and W. P.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010133http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
Kegelmeyer,2002]. agree that random over-sampling can increase the likelihood of occurringoverfitting, since it makes exact copies of theminority class examples.SMOTE generates synthetic minority examplesto over-sample the minority class .Its main ideais to form new minority class examples byinterpolating between several minority classexamples that lie together. For every minorityexample, its k (which is set to 5 in SMOTE)nearest neighbors of the same class arecalculated, then some examples are randomlyselected from them according to the over-sampling rate. A new synthetic examples aregenerated along the line between the minorityexample and its selected nearest neighbors. Thus,the over fitting problem is avoided and causesthe decision boundaries for the minority class tospread further into the majority class space.
4.2 Solution based on Algorithm levelfor handling imbalance4.2.1. One-class learning
An interesting aspect of one-class (recognition- based) learning is that, under certain conditionssuch as multi-modality of the domain space, oneclass approaches to solve the classification problem superior to discriminative (two-class)approaches (such as decision trees or Neural Networks) is a rule induction system thatutilizes a separate-and-conquer approach toiteratively build rules to cover previouslyuncovered training examples. Each rule is grown by adding conditions until no negative examplesare covered. It normally generates rules for eachclass from the rarest class to the most commonclass. Given this architecture, it is quitestraightforward to learn rules only for theminority class one-class learning is particularlyuseful when used on extremely unbalanced datasets composed of a high dimensional noisyfeature space. The one-class approach is relatedto aggressive feature selection methods, but ismore practical since feature selection can often be too expensive to apply.
4.2.2 Cost-sensitive learning
Changing the class distribution is not the onlyway to improve classifier performance whenlearning from imbalanced datasets. A differentapproach to incorporate costs in decision-makingto define fixed and unequal misclassificationcosts between classes [6]. Cost model takes theform of a cost matrix, where the cost of classifying a sample from a true class j to class icorresponds to the matrix entry
λ 
ij
. This matrixis usually expressed in terms of averagemisclassification costs for the problem. Thediagonal elements are usually set to zero,meaning correct classification has no cost. Wedefine conditional risk for making a decision
α
i
as:The equation states that the risk of choosingclass
i
is defined by fixed misclassification costsand the uncertainty of our knowledge about thetrue class of 
 x
expressed by the posterior  probabilities. The goal in cost-sensitiveclassification is to minimize the cost of misclassification, which can be realized bychoosing the class (v
 j
) with the minimumconditional risk.
4.2.3 Feature Selection
Feature selection is an important and relevantstep for mining various data sets [I.Guyon &A.Elisseef, 2003]. Learning from highdimensional spaces can be very expensive andusually not very accurate.It is particularly relevant to various real-world problems such as bioinformatics, image processing, text classification, Webcategorization, etc. High dimensional real-worlddatasets are often accompanied by another  problem: high skew in the class distribution, withthe class of interest being relatively rare. Thismakes it particularly important to select featuresthat lead to a higher separability between the twoclasses. It is important to select features that cancapture the high skew in the class distribution.The majority of work in feature selection for imbalanced data sets has focused on textclassification or Web categorization domain[D.Mladenic & M.Grobelink, 1999].A couple of  papers in this paper concentrates at featureselection in the area of imbalanced data sets,albeit in text classification or Webcategorization. [Zheng and Srihari, 2004] suggestthat existing measures used for feature selectionare not very appropriate for imbalanced data sets.They propose a feature selection framework,which selects features for positive and negativeclasses separately and then explicitly combinesthem. The authors show simple ways of 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010134http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (5)

You've already reviewed this. Edit your review.
Deeparn added this note
super
Deeparn liked this
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->