to decide which examples to select and in whatquantity [ G. Weiss and F. Provost,2003]. Inaddition, there can also be an imbalance in costsof making different errors,things could vary per case [N. V. Chawla, N. Japkowicz, and A. Kolcz, 2003].
3. Reasons for Imbalanced Datasets:
The data arenaturally imbalanced (e.g. credit cardfrauds and rare disease)(i.e.)IR=Number of minority Number of majority2)
Lack of Information
The data are not naturally imbalanced but itis too expensive to obtain data for learningthe minority class.Figure 2: Lack of positive data3)
hen the complexityraises, learning the datasets is crucial.Figure 3: High complexity data.4)
where the data points belong to both the classes.Figure 3: Overlapping data.
4. Empirical Methods dealing withImbalanced Datasets:
A number of solutions to the class-imbalance problem were previously proposed both at the
data and algorithmic levels
[SofiaVisa,Anca Ralescu ,2005] [N. V. Chawla, N.Japkowicz, and A. Ko lcz, 2003].At the
, these solutions includedifferent forms of re-sampling such as randomover sampling with replacement, random under sampling, directed over sampling (in which nonew examples are created, but the choice of samples to replace is informed rather thanrandom), directed under sampling (where, again,the choice of examples to eliminate is informed),over sampling with informed generation of newsamples, and combinations of the abovetechniques.At the
, solutions includeadjusting the costs of the various classes so as tocounter the class imbalance, adjusting the probabilistic estimate at the tree leaf (whenworking with decision trees), adjusting thedecision threshold, and recognition-based (i.e.,learning from one class) rather thandiscrimination-based (two class) learning.
4.1Solution based on Data level for handlingImbalanced datasets
Data level solutions include many differentforms of re-sampling such as random over sampling with replacement, random under sampling, directed over sampling, directed under sampling, over sampling with informedgeneration of new samples, and combinations of the above techniques.
4.1.1 Under sampling
Random under-sampling [Sofia Visa, AncaRalescu,2005] is a non-heuristic method thataims to balance class distribution through therandom elimination of majority class examples.The logic behind this is to try to balance out thedataset in an attempt to overcome theidiosyncrasies of the machine learning algorithm.The major drawback of random under samplingis that this method can discard potentially usefuldata that could be important for the induction process.
4.1.2 Over sampling
Random over-sampling is a non-heuristicmethod that aims to balance class distributionthrough the random replication of minority classexamples. Several authors[Sotiris Kotsiantis,Dimitris Kanellopoulous, Panayiotis, 2006], [N.V. Chawla, L. O. Hall, K. W. Bowyer, and W. P.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010133http://sites.google.com/site/ijcsis/ISSN 1947-5500