International Journal of Advanced Research in ISSN : 2347 - 8446 (Online

Computer Science & Technology (IJARCST 2014) Vol. 2, Issue 2, Ver. 3 (April - June 2014) ISSN : 2347 - 9817 (Print)

Handling Class Imbalance Problem Using Feature Selection
Deepika Tiwari
Dept. of Computer Science, BIT (MESRA), Noida Campus, Noida, India

The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier’s
suboptimal performance. Traditional classification algorithms are difficult in dealing with imbalance data. This paper proposes a
feature selection algorithm which is modified version of feature selection algorithm RELIEFF, this modifies the Original RELIEFF
algorithm to handle the class imbalance problem by assigns higher weight to attributes while dealing with minority classes which
results in higher weight of attributes which cater to minority samples.
The experimental results show that the proposed method obtains better performance compared to original RELIEFF algorithm.

1 Introduction learning algorithms to bias towards the positive class, such as
The class imbalance problem is a challenge to machine learning cost sensitive learning and recognition based learning. In addition
and data mining, and it has attracted significant research recent to these solutions, another approach is boosting. Boosting
years. A classifier affected by the class imbalance problem for a algorithms change the underlying data distribution and apply
specific data set would see strong accuracy overall but very poor the standard classifier learning algorithms to the revised data
performance on the minority class. The imbalance data sets are space iteratively. From this point of view, boosting approaches
pervasive in real-world applications. Examples of these kinds of should be categorized as solutions at data level. Because the class
applications include biological data analysis, text classification, imbalance problem is commonly accompanied by the issue of
and image classification, web page classification among many high dimensionality of the data set [9], applying feature selection
others. The skew of an imbalanced data set can be severe. Some techniques is a necessary course of action. Ingenious sampling
imbalanced data sets will only have one minority sample for every techniques and algorithmic methods may not be enough to combat
100 majority samples. Researchers have crafted many techniques high dimensional class imbalance problems. In this paper, we
to combat the class imbalance problem, including resampling, modified the existing feature selection RELIEFF algorithm to
new algorithms, and feature selection. With imbalanced data, handle the class imbalance problem. The modified algorithm
classification rules that predict the small classes tend to be assigns higher weight to attributes while dealing with samples
fewer and weaker than those that predict the prevalent classes; from minority classes which results in higher weight of attributes
consequently, test samples belonging to the small classes are which cater to minority samples. The experimental results show
misclassified more often than those belonging to the prevalent that the proposed method obtains better performance compared
classes. Standard classifiers usually perform poorly on imbalanced to other methods.
data sets because they are designed to generalize from training The remainder of this paper is organized as follows:
data and output the simplest hypothesis that best fits the data. Section 2 describes related work. Section 3 describes original
Therefore, the simplest hypothesis pays less attention to rare cases. RELIEFF Algorithm. Section 4 gives details of modified RELIEFF
However, in many cases, identifying rare objects is of crucial algorithm. Section 5 describes the experiment details. Section 6
importance; classification performances on the small classes are discusses the result. Finally, we conclude our work in Section 7
the main concerns in determining the property of a classification and discuss the future scopes in Section 8.
model. Why is the class imbalance problem so prevalent and
difficult to overcome? First, modern classifiers assume that II. Related Work
unseen data points on which the classifier will be asked to make
a prediction are drawn from the same distribution as the training A. Data Level Treatments
data. If testing and validation data samples were drawn from a These techniques aim to correct problems with the distribution of
different distribution, the trained classifier may give poor results a data set. Weiss and Provost noted that the original distribution
because of the flawed model. Based on this assumption, a classifier of samples is sometimes not the optimal distribution to use for a
will almost always produce poor accuracy on an imbalanced data given classifier [10], and different sampling techniques can modify
set. In a bi-class application, the imbalanced problem is observed the distribution to one that is closer to the optimal distribution.
as one class is represented by a large amount of samples while
the other is represented by only a few. The class with very few 1. Sub Sampling
training samples and usually associated with high identification Sub-sampling is the process of drawing a subset of data from a
importance, is referred as the positive class; the other one as the large.
negative class. The learning objective of this kind of data is to
obtain a satisfactory identification performance on the positive 2. Over Sampling
(small) class. Researchers have crafted many techniques to Over-sampling, in its basic form, duplicates the minority-class
combat the class imbalance problem, including resampling, new examples.
algorithms, and feature selection.
At the data level, the objective is to re-balance the class distribution 3. SMOTE (Synthetic Minority Over-Sampling)
by re-sampling the data space including oversampling instances This sampling technique involves artificially resampling the data
of the positive class and under sampling instances of the negative set.
class, sometimes, uses the combination of the two techniques. At The end result of sampling techniques is a data set that has a
the algorithm level, solutions try to adapt the existing classifier balanced distribution. While these methods can result in

© 2014, IJARCST All Rights Reserved 516

Issue 2. They showed this method generalizes well to imbalanced.ISSN : 2347 . In other fields. unless one has only training samples known algorithms to date. over-fitting. can cause a classifier to van Someren [21] analyzed the data sets from the CoIL Challenge overfit the data [10]. Van der Putten and or synthetically create new samples. However. Ver.9817 (Print) Vol. The Relief algorithms are very useful as they Boosting schemes. the probabilities are updated. A one-class learner is built to recognize samples Microarray-based classification data sets often have tens of from a given class and reject samples from other classes.8446 (Online) International Journal of Advanced Research in ISSN : 2347 . among the attributes. III. there are probability [18]. While these The goal of feature selection.Most of the filtering methods developed in data to be from one class and no other information. so the question helped combat the overfitting problem the most. Ingenious sampling the data you have [12]. and text classification data sets using methods accomplish this goal by learning using only positive data just a bag of words feature set have orders of magnitude more points and no other background information. draw a training set consisting of hard-to-classify samples. the best cost matrix for an imbalanced problem and then applied exact copies of examples after over-sampling may lead to classifier a standard boosting algorithm to classify the data set. One-class learning methods aim to combat the sets. the cost of the data gathering performance may be severely hindered [10] Because the class procedure limits the number of samples we can collect for use imbalance problem is commonly accompanied by the issue of in a data set and results in an artificial imbalance [10]. The result is that classifiers. instead. However. Thus. and the combination of two highly correlated from a classifier trained using only positive data will not be as good features can be better than either feature independently. 25. 3 (April . For high-dimensional data unseen data. This makes Relief After each iteration. Guyon and Elisseeff [28] gave a strong theoretical better than two-class SVMs on genomic data and argued that analysis as to the limits of feature selection metrics. and a misclassified sample receives a higher The first version of RELIEF algorithm [31] was able to deal only www. oneclass learners mining and machine learning assumes conditional independence are likely not the best approach. 2. Algorithmic Level Treatments data sets. selection phase.June 2014) Computer Science & Technology (IJARCST 2014) greatly improved results over the original data set. 24]. strong dependencies among the attributes [29]. Forman [24] noted of whether simply modifying the class ratio of a data set will a similar observation on highly imbalanced text classification result in significant improvement is considered open by some problems and stated researchers [12]. IJARCST 2014 . further in the series. significant issues surrounding their use. feature selection alone can combat the class imbalance A wide variety of new learning methods have been created problem [23. Sun et al. the interaction from these values by requiring a minimum threshold of similarity between different features also needed to be considered in the between a novel sample and the class [14]. techniques and algorithmic methods may not be enough to combat Oversampling methods. Existing RELIEFF Algorithm using them instead as unlabeled samples hindered a classifier’s The Relief algorithm is one of the very successful filtering performance.ijarcst. applying feature selection machine learning researchers simply find that you are limited to techniques is a necessary course of action. whether they duplicate existing samples highdimensional class imbalance problems. These algorithms are context sensitive and can algorithm samples with replacement from a data set with initially correctly estimate the quality of attributes in problems with equal probabilities for each sample. to change the weight of each sample and adjust for skew in the the additional training cases introduced by over-sampling can distribution. a classification can be made metrics were not good enough for this task. and the those with cancer. high-dimensional data that irrelevant features on their own can be useful in conjunction sets. the goal of each j features that allows a classifier to reach optimal performance.” Research shows that in high-dimensional B. Raskutti selecting highly correlated features because they were thought to and Kowalczyk [15] showed that one-class SVMs performed be redundant. C. many classifiers are strong performance than the choice of classification algorithm and relatively insensitive to a distribution’s skew [13]. is to select a subset of methods attack the problem in different ways. the next iteration. These thousands of features [26]. classifier performance may degrade due et al. These algorithms features than documents [24]. [20] used a genetic algorithm to identify increase the time complexity of the classifier. learning 517 © All Rights Reserved. Feature selection is a key step for many machine from imbalanced data by approaching it from an unsupervised learning algorithms. Manevitz and Yousef [16] believe that the results with other features. as the results using positive and negative data. specifically to combat the class imbalance problem. A sample that algorithms one of the most successful preprocessing algorithms is correctly classified receives a lower probability to be drawn in [30]. While many studies have shown some 2000 and found that feature selection was more significant to benefits to artificial rebalancing schemes. create a do not make the assumption that the attributes are independent series of classifiers all applied to the same data set. Chawla After under-sampling. the imbalance is an inherent part of the data [10]. “no degree of clever induction can make up for a lack of predictive signal in the input space. The AdaBoost of each other. Elkan and Noto [17] agreed and also showed that when a user has negative samples. in general. Similarly. In the worst case. The curse of dimensionality tells us that if many of the features There are many more people living without cancer at any time than are noisy. this is not always possible in real-world applications. Feature Selection In some fields. we use filters that score each feature independently based overfitting problem that occurs with most classifiers learning on a rule. However. another analysis of the often give a confidence level of the resemblance between unseen CoIL Challenge 2000 by Elkan [27] found that feature selection data points and the learned class. Many high dimensionality of the dataset [10]. especially when the data is high-dimensional. similar to the AdaBoost algorithm. The biggest flaw that he found with most of the The two most prominent types of one-class learners are oneclass applied feature selection methods is that they did not consider SVMs [15] and neural networks or autoencoders [14]. the cost of using a classifier can be very high. is still to optimize the performance of the learning machine on where j is a user-specified parameter. [19] applied boosting to the SMOTE algorithm at each step to the potential loss of useful majority-class examples.

Relief algorithm. Even though. if Ri is from majority class. Modified RELIEFF Algorithm The algorithm RELIEFF is an instance-based filtering method. from class C find k nearest misses Mj(C). sample gives robustness and more stability for RELIEFF than [2] FOR i=1 to m DO original RELIEF algorithm. for each attribute can be stated in the following way. attributes.9817 (Print) with binary classification problems. we put higher weight by using the following equation. It calculates the probability that two given instances (I1. Issue 2. The process of updating the weight [3] randomly select an instance Ri. In this as follows. the algorithm first takes one instance from the Mj(C) and checks whether an attribute has different value in Ri and Mj(C). END FOR. END FOR. find k nearest hits Hj. It could estimate the rank of an attribute whether that attribute is nominal or numerical but it could not deal with missing values of attributes in the dataset. the The way that the algorithm updates the weight for each attribute Algorithm is dependent on class distribution of instances. Output: the vector W of estimations of the qualities of attributes The selection of k-hits and k-misses for each randomly selected [1] set all weights W[A] = 0. RELIEFF is highly dependent on the number of instances from 10. However. FOR each class C ≠ class (Ri) DO 6. we attribute. 1. it does is simple and intuitive. Ver. If the value of the attribute is different in Ri and Mj(C). The pseudo code for RELIEFF is shown below the weight and if both instances are from same class but their attribute values are different then the algorithm decreases the Algorithm RELIEFF: weight .com . To compensate the degradation due to data imbalance we propose The distance measures for nominal and numerical attributes are a method for calculating the weight for each attribute. It means that the ability to estimate the quality of attributes by 9. an extension of original Here Pr denotes the conditional probability.0. different classes in the dataset. called RELIEFF that can deal with multi-class Simply speaking. we keep the original weight that attribute and adds that distance to the weight for that function to estimate the weight for the attributes. if the randomly selected instance Ri is from minority class. FOR i=1 to m DO 3. for all the k-misses in Mj(C). IJARCST All Rights Reserved 518 www. The contribution for each class of the misses is multiplied Input: for each training instance a vector of attribute values and by the prior probability of that class P(C) to keep the weight in the class value the range [0. 2. 8. In this way.ijarcst. 2. proposed modified version of RELIEFF . distance from the weight of that attribute. Then while updating the weight for each attribute. are able to handle the minority class more effectively than the The distances are always normalized so that the weights for the original RELIEFF algorithm. attributes are always in the same range and comparable to each The proposed RELIEFF filtering method is stated below. the algorithm decides this situation as desirable and calculates the distance for However. set all weights W[A] = 0. © 2014. Later in his 1994 paper [32]. 4. For For each attribute in randomly selected sample Ri. the class value. 7. we first determine the class distribution in the training data. Similarly. 5. randomly select an instance Ri. 3 (April .In case of missing values the RELIEFF algorithm modifies Input: for each training instance a vector of attribute values and the ‘diff’ function to calculate the distance for a particular attribute.June 2014) ISSN : 2347 . I2) have Output: the vector W of estimations of the qualities of different attribute values over the class value. not consider the class distribution while ranking the attributes. END FOR. performance degrades significantly when the dataset is imbalanced then the algorithm takes this as not desirable and subtracts the or biased towards any specific class. 1] and symmetric. For example if I1 has missing value but I2 does not then. if that attribute balanced datasets this filtering procedure works fine. [4] find k nearest hits Hj.8446 (Online) Computer Science & Technology (IJARCST 2014) Vol. the has a value different for that same attribute for a sample in Hj. other. if both instances are from different class and classification problem and missing attribute values in the their attribute values do not match then the algorithm increases dataset.International Journal of Advanced Research in ISSN : 2347 .0. FOR A=1 to a DO IV. Igor Kononenko proposed an algorithm.

8446 (Online) International Journal of Advanced Research in ISSN : 2347 . [25] and area under the receiver operating characteristic.ijarcst. F-measure. All the source code for different data mining and machine learning algorithms are accessible to the research community. Ver. Issue 2. researchers can easily use these algorithms and make necessary modifications suitable to any particular data mining problem. Conclusion precision-recall curve. recall. The standalone software provides tools for all kinds of data mining and machine learning tasks such as data pre- processing. regression. The commonality shared between them is In this paper we modified the existing RELIEFF Feature Selection that they are all class independent measurements. Receiver Operating Characteristic (ROC) curve. The modified ROC curve is well received in the imbalanced dataset community algorithm shows improved performance for imbalance data set and it is becoming the standard evaluation method in the area. algorithm to handle the class imbalance problem. Future Work As future work.cs. So. ac.ISSN : 2347 .waikato. Each example is a vector of the attribute 519 © All Rights Reserved. or ROC (AUC). V. cross validation of the experiments can be Weka WEKA is very powerful and rich data mining software available for download free of cost from the site http://www. [8] FOR A=1 to a DO [9] END FOR. Experiment VI. Moreover. Area under ROC Curve Features RELIEFF Modified Algorithm Central Nervous System 7130 60 39:29 50 . some of the most relevant ones to imbalanced datasets are: precision. 2. The file format that WEKA uses for input data mining problem is called ARFF file format. 3 (April . Performance Metrics (Area under ROC Curve) Researchers use statistics like the F-measure [24]. There are many evaluation measures in data mining. cost curve and VII. [10] to better evaluate minority class performance. . it is open source software written in java programming language. IJARCST 2014 . www. clustering etc. classification.9817 (Print) Vol.84 WebKb 973 5212 845:128 B.June 2014) Computer Science & Technology (IJARCST 2014) [5] FOR each class C ≠ class (Ri) DO [6] from class C find k nearest misses Mj(C). In our research.79 . [10] END FOR.reducing the no of features reduces the classifier time also which results in low time & space complexity as well. C. and experiments can be repeated for other datasets. Data Set Data Set Features Instances Class Ratio A.79 .83 100 . we have modified the RELIEFF pre-processing filtering algorithm whose source code is freely available with the software itself. This format simply defines the names of the attributes and their types followed by set of examples. [7] END FOR. In particular. Results A. VIII.

pp. Vincent Myers. 258-267. Zheng. [12] Proc. G.” ACM SIGKDD [7] Sathi T Marath . 2. Yousef. 1-2 (Oct.International Journal of Advanced Research in ISSN : 2347 . 2011 Sixth Int’l Conf. T. vol. Japkowicz. 97-122. 1999. VOL. Seventh European Conf. [17] C.A Genetic Algorithm Based Optimal [20] Y. Sixth International Conference on 18-22 Dec. and A.2012 IEEE Learning Research. and [22] X.” J. 2. International Conference on Systems. no. Class imbalance Classification. 80-89. 57. 1-6. [21] P. 1883-1888.Effect Of Feature Selection. Secaucus. “An Extensive Empirical Study of Feature [6] Xiaolong Zhang. Kamel. Provost. 2004. Explorations Newsletter. 97--136. Member.” Proc. nos. 2003. Nova Scotia November 2010. 171-182. Issue 2.Mine Classification With Selection Metric for Small Samples and Imbalanced Data Imbalanced Data. pp. Hall. ICDM Challenge 2000.” 53.” J.pp. pp. “Kernel-Based Distance Metric degree of Doctor of Philosophy at Dalhousie University Learning for Microarray Data Classification. Artificial Intelligence Research. 315-354. “Learning Classifiers from Only Positive and Unlabeled Data. vol. no. 2001. pp. 3. “An Introduction to Variable and Issue on Learning from Imbalanced Data Sets.ijarcst. Wang. Neural Networks. 42. pp.Imbalanced Data Classification Algorithm Selection Metrics for Text Classification. pp. Machine Learning Research. Manevitz and M. 177-195. AAAI Press. pp. Int’l problem: Traditionalmethods and a new algorithm". “Magical Thinking in Data Mining: Lessons from Genetic Algorithm Based Optimal Feature Selection for Web CoIL Challenge 2000. Forman. Principles and ON KNOWLEDGE AND DATA ENGINEERING. JULY 2009. 2003. Casasent. J. 2006. International Symposium on15-18 June 2011. 1/2. pp. Senior [19] N. AAAI ’00 Workshop Learning from Imbalanced Data [32] Kononenko. 3. Springer-VerlagNew York. [30] Dietterich. NJ. Chen. Seoul. [11] X. 2003. 139-154. “FAST: A ROC-Based Feature Miranda Schatten Silvious. Drummond and R.1182. 1994. 2000. [24] G. De Raedt.June 2014) ISSN : 2347 . vol.9817 (Print) References 360-363. Machine Learning Research. pp. M.” Machine Learning. Man.” Proc. Estimating attributes: analysis and Sets. 22. Chawla. “One-Class SVMs for Document Classification. in Intelligent Systems and Applications (INISTA). ed. NO. 1-11. Learn. 6. extensions of RELIEF. “Pruning Support [31] Kira. L. “A Bias-Variance [3] Yanmin Sun. 60-69. pp. vol. Elkan and K. pp. Ver. 6. and Xue-wen Chen.V. © 2014. 2004 [29] Robnik-Šikonja. current . Grobelnik. I. (1992). vol. 6. Weiss and F. [5] Nadeem Qazi. van Someren. ACM SIGKDD ’08. Machine Learning. vol. and Cybernetics [25] Z. 129-134. I. SIGKDD Explorations Newsletter. and Kononenko. 2006. 1.C. [14] N. 2006. “Learning when Training Data Are Empirical Analysis of ReliefF and RReliefF. 426- Page Classification. Practice of Knowledge Discoveryin Databases (PKDD). Noto. 2003. Kowalczyk. Chen. vol. 2004. Korea Text Categorization on Imbalanced Data. pp. pp. and R. pp. Theoretical and [10] G. Elkan. Introduction to Machine Learning. SENSING LETTERS. “Supervised versus Unsupervised Binary Learning by Feedforward Neural Networks. 107-119. M. IEEE. VOL. Wasikowski. Eds.M. 299. Japkowicz. Gerlach. 2004. Machine Learning. 17th (Catania. vol. “Editorial: Special [28] Guyon and A. vol. Kotcz.8446 (Online) Computer Science & Technology (IJARCST 2014) Vol. in: Joint Conf. 7.IEEE GEOSCIENCE AND REMOTE Classification Problems. Chawla.” J. Machine learning research: Four 2003. AI Magazine 18(4). Wu. 23-69.” Proc. K. “Feature Selection for Minority Over-sampling(SMOTE) And Under-sampling On Unbalanced Class Distribution and Naive Bayes. 592-602. Lazarevic. pp.” Proc.. '06. ACM SIGKDD ’01. (1997). “Boosting for Learning Feature Selection for Web Page Classification.” Proc. Alpaydin. Synthetic [23] D. Submitted in partial fulfilment of the requirements for the [26] H. Williams. Raskutti and A. Combating the Small Sample Class Imbalance “SMOTEBoost: Improving Prediction of the Minority Class Problem Using Feature Selection. 2012 14th International 16th Int’l Conf. and K. F. N. 2003).” ACM SIGKDD Explorations Newsletter. 213-220. vol. “Extreme Rebalancing for SVMs: A SVM Study. 2000. ACM SIGKDD ’08. “Exploiting the Cost (In) Conference on Machine Learning on Machine Learning sensitivity of Decision Tree Splitting Criteria. Italy). NO. Mladeni_c and M. [8] Selma Ayşe Özel Department of Computer Engineering.N. Data Mining. 2005. and D. pp. Holte. COEX. 43-45. [4] David P. “Feature Selection for October 14-17. 2004. no. 19. A. IEEE. 3. 124-133. OCTOBER 2010. [1] Mike Wasikowski. 1/2. Japkowicz.” ACM Feature Selection. [2] Selma Ayşe Özel. 2012. 2006.” BMC Halifax. nos. X. and Rendell. IJARCST All Rights Reserved 520 www. 1157. Bergadano and L. Elisseeff. and Y. 1. Bowyer. B. 2008. 10. 239-246. "The feature selection Vectorsfor Imbalanced Data Classification. Srihari.Innovations Multiple Classes with Imbalanced Class Distribution. Conference on Modelling and Simulation. IEEE. L.A [27] C.” Proc.” Proc. Member.Data Mining. Costly: The Effect of Class Distribution on Tree Induction. 2008. Proceedings of AAAI-92. IEEE TRANSACTIONS in Boosting. Sun. Date of Conference: 15-18 June 2011 431. Mach. 1289-1305. pp. MIT Press. Int’l Conf.” Machine Learning. 3 (April .” Proc. [16] L. der Putten and M. 2001. In Proceedings of the European [13] C. Large-Scale Web Page Classification . [18] E. [9] N. 6. Machine Based on Boosting and Cascade Model. Xiong and X. Boosting for Learning Multiple Classes with Analysis of a Real World Learning Problem: The CoIL Imbalanced Class Distribution. 2001 [15] A. Bioinformatics. Member. Chen and M. pp.