You are on page 1of 37

CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION ABOUT DOMAIN


Data mining is the process of extracting or mining knowledge from large amount of
data. It is an analytic process designed to explore large amounts of data in search of
consistent patterns and systematic relationships between variables, and then to validate the
findings by applying the detected patterns to new subsets of data. It can be viewed as a result
of natural evolution of information in development of functionalities such as data collection,
database creation, data management, data analysis. It is the process where intelligent methods
are applied in order to extract data patterns from databases, data warehouses, or other
information repositories. The data mining is a step in the knowledge discovery process. The
data mining step interacts with a user or a knowledge base. There are different data
repositories on which mining can be performed. The major data repositories are relational
databases, transactional databases, time-series databases, text databases, heterogeneous
databases, and spatial databases.
The data mining concept can be classified into two types: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Data mining systems can be classified according to the kinds of databases mined, the kinds of
knowledge mined, the techniques used, or the applications adapted. This query language can
be designed to support adhoc and interactive data. The functionalities are concept and class
descriptions, associations and correlations, classification and prediction, cluster analysis and
outlier analysis. Concise and precise descriptions of a class or a concept are called concept
and class description. Frequent patterns are the patterns that occur frequently in data. Mining
frequent patterns lead to the discovery of interesting associations and correlations within data.
Classification is the process of finding a model that describes and distinguishes data classes
or concepts. The process of finding interesting, interpreted, useful and novel data from a large
set of data is known as Knowledge Discovery in Databases (KDD). The steps involved in
mining the data are as follows: Pre-processing, mine the data and interpret the results.
Typically, these patterns cannot be discovered by traditional data exploration because
the relationships are too complex or because there is too much data. Knowledge discovery in
databases process or KDD is relatively young and interdisciplinary field of computer science
is the process of discovering new patterns from large data sets involving methods at the
1
intersection of artificial intelligence, machine learning, statistics and database systems. The
goal of data mining is to extract knowledge from a data set in a human-understandable
structure. Data mining is the entire process of applying computer-based methodology,
including new techniques for knowledge discovery, from data. Databases, Text Documents,
Computer Simulations, and Social Networks are the Sources of Data for Mining. Knowledge
extraction, data/pattern analysis, data archaeology, business intelligence, etc.
1.1.1 THE FOUNDATIONS OF DATA MINING
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies that
allow users to navigate through their data in real time. Data mining takes this evolutionary
process beyond retrospective data access and navigation to prospective and proactive
information delivery. Data mining is ready for application in the business community because
it is supported by three technologies that are now sufficiently mature:
 Massive data collection
 Powerful multiprocessor computers
 Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group
survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte
level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as
retail, these numbers can be much larger. The accompanying need for improved
computational engines can now be met in a cost-effective manner with parallel
multiprocessor computer technology. Data mining algorithms embody techniques that have
existed for at least 10 years, but have only recently been implemented as mature, reliable,
understandable tools that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has built
upon the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
From the user’s point of view, the four steps listed in Table 1 were revolutionary because
they allowed new business questions to be answered accurately and quickly.
1.1.2 STEPS IN THE DATA MINING
Figure 1.1 shows the steps to be followed in data mining process.
 Data Integration: First of all the data are collected and integrated from all the
different sources.
2
 Data Selection: Not all the data are collected in the first step. So in this step select
only those data which is useful for data mining.
 Data Cleaning: Collected datas are not clean and may contain errors, missing values,
noisy or inconsistent data. So it needs to apply different techniques to get rid of such
anomalies.
 Data Transformation: The data even after cleaning are not ready for mining that
need to transform into the forms appropriate for mining. The techniques used to
accomplish this are smoothing, aggregation, normalization etc.
 Data Mining: Now it ready to apply data mining techniques on the data to discover
the interesting patterns. Techniques like clustering and association analysis are among
the many different techniques used for data mining.
 Pattern Evaluation and Knowledge Presentation: This step involves visualization,
transformation, removing redundant patterns etc from the generated patterns.
 Decisions / Use of Discovered Knowledge: This step helps user to make use of the
knowledge acquired to take better decisions.

Figure 1.1 Steps in Data Mining


1.1.3 DATA MINING PROCESS
1.1.3.1 DEFINING THE DATA MINING PROBLEM
Most data-based modeling studies are performed for a particular application domain.
Hence, domain-specific knowledge and experience are usually necessary in order to come up

3
with a meaningful problem statement. Unfortunately, many application studies are likely to
be focused on the data mining technique at the cost of a clear problem statement. In this step,
a model usually specifies a set of variables for the unknown dependency and, if possible, a
general form of this dependency as an initial hypothesis. The first step requires the combined
expertise of an application domain and a data mining model. In successful data mining
applications, this cooperation does not stop in the initial phase; it continues during the entire
data mining process, the requirement to knowledge discovery is to understand data and
business. Without this understanding, no algorithm, regardless of complexity, is able to
provide result that can be confident.
1.1.3.2 COLLECTING THE DATA MINING DATA
This process is concerned with the collection of data from different sources and
locations. The current methods used to collect data are:
• Internal Data: data are usually collected from existing databases, data warehouses, and
OLAP. Actual transactions recorded by individuals are the richest source of information, and
at the same time, the most challenging to be useful.
• External Data: data items can be collected from Demographics, psychographics and web
graphics. In addition to data shared within a company.
1.1.3.3 DETECTING AND CORRECTING THE DATA
All raw data sets which are initially prepared for data mining are often large; many
are related to humans and have the potential for being messy. Real-world databases are
subject to noise, missing, and inconsistent data due to their typically huge size, often several
gigabytes or more. Data preprocessing is commonly used as a preliminary data mining
practice. It transforms the data into a format that will be easily and effectively processed by
the users. There are a number of data preprocessing techniques which include: Data cleaning;
that can be applied to remove noise and correct inconsistencies, outliers and missing values.
Data integration; merges data from multiple sources into a coherent data store, such as a data
warehouse or a data cube. Data transformations, such as normalization, may be applied;
normalization improves the accuracy and efficiency of mining algorithms involving distance
measurements. Data reduction; can reduce the data size by aggregating, eliminating
redundant features. The data processing techniques, when applied prior to mining, can
significantly improve the overall data mining results. Since multiple data sets may be used in
various transactional formats, extensive data preparation may be required. There are various
commercial software products that are specifically designed for data preparation, which can
facilitate the task of organizing the data prior to importing it into a data mining tool.
4
1.1.3.4 ESTIMATION AND BUILDING THE MODEL
Figure 1.2 represents the process involved in estimation and building the model.
This process includes four parts:
1. Select data mining task,
2. Select data mining method,
3. Select suitable algorithm
4. Extract knowledge

Select Data Select Data Select suitable Extract


Mining Task Mining Method algorithm knowledge

Figure 1.2 Estimation and building the model


1. Select Data mining task (s)
Selecting which task to use depends on the model whether it is predictive or
descriptive .predictive models predict the values of data using known results and/or
information found large data sets, historical data, or using some variables or fields in the data
set to predict unknown, classification, regressions, time series analysis, prediction, or
estimation are tasks for predictive model .
A descriptive model identifies patterns or relationships in data and serves as a way to
Explore the properties of the data examined. Clustering, summarization, association rules and
sequence discovery. The relative importance of prediction and description for particular data
mining applications can vary considerably. That means selecting which task to use depends
on the model whether it is predictive or descriptive.
2. Select Data mining method (s)
After selecting which task, choose the method, assuming a predictive model and the
task is classification while the method is Rule Induction, with Decision tree or Neural
Network. In most research in this area; researchers estimates the relevant model this model to
produce acceptable results. There are number of methods for model estimation Includes these
but not limited to neural networks, Decision trees, Association Rules, Genetic algorithms,
Cluster Detection, Fuzzy Logic.
3. Select suitable algorithm
The next step is to construct a specific algorithm that implements the general
methods. All data mining algorithm include three primary components these are:
(1) Model representation,
(2) Model evaluation,
(3) Search.
5
4. Extracting knowledge
This is the last step in building the model which is the results (or the answers for the
problem solved in data mining) after making the simulation for the algorithm. This can be
best explained by presenting an example of Auction Fraud.
5 .Model Descriptions, Validation
In all cases, data mining models should assist users in decision making. Hence, such
models need to be interpretable in order to be useful because humans are not likely to base
their decisions on complex “black-box” models; the goals of the accuracy of the model and
accuracy of its interpretation are somewhat contradictory. Modern data mining methods are
expected to yield highly accurate results using high dimensional models. The problem of
interpreting these models, are very important and is considered as a separate task with
specific techniques to validate the results.
Model validity is a necessary but insufficient condition for the credibility and
acceptability of data mining results. If, for example, the initial objectives are incorrectly
identified or the data set is improperly specified, the data mining results expressed through
the model will not be useful; however, still the model valid. One always has to keep in mind,
that a problem correctly formulated is a problem half-solved. The ultimate goal of a data
mining process should not be just to produce a model for a problem at hand, but to provide
one that is sufficiently credible, acceptable and implemented by the decision-makers; this
type would need to consider all the data i.e. using a dynamic database.
1.1.4 PROFITABLE APPLICATIONS
A wide range of companies have deployed successful applications of data mining.
While early adopters of this technology have tended to be in information-intensive industries
such as financial services and direct mail marketing, the technology is applicable to any
company looking to leverage a large data warehouse to better manage their customer
relationships. Two critical factors for success with data mining are: a large, well-integrated
data warehouse and a well-defined understanding of the business process within which data
mining is to be applied (such as customer prospecting, retention, campaign management, and
so on).
Some successful application areas include:
 A pharmaceutical company can analyze its recent sales force activity and their results
to improve targeting of high-value physicians and determine which marketing
activities will have the greatest impact in the next few months. The data needs to
include competitor market activity as well as information about the local health care
6
systems. The results can be distributed to the sales force via a wide-area network that
enables the representatives to review the recommendations from the perspective of the
key attributes in the decision process. The ongoing, dynamic analysis of the data
warehouse allows best practices from throughout the organization to be applied in
specific sales situations.
 A credit card company can leverage its vast warehouse of customer transaction data to
identify customers most likely to be interested in a new credit product. Using a small
test mailing, the attributes of customers with an affinity for the product can be
identified. Recent projects have indicated more than a 20-fold decrease in costs for
targeted mailing campaigns over conventional approaches.
 A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze its
own customer experience, this company can build a unique segmentation identifying
the attributes of high-value prospects. Applying this segmentation to a general
business database such as those provided by Dun & Bradstreet can yield a prioritized
list of prospects by region.
 A large consumer package goods company can apply data mining to improve its sales
process to retailers. Data from consumer panels, shipments, and competitor activity
can be applied to understand the reasons for brand and store switching. Through this
analysis, the manufacturer can select promotional strategies that best reach their target
customer segments.
1.2. INTRODUCTION OF RESEARCH
Many algorithms and methods have been proposed to ameliorate the effect of class
imbalance on the performance of learning algorithms. There are three main approaches to
these methods.
• Internal approaches acting on the algorithm. These approaches modify the learning
algorithm to deal with the imbalance problem. They can adapt the decision threshold to create
a bias toward the minority class or introduce costs in the learning process to compensate the
minority class.
• External approaches acting on the data. These algorithms act on the data instead of the
learning method. They have the advantage of being independent from the classifier used.
There are two basic approaches: oversampling the minority class and undersampling the
majority class.

7
• Combined approaches that are based on boosting accounting for the imbalance in the
training set. These methods modify the basic boosting method to account for minority class
underrepresentation in the data set.
There are two principal advantages of choosing sampling over cost-sensitive methods.
First, sampling is more general as it does not depend on the possibility of adapting a certain
algorithm to work with classification costs. Second, the learning algorithm is not modified,
which can cause difficulties and add additional parameters to be tuned.
As stated earlier, our goal is to obtain a method that is both scalable and able to
sample the most relevant instances to deal with class-imbalanced data sets. Scalability will be
achieved using a divide-and-conquer approach. The ability to sample instances to deal with
class-imbalanced data sets will be achieved by means of the combination of several rounds of
instance selection in balanced subsets of the whole data set.
1.3 PROBLEM STATEMENT
Most learning algorithms expect an approximately even distribution of instances
among the different classes and suffer, to different degrees, when that is not the case. Dealing
with the class-imbalance problem is a difficult but relevant task as many of the most
interesting and challenging real-world problems have a very uneven class distribution. In that
system not consider the multi class problem. Especially, many ensemble methods have been
proposed to deal with such imbalance. However, most efforts so far are only focused on two-
class imbalance problems. There are unsolved issues in multi-class imbalance problems,
which exist in real-world applications. No existing methods can deal with multi-class
imbalance problems efficiently and effectively.
1.4 OBJECTIVE
It is desirable to develop a more effective and efficient method to handle multi-class
imbalance problems. The impact of multi-class on the performance of random oversampling
and undersampling techniques by discussing “multiminority” and “multi-majority” cases in
depth. Both “multi-minority” and “multi-majority” negatively affect the overall and minority-
class performance. Set of benchmark data sets with multiple minority and/or majority classes
with the aim of tackling multi-class imbalance problems.

8
CHAPTER 2
LITERATURE REVIEW

2.1 CLASSIFIER LEARNING: AN EMPIRICAL STUDY


Machine learning and data mining methods, and the acceptance of these methods,
have advanced to the point where they are commonly being applied to very large, real-world
problems. Addressing these real-world problems has focused attention, and research, on
problems that were once only rarely considered. For example, many research papers and
several workshops have recently been directed at the problems of learning from data sets with
unbalanced class distributions and where the costs of misclassifying examples is non-
uniform. In order for the acceptance and the use of these methods to grow, research must
continue to address the practical concerns that arise when dealing with real-world data sets.
This research is motivated by the fact that obtaining data in a form suitable for
learning is often costly and that learning from large data sets may also be costly. The costs
associated with creating a useful data set include the cost of obtaining the data, cleaning the
data, transporting/storing the data, labelling the data, and transforming the raw data into a
form suitable for learning. The costs associated with learning from the data involve the cost
of computer hardware, the “cost” associated with the time it takes to learn from the data, and
the “opportunity cost” associated with not being able to learn from extremely large data sets
due to limited computational resources.
This article describes and analyzes the results from a comprehensive set of
experiments designed to investigate the effect that the class distribution of the training set has
on classifier performance. Evaluate classifier performance using two performance measures
and show that in many cases the naturally occurring class distribution is not best for learning
and, consequently, when the training-set size needs to be restricted, a class distribution other
than the natural class distribution should be chosen using F. Provost and G. M. Weiss [11].
Characterization of how the optimal class distribution relates to the naturally occurring class
distribution and answer a number of basic questions about how the class distribution of the
training set affects learning.
This article analyzes the effect of class distribution on classifier learning. It begins by
describing the different ways in which class distribution affects learning and how it affects
the evaluation of learned classifiers using J. R. Cano, F. Herrera, and M. Lozano [3]. Present
the results of two comprehensive experimental studies. The first study compares the
performance of classifiers generated from unbalanced data sets with the performance of
9
classifiers generated from balanced versions of the same data sets. This comparison allows us
to isolate and quantify the effect that the training set’s class distribution has on learning and
contrast the performance of the classifiers on the minority and majority classes. The second
study assesses what distribution is "best" for training, with respect to two performance
measures: classification accuracy and the area under the ROC curve (AUC). A tacit
assumption behind much research on classifier induction is that the class distribution of the
training data should match the “natural” distribution of the data. This study shows that the
naturally occurring class distribution often is not best for learning, and often substantially
better performance can be obtained by using a different class distribution refer C. J.
Carmona, J. Derrac, S. García, F. Herrera and I. Triguero [2]. Understanding how classifier
performance is affected by class distribution can help practitioners to choose training data—
in real-world situations the number of training examples often must be limited due to
computational costs or the costs associated with procuring and preparing the data.
2.2 LEARNING IMBALANCED DATA
The imbalanced learning problem has drawn a significant amount of interest from
academia, industry, and government funding agencies. The fundamental issue with the
imbalanced learning problem is the ability of imbalanced data to significantly compromise
the performance of most standard learning algorithms. Most standard algorithms assume or
expect balanced class distributions or equal misclassification costs. Therefore, when
presented with complex imbalanced data sets, these algorithms fail to properly represent the
distributive characteristics of the data and resultantly provide unfavorable accuracies across
the classes of the data. When translated to real-world domains, the imbalanced learning
problem represents a recurring problem of high importance with wide-ranging implications,
warranting increasing exploration.
With the great influx of attention devoted to the imbalanced learning problem and the
high activity of advancement in this field, remaining knowledgeable of all current
developments can be an overwhelming task. An estimation of the number of publications on
the imbalanced learning problem over the past decade based on the Institute of Electrical and
Electronics Engineers (IEEE) and Association for Computing Machinery (ACM) databases.
As can be seen, the activity of publications in this field is growing at an explosive rate. Due
to the relatively young age of this field and because of its rapid expansion, consistent
assessments of past and current works in the field in addition to projections for future
research are essential for long-term development. This work seeks to provide a survey of the
current understanding of the imbalanced learning problem and the state-of-the-art solutions
10
created to address this problem. Furthermore, in order to stimulate future research in this field
and also highlight the major opportunities and challenges for learning from imbalanced data
refer A. Estabrooks, N. Japkowicz and T. Jo [5].
The imbalanced learning problem is concerned with the performance of learning
algorithms in the presence of underrepresented data and severe class distribution skews. Due
to the inherent complex characteristics of imbalanced data sets, learning from such data
requires new understandings, principles, algorithms, and tools to transform vast amounts of
raw data efficiently into information and knowledge representation. This work provides a
comprehensive review of the development of research in learning from imbalanced data E. A.
Garcia and H. He [7].
The focus is to provide a critical review of the nature of the problem, the state-of-the-art
technologies, and the current assessment metrics used to evaluate learning performance under
the imbalanced learning scenario. Furthermore, in order to stimulate future research in this
field and also highlight the major opportunities and challenges, as well as potential important
research directions for learning from imbalanced data.
2.3 EVOLUTIONARY-BASED INSTANCE SELECTION
The class imbalance classification problem is one of the current challenges in data
mining. It appears when the number of instances of one class is much lower than the
instances of the other class(es). Since standard learning algorithms are developed to minimize
the global measure of error, which is independent of the class distribution, in this context this
causes a bias towards the majority class in the training of classifiers and results in a lower
sensitivity in detecting the minority class examples. Imbalance in class distribution is
pervasive in a variety of real-world applications, including but not limited to
telecommunications, WWW, finance, biology and medicine.
A main process in data mining is the one known as data reduction. In classification, it
aims to reduce the size of the training set mainly to increase the efficiency of the training
phase and even to reduce the classification error rate. Instance Selection (IS) is one of the
most known data reduction techniques in data mining. The problem of yielding an optimal
number of generalized examples for classifying a set of points is NP-hard. A large but finite
subset of them can be easily obtained following a simple heuristic algorithm acting over the
training data. However, almost all generalized examples produced could be irrelevant and, as
a result, the most influential ones must be distinguished.
This work proposes the use of EAs for generalized instances selection in imbalanced
classification domains. The objective is to increase the accuracy of this type of representation
11
by means of selecting the best suitable set of generalized examples to enhance its
classification performance over imbalanced domains. Select a large collection of imbalanced
data sets from KEEL-dataset repository for developing our experimental analysis refer J. R.
Cano, F. Herra, and M. Lozano [3]. In order to deal with the problem of imbalanced data
sets, include an study that involves the use of a preprocessing technique, the ‘‘Synthetic
Minority Over-sampling Technique’’ (SMOTE), to balance the distribution of training
examples in both classes refer W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer
[1]. The empirical study has been checked via non-parametrical statistical testing and the
results show an improvement of accuracy for this approach whereas the number of
generalized examples stored in the final subset is much lower.
2.4 ENSEMBLE CONSTRUCTION
An ensemble of classifiers consists of a combination of different classifiers,
homogeneous or heterogeneous, to jointly perform a classification task. Ensemble
construction is one of the fields of Artificial Intelligence that is receiving most research
attention, mainly due to the significant performance improvements over single classifiers that
have been reported with ensemble methods.
Techniques using multiple models usually consist of two independent phases: model
generation and model combination. Most techniques are focused on obtaining a group of
classifiers which are as accurate as possible but which disagree as much as possible. These
two objectives are somewhat conflicting, since if the classifiers are more accurate, it is
obvious that they must agree more frequently. Many methods have been developed to enforce
diversity on the classifiers that form the ensemble identifies four fundamental approaches: (i)
using different combination schemes, (ii) using different classifier models, (iii) using
different feature subsets, and (iv) using different training sets. Perhaps the last one is the most
commonly used. The algorithms in this last approach can be divided into two groups:
algorithms that adaptively change the distribution of the training set based on the
performance of the previous classifiers, and algorithms that do not adapt the distribution N.
Garcia-Pedrajas [6]. Boosting methods are the most representative methods of the first
group. The most widely used boosting methods are ADABOOST and its numerous variants,
and Arc-x4. They are based on adaptively increasing the probability of sampling the instances
that are not classified correctly by the previous classifiers refer L. Kuncheva and C. J.
Whitaker, [9].
This work proposes a novel approach for ensemble construction based on the use of
nonlinear projections to achieve both accuracy and diversity of individual classifiers. The
12
proposed approach combines the philosophy of boosting, putting more effort on difficult
instances, with the basis of the random subspace method. Our main contribution is that
instead of using a random subspace, construct a projection taking into account the instances
which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear
projections are created by a neural network trained using only incorrectly classified instances.
The feature subspace induced by the hidden layer of this network is used as the input space to
a new classifier. The method is compared with bagging and boosting techniques, showing an
improved performance on a large set of 44 problems from the UCI Machine Learning
Repository. An additional study showed that the proposed approach is less
sensitive to noise in the data than boosting methods.
2.5 VARIABLE AND FEATURE SELECTION
Variable and feature selection: facilitating data visualization and data understanding,
reducing the measurement and storage requirements, reducing training and utilization times,
defying the curse of dimensionality to improve prediction performance. Some methods put
more emphasis on one aspect than another, and this is another point of distinction between
this special issue and previous work. The papers in this issue focus mainly on constructing
and selecting subsets of features that are useful to build a good predictor. This contrasts with
the problem of finding or ranking all potentially relevant variables. Selecting the most
relevant variables is usually suboptimal for building a predictor, particularly if the variables
are redundant. Conversely, a subset of useful variables may exclude many redundant, but
relevant, variables.
The depth of treatment of various subjects reflects the proportion of papers covering
them: the problem of supervised learning is treated more extensively than that of
unsupervised learning; classification problems serve more often as illustration than regression
problems, and only vector input data is considered. Complexity is progressively introduced
throughout the sections: The first section starts by describing filters that select variables by
ranking them with correlation coefficients. Limitations of such approaches are illustrated by a
set of constructed examples. Subset selection methods are then introduced. These include
wrapper methods that assess subsets of variables according to their usefulness to a given
predictor. This shows how some embedded methods implement the same idea, but proceed
more efficiently by directly optimizing a two-part objective function with a goodness-of-fit
term and a penalty for a large number of variables A. Elisseeff and I. Guyon [4]. Turn to the
problem of feature construction, whose goals include increasing the predictor performance
and building more compact feature subsets. All of the previous steps benefit from reliably
13
assessing the statistical significance of the relevance of features. Briefly review model
selection methods and statistical tests used to that effect.
2.6 WEIGHTED INSTANCE SELECTION
This work approaches the problem of constructing ensembles of classifiers from the
point of view of instance selection. Instance selection is aimed at obtaining a subset of the
instances available for training able to achieve, at least, the same performance as the whole
training set. In this way instance selection algorithms try to keep the performance of the
classifiers while reducing the number of instances in the training set. Meanwhile, boosting
methods construct an ensemble of classifiers iteratively focusing each new member on the
most difficult instances by means of a biased distribution of the training instances refer
N.Garcia-Pedrajas [6]. This work shows how these two methodologies can be combined
advantageously. It can focus instance selection algorithm on boosting classifiers evaluating
the selection algorithm in the biased distribution of the instances given by the boosting
method.
This method can be considered as boosting by instance selection. Instance selection
has been mostly developed and used for k-nearest neighbors classifiers (k-NN) refer D. L.
Wilson [12]. So, in a first step our methodology is suited to construct ensembles of k-NN
classifiers. Constructing ensembles of classifiers by means of instance selection has the
important feature of reducing the complexity of the final ensemble as only a subset of the
instances is selected for each classifier.
2.7 DATA REDUCTION IN KDD
Advances in digital and computer technology that have led to the huge expansion of
the Internet means that massive amount of information and collection of data have to be
processed. Scientific research, ranging from astronomy to human natural genome, faces the
same problem of how to deal with vast amounts of information. Raw data is rarely used
directly and manual analysis simply cannot keep up with the fast growth of data. Knowledge
discovery in databases (KDD) and data mining (DM) can help deal with this problem because
the aim to turn raw data into nuggets and create special edges. Due to the enormous amounts
of data, much of the current research is based on scaling up DM algorithms. Other research
has also tackled scaling down data. The main problem of scaling down data is how to select
the relevant data and then apply a DM algorithm. This task is carried out in the data
preprocessing phase in a KDD process refer J. R. Cano, F. Herra, and M. Lozano [3].
The aim of this work is to study the application of some representative EA models for
data reduction, and to compare them with non-evolutionary instance selection algorithms
14
(hereafter referred to as classical ones). In order to do this, our study is carried out from a
twofold perspective.
1) IS-PS: The analysis of the results obtained when selecting prototypes (instances) for a 1-
NN (nearest neighbor) algorithm. This approach will be denoted as instance selection-
prototype selection (IS-PS).
2) IS-TSS: The analysis of the behavior of EAs as instance selectors for data reduction, when
selecting instances to compose the training set that will be used by C4.5 , a well-known
decision-tree induction algorithm refer T. R. Martinez and D. R. Wilson [10]. In this
approach, the selected instances are first used to build a decision tree, and then the tree is
used to classify new examples. This approach will be denoted as (instance selection– training
set selection (IS-TSS). The analysis of the behavior of EAs for data reduction in KDD is, in
fact, the most important and novel aspect of this paper.
As with any algorithm, the issue of scalability and the effect of increasing the size of
data on algorithm behavior are always present. To address this, carry out a number of
experiments on IS-PS and IS-TSS with increasing complexity and size of data.
2.8 EVOLUTIONARY TRAINING SET SELECTION
The data used in a classification task could be not perfect. Data could present different
types of imperfections, such as the presence of errors or missing values or imbalanced
distribution of classes. In the last years, the class imbalance problem is one of the emergent
challenges in data mining (DM). The problem appears when the data presents a class
imbalance, which consists in containing many more examples of one class than the other one
and the less representative class represents the most interesting concept from the point of
view of learning. The imbalance classification problem is much related with the cost-
sensitive classification problem. Imbalance in class distribution is pervasive in a variety of
real-world applications, including but not limited to telecommunications, WWW, finance,
ecology, biology and medicine.
In the field of class imbalanced classification, EAs are being applied recently. In an
EA is used to search an optimal tree in a global manner for cost-sensitive classification. In the
authors propose new heuristics and metrics for improving the performance of several genetic
programming classifiers in imbalanced domains. EAs have also been applied for under-
sampling the data in imbalanced domains in instance-based learning refer N. García-
Pedrajas, D. Ortiz-Boyer and J. A. Romero del Castillo [8].
In this contribution, propose the use of EAs for TSS in imbalanced data sets. The
objective is to increase the effectiveness of a well-known decision tree classifier, C4.5 and a
15
rule induction algorithm, PART by means of removing instances guided by an evolutionary
under-sampling algorithm. Compare the approach with other under-sampling, over-sampling
methods and hybridization proposals of over-sampling and under-sampling studied in the
literature. The empirical study is contrasted via non-parametrical statistical testing in a
multiple data set environment.
2.9 NEAREST NEIGHBOR CLASSIFICATION
The Nearest Neighbor (NN) rule is one of the oldest and better-known algorithms for
performing supervised nonparametric classification. The entire training set (TS) is stored in
the computer memory. To classify a new pattern, its distance to each one of the stored
training patterns is computed. The new pattern is then assigned to the class represented by its
nearest training pattern. From this definition, it is obvious that this classifier suffers from a
main drawback: large memory requirement to store the whole TS and the also large response
time needed. This disadvantage is more critical in contexts like Data Mining where huge
databases are commonly dealt.
The above mentioned drawback has been considerably cut down by the development
of suitable data structures and associated search algorithms and also by proposals to reduce
the TS size. Hart’s idea13 of a consistent subset has become a milestone in the latter research
line, stimulating a sequel of new algorithms aimed at eliminating as many training patterns as
possible without seriously affecting the predictive accuracy of the classifier. Most of the
research done in this direction is reviewed from slightly different viewpoints.
The present work is concerned with reducing the TS size while trying to maintain (or
even improve) the accuracy rate of the NN rule refer D. L. Wilson [12]. A novel reduction
technique based on some features of the Selective Subset, 23 the Modified Selective Subset
(MSS), is presented. In comparison to the original Selective algorithm, the MSS yields a
better approximation of the decision boundaries as induced by the plain NN rule when using
the whole training sample. As a byproduct, an algorithm much simpler (in terms of storage
and computational time requirements) is obtained. This algorithm may be of crucial interest
in situations in which particularly good decision boundaries need to be accurately represented
by a reduced set of prototypes.
2.10

16
CHAPTER 3
SYSTEM ANALYSIS

3.1 EXISTING SYSTEM


New framework called oligarchic instance selection, which is specifically designed
for class imbalanced data sets. The method has two major objectives: 1) improving the
performance of previous approaches based of instances selection for class-imbalanced data
sets; and 2) developing a method that is able to scale up to very large, and even huge,
problems.
The class-imbalanced nature of the problem is dealt with by means of two
mechanisms. First, the selection of instances from the majority and minority classes is
performed separately. Second, selection is driven by a fitness function that takes accuracy in
both classes into account. Furthermore, at its inner level, all the selection process is always
performed in balanced sets. Its divide-and-conquer philosophy addresses the problem of
scalability without compromising its performance. The method is based on applying an
instance selection algorithm to balanced subsets of the whole training set and on combining
the results obtained from those subsets by means of a voting scheme. As an additional and
very useful feature, the method has linear time complexity and can be easily implemented in
a shared or distributed memory parallel machine
In that system is primarily based on the divide-and conquers approach. Instead of
applying the instance selection method to the whole data set, first perform a random partition
of the instances and apply the selection to each one of the subsets obtained. To account for
the class-imbalanced nature of the data sets, the subsets used always contain the same number
of instances from both classes. In this method treats majority class instances unfairly,
favoring minority class instances and refer to it as oligarchic instance selection (OLIGOIS).
On its own, each round would not be able to achieve good performance. However, the
combination of several rounds using a voting scheme is able to improve the performance of
an instance selection algorithm applied to the whole data set with a large reduction in the
execution time of the algorithm.
3.2 DISADVANTAGES
 This technique is not suitable for multi class problem.
 Feature selection algorithm also not based on multi class value.

17
3.3 PROPOSED SYSTEM
The aim of our study was to investigate how class imbalance affects the multi-class
classification for high-dimensional class-imbalanced data, a problem that to our knowledge
has not been systematically addressed so far. This focused mainly on DLDA because of its
good behaviour in the two-class problems with high-dimensional class-imbalanced data;
another reason for choosing DLDA was the straightforward generalization of the two-class
DLDA to the multi-class situation (multi-class DLDA, mDLDA). Compared mDLDA and
the Friedman’s one-versus-one approach, which breaks down the multi-class problem in a
series of two-class classification problems and assigns new samples to the class having most
votes. Friedman’s approach was chosen because of its wide applicability and simplicity, and
because it was previously indicated as beneficial when the classes are imbalanced or when
the number of classes is large. This leads, choosing of a one-versus-one rather than a one-
versus-all strategy because it would be less affected by class-imbalance.
3.4 ADVANTAGES
 It is recognized that multi-class classification tasks are generally significantly harder
than binary classification tasks
 Main aim is improving accuracy a method achieves the same accuracy using fewer
instances, that method would be preferable.
 Moreover, many of the most relevant class-imbalanced problems appear in very large
data sets where data reduction is a must.

18
CHAPTER 4
SYSTEM SPECIFICATION

4.1 HARDWARE REQUIREMENTS


Processor : Pentium IV 2.4 GHz
Hard Disk : 80 GB
RAM : 2 GB
4.2 SOFTWARE REQUIREMENTS
OPERATING SYSTEM : WINDOWS XP
BACK END : JAVA (JDK 1.6/1.7)
4.3 SOFTWARE DESCRIPTION
Java is an object-oriented programming language with a built-in application
programming interface (API) that can handle graphics and user interfaces and that can be
used to create applications or applets. Because of its rich set of API's, similar to Macintosh
and Windows, and its platform independence, Java can also be thought of as a platform in
itself. Java also has standard libraries for doing mathematics.
Much of the syntax of Java is the same as C and C++. One major difference is that
Java does not have pointers. However, the biggest difference is that you must write object
oriented code in Java. Procedural pieces of code can only be embedded in objects. In the
following, assume that the reader has some familiarity with a programming language. In
particular, some familiarity with the syntax of C/C++ is useful.
In Java distinguition between applications, which are programs that perform the same
functions as those written in other programming languages, and applets, which are programs
that can be embedded in a Web page and accessed over the Internet. Our initial focus will be
on writing applications. When a program is compiled, a byte code is produced that can be
read and executed by any platform that can run Java.
The following properties of Java
 Variable Declaration: The types of all variables must be declared. The primitive
types are byte, short, int, long (8, 16, 32, and 64 bit integer variables, respectively),
float and double (32 and 64-bit floating point variables), boolean (true or false), and
char. Boolean is a distinct type rather than just another way of using integers. Strings
are not a primitive type, but are instances of the String class. Because they are so
common, string literals may appear in quotes just as in other languages.
19
 Naming Conventions: Java distinguishes between upper and lower case variables.
The convention is to capitalize the first letter of a class name. If the class name
consists of several words, they are run together with successive words capitalized
within the name (instead of using underscores to separate the names). The name of the
constructor is the same as the name of the class. All keywords (words that are part of
the language and cannot be redefined) are written in lower case.
 Instance variables and methods can be accessed from any method within the class.
The x in the argument list of the above constructor refers to the local value of the
parameter which is set when Particle is called. Use the this keyword to refer to those
variables defined for the entire class in contrast to those defined locally within a
method and those that are arguments to a method. In the above example, this.x refers
to the variable x which is defined just after the first line of the class definition. After
having Classes are effectively new programmer-defined types; each class defines data
(fields) and methods to manipulate the data. Fields in the class are template for the
instance variables that are created when objects are instantiated (created) from that
class. A new set of instance variables is created each time that an object is instantiated
from the class.
 The members of a class (variables and methods) are accessed by referrring to an
object created from class using the dot operator. For example, suppose that a class
Particle contains and instance variable x and a method step. If an object of this class is
named p, then the instance variable in p would be accessed as p.x and the method
accessed as p.step.
 A semicolon is used to terminate individual statements.
 Comments. There are three comment styles in Java. A single line comment starts
with // and can be included anywhere in the program. Multiple line comments begin
with /* and end with */; these are also useful for commenting out a portion of the text
on a line. Finally, text enclosed within /** ... */ serves to generate documentation
using the javadoc command.
 Type casting changes the type of a value from its normal type to some other type.
Multiple Constructors
The arguments of a constructor specify the parameters for the initialization of an
object. Multiple constructors provide the flexibility of initializing objects in the same class
with different sets of arguments,

20
 The multiple constructors (all named Particle) are distinguished only by the number of
arguments. The constructors can be defined in any order. Hence, the keyword this in
the first constructor refers to the next in the sequence because the latter has two
arguments. The first constructor has no arguments and creates a particle of unit mass
at the origin; the next is defined with two arguments: the spatial coordinates of the
particle. The second constructor in turn references the third constructor which uses the
spatial coordinates and the mass. The third and fourth constructors each refer to the
final constructors which uses all five arguments. (The order of the constructors is
unimportant.) Once the Particle class with its multiple constructors is defined, any
class can call the constructor Particle using the number of arguments appropriate to
that application. The advantage of having multiple constructors is that applications
that use a particular constructor are unaffected by later additions made to the class
Particle, whether variables or methods. For example, adding acceleration as an
argument does not affect applications that rely only on the definitions given above.
 Using multiple constructors is called method overloading -- the method name is used
to specify more than one method. The rule for overloading is that the argument lists
for all of the different methods must be unique, including the number of arguments
and/or the types of the arguments.
 All classes have at least one implicit constructor method. If no constructor is defined
explicitly, the compiler creates one with no arguments.
History of Java
James Gosling initiated the Java language project in June 1991 for use in one of his
many set-top box projects. The language, initially called Oak after an oak tree that stood
outside Gosling's office, also went by the name Green and ended up later renamed as Java,
from a list of random words
There were five primary goals in the creation of the Java language:
1. It should use the object-oriented programming methodology.
2. It should allow the same program to be executed on multiple operating systems.
3. It should contain built-in support for using computer networks.
4. It should be designed to execute code from remote sources securely.
5. It should be easy to use by selecting what was considered the good parts of other object-
oriented languages.

21
Java Arrays
Arrays are objects that store multiple variables of the same type. However an Array
itself is an object on the heap. Look into how to declare, construct and initialize in the
upcoming chapters.
Java Enums
Enums were introduced in java 5.0. Enums restrict a variable to have one of only a
few predefined values. The values in this enumerated list are called enums. With the use of
enums it is possible to reduce the number of bugs in your code. For example if considering an
application for a fresh juice shop it would be possible to restrict the glass size to small,
medium and large. This would make sure that it would not allow anyone to order any size
other than the small, medium or large.
Inheritance
In java classes can be derived from classes. Basically if you need to create a new class
and here is already a class that has some of the code you require, then it is possible to derive
your new class from the already existing code.
This concept allows you to reuse the fields and methods of the existing class without
having to rewrite the code in a new class. In this scenario the existing class is called the super
class and the derived class is called the subclass.
Interfaces
In Java language an interface can be defined as a contract between objects on how to
communicate with each other. Interfaces play a vital role when it comes to the concept of
inheritance.
An interface defines the methods, a deriving class(subclass) should use. But the
implementation of the methods is totally up to the subclass.
Java programming language was originally developed by Sun Microsystems, which
was initiated by James Gosling and released in 1995 as core component of Sun
Microsystems’s Java platform (Java 1.0 [J2SE]).Sun Microsystems has renamed the new J2
versions as Java SE, Java EE and Java ME respectively. Java is guaranteed to be Write Once,
Run Anywhere
Object Oriented
In java everything is an Object. Java can be easily extended since it is based on the
Object model.

22
Platform Independent
Unlike many other programming languages including C and C++ when Java is
compiled, it is not compiled into platform specific machine, rather into platform independent
byte code. This byte code is distributed over the web and interpreted by virtual Machine
(JVM) on whichever platform it is being run.
Simple
Java is designed to be easy to learn. If you understand the basic concept of OOP java
would be easy to master.
Architectural- neutral
Java compiler generates an architecture-neutral object file format which makes the
compiled code to be executable on many processors, with the presence Java runtime system.
Portable
Java is being architectural neutral and having no implementation dependent aspects of
the specification makes Java portable. Compiler and Java is written in ANSI C with a clean
portability boundary which is a POSIX subset.
Robust
Java makes an effort to eliminate error prone situations by emphasizing mainly on
compile time error checking and runtime checking.
Multi-threaded
With Java’s multi-threaded feature it is possible to write programs that can do many
tasks simultaneously. This design feature allows developers to construct smoothly running
interactive applications.
Interpreted
Java byte code is translated on the fly to native machine instructions and is not stored
anywhere. The development process is more rapid and analytical since the linking is an
incremental and light weight process.
High Performance
With the use of Just-In-Time compilers Java enables high performance.
Distributed
Java is designed for the distributed environment of the internet.
Dynamic
Java is considered to be more dynamic than C or C++ since it is designed to adapt to
an evolving environment. Java programs can carry extensive amount of run-time information
that can be used to verify and resolve accesses to objects on run-time.
23
CHAPTER 5
SYSTEM IMPLEMENTATION

5.1 RANDOM PARTITION


Class imbalanced data sets using instance selection algorithms, which can remove
instances from both the minority and majority classes. Scalability will be achieved using a
divide-and-conquer approach. The ability to sample instances to deal with class-imbalanced
data sets will be achieved by means of the combination of several rounds of instance
selection in balanced subsets of the whole data set. A training set T, with n instances, n+ from
the minority or positive class, and n− from the majority or negative class. First, the training
data T is divided into t disjoint subsets Dj of approximately equal size s as follows:
T =¿ j=1¿ t D j
The partition is carried out using a random algorithm. The result of this first step is t
subsets where the distribution of classes is roughly as imbalanced as the distribution of the
whole data set due to the use of a random algorithm. To avoid effects derived from these
uneven distributions, balance all subsets by adding randomly selected instances of the
minority class. These instances are randomly sampled without replacement to avoid repeated
instances in any subset. It may happen in heavily imbalanced data sets that there are not
enough instances from the minority class to construct balanced data sets.
5.2 INSTANCE BASED SELECTION WITH OLIGOIS
Improvement is achieved using separate thresholds for the minority and majority
classes. Two thresholds are then used: t+ is used for selecting minority class instances and t−
for majority class instances. The evaluation of a pair of thresholds, i.e., t+ and t−, is made
using the subset S(t+, t−) selected with these two thresholds. The key difference is that,
evaluate a larger set of values: [0, r] × [0, t ・ r]. Thus, for each pair of thresholds, then
evaluate the following:
f ¿ and select the best pair of thresholds. The evaluation of this number of thresholds might
exclude the scalability achieved by the divide-and-conquer approach. To avoid this negative
effect, the evaluation of a pair of thresholds is also approached using a divide-and-conquer
method. Instead of evaluating the accuracy of S(t+, t−) with the whole data set apply the
same partition philosophy used in the previous step. The training set is divided into random
disjoint subsets, and accuracy is estimated separately in each subset using the average
evaluation of all the subsets for the fitness of each pair of thresholds.

24
This procedure obtains a selected set of instances that may be imbalanced. To obtain a
balanced data set, perform a last step. The class with more selected instances is under
sampled, removing first the instances with fewer votes. If it achieves a better evaluation, the
balanced selected data set is used as the final result of the algorithm; otherwise, the selection
obtained using the best thresholds is kept.
5.3 CLASS PREDICTION METHODS
Indicating the number of samples with n, the number of variables with p and the
number of variables selected and used in the classification rule with G, these variables are the
most informative about class distinction; K is the number of classes while the class
membership of the samples is indicated with integers from 1 to K; the classes are non-
overlapping and each sample belongs to exactly one class, the number of samples in Class k
is denoted by nk. Let xij be the expression of jth variable (j = 1, ..., p) on ith sample (i = 1,...
n). For sample i denote the set of G selected variables by xi. Let ¯x(k) g denote the mean
expression of the gth selected variable in Class k. The mean expression of the gth variable
in Class k is defined as
1
x́(k) x ig
g =
nk i∑
∈C k

*
And let x represent the set of selected variables for a new sample.
5.4 MULTI-CLASS DLDA AND FRIEDMAN’S APPROACH
Discriminant analysis methods are used to find linear combination of variables that
maximize the between-class variance and at the same time minimize the within-class
variance. Diagonal linear discriminant analysis (DLDA) is a special case of discriminant
analysis that assumes that the variables are independent and have the same variance in all
classes. The multi-class DLDA (mDLDA) classification rule for a new sample x * is linear and
is defined as
G
C ( x )=argm ¿k ∑ ¿ ¿ ¿ ¿
g=1

2 ¿
Where s the sample is estimate of the pooled variance for variable g and x g is the gth selected
g

variable of the new sample. The two-class DLDA is a special case of mDLDA.
In Friedman’s approach, also known as the win-max rule, the class-prediction

problem for K > 2 classes is divided in ( K2 ) binary class-prediction problem, one for all pairs
of classes. Within each binary class-prediction problem build a rule for class-prediction (train
a classifier) and a new sample is classified in one of the two classes. The final class-
25
prediction in one of the K classes is defined with majority voting, assigning the new sample
to the class with most votes.
5.5 SIMPLE UNDER SAMPLING AND VARIABLE SELECTION
Simple under sampling (down-sizing) consists of obtaining a class-balanced training
set by removing a subset of randomly selected samples from the larger class. In mDLDA
undersampling consisted in using min(n1, n2, n3) samples from each class, randomly
selecting which samples from the majority class(es) should be removed. With Friedman’s
approach each pairwise comparison was undersampled if the size of the classes was not equal
(nk ≠ nj). The classification rule was derived on the balanced training set as described for the
original data, and evaluated on the test set.
The G < p variables that were most informative about class distinction were selected
on the training set and used to define the classification rules (Eq. 2). Variable selection was
based on two sample t-test with assumed equal variances for the Friedman’s approach, or F-
test for the equality of more than two means for mDLDA.
5.6 PERFORMANCE EVALUATION
Imbalanced nature of the problems into account. Given the number of true positives
(TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs), define several
measures. Perhaps, the most common measures are the TP rate TPrate, recall R, or sensitivity
Sn, i.e.,
TPrate = R = Sn = TP/ TP + FN
which is relevant if it is only interested in the performance on the positive class and the TN
rate TNrate or specificity Sp, as follows: TNrate = Sp = TN / TN + FP
Concerned about the performance of both negative and positive classes, the G-mean measure:
G − mean = √ Sp ・ Sn.
For this reason four different measures of performance were considered: (i) overall
predictive accuracy (PA, the number of correctly classified subjects from the test set divided
by the total number of subjects in the test set), (ii) predictive accuracy of Class 1 (PA1, i.e.,
PA evaluated using only samples from Class 1), (iii) predictive accuracy of Class 2 (PA2 i.e.,
PA evaluated using only samples from Class 2) and (iii) predictive accuracy of Class 3 (PA3).

Overall Architecture Diagram

26
Data samples

Class imbalanced Input dataset Instance based selection

Majority class and minority Feature selection algorithm


class identification

Divide and conquer strategy Multiclass classification

OligoIS

Output classification

Figure 5.1 Overall Architecture


Description
Figure 5.1 represents the overall process involved in project. Data set will be given as
input and the majority and minority classes are identified. After that divide and conquer will
be applied and the instances are selected.

CHAPTER 6

27
CONCLUSION

A new method of instance selection in class-imbalanced data sets that is applicable to


very large data sets. The method consists of concurrently partitioning dataset based on its size
and applying instance selection on small class-balanced subsets of the original data set
several times that formed several sets consists of frequently selected instances. This was done
by fitness calculating fitness value of each attributes and combining them by means of a
voting method by setting a different threshold for minority and majority class samples. The
amount of bias depends also jointly on the magnitude of the differences between classes and
on the sample size, i.e. the bias diminishes when the differences between the classes are
larger or the sample size is increased. Also variable selection plays an important role in the
class-imbalance problem and the most effective strategy depends on the type of differences
that exist between classes.

APPENDIX 1
28
SOURCE CODE
Oligois.java
for(int i=1;i<=round;i++)
{
Main.ta.append("\n\nRound "+i+"\n");
System.out.println("\n\nRound "+i+"\n");
Partition.main(args);
for(int u=1;u<=partsize;u++)
{
inputFile="Partition"+u;
preprocess pc=new preprocess(inputFile);
String fname[]=new String[2];
fname[0]=inputFile;
fname[1]=""+i;
InstanceSelect.main(fname);
System.out.println("\nPartition "+u+" instance selection finished");
Main.ta.append("\nPartition "+u+" instance selection finished");
votecalc(inputFile+""+i);
}
rnd++;
}
System.out.println("\nTotal number of Selected Instance over "+round+"
rounds"+map.size());
int mm=0;
for(tmax=1;tmax<thma;tmax++)
{
for(tmin=1;tmin<thmi;tmin++)
{
int recnt1=0,redcnt2=0;
Iterator<Map.Entry<Integer, Integer>> entries = map.entrySet().iterator();
while (entries.hasNext())
{
Map.Entry<Integer, Integer> entry = entries.next();
String tmp=""+entry.getKey();
29
String[] tmpar=tmp.split(",");
if((tmpar[tmpar.length-1].equals("4.0"))&&(entry.getValue()>tmin))
{
recnt1++;
if(redcnt2<map.size()/3)
{
writer1.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer1.append(" "+t+":"+tmpar[t-1]);
writer1.newLine();
}
else
{
writer2.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer2.append(" "+t+":"+tmpar[t-1]);
writer2.newLine();
}
}
else if((tmpar[tmpar.length-1].equals("2.0"))&&(entry.getValue()>tmax))
{
recnt1++;
if(redcnt2<map.size()/3)
{
writer1.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer1.append(" "+t+":"+tmpar[t-1]);
writer1.newLine();
}
else
{
writer2.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer2.append(" "+t+":"+tmpar[t-1]);
30
writer2.newLine();
}
}
redcnt2++;
}
String arg[]={"red.train.txt","red.test.txt","1","0"};
knn.main(arg);
double r=0;
if(accur1!=1.0)
{
fli.add(r=(accur1+(double)(recnt1)/(double)redcnt2));
}
tp[mm]=tmin;
tm[mm]=tmax;
mm++;
}
}
double t1=0,t2=0;
int s1=0,s2=0;
double mx=Collections.max(fli);
for(int g=0;g<tmp.size();g++)
{
if(tmp.get(g)==mx)
{
t1=tp[g];
t2=tm[g];
}
}
Iterator<Map.Entry<Integer, Integer>> entries = map.entrySet().iterator();
while (entries.hasNext())
{
Map.Entry<Integer, Integer> entry = entries.next();
String tmp1=""+entry.getKey();
String[] tmpar1=tmp1.split(",");
31
if((tmpar1[tmpar1.length-1].equals("4.0"))&&(entry.getValue()>t1))
{
s1++;
tmpl1.add(""+tmp1);
}
if((tmpar1[tmpar1.length-1].equals("2.0"))&&(entry.getValue()>t2))
{
s2++;
map1.put(tmp1,entry.getValue());
tmpl2.add(""+tmp1);
tmpl3.add(entry.getValue());
}
}
System.out.println("\n\nTotal dataset size Before Undersampling Majority class : "+(s1+s2));
System.out.println("\nMinority Class Instance :"+s1+"\n Majority Class Instance : "+s2);
Main.ta.append("\n\nTotal dataset size Before Undersampling Majority class : "+(s1+s2));
Main.ta.append("\nMinority Class Instance :"+s1+"\n Majority Class Instance : "+s2);
Iterator<Map.Entry<Integer, Integer>> entries1 = map1.entrySet().iterator();
int mk=0;
while (entries1.hasNext())
{
Map.Entry<Integer, Integer> entry1 = entries1.next();
if(mk<s1)
{
tmpl1.add(""+entry1.getKey());
}
mk++;
}
Collections.shuffle(tmpl1);
for(int k=0;k<tmpl1.size();k++)
{
if(k<tmpl1.size()/3)
{
String[] kkk=tmpl1.get(k).split(",");
32
writer1.append(""+kkk[kkk.length-1]);
for(int t=1;t<kkk.length;t++)
writer1.append(" "+t+":"+kkk[t-1]);
writer1.newLine();
}
else
{
String[] kkk=tmpl1.get(k).split(",");
writer2.append(kkk[kkk.length-1]+" ");
for(int t=1;t<kkk.length;t++)
writer2.append(" "+t+":"+kkk[t-1]);
writer2.newLine();
}
}
writer1.close();
writer2.close();
System.out.println("\n\n Total dataset size After Undersampling Majority class:
"+tmpl1.size());
System.out.println("\nMinority Class Instance :"+tmpl1.size()/2+"\n Majority Class Instance :
"+tmpl1.size()/2);
Main.ta.append("\n\n Total dataset size After Undersampling Majority class: "+tmpl1.size());
Main.ta.append("\nMinority Class Instance :"+tmpl1.size()/2+"\n Majority Class Instance :
"+tmpl1.size()/2);
Thread.sleep(500);
String arg[]={"Final.train.txt","Final.test.txt","1","0"};
knn.main(arg);
Main.ac2=accur1;
Main.ta.append("\n\nAccuracy\n\n SSO: "+Main.ac1);
Main.ta.append("\n Oligols: "+Main.ac2);
}

APPENDIX 2
33
SNAPSHOTS

Figure A2.1 Sample Subspace Optimization


The above figure shows the accuracy using sample subspace optimization algorithm

Figure A2.2 Specifying Rounds


The above figure shows the number of rounds to be done during instance selection process.

34
Figure A2.3 Specifying Partition
The above figure shows the number of partitions to be done during instance selection process.

Figure A2.4 Partitioned Data Set


The above figure shows the partitioned data set and its size with number of majority and
minority classes.

35
Figure A2.5 Balanced Data Set
The above figure shows the balanced dataset after sampling

Figure A2.6 Accuracy of Algorithm


The above figure shows the accuracy of OligoIS algorithm.

36
REFERENCES

[1] W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic


minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. 1, pp. 321–357, Jan.
2002.
[2] C. J. Carmona, J. Derrac, S. García, F. Herrera and I. Triguero, “Evolutionary-based
selection of generalized instances for imbalanced classification,” Knowl.-Based Syst., vol.
25, no. 1, pp. 3–12, Feb. 2012.
[3] J. R. Cano, F. Herrera, and M. Lozano, “Using evolutionary algorithms as instance
selection for data reduction in KDD: An experimental study,” IEEE Trans. Evol. Comput.,
vol. 7, no. 6, pp. 561–575, Dec. 2003.
[4] A. Elisseeff and I. Guyon “An introduction to variable and feature selection,” J. Mach.
Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003.
[5] A. Estabrooks, N. Japkowicz and T. Jo, “A multiple resampling method for learning from
imbalanced data sets,” Comput. Intell., vol. 20, no. 1, pp. 18–36, Feb. 2004.
[6] N. García-Pedrajas, “Constructing ensembles of classifiers by means of weighted instance
selection,” IEEE Trans. Neural Netw., vol. 20, no. 2, pp. 258–277, Feb. 2009.
[7] E. A. Garcia and H. He, “Learning from imbalanced data,” IEEE Trans. Knowl. Data
Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[8] N. García-Pedrajas, D. Ortiz-Boyer and J. A. Romero del Castillo, “A cooperative
coevolutionary algorithm for instance selection for instancebased learning,” Mach. Learn.,
vol. 78, no. 3, pp. 381–420, Mar. 2010.
[9] L. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their
relationship with the ensemble accuracy,” Mach. Learn., vol. 51, no. 2, pp. 181–207, May
2003.
[10] T. R. Martinez and D. R. Wilson “Reduction techniques for instance based learning
algorithms,” Mach. Learn., vol. 38, no. 3, pp. 257–286, Mar. 2000.
[11] F. Provost and G. M. Weiss, “The effect of class distribution on classifier learning: An
empirical study,” Dept. Comput. Sci., Rutgers Univ., Newark, NJ, Tech. Rep. TR-43, 2001.
[12] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE
Trans. Syst., Man, Cybern., vol. SMC-2, no. 3, pp. 408–421, Jul. 1972.

37

You might also like