Professional Documents
Culture Documents
INTRODUCTION
3
with a meaningful problem statement. Unfortunately, many application studies are likely to
be focused on the data mining technique at the cost of a clear problem statement. In this step,
a model usually specifies a set of variables for the unknown dependency and, if possible, a
general form of this dependency as an initial hypothesis. The first step requires the combined
expertise of an application domain and a data mining model. In successful data mining
applications, this cooperation does not stop in the initial phase; it continues during the entire
data mining process, the requirement to knowledge discovery is to understand data and
business. Without this understanding, no algorithm, regardless of complexity, is able to
provide result that can be confident.
1.1.3.2 COLLECTING THE DATA MINING DATA
This process is concerned with the collection of data from different sources and
locations. The current methods used to collect data are:
• Internal Data: data are usually collected from existing databases, data warehouses, and
OLAP. Actual transactions recorded by individuals are the richest source of information, and
at the same time, the most challenging to be useful.
• External Data: data items can be collected from Demographics, psychographics and web
graphics. In addition to data shared within a company.
1.1.3.3 DETECTING AND CORRECTING THE DATA
All raw data sets which are initially prepared for data mining are often large; many
are related to humans and have the potential for being messy. Real-world databases are
subject to noise, missing, and inconsistent data due to their typically huge size, often several
gigabytes or more. Data preprocessing is commonly used as a preliminary data mining
practice. It transforms the data into a format that will be easily and effectively processed by
the users. There are a number of data preprocessing techniques which include: Data cleaning;
that can be applied to remove noise and correct inconsistencies, outliers and missing values.
Data integration; merges data from multiple sources into a coherent data store, such as a data
warehouse or a data cube. Data transformations, such as normalization, may be applied;
normalization improves the accuracy and efficiency of mining algorithms involving distance
measurements. Data reduction; can reduce the data size by aggregating, eliminating
redundant features. The data processing techniques, when applied prior to mining, can
significantly improve the overall data mining results. Since multiple data sets may be used in
various transactional formats, extensive data preparation may be required. There are various
commercial software products that are specifically designed for data preparation, which can
facilitate the task of organizing the data prior to importing it into a data mining tool.
4
1.1.3.4 ESTIMATION AND BUILDING THE MODEL
Figure 1.2 represents the process involved in estimation and building the model.
This process includes four parts:
1. Select data mining task,
2. Select data mining method,
3. Select suitable algorithm
4. Extract knowledge
7
• Combined approaches that are based on boosting accounting for the imbalance in the
training set. These methods modify the basic boosting method to account for minority class
underrepresentation in the data set.
There are two principal advantages of choosing sampling over cost-sensitive methods.
First, sampling is more general as it does not depend on the possibility of adapting a certain
algorithm to work with classification costs. Second, the learning algorithm is not modified,
which can cause difficulties and add additional parameters to be tuned.
As stated earlier, our goal is to obtain a method that is both scalable and able to
sample the most relevant instances to deal with class-imbalanced data sets. Scalability will be
achieved using a divide-and-conquer approach. The ability to sample instances to deal with
class-imbalanced data sets will be achieved by means of the combination of several rounds of
instance selection in balanced subsets of the whole data set.
1.3 PROBLEM STATEMENT
Most learning algorithms expect an approximately even distribution of instances
among the different classes and suffer, to different degrees, when that is not the case. Dealing
with the class-imbalance problem is a difficult but relevant task as many of the most
interesting and challenging real-world problems have a very uneven class distribution. In that
system not consider the multi class problem. Especially, many ensemble methods have been
proposed to deal with such imbalance. However, most efforts so far are only focused on two-
class imbalance problems. There are unsolved issues in multi-class imbalance problems,
which exist in real-world applications. No existing methods can deal with multi-class
imbalance problems efficiently and effectively.
1.4 OBJECTIVE
It is desirable to develop a more effective and efficient method to handle multi-class
imbalance problems. The impact of multi-class on the performance of random oversampling
and undersampling techniques by discussing “multiminority” and “multi-majority” cases in
depth. Both “multi-minority” and “multi-majority” negatively affect the overall and minority-
class performance. Set of benchmark data sets with multiple minority and/or majority classes
with the aim of tackling multi-class imbalance problems.
8
CHAPTER 2
LITERATURE REVIEW
16
CHAPTER 3
SYSTEM ANALYSIS
17
3.3 PROPOSED SYSTEM
The aim of our study was to investigate how class imbalance affects the multi-class
classification for high-dimensional class-imbalanced data, a problem that to our knowledge
has not been systematically addressed so far. This focused mainly on DLDA because of its
good behaviour in the two-class problems with high-dimensional class-imbalanced data;
another reason for choosing DLDA was the straightforward generalization of the two-class
DLDA to the multi-class situation (multi-class DLDA, mDLDA). Compared mDLDA and
the Friedman’s one-versus-one approach, which breaks down the multi-class problem in a
series of two-class classification problems and assigns new samples to the class having most
votes. Friedman’s approach was chosen because of its wide applicability and simplicity, and
because it was previously indicated as beneficial when the classes are imbalanced or when
the number of classes is large. This leads, choosing of a one-versus-one rather than a one-
versus-all strategy because it would be less affected by class-imbalance.
3.4 ADVANTAGES
It is recognized that multi-class classification tasks are generally significantly harder
than binary classification tasks
Main aim is improving accuracy a method achieves the same accuracy using fewer
instances, that method would be preferable.
Moreover, many of the most relevant class-imbalanced problems appear in very large
data sets where data reduction is a must.
18
CHAPTER 4
SYSTEM SPECIFICATION
20
The multiple constructors (all named Particle) are distinguished only by the number of
arguments. The constructors can be defined in any order. Hence, the keyword this in
the first constructor refers to the next in the sequence because the latter has two
arguments. The first constructor has no arguments and creates a particle of unit mass
at the origin; the next is defined with two arguments: the spatial coordinates of the
particle. The second constructor in turn references the third constructor which uses the
spatial coordinates and the mass. The third and fourth constructors each refer to the
final constructors which uses all five arguments. (The order of the constructors is
unimportant.) Once the Particle class with its multiple constructors is defined, any
class can call the constructor Particle using the number of arguments appropriate to
that application. The advantage of having multiple constructors is that applications
that use a particular constructor are unaffected by later additions made to the class
Particle, whether variables or methods. For example, adding acceleration as an
argument does not affect applications that rely only on the definitions given above.
Using multiple constructors is called method overloading -- the method name is used
to specify more than one method. The rule for overloading is that the argument lists
for all of the different methods must be unique, including the number of arguments
and/or the types of the arguments.
All classes have at least one implicit constructor method. If no constructor is defined
explicitly, the compiler creates one with no arguments.
History of Java
James Gosling initiated the Java language project in June 1991 for use in one of his
many set-top box projects. The language, initially called Oak after an oak tree that stood
outside Gosling's office, also went by the name Green and ended up later renamed as Java,
from a list of random words
There were five primary goals in the creation of the Java language:
1. It should use the object-oriented programming methodology.
2. It should allow the same program to be executed on multiple operating systems.
3. It should contain built-in support for using computer networks.
4. It should be designed to execute code from remote sources securely.
5. It should be easy to use by selecting what was considered the good parts of other object-
oriented languages.
21
Java Arrays
Arrays are objects that store multiple variables of the same type. However an Array
itself is an object on the heap. Look into how to declare, construct and initialize in the
upcoming chapters.
Java Enums
Enums were introduced in java 5.0. Enums restrict a variable to have one of only a
few predefined values. The values in this enumerated list are called enums. With the use of
enums it is possible to reduce the number of bugs in your code. For example if considering an
application for a fresh juice shop it would be possible to restrict the glass size to small,
medium and large. This would make sure that it would not allow anyone to order any size
other than the small, medium or large.
Inheritance
In java classes can be derived from classes. Basically if you need to create a new class
and here is already a class that has some of the code you require, then it is possible to derive
your new class from the already existing code.
This concept allows you to reuse the fields and methods of the existing class without
having to rewrite the code in a new class. In this scenario the existing class is called the super
class and the derived class is called the subclass.
Interfaces
In Java language an interface can be defined as a contract between objects on how to
communicate with each other. Interfaces play a vital role when it comes to the concept of
inheritance.
An interface defines the methods, a deriving class(subclass) should use. But the
implementation of the methods is totally up to the subclass.
Java programming language was originally developed by Sun Microsystems, which
was initiated by James Gosling and released in 1995 as core component of Sun
Microsystems’s Java platform (Java 1.0 [J2SE]).Sun Microsystems has renamed the new J2
versions as Java SE, Java EE and Java ME respectively. Java is guaranteed to be Write Once,
Run Anywhere
Object Oriented
In java everything is an Object. Java can be easily extended since it is based on the
Object model.
22
Platform Independent
Unlike many other programming languages including C and C++ when Java is
compiled, it is not compiled into platform specific machine, rather into platform independent
byte code. This byte code is distributed over the web and interpreted by virtual Machine
(JVM) on whichever platform it is being run.
Simple
Java is designed to be easy to learn. If you understand the basic concept of OOP java
would be easy to master.
Architectural- neutral
Java compiler generates an architecture-neutral object file format which makes the
compiled code to be executable on many processors, with the presence Java runtime system.
Portable
Java is being architectural neutral and having no implementation dependent aspects of
the specification makes Java portable. Compiler and Java is written in ANSI C with a clean
portability boundary which is a POSIX subset.
Robust
Java makes an effort to eliminate error prone situations by emphasizing mainly on
compile time error checking and runtime checking.
Multi-threaded
With Java’s multi-threaded feature it is possible to write programs that can do many
tasks simultaneously. This design feature allows developers to construct smoothly running
interactive applications.
Interpreted
Java byte code is translated on the fly to native machine instructions and is not stored
anywhere. The development process is more rapid and analytical since the linking is an
incremental and light weight process.
High Performance
With the use of Just-In-Time compilers Java enables high performance.
Distributed
Java is designed for the distributed environment of the internet.
Dynamic
Java is considered to be more dynamic than C or C++ since it is designed to adapt to
an evolving environment. Java programs can carry extensive amount of run-time information
that can be used to verify and resolve accesses to objects on run-time.
23
CHAPTER 5
SYSTEM IMPLEMENTATION
24
This procedure obtains a selected set of instances that may be imbalanced. To obtain a
balanced data set, perform a last step. The class with more selected instances is under
sampled, removing first the instances with fewer votes. If it achieves a better evaluation, the
balanced selected data set is used as the final result of the algorithm; otherwise, the selection
obtained using the best thresholds is kept.
5.3 CLASS PREDICTION METHODS
Indicating the number of samples with n, the number of variables with p and the
number of variables selected and used in the classification rule with G, these variables are the
most informative about class distinction; K is the number of classes while the class
membership of the samples is indicated with integers from 1 to K; the classes are non-
overlapping and each sample belongs to exactly one class, the number of samples in Class k
is denoted by nk. Let xij be the expression of jth variable (j = 1, ..., p) on ith sample (i = 1,...
n). For sample i denote the set of G selected variables by xi. Let ¯x(k) g denote the mean
expression of the gth selected variable in Class k. The mean expression of the gth variable
in Class k is defined as
1
x́(k) x ig
g =
nk i∑
∈C k
*
And let x represent the set of selected variables for a new sample.
5.4 MULTI-CLASS DLDA AND FRIEDMAN’S APPROACH
Discriminant analysis methods are used to find linear combination of variables that
maximize the between-class variance and at the same time minimize the within-class
variance. Diagonal linear discriminant analysis (DLDA) is a special case of discriminant
analysis that assumes that the variables are independent and have the same variance in all
classes. The multi-class DLDA (mDLDA) classification rule for a new sample x * is linear and
is defined as
G
C ( x )=argm ¿k ∑ ¿ ¿ ¿ ¿
g=1
2 ¿
Where s the sample is estimate of the pooled variance for variable g and x g is the gth selected
g
variable of the new sample. The two-class DLDA is a special case of mDLDA.
In Friedman’s approach, also known as the win-max rule, the class-prediction
problem for K > 2 classes is divided in ( K2 ) binary class-prediction problem, one for all pairs
of classes. Within each binary class-prediction problem build a rule for class-prediction (train
a classifier) and a new sample is classified in one of the two classes. The final class-
25
prediction in one of the K classes is defined with majority voting, assigning the new sample
to the class with most votes.
5.5 SIMPLE UNDER SAMPLING AND VARIABLE SELECTION
Simple under sampling (down-sizing) consists of obtaining a class-balanced training
set by removing a subset of randomly selected samples from the larger class. In mDLDA
undersampling consisted in using min(n1, n2, n3) samples from each class, randomly
selecting which samples from the majority class(es) should be removed. With Friedman’s
approach each pairwise comparison was undersampled if the size of the classes was not equal
(nk ≠ nj). The classification rule was derived on the balanced training set as described for the
original data, and evaluated on the test set.
The G < p variables that were most informative about class distinction were selected
on the training set and used to define the classification rules (Eq. 2). Variable selection was
based on two sample t-test with assumed equal variances for the Friedman’s approach, or F-
test for the equality of more than two means for mDLDA.
5.6 PERFORMANCE EVALUATION
Imbalanced nature of the problems into account. Given the number of true positives
(TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs), define several
measures. Perhaps, the most common measures are the TP rate TPrate, recall R, or sensitivity
Sn, i.e.,
TPrate = R = Sn = TP/ TP + FN
which is relevant if it is only interested in the performance on the positive class and the TN
rate TNrate or specificity Sp, as follows: TNrate = Sp = TN / TN + FP
Concerned about the performance of both negative and positive classes, the G-mean measure:
G − mean = √ Sp ・ Sn.
For this reason four different measures of performance were considered: (i) overall
predictive accuracy (PA, the number of correctly classified subjects from the test set divided
by the total number of subjects in the test set), (ii) predictive accuracy of Class 1 (PA1, i.e.,
PA evaluated using only samples from Class 1), (iii) predictive accuracy of Class 2 (PA2 i.e.,
PA evaluated using only samples from Class 2) and (iii) predictive accuracy of Class 3 (PA3).
26
Data samples
OligoIS
Output classification
CHAPTER 6
27
CONCLUSION
APPENDIX 1
28
SOURCE CODE
Oligois.java
for(int i=1;i<=round;i++)
{
Main.ta.append("\n\nRound "+i+"\n");
System.out.println("\n\nRound "+i+"\n");
Partition.main(args);
for(int u=1;u<=partsize;u++)
{
inputFile="Partition"+u;
preprocess pc=new preprocess(inputFile);
String fname[]=new String[2];
fname[0]=inputFile;
fname[1]=""+i;
InstanceSelect.main(fname);
System.out.println("\nPartition "+u+" instance selection finished");
Main.ta.append("\nPartition "+u+" instance selection finished");
votecalc(inputFile+""+i);
}
rnd++;
}
System.out.println("\nTotal number of Selected Instance over "+round+"
rounds"+map.size());
int mm=0;
for(tmax=1;tmax<thma;tmax++)
{
for(tmin=1;tmin<thmi;tmin++)
{
int recnt1=0,redcnt2=0;
Iterator<Map.Entry<Integer, Integer>> entries = map.entrySet().iterator();
while (entries.hasNext())
{
Map.Entry<Integer, Integer> entry = entries.next();
String tmp=""+entry.getKey();
29
String[] tmpar=tmp.split(",");
if((tmpar[tmpar.length-1].equals("4.0"))&&(entry.getValue()>tmin))
{
recnt1++;
if(redcnt2<map.size()/3)
{
writer1.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer1.append(" "+t+":"+tmpar[t-1]);
writer1.newLine();
}
else
{
writer2.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer2.append(" "+t+":"+tmpar[t-1]);
writer2.newLine();
}
}
else if((tmpar[tmpar.length-1].equals("2.0"))&&(entry.getValue()>tmax))
{
recnt1++;
if(redcnt2<map.size()/3)
{
writer1.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer1.append(" "+t+":"+tmpar[t-1]);
writer1.newLine();
}
else
{
writer2.append(""+tmpar[tmpar.length-1]);
for(int t=1;t<tmpar.length;t++)
writer2.append(" "+t+":"+tmpar[t-1]);
30
writer2.newLine();
}
}
redcnt2++;
}
String arg[]={"red.train.txt","red.test.txt","1","0"};
knn.main(arg);
double r=0;
if(accur1!=1.0)
{
fli.add(r=(accur1+(double)(recnt1)/(double)redcnt2));
}
tp[mm]=tmin;
tm[mm]=tmax;
mm++;
}
}
double t1=0,t2=0;
int s1=0,s2=0;
double mx=Collections.max(fli);
for(int g=0;g<tmp.size();g++)
{
if(tmp.get(g)==mx)
{
t1=tp[g];
t2=tm[g];
}
}
Iterator<Map.Entry<Integer, Integer>> entries = map.entrySet().iterator();
while (entries.hasNext())
{
Map.Entry<Integer, Integer> entry = entries.next();
String tmp1=""+entry.getKey();
String[] tmpar1=tmp1.split(",");
31
if((tmpar1[tmpar1.length-1].equals("4.0"))&&(entry.getValue()>t1))
{
s1++;
tmpl1.add(""+tmp1);
}
if((tmpar1[tmpar1.length-1].equals("2.0"))&&(entry.getValue()>t2))
{
s2++;
map1.put(tmp1,entry.getValue());
tmpl2.add(""+tmp1);
tmpl3.add(entry.getValue());
}
}
System.out.println("\n\nTotal dataset size Before Undersampling Majority class : "+(s1+s2));
System.out.println("\nMinority Class Instance :"+s1+"\n Majority Class Instance : "+s2);
Main.ta.append("\n\nTotal dataset size Before Undersampling Majority class : "+(s1+s2));
Main.ta.append("\nMinority Class Instance :"+s1+"\n Majority Class Instance : "+s2);
Iterator<Map.Entry<Integer, Integer>> entries1 = map1.entrySet().iterator();
int mk=0;
while (entries1.hasNext())
{
Map.Entry<Integer, Integer> entry1 = entries1.next();
if(mk<s1)
{
tmpl1.add(""+entry1.getKey());
}
mk++;
}
Collections.shuffle(tmpl1);
for(int k=0;k<tmpl1.size();k++)
{
if(k<tmpl1.size()/3)
{
String[] kkk=tmpl1.get(k).split(",");
32
writer1.append(""+kkk[kkk.length-1]);
for(int t=1;t<kkk.length;t++)
writer1.append(" "+t+":"+kkk[t-1]);
writer1.newLine();
}
else
{
String[] kkk=tmpl1.get(k).split(",");
writer2.append(kkk[kkk.length-1]+" ");
for(int t=1;t<kkk.length;t++)
writer2.append(" "+t+":"+kkk[t-1]);
writer2.newLine();
}
}
writer1.close();
writer2.close();
System.out.println("\n\n Total dataset size After Undersampling Majority class:
"+tmpl1.size());
System.out.println("\nMinority Class Instance :"+tmpl1.size()/2+"\n Majority Class Instance :
"+tmpl1.size()/2);
Main.ta.append("\n\n Total dataset size After Undersampling Majority class: "+tmpl1.size());
Main.ta.append("\nMinority Class Instance :"+tmpl1.size()/2+"\n Majority Class Instance :
"+tmpl1.size()/2);
Thread.sleep(500);
String arg[]={"Final.train.txt","Final.test.txt","1","0"};
knn.main(arg);
Main.ac2=accur1;
Main.ta.append("\n\nAccuracy\n\n SSO: "+Main.ac1);
Main.ta.append("\n Oligols: "+Main.ac2);
}
APPENDIX 2
33
SNAPSHOTS
34
Figure A2.3 Specifying Partition
The above figure shows the number of partitions to be done during instance selection process.
35
Figure A2.5 Balanced Data Set
The above figure shows the balanced dataset after sampling
36
REFERENCES
37